[tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri May 1 00:04:19 UTC 2020
#33406: automate reboots
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: project | Status:
| accepted
Priority: Low | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Major | Resolution:
Keywords: tpa-roadmap-may | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Changes (by anarcat):
* keywords: tpa-roadmap-april => tpa-roadmap-may
Comment:
i fixed the timeout error, and did today's round of upgrades without too
many problems. one issue that came up is that ganeti wasn't happy to
chain-reboot machines: some instances had to have a `activate-disks` ran
so they recognize their secondary. that has been added as a TODO in the
code.
i also made some experiments with feeding LDAP hosts lists as an argument
to the reboot command which also worked well. this, for example, rebooted
the `rotation` hosts with a 10-minute delay:
{{{
./reboot -H $(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org
-x -ZZ -b dc=torproject,dc=org -LLL
"(&(hostname=*.torproject.org)(rebootPolicy=rotation))" hostname | awk
"\$1 == \"hostname:\" {print \$2}" | sort') -v
}}}
I added a modified recipe to the upgrades page, which covers all cases.
I also set the reboot policy on a few hosts so they are classified
properly, those didn't have a policy, and now have:
manual:
* moly (KVM, requires special handling)
* kvm4 (KVM)
* kvm5 (KVM)
* scw-arm-par1 (buggy buildbox, see #32920)
* fsn-node-01 (ganeti, requires special handling)
* fsn-node-02 (ganeti)
* fsn-node-03 (ganeti)
* weissii (windows buildbox, no ssh)
* woronowii (windows buildbox, no ssh)
* winklerianum (windows buildbox, no ssh)
justdoit:
* pauli (puppet)
* rude (rt)
* alberti (ldap)
* eugeni (mail)
* majus (translation)
* rouyi (jenkins)
* troodi (trac)
* nevii (dns primary)
* henryi (consensus-health)
* vineale (gitweb)
* gayi (svn)
* polyanthum (bridges)
* materculae (exonerator)
* meronense (metrics.tpo)
* colchicifolium (collector backend)
* carinatum (DocTor)
* build-x86-05 (buildbox)
* build-x86-06 (buildbox)
* build-x86-08 (buildbox)
* build-x86-09 (buildbox)
* perdulce (people.tpo)
* staticiforme (static master)
* forrestii (fpcentral)
* subnotabile (survey)
* crm-int-01 (CRM backend)
* crm-ext-01 (CRM frontend)
* submit-01 (mail)
rotation:
* fallax (DNS secondary)
* omeiense (onionoo backend)
* oo-hetzner-03 (onionoo backend)
* neriniflorum (DNS secondary)
* web-hetzner-01 (web frontend)
* web-cymru-01 (web frontend)
the following were already configured as...
rotation:
* orestis (onionoo backend)
* nutans (DNS secondary)
* cdn-backend-sunet-01 (web frontend)
* hetzner-hel1-02 (DNS secondary)
* hetzner-hel1-03 (web frontend)
* onionoo-backend-01 (onionoo backend)
* web-fsn-01 (web frontend)
* web-fsn-02 (web frontend)
* onionoo-frontend-01 (onionoo frontend)
* cache01 (cache frontend)
* cache-02 (cache frontend)
* onionoo-backend-02 (onionoo backend)
justdoit:
* corsicum (collector)
* hetzner-hel1-01 (nagios)
* bungei (backup storage)
* hetzner-nbg1-01 (prometheus)
* hetzner-nbg1-02 (prometheus)
* archive-01 (non-redundant web frontend)
* loghost01 (syslog)
* static-master-fsn (static master)
* bacula-director-01 (backup director)
* gettor-01 (gettor)
* onionbalance-01 (onionbalance)
* chives (IRC)
* build-arm-10 (buildbox)
* tbb-nightlies-master (static master)
* gitlab-02 (gitlab)
* check-01 (check.tpo)
manual:
* mandos-01 (mandos, requires crypto)
* fsn-node-04
* fsn-node-05
In other words, I made the following diff in LDAP:
{{{
--- policy-before 2020-04-30 19:48:50.158412413 -0400
+++ policy-after 2020-04-30 19:54:15.209832522 -0400
@@ -6,27 +6,35 @@
dn: host=moly,ou=hosts,dc=torproject,dc=org
host: moly
+rebootPolicy: manual
dn: host=pauli,ou=hosts,dc=torproject,dc=org
host: pauli
+rebootPolicy: justdoit
dn: host=rude,ou=hosts,dc=torproject,dc=org
host: rude
+rebootPolicy: justdoit
dn: host=alberti,ou=hosts,dc=torproject,dc=org
host: alberti
+rebootPolicy: justdoit
dn: host=cupani,ou=hosts,dc=torproject,dc=org
host: cupani
+rebootPolicy: justdoit
dn: host=fallax,ou=hosts,dc=torproject,dc=org
host: fallax
+rebootPolicy: rotation
dn: host=eugeni,ou=hosts,dc=torproject,dc=org
host: eugeni
+rebootPolicy: justdoit
dn: host=majus,ou=hosts,dc=torproject,dc=org
host: majus
+rebootPolicy: justdoit
dn: host=listera,ou=hosts,dc=torproject,dc=org
host: listera
@@ -34,63 +42,83 @@
dn: host=rouyi,ou=hosts,dc=torproject,dc=org
host: rouyi
+rebootPolicy: justdoit
dn: host=palmeri,ou=hosts,dc=torproject,dc=org
host: palmeri
+rebootPolicy: justdoit
dn: host=weissii,ou=hosts,dc=torproject,dc=org
host: weissii
+rebootPolicy: manual
dn: host=troodi,ou=hosts,dc=torproject,dc=org
host: troodi
+rebootPolicy: justdoit
dn: host=nevii,ou=hosts,dc=torproject,dc=org
host: nevii
+rebootPolicy: justdoit
dn: host=henryi,ou=hosts,dc=torproject,dc=org
host: henryi
+rebootPolicy: justdoit
dn: host=vineale,ou=hosts,dc=torproject,dc=org
host: vineale
+rebootPolicy: justdoit
dn: host=gayi,ou=hosts,dc=torproject,dc=org
host: gayi
+rebootPolicy: justdoit
dn: host=polyanthum,ou=hosts,dc=torproject,dc=org
host: polyanthum
+rebootPolicy: justdoit
dn: host=materculae,ou=hosts,dc=torproject,dc=org
host: materculae
+rebootPolicy: justdoit
dn: host=omeiense,ou=hosts,dc=torproject,dc=org
host: omeiense
+rebootPolicy: rotation
dn: host=meronense,ou=hosts,dc=torproject,dc=org
host: meronense
+rebootPolicy: justdoit
dn: host=colchicifolium,ou=hosts,dc=torproject,dc=org
host: colchicifolium
+rebootPolicy: justdoit
dn: host=carinatum,ou=hosts,dc=torproject,dc=org
host: carinatum
+rebootPolicy: justdoit
dn: host=build-x86-05,ou=hosts,dc=torproject,dc=org
host: build-x86-05
+rebootPolicy: justdoit
dn: host=build-x86-06,ou=hosts,dc=torproject,dc=org
host: build-x86-06
+rebootPolicy: justdoit
dn: host=perdulce,ou=hosts,dc=torproject,dc=org
host: perdulce
+rebootPolicy: justdoit
dn: host=staticiforme,ou=hosts,dc=torproject,dc=org
host: staticiforme
+rebootPolicy: justdoit
dn: host=woronowii,ou=hosts,dc=torproject,dc=org
host: woronowii
+rebootPolicy: manual
dn: host=winklerianum,ou=hosts,dc=torproject,dc=org
host: winklerianum
+rebootPolicy: manual
dn: host=orestis,ou=hosts,dc=torproject,dc=org
host: orestis
@@ -106,21 +134,27 @@
dn: host=kvm4,ou=hosts,dc=torproject,dc=org
host: kvm4
+rebootPolicy: manual
dn: host=oo-hetzner-03,ou=hosts,dc=torproject,dc=org
host: oo-hetzner-03
+rebootPolicy: rotation
dn: host=forrestii,ou=hosts,dc=torproject,dc=org
host: forrestii
+rebootPolicy: justdoit
dn: host=subnotabile,ou=hosts,dc=torproject,dc=org
host: subnotabile
+rebootPolicy: justdoit
dn: host=kvm5,ou=hosts,dc=torproject,dc=org
host: kvm5
+rebootPolicy: manual
dn: host=neriniflorum,ou=hosts,dc=torproject,dc=org
host: neriniflorum
+rebootPolicy: rotation
dn: host=hetzner-hel1-01,ou=hosts,dc=torproject,dc=org
host: hetzner-hel1-01
@@ -132,12 +166,15 @@
dn: host=build-x86-08,ou=hosts,dc=torproject,dc=org
host: build-x86-08
+rebootPolicy: justdoit
dn: host=web-hetzner-01,ou=hosts,dc=torproject,dc=org
host: web-hetzner-01
+rebootPolicy: rotation
dn: host=scw-arm-par-01,ou=hosts,dc=torproject,dc=org
host: scw-arm-par-01
+rebootPolicy: manual
dn: host=hetzner-hel1-02,ou=hosts,dc=torproject,dc=org
host: hetzner-hel1-02
@@ -149,15 +186,19 @@
dn: host=web-cymru-01,ou=hosts,dc=torproject,dc=org
host: web-cymru-01
+rebootPolicy: rotation
dn: host=crm-int-01,ou=hosts,dc=torproject,dc=org
host: crm-int-01
+rebootPolicy: justdoit
dn: host=crm-ext-01,ou=hosts,dc=torproject,dc=org
host: crm-ext-01
+rebootPolicy: justdoit
dn: host=build-x86-09,ou=hosts,dc=torproject,dc=org
host: build-x86-09
+rebootPolicy: justdoit
dn: host=bungei,ou=hosts,dc=torproject,dc=org
host: bungei
@@ -181,9 +222,11 @@
dn: host=fsn-node-01,ou=hosts,dc=torproject,dc=org
host: fsn-node-01
+rebootPolicy: manual
dn: host=fsn-node-02,ou=hosts,dc=torproject,dc=org
host: fsn-node-02
+rebootPolicy: manual
dn: host=loghost01.torproject.org,ou=hosts,dc=torproject,dc=org
host: loghost01
@@ -243,6 +286,7 @@
dn: host=fsn-node-03,ou=hosts,dc=torproject,dc=org
host: fsn-node-03
+rebootPolicy: manual
dn: host=onionoo-backend-02,ou=hosts,dc=torproject,dc=org
host: onionoo-backend-02
}}}
The policy is being interpreted here as:
* manual: requires manual intervention or special tools (fabric in case
of ganeti, reboot-host in the case of KVM, nothing for windows boxes)
* justdoit: can be rebooted with proper prior warning (10 minutes),
possibly in parallel with each other
* rotation: must not be rebooted together, longer warning (30 minutes)
I tried to update the "upgrades" docs to reflect this.
I think the last steps here are:
1. add LDAP support in the reboot script
2. parallelize "justdoit" jobs
3. turn ganeti hosts into "rotation" once we officialize this new
procedure
This is therefore likely to be completed in may.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33406#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list