[tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri Dec 6 16:55:54 UTC 2019
#32660: onionoo-backend is killing the ganeti cluster
-----------------------------+------------------------------
Reporter: anarcat | Owner: metrics-team
Type: defect | Status: closed
Priority: Medium | Milestone:
Component: Metrics/Onionoo | Version:
Severity: Normal | Resolution: fixed
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+------------------------------
Changes (by anarcat):
* status: merge_ready => closed
* resolution: => fixed
Old description:
> hello!
>
> today i noticed that, since last friday (UTC) morning, there has been
> pretty big spikes on the internal network between the ganeti nodes, every
> hour. it looks like this, in grafana:
>
> [[Image(snap-2019.12.02-16.06.11.png)]]
>
> We can clearly see a correlation between the two node's traffic, in
> reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
> during a surge.
>
> It seems this is due to onionoo-backend-01 blasting the disk and CPU for
> some reason. This is the disk I/O graphs for that host, which correlate
> pretty cleanly with the above graphs:
>
> [[Image(snap-2019.12.02-16.30.33.png)]]
>
> This was confirmed by an inspection of `drbd`, the mechanisms that
> synchronizes the disks across the network. It seems there's a huge surge
> of "writes" on the network every hour which lasts anywhere between 20 and
> 30 minutes. This was (somewhat) confirmed by running:
>
> {{{
> watch -n 0.1 -d cat /proc/drbd
> }}}
>
> on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
> DRBD. 13 and 17 are the web nodes, so that's expected - probably log
> writes? But device ID 4 is onionoo-backend, which is what led me to the
> big traffic graph.
>
> could someone from metrics investigate?
>
> can i just turn off this machine altogether, considering it's basically
> trying to murder the cluster every hour? :)
New description:
hello!
today i noticed that, since last friday (UTC) morning, there has been
pretty big spikes on the internal network between the ganeti nodes, every
hour. it looks like this, in grafana:
[[Image(snap-2019.12.02-16.06.11.png, 700)]]
We can clearly see a correlation between the two node's traffic, in
reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
during a surge.
It seems this is due to onionoo-backend-01 blasting the disk and CPU for
some reason. This is the disk I/O graphs for that host, which correlate
pretty cleanly with the above graphs:
[[Image(snap-2019.12.02-16.30.33.png)]]
This was confirmed by an inspection of `drbd`, the mechanisms that
synchronizes the disks across the network. It seems there's a huge surge
of "writes" on the network every hour which lasts anywhere between 20 and
30 minutes. This was (somewhat) confirmed by running:
{{{
watch -n 0.1 -d cat /proc/drbd
}}}
on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
DRBD. 13 and 17 are the web nodes, so that's expected - probably log
writes? But device ID 4 is onionoo-backend, which is what led me to the
big traffic graph.
could someone from metrics investigate?
can i just turn off this machine altogether, considering it's basically
trying to murder the cluster every hour? :)
--
Comment:
wow, that *is* a huge improvement! check this out:
https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
metrics?orgId=1&from=1575563766753&to=1575650166753&var-
node=omeiense.torproject.org:9100&var-node=oo-hetzner-03.torproject.org
in particular:
[[Image(snap-2019.12.06-11.36.33.png, 700)]]
large reduction in CPU and memory usage, significant reduction in load!
[[Image(snap-2019.12.06-11.43.28.png, 700)]]
also a *dramatic* reduction in disk utilization! especially: all that
writing was significantly reduced... but what i find the most interesting
is this:
[[Image(snap-2019.12.06-11.49.27.png, 700)]]
ie. we write less, but we don't read more! even though we're computing all
those checksums, we don't impose extra load on the disks because of that
reading, which is one thing I was worried about.
but even if we would read more (which we don't) it would still be a
worthwhile tradeoffs because (1) we can cache those and (2) we (obviously)
don't need to replicate reads across the cluster.
i can't confirm the effect on the actual ganeti cluster because irl
(thankfully! :) has turned off those jobs on onionoo-backend-01. but i'm
not confident the cluster will be happier with this work if/when we turn
it back on.
thank you so much for taking the extra time in fixing this and taking care
of our hardware. sometimes it's easier to throw hardware at a problem, but
this seemed like a case where we could improve our algos a little, and I'm
glad it worked out. :)
all in all, i think this can be marked as fixed, at least it is for me.
i'll let other tickets speak for the rest of the work on this onionoo
stuff. from what i understand, there needs to be extra work to bring that
other backend online (or build a new one?) but i'll let you folks figure
out the next steps. :)
do ping me if you need help on that!
cheers, and thanks again!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32660#comment:15>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list