[tor-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Dec 2 21:55:07 UTC 2019
#32660: onionoo-backend is killing the ganeti cluster
-----------------------------+------------------------------
Reporter: anarcat | Owner: metrics-team
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Metrics/Onionoo | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+------------------------------
Old description:
> hello!
>
> today i noticed that, since last friday (UTC) morning, there has been
> pretty big spikes on the internal network between the ganeti nodes, every
> hour. it seems this is due to onionoo-backend-01 blasting the disk and
> CPU for some reason.
>
> could someone from metrics investigate? can i just turn off this machine
> altogether, considering it's basically trying to murder the cluster every
> hour? :)
>
> (will attach explanatory screenshots)
New description:
hello!
today i noticed that, since last friday (UTC) morning, there has been
pretty big spikes on the internal network between the ganeti nodes, every
hour. it looks like this, in grafana:
[[Image(snap-2019.12.02-16.06.11.png)]]
We can clearly see a correlation between the two node's traffic, in
reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
during a surge.
It seems this is due to onionoo-backend-01 blasting the disk and CPU for
some reason. This is the disk I/O graphs for that host, which correlate
pretty cleanly with the above graphs:
[[Image(snap-2019.12.02-16.30.33.png)]]
This was confirmed by an inspection of `drbd`, the mechanisms that
synchronizes the disks across the network. It seems there's a huge surge
of "writes" on the network every hour which lasts anywhere between 20 and
30 minutes. This was (somewhat) confirmed by running:
{{{
watch -n 0.1 -d cat /proc/drbd
}}}
on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
DRBD. 13 and 17 are the web nodes, so that's expected - probably log
writes? But device ID 4 is onionoo-backend, which is what led me to the
big traffic graph.
could someone from metrics investigate?
can i just turn off this machine altogether, considering it's basically
trying to murder the cluster every hour? :)
--
Comment (by anarcat):
attach screenshots and further explanations.
the TL;DR: here is: can i shutdown this backend?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32660#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list