[metrics-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Dec 3 16:28:56 UTC 2019
#32660: onionoo-backend is killing the ganeti cluster
-----------------------------+------------------------------
Reporter: anarcat | Owner: metrics-team
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Metrics/Onionoo | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+------------------------------
Comment (by anarcat):
> Let's first find out what's happening there. We were planning to stop
this instance this afternoon and set up a new one on the same host. If we
don't know what's going wrong, we might see the same issue with the new
instance.
> So, this seems like something that is caused by the hourly updater. Can
you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly
the same timing?
They do look similar, now that you mention it:
https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
metrics?orgId=1&var-node=omeiense.torproject.org:9100&var-node=onionoo-
backend-01.torproject.org:9100&var-node=oo-hetzner-03.torproject.org:9100
The peak bandwidth usage is bigger on onionoo-backend-01, but that might
be just because the disks are faster; the peak is bigger, but shorter, so
the transfer size is probably equivalent.
> I have suspended the hourly updater on this host, but this is the normal
expected operation for Onionoo's hourly updater. There are state files
that get updated every run including those for which nothing has changed.
This is a known flaw in Onionoo but until now it hasn't been a problem.
Okay, maybe I'm being overly cautious then. It might be a problem in the
mid-long term in the new cluster because of the way it is structured: all
writes are replicated between the two servers, because they act as a
redundant cluster. If one node goes down, the other can take over on the
fly. It also allows us to migrate the machines between the two servers
more or less in realtime.
Maybe we should make an exception for this host, and keep the data only on
one server. This would have a few implications:
1. if the server goes down, we need to restore from backups, so there's
up to 24h of data loss
2. when we reboot the servers, the machine will go down for the duration
of the reboot
3. moving the machine around if we decommission the server will require
manual work
> karsten is going to look at how difficult it would be to reduce the
number of writes performed. Is the problem total IO or is it just the
writes? Are reads cached? i.e. if we read it again to compare before
writing, does that help?
A napkin calculation tells me we're writing about 50GiB of data on the
disk every hour. That seems like a *lot*!
(I base this on the graphs that seem to average about 36MiB/s for 35
minutes on onionoo-backend-01, which means around 56GiB. oo-hetzner-03
writes 21MiB for 35 minutes, which means about 46GiB. about average
between the two is 50GiB.)
Is that about right? What *are* you writing in there? :)
> There's very little we can do about the CPU load. We already use
optimized parsing libraries for JSON, and quite simple parsers for Tor
descriptors. Metrics does involve some computation. If CPU load is a
problem then perhaps the Ganeti cluster is the wrong place for Onionoo to
live and we need something else.
I don't mind the CPU load so much, actually. That we have some capacity.
And we do have the capacity on the network too - it's a gigabit link after
all. It's just that this single node is already taking 10% of the capacity
during those peaks, so I was worried it was an anomaly.
But maybe there's much ado about nothing here. It just seems strange that
we write all that data all the time...
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32660#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list