[metrics-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Dec 3 16:28:56 UTC 2019


#32660: onionoo-backend is killing the ganeti cluster
-----------------------------+------------------------------
 Reporter:  anarcat          |          Owner:  metrics-team
     Type:  defect           |         Status:  new
 Priority:  Medium           |      Milestone:
Component:  Metrics/Onionoo  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------

Comment (by anarcat):

 >  Let's first find out what's happening there. We were planning to stop
 this instance this afternoon and set up a new one on the same host. If we
 don't know what's going wrong, we might see the same issue with the new
 instance.

 > So, this seems like something that is caused by the hourly updater. Can
 you tell us if omeiense and/or oo-hetzner-03 have similar loads at roughly
 the same timing?

 They do look similar, now that you mention it:

 https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
 metrics?orgId=1&var-node=omeiense.torproject.org:9100&var-node=onionoo-
 backend-01.torproject.org:9100&var-node=oo-hetzner-03.torproject.org:9100

 The peak bandwidth usage is bigger on onionoo-backend-01, but that might
 be just because the disks are faster; the peak is bigger, but shorter, so
 the transfer size is probably equivalent.

 > I have suspended the hourly updater on this host, but this is the normal
 expected operation for Onionoo's hourly updater. There are state files
 that get updated every run including those for which nothing has changed.
 This is a known flaw in Onionoo but until now it hasn't been a problem.

 Okay, maybe I'm being overly cautious then. It might be a problem in the
 mid-long term in the new cluster because of the way it is structured: all
 writes are replicated between the two servers, because they act as a
 redundant cluster. If one node goes down, the other can take over on the
 fly. It also allows us to migrate the machines between the two servers
 more or less in realtime.

 Maybe we should make an exception for this host, and keep the data only on
 one server. This would have a few implications:

  1. if the server goes down, we need to restore from backups, so there's
 up to 24h of data loss
  2. when we reboot the servers, the machine will go down for the duration
 of the reboot
  3. moving the machine around if we decommission the server will require
 manual work

 > karsten is going to look at how difficult it would be to reduce the
 number of writes performed. Is the problem total IO or is it just the
 writes? Are reads cached? i.e. if we read it again to compare before
 writing, does that help?

 A napkin calculation tells me we're writing about 50GiB of data on the
 disk every hour. That seems like a *lot*!

 (I base this on the graphs that seem to average about 36MiB/s for 35
 minutes on onionoo-backend-01, which means around 56GiB. oo-hetzner-03
 writes 21MiB for 35 minutes, which means about 46GiB. about average
 between the two is 50GiB.)

 Is that about right? What *are* you writing in there? :)

 > There's very little we can do about the CPU load. We already use
 optimized parsing libraries for JSON, and quite simple parsers for Tor
 descriptors. Metrics does involve some computation. If CPU load is a
 problem then perhaps the Ganeti cluster is the wrong place for Onionoo to
 live and we need something else.

 I don't mind the CPU load so much, actually. That we have some capacity.
 And we do have the capacity on the network too - it's a gigabit link after
 all. It's just that this single node is already taking 10% of the capacity
 during those peaks, so I was worried it was an anomaly.

 But maybe there's much ado about nothing here. It just seems strange that
 we write all that data all the time...

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32660#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list