[tor-bugs] #29816 [Internal Services/Tor Sysadmin Team]: replace "Tor VM hosts" spreadsheet with Grafana dashboard

Wed Apr 17 20:49:35 UTC 2019

#29816: replace "Tor VM hosts" spreadsheet with Grafana dashboard
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  task                                 |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Minor                                |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------
Description changed by anarcat:

Old description:

> Our KVM allocation strategy is currently managed through a Google
> spreadsheet. This is suboptimal for a few reasons:
>
>  1. it is hard to keep up to date - for example, moly is not listed in
> there even though it's in LDAP as a "KVM host"
>
>  2. it's not real time data - for example, even if a host is allocated
> one vCPU, it might be totally idle most of the time and doing mostly
> network or disk, while another one might hit the CPU hard. actual load is
> what matters
>
>  3. it's hosted by Google - that has a few problems, the most important
> of which is that some TPA do not actually *want* to use Google services
> and might be reluctant to update it, worsening problem 1
>
> I propose we shift this to a Grafana dashboard. I already have a
> prototype in the form of the [https://grafana.com/dashboards/405 Node
> exporter server metrics Grafana Dashboard] which shows multiple hosts
> basic stats in parallel. I set the default of the dashboard in Grafana to
> show the 6 KVM hosts:
>
> <https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
> metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100
> &var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100
> &var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100
> &var-node=unifolium.torproject.org:9100>
>
> That looks like this:
>
> [[Image(https://screenshotscdn.firefoxusercontent.com/images/444d04c8-bea4-4ac9
> -803e-5397126877a2.png, 700px)]]
>
> .. but it's not ideal:
>
>  * it's showing irrelevant stats for this purpose like context switches
> or detailed disk or memory stats
>
>  * it's missing critical information like the number of KVM guests hosted
> on the machine, how many CPUs and disk space is allocated and so on
>
> This is the information we should be showing:
>
>  * disk capacity vs allocation
>  * disk utilization
>  * CPU count vs allocation
>  * actual CPU utilization
>  * load?
>  * memory capacity vs allocation
>  * actual memory usage
>
> Some of that information currently lives *only* in the spreadsheet. For
> example, disk allocations are only available there, as the KVM guests run
> on QCOW (Qemu Copy On Write) filesystems that only take space when
> actually used by the guest. This has the advantage of allowing us to
> over-provision, but means we must keep that metadata somewhere else.
>
> So for now it's in the spreadsheet, but we could find a way to move it
> somewhere Prometheus can scrape. One trick that Prometheus has is that it
> can expose metrics stored as text files in `/var/lib/prometheus/node-
> exporter/*.prom`. This is how the smartctl and APT metrics get shipped
> for example: a cron job (well, a systemd timer) regularly writes that
> file, atomically. So one option could be to move this information to
> (say) LDAP or Puppet/Hiera and write that information into that file
> using a cronjob (LDAP) or Puppet (Hiera).
>
> Then we'd build a custom Grafana dashboard and get rid of the other
> spreadsheet.
>
> A stop-gap measure might be to simplify the spreadsheet and move it to a
> plain text markdown file. We would lose the automatic calculations the
> spreadsheet provide, in exchange for easier updating and transparency.

New description:

 Our KVM allocation strategy is currently managed through a Google
 spreadsheet. This is suboptimal for a few reasons:

  1. it is hard to keep up to date - for example, moly is not listed in
 there even though it's in LDAP as a "KVM host"

  2. it's not real time data - for example, even if a host is allocated one
 vCPU, it might be totally idle most of the time and doing mostly network
 or disk, while another one might hit the CPU hard. actual load is what
 matters

  3. it's hosted by Google - that has a few problems, the most important of
 which is that some TPA do not actually *want* to use Google services and
 might be reluctant to update it, worsening problem 1

 I propose we shift this to a Grafana dashboard. I already have a prototype
 in the form of the [https://grafana.com/dashboards/405 Node exporter
 server metrics Grafana Dashboard] which shows multiple hosts basic stats
 in parallel. I set the default of the dashboard in Grafana to show the 6
 KVM hosts:

 <https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
 metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-
 node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-
 node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-
 node=unifolium.torproject.org:9100>

 That looks like this:

 [[Image(https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png,
 700px)]]

 .. but it's not ideal:

  * it's showing irrelevant stats for this purpose like context switches or
 detailed disk or memory stats

  * it's missing critical information like the number of KVM guests hosted
 on the machine, how many CPUs and disk space is allocated and so on

 This is the information we should be showing:

  * disk capacity vs allocation
  * disk utilization
  * CPU count vs allocation
  * actual CPU utilization
  * load?
  * memory capacity vs allocation
  * actual memory usage

 Some of that information currently lives *only* in the spreadsheet. For
 example, disk allocations are only available there, as the KVM guests run
 on QCOW (Qemu Copy On Write) filesystems that only take space when
 actually used by the guest. This has the advantage of allowing us to over-
 provision, but means we must keep that metadata somewhere else.

 So for now it's in the spreadsheet, but we could find a way to move it
 somewhere Prometheus can scrape. One trick that Prometheus has is that it
 can expose metrics stored as text files in `/var/lib/prometheus/node-
 exporter/*.prom`. This is how the smartctl and APT metrics get shipped for
 example: a cron job (well, a systemd timer) regularly writes that file,
 atomically. So one option could be to move this information to (say) LDAP
 or Puppet/Hiera and write that information into that file using a cronjob
 (LDAP) or Puppet (Hiera).

 Then we'd build a custom Grafana dashboard and get rid of the other
 spreadsheet.

 A stop-gap measure might be to simplify the spreadsheet and move it to a
 plain text markdown file. We would lose the automatic calculations the
 spreadsheet provide, in exchange for easier updating and transparency.

--

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29816#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online