[tor-bugs] #29816 [Internal Services/Tor Sysadmin Team]: replace "Tor VM hosts" spreadsheet with Grafana dashboard
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Apr 17 20:49:35 UTC 2019
#29816: replace "Tor VM hosts" spreadsheet with Grafana dashboard
-------------------------------------------------+-------------------------
Reporter: anarcat | Owner: anarcat
Type: task | Status:
| assigned
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Minor | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+-------------------------
Description changed by anarcat:
Old description:
> Our KVM allocation strategy is currently managed through a Google
> spreadsheet. This is suboptimal for a few reasons:
>
> 1. it is hard to keep up to date - for example, moly is not listed in
> there even though it's in LDAP as a "KVM host"
>
> 2. it's not real time data - for example, even if a host is allocated
> one vCPU, it might be totally idle most of the time and doing mostly
> network or disk, while another one might hit the CPU hard. actual load is
> what matters
>
> 3. it's hosted by Google - that has a few problems, the most important
> of which is that some TPA do not actually *want* to use Google services
> and might be reluctant to update it, worsening problem 1
>
> I propose we shift this to a Grafana dashboard. I already have a
> prototype in the form of the [https://grafana.com/dashboards/405 Node
> exporter server metrics Grafana Dashboard] which shows multiple hosts
> basic stats in parallel. I set the default of the dashboard in Grafana to
> show the 6 KVM hosts:
>
> <https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
> metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100
> &var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100
> &var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100
> &var-node=unifolium.torproject.org:9100>
>
> That looks like this:
>
> [[Image(https://screenshotscdn.firefoxusercontent.com/images/444d04c8-bea4-4ac9
> -803e-5397126877a2.png, 700px)]]
>
> .. but it's not ideal:
>
> * it's showing irrelevant stats for this purpose like context switches
> or detailed disk or memory stats
>
> * it's missing critical information like the number of KVM guests hosted
> on the machine, how many CPUs and disk space is allocated and so on
>
> This is the information we should be showing:
>
> * disk capacity vs allocation
> * disk utilization
> * CPU count vs allocation
> * actual CPU utilization
> * load?
> * memory capacity vs allocation
> * actual memory usage
>
> Some of that information currently lives *only* in the spreadsheet. For
> example, disk allocations are only available there, as the KVM guests run
> on QCOW (Qemu Copy On Write) filesystems that only take space when
> actually used by the guest. This has the advantage of allowing us to
> over-provision, but means we must keep that metadata somewhere else.
>
> So for now it's in the spreadsheet, but we could find a way to move it
> somewhere Prometheus can scrape. One trick that Prometheus has is that it
> can expose metrics stored as text files in `/var/lib/prometheus/node-
> exporter/*.prom`. This is how the smartctl and APT metrics get shipped
> for example: a cron job (well, a systemd timer) regularly writes that
> file, atomically. So one option could be to move this information to
> (say) LDAP or Puppet/Hiera and write that information into that file
> using a cronjob (LDAP) or Puppet (Hiera).
>
> Then we'd build a custom Grafana dashboard and get rid of the other
> spreadsheet.
>
> A stop-gap measure might be to simplify the spreadsheet and move it to a
> plain text markdown file. We would lose the automatic calculations the
> spreadsheet provide, in exchange for easier updating and transparency.
New description:
Our KVM allocation strategy is currently managed through a Google
spreadsheet. This is suboptimal for a few reasons:
1. it is hard to keep up to date - for example, moly is not listed in
there even though it's in LDAP as a "KVM host"
2. it's not real time data - for example, even if a host is allocated one
vCPU, it might be totally idle most of the time and doing mostly network
or disk, while another one might hit the CPU hard. actual load is what
matters
3. it's hosted by Google - that has a few problems, the most important of
which is that some TPA do not actually *want* to use Google services and
might be reluctant to update it, worsening problem 1
I propose we shift this to a Grafana dashboard. I already have a prototype
in the form of the [https://grafana.com/dashboards/405 Node exporter
server metrics Grafana Dashboard] which shows multiple hosts basic stats
in parallel. I set the default of the dashboard in Grafana to show the 6
KVM hosts:
<https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-
node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-
node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-
node=unifolium.torproject.org:9100>
That looks like this:
[[Image(https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png,
700px)]]
.. but it's not ideal:
* it's showing irrelevant stats for this purpose like context switches or
detailed disk or memory stats
* it's missing critical information like the number of KVM guests hosted
on the machine, how many CPUs and disk space is allocated and so on
This is the information we should be showing:
* disk capacity vs allocation
* disk utilization
* CPU count vs allocation
* actual CPU utilization
* load?
* memory capacity vs allocation
* actual memory usage
Some of that information currently lives *only* in the spreadsheet. For
example, disk allocations are only available there, as the KVM guests run
on QCOW (Qemu Copy On Write) filesystems that only take space when
actually used by the guest. This has the advantage of allowing us to over-
provision, but means we must keep that metadata somewhere else.
So for now it's in the spreadsheet, but we could find a way to move it
somewhere Prometheus can scrape. One trick that Prometheus has is that it
can expose metrics stored as text files in `/var/lib/prometheus/node-
exporter/*.prom`. This is how the smartctl and APT metrics get shipped for
example: a cron job (well, a systemd timer) regularly writes that file,
atomically. So one option could be to move this information to (say) LDAP
or Puppet/Hiera and write that information into that file using a cronjob
(LDAP) or Puppet (Hiera).
Then we'd build a custom Grafana dashboard and get rid of the other
spreadsheet.
A stop-gap measure might be to simplify the spreadsheet and move it to a
plain text markdown file. We would lose the automatic calculations the
spreadsheet provide, in exchange for easier updating and transparency.
--
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29816#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list