[tor-bugs] #33941 [Internal Services/Tor Sysadmin Team]: Nagios checks for op-??.onionperf.torproject.net

Tue Apr 28 14:59:36 UTC 2020

#33941: Nagios checks for op-??.onionperf.torproject.net
-------------------------------------------------+---------------------
 Reporter:  karsten                              |          Owner:  tpa
     Type:  task                                 |         Status:  new
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Comment (by anarcat):

 > Hmm, or would #31945 be an option?

 Yes, that's exactly what I had in mind: Prometheus has a bunch of
 exporters for various things. You can write your own, but stuff like disk
 space is covered with the node exporter. We install on all TPA hosts, but
 it can also be installed on third-party hosts and then scraped by our
 secondary prometheus server.

 > To be consumed by either Nagios or Prometheus?

 That would be consumed by Prometheus for now.

 > I hope that I don't have to learn much about Prometheus but can treat it
 as a black box that runs a application-specific check script and sends me
 an alert if something's broken.

 That's the rub, isn't it. :) One issue we currently have in Prometheus is
 we don't have alerting setup. It's part of the ongoing conversation with
 the metrics team (in #31159) and we crossed a significant landmark this
 week when we setup a plugin in Grafana that could display an history of
 availability probes. It's not alerting, but it's an easy to use dashboard
 that shows when something is down.

 The alerting is definitely on the roadmap too, but will require a little
 more research before we get it going...

 > If that's an impossible requirement I'll have to make new plans about
 keeping an AWS instance alive that I'd prefer to terminate.

 It's not impossible, but it's true it might be more complex than setting
 up a Nagios check, which you're familiar with. I would definitely want to
 avoid reimplementing the node exporter or NRPE checks, that said: if you
 write a check, check application-level metrics and do not reimplement
 stuff disk checks please. :)

 One problem with monitoring system-level metrics in Prometheus is that you
 need to write alerting rules for each metric yourself. While Nagios checks
 have built-in threshold (e.g. "80% disk use is warning", "90% is
 critical"), Prometheus metrics are just that: metrics; you need to write
 your own alerting rules. This goes on par with the philosophy of alerting
 being for application-specific metrics, for which there are no good
 templates, but of course it makes our job harder for now.

 Would you be okay with a dashboard for now? We get that out of the box
 with Prometheus + node exporter + Grafana, and we probably could set that
 up by the end of the month. That would cover trending.

 Alerting is more complicated, because it involves either breaking our
 rules with the Nagios server, or implementing alerting in Prometheus.
 Neither are "impossible", but are more work than what I'd commit to this
 week.

 Maybe we could keep the AWS instance for another week or so? don't we pay
 for this by the minute anyways? :)

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33941#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online