[metrics-bugs] #33941 [Internal Services/Tor Sysadmin Team]: Nagios checks for op-??.onionperf.torproject.net

Mon Apr 27 14:24:07 UTC 2020

#33941: Nagios checks for op-??.onionperf.torproject.net
-------------------------------------------------+---------------------
 Reporter:  karsten                              |          Owner:  tpa
     Type:  task                                 |         Status:  new
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Comment (by anarcat):

 I am trying to adopt a ground principle of monitoring -- or, to be more
 accurate, alerting -- which follows your line of thought: alert on user-
 level metrics (like access time, status codes, integrity) and not system-
 level metrics (like disk space, CPU usage). If the disk space runs out,
 the other metrics will show it, and alterting will ring. (Or, kind of
 perversely: if they do *not* show it, then it's fine and you didn't need
 to follow system-level metrics anyways).

 (Monitoring resources like disk space and CPU usage is not an altering
 problem, it's the other part of "monitoring", which I call "trending":
 e.g. creating graphs or prediction algorithms on monitored resources to
 proactively fix problems before they come up.)

 Now the reason I like prometheus better than Nagios is that it allows us
 to do both using the same metrics system. But I understand if you would
 prefer to stick with what you know.

 I will just mention that Prometheus metrics are extremely simple to
 implement, even easier than Nagios: you have an HTTP endpoint which
 exports metrics like:

 {{{
 metric_name value
 }}}

 You also usually add labels like in this real-life (but simplified)
 example:

 {{{
 node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"}
 16160059392
 }}}

 So if you feel like writing code this week, I would recommend
 instrumenting your service to respond to that emerging standard instead of
 the old Nagios paradigm. ;)

 See also:

 https://openmetrics.io/
 https://prometheus.io/docs/instrumenting/exposition_formats/

 I'll also mention that our Nagios setup is a little... hermetic: I
 wouldn't actually *know* how to add a check for a machine completely
 outside of tor in there. It might be possible if the check is only ran
 from the Nagios host, but I'm not sure what the consequences would be if
 we tried.

 Really, we setup that second Prometheus server exactly for that use case,
 because did *not* want to monitor external resources with Nagios (and also
 to expand the use of Prometheus inside the team, because people were
 excited about it!)... So I'd like to stick with that policy if possible...

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33941#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online