[metrics-bugs] #33941 [Internal Services/Tor Sysadmin Team]: Nagios checks for op-??.onionperf.torproject.net
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Apr 27 14:24:07 UTC 2020
#33941: Nagios checks for op-??.onionperf.torproject.net
-------------------------------------------------+---------------------
Reporter: karsten | Owner: tpa
Type: task | Status: new
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+---------------------
Comment (by anarcat):
I am trying to adopt a ground principle of monitoring -- or, to be more
accurate, alerting -- which follows your line of thought: alert on user-
level metrics (like access time, status codes, integrity) and not system-
level metrics (like disk space, CPU usage). If the disk space runs out,
the other metrics will show it, and alterting will ring. (Or, kind of
perversely: if they do *not* show it, then it's fine and you didn't need
to follow system-level metrics anyways).
(Monitoring resources like disk space and CPU usage is not an altering
problem, it's the other part of "monitoring", which I call "trending":
e.g. creating graphs or prediction algorithms on monitored resources to
proactively fix problems before they come up.)
Now the reason I like prometheus better than Nagios is that it allows us
to do both using the same metrics system. But I understand if you would
prefer to stick with what you know.
I will just mention that Prometheus metrics are extremely simple to
implement, even easier than Nagios: you have an HTTP endpoint which
exports metrics like:
{{{
metric_name value
}}}
You also usually add labels like in this real-life (but simplified)
example:
{{{
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"}
16160059392
}}}
So if you feel like writing code this week, I would recommend
instrumenting your service to respond to that emerging standard instead of
the old Nagios paradigm. ;)
See also:
https://openmetrics.io/
https://prometheus.io/docs/instrumenting/exposition_formats/
I'll also mention that our Nagios setup is a little... hermetic: I
wouldn't actually *know* how to add a check for a machine completely
outside of tor in there. It might be possible if the check is only ran
from the Nagios host, but I'm not sure what the consequences would be if
we tried.
Really, we setup that second Prometheus server exactly for that use case,
because did *not* want to monitor external resources with Nagios (and also
to expand the use of Prometheus inside the team, because people were
excited about it!)... So I'd like to stick with that policy if possible...
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33941#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list