[metrics-bugs] #33941 [Internal Services/Tor Sysadmin Team]: Nagios checks for op-??.onionperf.torproject.net
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Apr 28 14:59:36 UTC 2020
#33941: Nagios checks for op-??.onionperf.torproject.net
-------------------------------------------------+---------------------
Reporter: karsten | Owner: tpa
Type: task | Status: new
Priority: Medium | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------------------+---------------------
Comment (by anarcat):
> Hmm, or would #31945 be an option?
Yes, that's exactly what I had in mind: Prometheus has a bunch of
exporters for various things. You can write your own, but stuff like disk
space is covered with the node exporter. We install on all TPA hosts, but
it can also be installed on third-party hosts and then scraped by our
secondary prometheus server.
> To be consumed by either Nagios or Prometheus?
That would be consumed by Prometheus for now.
> I hope that I don't have to learn much about Prometheus but can treat it
as a black box that runs a application-specific check script and sends me
an alert if something's broken.
That's the rub, isn't it. :) One issue we currently have in Prometheus is
we don't have alerting setup. It's part of the ongoing conversation with
the metrics team (in #31159) and we crossed a significant landmark this
week when we setup a plugin in Grafana that could display an history of
availability probes. It's not alerting, but it's an easy to use dashboard
that shows when something is down.
The alerting is definitely on the roadmap too, but will require a little
more research before we get it going...
> If that's an impossible requirement I'll have to make new plans about
keeping an AWS instance alive that I'd prefer to terminate.
It's not impossible, but it's true it might be more complex than setting
up a Nagios check, which you're familiar with. I would definitely want to
avoid reimplementing the node exporter or NRPE checks, that said: if you
write a check, check application-level metrics and do not reimplement
stuff disk checks please. :)
One problem with monitoring system-level metrics in Prometheus is that you
need to write alerting rules for each metric yourself. While Nagios checks
have built-in threshold (e.g. "80% disk use is warning", "90% is
critical"), Prometheus metrics are just that: metrics; you need to write
your own alerting rules. This goes on par with the philosophy of alerting
being for application-specific metrics, for which there are no good
templates, but of course it makes our job harder for now.
Would you be okay with a dashboard for now? We get that out of the box
with Prometheus + node exporter + Grafana, and we probably could set that
up by the end of the month. That would cover trending.
Alerting is more complicated, because it involves either breaking our
rules with the Nagios server, or implementing alerting in Prometheus.
Neither are "impossible", but are more work than what I'd commit to this
week.
Maybe we could keep the AWS instance for another week or so? don't we pay
for this by the minute anyways? :)
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33941#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list