[tor-bugs] #29864 [Internal Services/Tor Sysadmin Team]: consider replacing nagios with prometheus
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri Mar 22 20:11:55 UTC 2019
#29864: consider replacing nagios with prometheus
-----------------------------------------------------+-----------------
Reporter: anarcat | Owner: tpa
Type: project | Status: new
Priority: Low | Milestone:
Component: Internal Services/Tor Sysadmin Team | Version:
Severity: Major | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-----------------------------------------------------+-----------------
As a followup to the Prometheus/Grafana setup started in #29681, I am
wondering if we should also consider replacing the Nagios/Icinga server
with Prometheus. I have done a little research on the subject and figured
it might be good to at least document the current state of affairs.
This would remove a complex piece of architecture we have at TPO that was
designed before Puppet was properly deployed. Prometheus has an
interesting federated design that allows it to scale to multiple machines
easily, along with a high availability component for the alertmanager that
allows it to be more reliable than a traditionnal Nagios configuration. It
would also simplify our architecture as the Nagios server automation is a
complex mix of Debian packages and git hooks that is serving us well, but
hard to comprehend and debug for new administrators. (I managed to wipe
the entire Nagios config myself on my first week on the job by messing up
a configuration file.) Having the monitoring server fully deployed by
Puppet would be a huge improvement, even if it would be done with Nagios
instead of Prometheus, of course.
Right now the Nagios server is actually running Icinga 1.13, a Nagios
fork, on a heztner machine (`hetzner-hel1-01`). It's doing its job
generally well although it feels a *little* noisy, but that's to be
expected form Nagios servers. Reducing the number of alerts seems to be an
objective, explicitely documented in #29410, for example.
Both Grafana and Prometheus can do alerting, with various mechanisms and
plugins. I haven't investigated those deeply, but in general that's not a
problem in alerting: you fire some script or API and the rest happens. I
suspect we could port the current Nagios alerting scripts to Prometheus
fairly easily, although I haven't investigated our scripts in details.
The problem is reproducing the check scripts and their associated alert
threshold. In the Nagios world, when a check is installed, it *comes* with
its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has
developed a wide variety of such checks. According to the current Nagios
dashboard, it monitors 4612 services on 88 hosts (which is interesting
considering LDAP thinks there are 78). That looks terrifying, but it's
actually a set of 9 commands running on the Nagios server, including the
complex `check_nrpe` system, which is basically a client-side nagios that
has its own set of checks. And that's where the "cardinal explosion"
happens: on a typical host, there are 315 such checks implemented.
That's the hard part: convert those 324 checks into Prometheus alerts, one
at a time. Unfortunately, there are no "built-in" or even "third-party"
"prometheus alert sets" that I could find in my
[https://anarc.at/blog/2018-01-17-monitoring-prometheus/ original
research], although that might have changed in the last year.
Each check in Prometheus is basically a YAML file describing a Prometheus
query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an
alert. It's not impossible to do that conversion, it's just a lot of work.
To do this progressively while allowing us to make new alerts on
Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare
did, which is to establish a "Nagios to Prometheus" bridge, by which
Nagios doesn't send the alerts on its own and instead forwards them to the
Prometheus server, a plugin they called
[https://github.com/cloudflare/promsaint Promsaint].
With the bridge in place, Nagios checks can be migrated into Prometheus
alerts progressively without disruption. Note that Cloudflare documented
their experience with Prometheus in [https://promcon.io/2017-munich/talks
/monitoring-cloudflares-planet-scale-edge-network-with-prometheus/ this
2017 promcon talk]. Cloudflare also made a
[https://github.com/cloudflare/unsee alert dashboard] and
[https://github.com/cloudflare/alertmanager2es elasticsearch integration]
which might be good to investigate further.
Another useful piece is this [https://www.robustperception.io/nagios-nrpe-
prometheus-exporter NRPE to Prometheus exporter], which allows Prometheus
to directly scrape NRPE targets. It doesn't include Prometheus alerts and
instead relies on a Grafana dashboard to show possible problems so, as
such, I don't think it's that useful an alternative.
So, battle plan is basically this:
1. `apt install prometheus-alertmanager`
2. reimplement the Nagios alerting commands
3. send Nagios alerts through the alertmanager
4. rewrite (non-NRPE) commands (9) as Prometheus alerts
5. optionnally, scrape the NRPE metrics from Prometheus
6. optionnally, create a dashboard and/or alerts for the NRPE metrics
7. rewrite NRPE commands (300+) as Prometheus alerts
8. turn off the Nagios server
9. remove all traces of NRPE on all nodes
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29864>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list