[tor-bugs] #29864 [Internal Services/Tor Sysadmin Team]: consider replacing nagios with prometheus

Fri Mar 22 20:11:55 UTC 2019

#29864: consider replacing nagios with prometheus
-----------------------------------------------------+-----------------
     Reporter:  anarcat                              |      Owner:  tpa
         Type:  project                              |     Status:  new
     Priority:  Low                                  |  Milestone:
    Component:  Internal Services/Tor Sysadmin Team  |    Version:
     Severity:  Major                                |   Keywords:
Actual Points:                                       |  Parent ID:
       Points:                                       |   Reviewer:
      Sponsor:                                       |
-----------------------------------------------------+-----------------
 As a followup to the Prometheus/Grafana setup started in #29681, I am
 wondering if we should also consider replacing the Nagios/Icinga server
 with Prometheus. I have done a little research on the subject and figured
 it might be good to at least document the current state of affairs.

 This would remove a complex piece of architecture we have at TPO that was
 designed before Puppet was properly deployed. Prometheus has an
 interesting federated design that allows it to scale to multiple machines
 easily, along with a high availability component for the alertmanager that
 allows it to be more reliable than a traditionnal Nagios configuration. It
 would also simplify our architecture as the Nagios server automation is a
 complex mix of Debian packages and git hooks that is serving us well, but
 hard to comprehend and debug for new administrators. (I managed to wipe
 the entire Nagios config myself on my first week on the job by messing up
 a configuration file.) Having the monitoring server fully deployed by
 Puppet would be a huge improvement, even if it would be done with Nagios
 instead of Prometheus, of course.

 Right now the Nagios server is actually running Icinga 1.13, a Nagios
 fork, on a heztner machine (`hetzner-hel1-01`). It's doing its job
 generally well although it feels a *little* noisy, but that's to be
 expected form Nagios servers. Reducing the number of alerts seems to be an
 objective, explicitely documented in #29410, for example.

 Both Grafana and Prometheus can do alerting, with various mechanisms and
 plugins. I haven't investigated those deeply, but in general that's not a
 problem in alerting: you fire some script or API and the rest happens. I
 suspect we could port the current Nagios alerting scripts to Prometheus
 fairly easily, although I haven't investigated our scripts in details.

 The problem is reproducing the check scripts and their associated alert
 threshold. In the Nagios world, when a check is installed, it *comes* with
 its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has
 developed a wide variety of such checks. According to the current Nagios
 dashboard, it monitors 4612 services on 88 hosts (which is interesting
 considering LDAP thinks there are 78). That looks terrifying, but it's
 actually a set of 9 commands running on the Nagios server, including the
 complex `check_nrpe` system, which is basically a client-side nagios that
 has its own set of checks. And that's where the "cardinal explosion"
 happens: on a typical host, there are 315 such checks implemented.

 That's the hard part: convert those 324 checks into Prometheus alerts, one
 at a time. Unfortunately, there are no "built-in" or even "third-party"
 "prometheus alert sets" that I could find in my
 [https://anarc.at/blog/2018-01-17-monitoring-prometheus/ original
 research], although that might have changed in the last year.

 Each check in Prometheus is basically a YAML file describing a Prometheus
 query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an
 alert. It's not impossible to do that conversion, it's just a lot of work.

 To do this progressively while allowing us to make new alerts on
 Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare
 did, which is to establish a "Nagios to Prometheus" bridge, by which
 Nagios doesn't send the alerts on its own and instead forwards them to the
 Prometheus server, a plugin they called
 [https://github.com/cloudflare/promsaint Promsaint].

 With the bridge in place, Nagios checks can be migrated into Prometheus
 alerts progressively without disruption. Note that Cloudflare documented
 their experience with Prometheus in [https://promcon.io/2017-munich/talks
 /monitoring-cloudflares-planet-scale-edge-network-with-prometheus/ this
 2017 promcon talk]. Cloudflare also made a
 [https://github.com/cloudflare/unsee alert dashboard] and
 [https://github.com/cloudflare/alertmanager2es elasticsearch integration]
 which might be good to investigate further.

 Another useful piece is this [https://www.robustperception.io/nagios-nrpe-
 prometheus-exporter NRPE to Prometheus exporter], which allows Prometheus
 to directly scrape NRPE targets. It doesn't include Prometheus alerts and
 instead relies on a Grafana dashboard to show possible problems so, as
 such, I don't think it's that useful an alternative.

 So, battle plan is basically this:

  1. `apt install prometheus-alertmanager`
  2. reimplement the Nagios alerting commands
  3. send Nagios alerts through the alertmanager
  4. rewrite (non-NRPE) commands (9) as Prometheus alerts
  5. optionnally, scrape the NRPE metrics from Prometheus
  6. optionnally, create a dashboard and/or alerts for the NRPE metrics
  7. rewrite NRPE commands (300+) as Prometheus alerts
  8. turn off the Nagios server
  9. remove all traces of NRPE on all nodes

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29864>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online