[tor-dev] Fallback metrics service operator

Thu Aug 8 15:28:27 UTC 2013

Hi Sebastian,

we discussed that you could become the fallback metrics service operator
and take over operation whenever I have to go on a four-week
round-the-world cruise or in the unlikely case that my application for
the one-way trip to Mars [0] goes through.  This is awesome!  Thanks a lot!

Moving this discussion to tor-dev@ in case this is interesting to others.

So, you asked what the actual tasks are, and how much time I'd think
they'll be taking.

Let's start by taking a look at this overview [1] of the various tasks
that are performed by metrics services.  Going through the various tool
names stated in brackets, starting with the ones that I think should
have highest priority:

- metrics-db-* is where we collect all the fine Tor network data.  The
cronjob behind metrics-db-R has to run once per hour, or our archives
will be missing consensuses and votes published in that hour.  The other
metrics-db-*'s can handle short downtimes better than metrics-db-R.
Beware that this is not the best code you ever read.

- metrics-web-* is the metrics website, including the necessary
PostgreSQL and Tomcat parts.  I'm not spending much time on this
codebase anymore, because we should really do something like Thomas does
in his Visionion project.  But until that's there, metrics-web is all we
have.  And I think it has quite a few visitors.  So, if it breaks, we
should fix it ASAP.

- task-6498, task-8462, and task-2718 are additions to metrics-web which
should be properly integrated into metrics-web.git.  I should do this,
maybe together with you.

- Onionoo powers Atlas, Compass, and a few other tools.  It has become
quite complex over its 1.5 years of existence.  It consists of two
parts, a Java backend and a Tomcat frontend.  It's actually one of the
few codebases I maintain that has pretty high test coverage.  I'm trying
hard to keep this service running 24/7, and I'm impressed how fast
people report if it breaks for just half an hour.

Those are the highest priority ones.  Looking at the lower priorities:

- tor produces descriptors and statistics in its extra-info descriptors
that we collect.  The only thing I do is nag directory authority
operators and the bridge authority operator if something's wrong, so
that our descriptor archives will be as complete as possible.

- Torperf is our performance measurement tool.  You know it from a few
years ago when you wrote a Python controller for it.  It hasn't changed
much since then.  I started rewriting it in Twisted, but that's still
work in proress.  I'm afraid if Torperf breaks, we'll have to fix it.
People care about the performance data we gather.

- TorDNSEL produces exit lists that we download and archive.  If
TorDNSEL breaks, metrics-db starts complaining, so we learn that it
needs to be fixed.  It's always unclear who's going to fix it, but in
the end there's always somebody who does.

- BridgeDB exports bridge pool assignments that we sanitize and archive.
 There's currently a bug in BridgeDB making these files not as useful as
they could be (#9264).  Bugging the BridgeDB maintainer(s) about these
things is part of the metrics service operator job, because we care
about archives being complete.

And finally, here are the services that I'd say are not part of the
fallback metrics service operator job:

- Atlas is already co-maintained by Philipp since the dev meeting, and
he does a great job there so far.

- Compass isn't really something I operate.  I don't know enough about
the web part behind it to fix it if it breaks horribly.  I just deploy
patches after doing a quick plausibility check, and if they break, I
revert.  I'd say if it breaks and I'm not around, leave it broken until
somebody else fixes it.

- DocTor may soon be replaced by a new tool that Damian writes and that
will use the new descriptor fetching stuff in Stem.  Ideally, we'll shut
down DocTor in two months from now, so that's nothing to worry about
from a metrics service operation point of view.

What did I miss?

So, how do we estimate how much time it will take you to get started?
Depends on how certain your want to be that you'll be able to handle
things if they break.

In a perfect world, I'd say that you should set up each service on a
different machine and run it for a while.  And when we get new hardware
in a month or two, we should move services together.  I'm aware that
this requires a lot of work on both your and my side to understand the
code, make it easier to install, and document everything better.  But
this would also improve the tools a lot.

In a maybe more realistic world, I'd say that you should take a look at
the codebases and at the service directories on yatei, and then we
discuss why things are set up as they are.  This should also put you
into a fine position to rescue broken services, though you might have to
do more research once that happens.

Maybe there are more approaches?  What do you prefer?

Oh, and to be honest, I didn't apply for the Mars thing. ;)

All the best,
Karsten

[0]
http://newsfeed.time.com/2013/05/09/78000-people-apply-for-one-way-trip-to-mars/

[1] https://metrics.torproject.org/tools.html