[tor-bugs] #28320 [Metrics/CollecTor]: Rewrite CollecTor relaydescs module using Stem/txtorcon
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Nov 5 08:43:47 UTC 2018
#28320: Rewrite CollecTor relaydescs module using Stem/txtorcon
-----------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: task | Status: new
Priority: Medium | Milestone:
Component: Metrics/CollecTor | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: Sponsor13 |
-----------------------------------+--------------------------
The CollecTor service collects and archives data from various nodes and
services in the public Tor network. Internally, it consists of several
modules that are running in the background following a pre-defined
schedule. These modules either download data from other hosts or process
data that has been copied from other hosts to the local file system. The
processed data is then provided via a locally running static web server.
CollecTor is written in Java. It uses several APIs either provided in the
JDK or in third-party libraries. For example, it uses
`java.util.concurrent` for scheduling. However, it does not use a specific
framework for batch processing. That is why it has to solve challenges
like the following on its own:
- Scheduling: Make sure modules are running, say, once per hour; avoid
overlapping runs.
- Dependencies: Make sure that module runs don't interfere with each
other; one module writes newly obtained files to disk, another tars them
up, yet another writes an index file and provides that to external
applications.
- Shutdowns: Handle externally triggered shutdowns gracefully and make
sure the service resumes operation after reboot, without missing data.
These are just a few examples, and CollecTor does not resolve all of them
in the best way possible. It also feels like somebody must have solved
these challenges before. We should find out, and the best way is probably
to try it out in practice.
In Mexico City we decided to evaluate existing batch processing frameworks
by rewriting the CollecTor relaydescs module using Python with Stem or
txtorcon. It should be sufficient to make it work for at least consensuses
and server descriptors as initial proof of concept. Other descriptor types
can follow later, if we decide to switch from Java to Python for
CollecTor.
The first steps are to write down requirements and possible Python
libraries for the batch-processing parts.
We're done with this task when we have a working prototype of CollecTor in
Python that fetches consensuses and server descriptors from the directory
authorities.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28320>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list