[metrics-bugs] #28342 [Metrics/Statistics]: Share more code between modules
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Nov 5 21:21:37 UTC 2018
#28342: Share more code between modules
------------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
------------------------------------+--------------------------
We currently have nine data-processing modules in metrics-web. Each of
them reads descriptors from a local directory, aggregates them somehow,
and writes one or more CSV files to a local directory. Some modules use a
database for the aggregation part, others use state files.
Some of these modules have a lot of code in common. Yet they do not share
actual code other than what's in metrics-lib for reading and parsing
descriptors. This is bad for obvious reasons.
I'd like to approach this refactoring from a top-down perspective where we
generalize similar functionality and use it in all modules. The following
list is ordered by topic, not by priority:
- Configuration: Most of the modules can be configured in some way,
including database connection details or file system paths. It would be
easier to configure things once for all modules by using a single config
file, or to have reasonable defaults and modify single options via
command-line arguments.
- Scheduling: Right now, modules are running once per day, called by
cron. We would like to run them more often, but we need to avoid
overlapping runs. And we need to handle shutdowns gracefully. A common
scheduler might help with this.
- Descriptor reading and parsing: Each module has its own code for
reading and parsing descriptors using metrics-lib. This includes setting
paths where descriptors are located and paths for parse history files.
- Statistics: We have similar code for computing percentiles and other
statistics distributed over the code base. We might be able to generalize
these computations and provide a common math/statistics interface for
them. We should still use a math library, so this would be mostly a
wrapper for that library.
- Database access: Several of our modules have the same or very similar
code to: connect to a database, import parsed descriptor parts into
tables, executing stored procedures for importing data, executing stored
procedures for aggregating data, querying one or more results view, and
disconnecting from the database. What we need is a more powerful API to
our databases than `java.sql`.
- Output: Our modules write one or more CSV files as their output. Some
modules treat missing values differently in the output, but this code is
mostly the same in all modules. Maybe this is still part of the database
access item above. If not, we should share some code across modules for
writing output files.
This is not high priority, and it requires discussion prior to making any
code changes. This ticket is supposed to get us started here. (And I said
I wouldn't close #26035 before this ticket exists.)
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28342>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list