[metrics-bugs] #28342 [Metrics/Statistics]: Share more code between modules

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Nov 5 21:21:37 UTC 2018


#28342: Share more code between modules
------------------------------------+--------------------------
     Reporter:  karsten             |      Owner:  metrics-team
         Type:  enhancement         |     Status:  new
     Priority:  Medium              |  Milestone:
    Component:  Metrics/Statistics  |    Version:
     Severity:  Normal              |   Keywords:
Actual Points:                      |  Parent ID:
       Points:                      |   Reviewer:
      Sponsor:                      |
------------------------------------+--------------------------
 We currently have nine data-processing modules in metrics-web. Each of
 them reads descriptors from a local directory, aggregates them somehow,
 and writes one or more CSV files to a local directory. Some modules use a
 database for the aggregation part, others use state files.

 Some of these modules have a lot of code in common. Yet they do not share
 actual code other than what's in metrics-lib for reading and parsing
 descriptors. This is bad for obvious reasons.

 I'd like to approach this refactoring from a top-down perspective where we
 generalize similar functionality and use it in all modules. The following
 list is ordered by topic, not by priority:

  - Configuration: Most of the modules can be configured in some way,
 including database connection details or file system paths. It would be
 easier to configure things once for all modules by using a single config
 file, or to have reasonable defaults and modify single options via
 command-line arguments.

  - Scheduling: Right now, modules are running once per day, called by
 cron. We would like to run them more often, but we need to avoid
 overlapping runs. And we need to handle shutdowns gracefully. A common
 scheduler might help with this.

  - Descriptor reading and parsing: Each module has its own code for
 reading and parsing descriptors using metrics-lib. This includes setting
 paths where descriptors are located and paths for parse history files.

  - Statistics: We have similar code for computing percentiles and other
 statistics distributed over the code base. We might be able to generalize
 these computations and provide a common math/statistics interface for
 them. We should still use a math library, so this would be mostly a
 wrapper for that library.

  - Database access: Several of our modules have the same or very similar
 code to: connect to a database, import parsed descriptor parts into
 tables, executing stored procedures for importing data, executing stored
 procedures for aggregating data, querying one or more results view, and
 disconnecting from the database. What we need is a more powerful API to
 our databases than `java.sql`.

  - Output: Our modules write one or more CSV files as their output. Some
 modules treat missing values differently in the output, but this code is
 mostly the same in all modules. Maybe this is still part of the database
 access item above. If not, we should share some code across modules for
 writing output files.

 This is not high priority, and it requires discussion prior to making any
 code changes. This ticket is supposed to get us started here. (And I said
 I wouldn't close #26035 before this ticket exists.)

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28342>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list