[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon May 7 14:43:47 UTC 2018


#26035: Streamline sample quantile types used in the various modules
------------------------------------+--------------------------
     Reporter:  karsten             |      Owner:  metrics-team
         Type:  enhancement         |     Status:  new
     Priority:  Medium              |  Milestone:
    Component:  Metrics/Statistics  |    Version:
     Severity:  Normal              |   Keywords:
Actual Points:                      |  Parent ID:
       Points:                      |   Reviewer:
      Sponsor:  Sponsor13           |
------------------------------------+--------------------------
 While documenting how to reproduce our various statistics, I noticed that
 we're using different methods/formulas for computing sample quantiles,
 that is, the median, quartiles, percentiles, and so on. Ideally, we would
 settle on one method and use that everywhere. The benefit is easier
 documentation and reproducibility.

 Here is a (probably still incomplete) list of graphs for which we
 calculate quantiles (with the tool written in parentheses):
  - [https://metrics.torproject.org/userstats-relay-country.html Relay
 users]: Median and inter-quartile range of ratios in censorship detector
 (Python, possibly Java soon)
  - [https://metrics.torproject.org/advbwdist-perc.html Advertised
 bandwidth distribution]: Percentiles, including the unusual 0-th
 percentile (Java) and median (R)
  - [https://metrics.torproject.org/advbwdist-relay.html Advertised
 bandwidth of n-th fastest relays]: Median (R)
  - [https://metrics.torproject.org/connbidirect.html Fraction of
 connections used uni-/bidirectionally]: Quartiles (Java)
  - [https://metrics.torproject.org/torperf.html Time to download files
 over Tor]: Quartiles (PostgreSQL)
  - [https://metrics.torproject.org/hidserv-dir-onions-seen.html Unique
 .onion addresses (version 2 only)]: Quartiles for weighted inter-quartile
 mean (Java)
  - [https://metrics.torproject.org/hidserv-rend-relayed-cells.html Onion-
 service traffic (versions 2 and 3)]: Quartiles for weighted inter-quartile
 mean (Java)

 There exist surprisingly many ways for computing quantiles. I found the
 following links to be quite helpful:

  -
 https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
  -
 https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/quantile

 Looking at the lists, we should probably pick two types: one discontinuous
 (`R-1` to `R-3`) and one continuous type (`R-4` to `R-9`). And ideally,
 we'd pick types that are either the defaults in the tools we're using or
 that we can easily select to use in those tools.

 Going through our tools:
  - PostgreSQL has two functions, `PERCENTILE_CONT` and `PERCENTILE_DISC`,
 of which we already use the first. I did some experiments with a quite
 large sample set and found that `PERCENTILE_CONT` produces the exact same
 output as `R-7` and `PERCENTILE_DISC` must be either `R-1` or `R-2`. A
 math person might be able to say whether it's `R-1` or `R-2` by looking at
 the PostgreSQL source code. And maybe that person would be able to confirm
 the `R-7` part, too. It seems like we don't have the choice of using other
 types than these in PosrtgreSQL, though, or at least not easily.
  - R has support for all nine types. After all, they're named after this
 language. It seems like `R-7` is the default type.
  - Java with Apache Commons Math has support for all nine types, `R-1` to
 `R-9`. And in theory, the two types we need shouldn't be terribly hard to
 re-implement, in case we want to avoid putting in this not-exactly-tiny
 library as dependency.
  - Python with SciPy/Numpy probably has support for some types, but I
 guess we're not planning to keep our Python code anyway, so this doesn't
 really matter.

 Whee, long ticket. Thoughts?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list