[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri May 11 19:37:15 UTC 2018
#26035: Streamline sample quantile types used in the various modules
--------------------------------+---------------------------
Reporter: karsten | Owner: iwakeh
Type: enhancement | Status: accepted
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor: Sponsor13
--------------------------------+---------------------------
Comment (by iwakeh):
This turned out to be longer than intended:
a) Advertised bandwidth distribution:
* With the notation in comment:1 the Java code uses essentially `result =
val(floor((N-1)*percentile))` [https://gitweb.torproject.org/metrics-
web.git/tree/src/main/java/org/torproject/metrics/stats/advbwdist/Main.java#n124
about here].
* [https://gitweb.torproject.org/metrics-
web.git/tree/src/main/R/advbwdist/aggregate.R#n13 R code] takes the median
of the Java calculated percentiles.
b) Advertised bandwidth of n-th fastest relays uses the resulting data
from a).
c) Fraction of connections used uni-/bidirectionally:
[https://gitweb.torproject.org/metrics-
web.git/tree/src/main/java/org/torproject/metrics/stats/connbidirect/Main.java#n468
Uses Java] and calculates `result = val(floor(N*percentile))` for the
three percentiles .25, .5, and .75.
d) Time to download files over Tor: uses percentile_cont
e) Unique .onion addresses (version 2 only):
[https://gitweb.torproject.org/metrics-
web.git/tree/src/main/java/org/torproject/metrics/stats/hidserv/Aggregator.java#n165
The code] doesn't seem to calculate quartiles, rather checks that the
interval is contained in the 25% to 75% interval of the fraction sum.
Hmm, what am I missing here?
f) Onion-service traffic (versions 2 and 3): same as e).
In total, the Java calculations a) and c) use a discrete version of median
calculation and differ in 'slight index shifting'.
The calculations in e) and f) are not really 'quartiles'.
The remaining calculations use R's median and postgresql percentile_cont,
where R's standard median is calculated (in pseude code) as
{{{
#!C
first = floor(percentile * N);
second = ceil(percentile * N);
/* if first==second the value of first is used. */
if (first==second) {
result = val(first);
} else { /* if first and second differ, take the average. */
result = val(first) + (0.5 * (val(second) - val(first)));
}
}}}
So R's standard median is the same as 50%-percentile of type R-2 and also
coincides with 50%-percentile of type R-7.
There is a variety there. The discrete types are easier to compute (when
trying to reproduce the results for example). Introducing the
interpolation (or continuous) type in Java would mean to complicate the
current code a little, but could be done w/o commons-math.
Of course, the two calculations in a) and c) should be the same, but
that's only a minor change and not related to the choice of percentile
calculation.
=== I: R-1
If we decide to use the discrete R-1 throughout. If so, we'd need to
* replace percentile_cont by percentile_dics, and
* replace R's median function throughout by the 50%-percentile type R-1
provided by utility function named `metricsmedian`.
=== II: R-7
If we decide to use of the proportionate interpolation method R-7
throughout, there are these work packages:
* implement a simple utility interpolation function for Java (similar to
postgresql's approach)
* and make use of it in a) and c).
* replace R's median function throughout by the 50%-percentile type R-7
provided by utility function named `metricsmedian`.
In both cases the calculations e) and f) stay as they are, but need more
documentation.
----
PS: (Trucks often have a spare tire mounted somewhere, and in some
countries they use three wheeled trucks ;-)
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list