[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue May 8 15:34:47 UTC 2018
#26035: Streamline sample quantile types used in the various modules
--------------------------------+------------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor: Sponsor13
--------------------------------+------------------------------
Comment (by iwakeh):
Long description, true.
Let's look at the postgresql implementation (using source of 9.6.8, but
assuming it is not subject to frequent change,
./src/backend/utils/adt/orderedsetaggs.c).
The computation of percentile_* as C-ish pseudo-code (which could easily
be used for implementation in Java or python):
{{{
#!C
/*
All values are sorted and indexed according to the order.
first and second are indices; val(k) is the value at index k;
percentile is the wanted percentile. N is the count of values.
*/
first = floor(percentile * N);
second = ceil(percentile * N);
/* if first==second the value of first is used. */
if (first==second) {
result = val(first);
} else { /* if first and second differ the following interpolation is
used. */
proportion = (percentile * N) - first;
/* the value is chosen between the values of first and second */
result = val(first) + (proportion * (val(second) - val(first)));
}
/*-----------------------------------*/
/* For comparison percentile_disc: */
result = val( ceil(N*percentile));
}}}
For values, where fractions make sense, e.g. seconds for onionperf
results, the interpolation or continuous method could be used and for all
else, e.g. user numbers etc., the simpler calculation (refered to as
discontinuous) could be used.
The classification R-x is not really a DIN and I wouldn't rely on an
implementation without checking the source.
Thus, using our own implementation in Java might be less trouble and
smaller dependency counts/space used. Python is not used anymore soon,
and R is only used for the median and either could document the median
calculation in R or implement the 0.5-percentile.
Possible steps:
* keep using postgresql percentile_cont
* implement the equivalent Java functionality and replace commons-math, if
this is the only functionality why it is included.
* document percentile calculation once for Java together with postgresql
* adapt the R median calculation, if necessary
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list