[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed May 9 08:38:28 UTC 2018
#26035: Streamline sample quantile types used in the various modules
--------------------------------+------------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor: Sponsor13
--------------------------------+------------------------------
Comment (by karsten):
Thanks for this thoughtful response! I have just a few more thoughts on
this:
I'm not entirely sure how that pseudocode handles `percentile` values of
0.0 and 1.0. Is `val()` 0-indexed or 1-indexed? In either case, we'll end
up outside of values with one of the two percentiles. (PostgreSQL's
`PERCENTILE_CONT` does support percentiles 0.0 and 1.0.)
I also looked more closely at types `R-1` and `R-2` and figured that
PostgreSQL's `PERCENTILE_DISC` is `R-1`, because it does not produce any
averages. So, PostgreSQL implements `R-1` and `R-7`.
(By the way, when I refer to them as `R-x`, that's mainly to simplify our
discussion here. I'm happy to specify them with their formulas and only
mention that they are what R defines as type x in their `quantile()`
function.)
Regarding R's `median()`, function, that produces the same result as `R-2`
and `R-7`, right?
I wonder if, for the sake of simplicity, we should avoid using
`PERCENTILE_DISC` (which we're not using yet, AFAIK) and only use
`PERCENTILE_CONT` and R's `median()`. That is, use `R-7` everywhere.
I do agree that interpolation between two integers representing user
numbers doesn't make as much sense. But we can always truncate or round
results, if we believe that integers are less confusing.
(I could imagine that if we were to compute percentiles of truly discrete
variables like the number of tires mounted on trucks, we wouldn't want to
return 7, but only actual sample values. I don't think that we need to
worry about that here.)
Regarding Apache Commons Math, we're not using that yet, and I don't feel
strongly about adding it as dependency or implementing this quite simple
function ourselves, say, in metrics-lib. Worth adding tests, I guess.
Regarding Python, I'm amending my statement above a little bit. It's true
that we're going to replace our last remaining Python code. Still, if we
want to make our numbers reproducible, we'll have to accept that many of
our users will want to reproduce them using Python. We should at least
take a brief look how this would work.
So, your possible steps make sense. Is this something you'd like to work
on?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list