[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules
Tor Bug Tracker & Wiki
blackhole at torproject.org
Sat May 12 08:20:12 UTC 2018
#26035: Streamline sample quantile types used in the various modules
--------------------------------+---------------------------
Reporter: karsten | Owner: iwakeh
Type: enhancement | Status: accepted
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor: Sponsor13
--------------------------------+---------------------------
Comment (by karsten):
Thanks, very useful! Let me first try to answer the open questions:
- What's up with a) and c) using slightly different percentile
implementations? The reason is that we're including the 0th (minimum) and
100th percentile (maximum) in a) which we're not in c). It's totally
possible that what we're using right now for a) is a terrible hack. Maybe
we should instead use the formula for c) in a) and handle percentile 0 or
100 as a special case. Whatever the other implementations do.
- What's up with e) and f) not being quartiles? What we're doing there is
that we're computing the ''weighted'' quartiles. And again, it might be
that it's a hack that we should rewrite. The goal should be to implement a
weighted trimmed mean. The technical report probably has a better
definition. What we cannot do, though, is use the exact same percentile
definition as we're using for the other places.
- I think you left out the Python code that is our current censorship
detector. Which is fine, as I see how we could change that code to match
what we're doing elsewhere.
So, I guess the decision we need to make is whether we want to use R-1 or
R-7 everywhere, right?
I'm slightly leaning towards R-7 here.
One reason is that, if we used R-1, we couldn't use R's default `median()`
anymore, because that interpolates. I found a non-interpolating median
implementation in Python, called
[https://docs.python.org/3/library/statistics.html#statistics.median_low
median_low] (or median_high). And I think the Tor daemon uses a low median
for some things related to directory authority voting. But I believe it's
not the standard.
So, if we use R-7, we should have good tool support.
Except for Java where we'd have to implement something ourselves, which
would also have to handle special cases 0 and 100.
By the way, do you feel strongly about avoiding Apache Commons Math? We'd
only have to add it to metrics-web, and it would save us half a day of
writing code and testing it. After all, we also rely on libraries for
things like base64 encoding, which is not rocket science to implement
ourselves. We wouldn't have to add it to the metrics-web .war file!
P.S.: Did I write something about trucks? I meant insect legs! Unless
those have a spare leg mounted somewhere, too, in which case I'll think
even harder about a good example. ;)
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list