[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu May 24 19:32:47 UTC 2018
#26035: Streamline sample quantile types used in the various modules
--------------------------------+--------------------------------
Reporter: karsten | Owner: karsten
Type: enhancement | Status: needs_revision
Priority: High | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor: Sponsor13
--------------------------------+--------------------------------
Changes (by iwakeh):
* status: needs_review => needs_revision
Comment:
Replying to [comment:15 karsten]:
> Replying to [comment:14 iwakeh]:
> > Taking you up on your offer from comment:13, so I can concentrate on
reviews and tickets of CollecTor.
>
> Alright, happy to implement this change.
Thanks!
>
> Please review [https://gitweb.torproject.org/karsten/metrics-
web.git/log/?h=task-26035 my task-26035 branch] with three commits:
>
> - [https://gitweb.torproject.org/karsten/metrics-
web.git/commit/?h=task-26035&id=4f92894a1ee5315b9e4a17b38f3cdb229612f0f1
4f92894] changes how we're computing median and inter-quartile range in
the censorship detector code, which is still written in Python. I tested
the change by running on our user number estimates. I found that it
changes 159 of 2447 days in our data (6.5%) and leaves the remaining days
entirely unchanged. This also makes sense: with a slightly different
median and inter-quartile range we either include a value or exclude it as
outlier. I'd say we cannot conclude that one of the implementations is
correct and the other is not. The new implementation will simply be more
consistent throughout our code base.
This looks fine and will make it easier to transfer this into Java later.
>
> - [https://gitweb.torproject.org/karsten/metrics-
web.git/commit/?h=task-26035&id=2685c78f13cbf9402d5ba0b4380df03f246e86e5
2685c78] makes the same change to our advertised bandwidth statistics.
Obviously, this changes results a bit, because we're now interpolating
between actually reported advertised bandwidths rather than returning a
value that was actually reported by one of the relays. Still, for the sake
of consistency throughout our code base, we should switch.
>
> - [https://gitweb.torproject.org/karsten/metrics-
web.git/commit/?h=task-26035&id=f9c24cab1006bf5999c662e9d06767c59c71a3e6
f9c24ca] makes the third change in this series, this time to the
connbidirect module. The change is quite significant in years 2011 and
2012 where we had just a handful of relays reporting these statistics.
Then it does make a difference whether we're interpolating or not. Same
argument in favor of doing it now.
The advbwdist module has a new static method, which should be made more
visible and thus facilitate re-use as well as testing.
Of course, for re-use it needs to be made more generic and maybe also
placed in a different class (maybe `**.stats.Utiliy`).
Remarks & suggestions in no particular order:
* the sorting step in advbwdist changes an input parameter, which is bad
practice.
* commons-math Percentile class doesn't require the input data to be
sorted. (The javadoc comment only talks about sorting in order to explain
what will happen for edge cases.)
* Maybe rather use `doubleValue()` instead of casting a Number sub-type to
a primitive.
* Casting of percentile results could be performed by the caller, which
could guarantee that there are only values of for example type short
entered (see connbiderect). Or, provide special utility methods that re-
use code internally.
* connbidirect uses similar code as advbwdist for almost the same
computations. The input fraction list also get changed by the unnecessary
sorting step (this might not matter in that case, but still)
* The Java re-implementation of the python detector will also benefit from
a percentile function.
* The percentile input parameter `int[] percentiles` could be changed to
`int ... percentiles`.
Encapsulation and testability of this type of functionality that is needed
throughout the code is essential and will also make documentation now and
in future much easier.
The functionality should especially be tested b/c of the large impact such
changes have, i.e., re-computation of everything. This should be revised.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:16>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list