[metrics-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu May 17 13:12:54 UTC 2018
#26022: Fix a flaw in the noise-removing code in our onion service statistics
--------------------------------+------------------------------
Reporter: karsten | Owner: metrics-team
Type: defect | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
--------------------------------+------------------------------
Comment (by amj703):
> > Why not throw out the outliers, then add the remaining, then do the
adjustment?
>
> The way we're determining whether a reported value was an outlier or not
is by extrapolating all reported values to network totals and discarding
the lowest 25% and highest 25% of ''extrapolated'' values. But
extrapolating values requires us to make these adjustment first, or we'd
extrapolate to the wrong network totals.
It sounds to me like it doesn't make a difference if throw out the
outliers before adjustment or after adjustment. The adjustment preserves
the relative ordering of the values, and so the bottom (and top) 25% of
data points will remain the same. So here's a suggestion: (1) extrapolate
by dividing the raw measurement by the relay's weight; (2) remove the top
and bottom 25% of extrapolated values; (3) add the remaining raw values;
(4) adjust that result by rounding towards the closest bin edge,
subtracting num_relay*bin_size/2, and replacing a negative value with 0;
and then (5) extrapolate the result by dividing by the total weight of the
included relays.
> Here's another idea, though: what if we change the way how we're
removing noise by ''only'' subtracting `bin_size / 2` to undo the binning
step as good as we can and leave the Laplace noise alone. Basically, we'd
only account for the fact that relays always round up to the next multiple
of `bin_size`, but we wouldn't do anything about the positive or negative
noise.
This isn't really the best guess about the value at the relays before
noise is added. While not performing the noise-removal step makes it
easier to explain how the metrics numbers are produced (and to implement
them), I think it makes it harder to understand what they mean. This is a
judgement call that I can see both sides of, though!
> Wait, we're talking about two different things:
>
> 1. Relays internally round ''up'' to the next multiple of `bin_size`.
> 2. metrics-web contains that `removeNoise()` method that this ticket is
all about.
Ah, right, thanks for clarifying that.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26022#comment:15>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list