[metrics-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu May 17 14:30:10 UTC 2018


#26022: Fix a flaw in the noise-removing code in our onion service statistics
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  metrics-team
     Type:  defect              |         Status:  needs_review
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:
--------------------------------+------------------------------

Comment (by karsten):

 Replying to [comment:15 amj703]:
 > > > Why not throw out the outliers, then add the remaining, then do the
 adjustment?
 > >
 > > The way we're determining whether a reported value was an outlier or
 not is by extrapolating all reported values to network totals and
 discarding the lowest 25% and highest 25% of ''extrapolated'' values. But
 extrapolating values requires us to make these adjustment first, or we'd
 extrapolate to the wrong network totals.
 >
 > It sounds to me like it doesn't make a difference if throw out the
 outliers before adjustment or after adjustment. The adjustment preserves
 the relative ordering of the values, and so the bottom (and top) 25% of
 data points will remain the same.

 You're right. Agreed, this would work.

 > So here's a suggestion: (1) extrapolate by dividing the raw measurement
 by the relay's weight; (2) remove the top and bottom 25% of extrapolated
 values; (3) add the remaining raw values; (4) adjust that result by
 rounding towards the closest bin edge, subtracting num_relay*bin_size/2,
 and replacing a negative value with 0; and then (5) extrapolate the result
 by dividing by the total weight of the included relays.

 I see how that approach would work and produce similar results to our
 current approach. In particular, I think it does the following things
 differently:

  - We're currently rounding each reported statistic towards the closest
 bin edge, whereas your suggested approach would only do that once in step
 (4). You approach is less likely to introduce errors and relies more on
 noise to cancel out itself. This seems good.

  - Your approach replaces a negative end value with 0 in step (4) which
 our current approach doesn't. We could easily adopt this part, if we
 wanted. So far, it hasn't happened that we got a negative end result,
 though.

 > > Here's another idea, though: what if we change the way how we're
 removing noise by ''only'' subtracting `bin_size / 2` to undo the binning
 step as good as we can and leave the Laplace noise alone. Basically, we'd
 only account for the fact that relays always round up to the next multiple
 of `bin_size`, but we wouldn't do anything about the positive or negative
 noise.
 >
 > This isn't really the best guess about the value at the relays before
 noise is added. While not performing the noise-removal step makes it
 easier to explain how the metrics numbers are produced (and to implement
 them), I think it makes it harder to understand what they mean. This is a
 judgement call that I can see both sides of, though!

 Wait, my suggestion was to subtract `bin_size / 2` but not round to the
 closest bin edge. I think it produces the same result as your suggestion
 above, except that it does not round the end result to the closest bin
 edge in your step (4).

 However, and here comes the catch: I'm running out of time for
 experimenting with different approaches. My goal here was to fix a bug,
 not to try improved estimation algorithms. I can see how your suggestion
 might improve our results, but it's a bigger code change than a single
 method. I'm afraid I'll have to leave that as future work. :(

 I'll wait for the simplified `removeNoise()` method run to finish and will
 post a new graph that compares flawed, fixed, and simplied methods.

 > > Wait, we're talking about two different things:
 > >
 > >  1. Relays internally round ''up'' to the next multiple of `bin_size`.
 > >  2. metrics-web contains that `removeNoise()` method that this ticket
 is all about.
 >
 > Ah, right, thanks for clarifying that.

 Great! Sorry for confusing you earlier.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26022#comment:16>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list