[metrics-bugs] #28305 [Metrics/Statistics]: Include client numbers even if we think we got reports from more than 100% of all relays

Thu Nov 29 14:22:39 UTC 2018

#28305: Include client numbers even if we think we got reports from more than 100%
of all relays
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  karsten
     Type:  defect              |         Status:  accepted
 Priority:  High                |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:  SponsorV-can
--------------------------------+------------------------------

Comment (by karsten):

 Replying to [comment:6 teor]:
 > Replying to [comment:5 karsten]:
 > > Note the red arrow. At this point `n(H)` grows larger than `n(N)`.
 That's an issue. By definition, a relay cannot report written directory
 bytes statistics for a longer time than it's online.
 >
 > But relays that aren't listed in the consensus can still be acting as
 relays.

 You're right, there are cases where this is possible. These are just cases
 we did not consider in the original design of the `frac` formula. But yes,
 this is possible.

 > > A possible mitigation (other than the one I suggested above) could be
 to replace `n(H)` with `n(N^H)` in the `frac` formula. This would mean
 that we'd cap the amount of time for which a relay reported written
 directory bytes to the amount of time it was listed in the consensus.
 >
 > This seems like a reasonable approach: if the relay is listed in the
 consensus for `n(N^H)` seconds, then we should weight its bandwidth using
 that number of seconds.

 Oh, you're raising another important point here: speaking in formula
 terms, if we replace `n(H)` with `n(N^H)` we'll also have to replace
 `h(H)` with `h(N^H)`.

 Similarly, we'll have to replace `h(R^H)` with `h(R^H^N)` and `n(R\H)`
 with `n(R^N\H)`.

 Hmmmm. I'm less optimistic now that changing the `frac` formula is a good
 idea. It seems like a too big change to make, and we're not even sure that
 the result will be more accurate.

 > > I'm currently dumping and downloading the database to try this out at
 home. However, I'm afraid that deploying this fix is going to be much more
 expensive than making the simple fix suggested above. I'll report here
 what I find out.
 >
 > I'm not sure if it will make much of a difference long-term: relays that
 drop out of the consensus should have low bandwidth weights, and therefore
 low bandwidths. (Except when the network is unstable, or there are less
 than 3 bandwidth authorities.)

 Agreed.

 Let's make the change I suggested above, in a slightly modified way:

 {{{
 -WHERE a.frac BETWEEN 0.1 AND 1.0
 +WHERE a.frac BETWEEN 0.1 AND 1.1
 }}}

 The reason for accepting `frac` values between `1.0` and `1.1` is that, as
 discussed here, there can be relays reporting statistics that temporarily
 didn't make it into the consensus.

 The reason for not giving up on the upper bound is that, as the graph
 above shows, there are still single days over the years when `frac`
 suddenly went up to `1.2`, `1.5`, or even `2.0`. We should continue
 excluding these data points. Therefore we should use `1.1` as new upper
 bound.

 How does this sound?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28305#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online