[metrics-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri May 4 15:44:04 UTC 2018
#26022: Fix a flaw in the noise-removing code in our onion service statistics
------------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
------------------------------------+--------------------------
I think I found a flaw in the aggregation part of our onion service
statistics.
For context, I'm quoting a paragraph from our technical report where we
explain how we're [https://research.torproject.org/techreports
/extrapolating-hidserv-stats-2015-01-31.pdf extrapolating network totals
from hidden-service statistics]:
''"When processing hidden-service statistics, we need to handle the fact
that they have been obfuscated by relays. As first step, we're attempting
to remove the additive Laplace-distributed noise by rounding up or down to
the nearest multiple of `bin_size`. The idea is that it's most likely that
noise was added to the closest right side of a bin than to the right side
of another bin. In step two, we're subtracting half of `bin_size`, because
the relay added between 0 and `bin_size − 1` to the originally observed
value."''
In our Java code, these two steps turned into:
{{{
/** Removes noise from a reported stats value by rounding to the nearest
* right side of a bin and subtracting half of the bin size. */
private long removeNoise(long reportedNumber, long binSize) {
long roundedToNearestRightSideOfTheBin =
((reportedNumber + binSize / 2) / binSize) * binSize;
long subtractedHalfOfBinSize =
roundedToNearestRightSideOfTheBin - binSize / 2;
return subtractedHalfOfBinSize;
}
}}}
I think that this code has a flaw: in `(reportedNumber + binSize / 2) /
binSize`, we use [https://en.wikipedia.org/wiki/Truncation integer
truncation]. However, as described in that Wikipedia article, ''"for
negative numbers truncation does not round in the same direction as the
floor function: truncation always rounds toward zero, the floor function
rounds towards negative infinity."''
I'm going to attach a graph (once this ticket exists) that visualizes the
effect for reported values around zero. In short: the step function has
one really long step, and that seems flawed or at least inconsistent.
My suggestion would be to change that Java code and use `Math.floorDiv()`
rather than integer truncation. Something like this:
{{{
diff --git
a/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
b/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
index f2abc789..888c8959 100644
--- a/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
+++ b/src/main/java/org/torproject/metrics/stats/hidserv/Parser.java
@@ -245,7 +245,7 @@ public class Parser {
* right side of a bin and subtracting half of the bin size. */
private long removeNoise(long reportedNumber, long binSize) {
long roundedToNearestRightSideOfTheBin =
- ((reportedNumber + binSize / 2) / binSize) * binSize;
+ Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize;
long subtractedHalfOfBinSize =
roundedToNearestRightSideOfTheBin - binSize / 2;
return subtractedHalfOfBinSize;
}}}
See the [https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html
#floorDiv-long-long- JavaDocs for that method] for details. Quoting:
''"Normal integer division operates under the round to zero rounding mode
(truncation). This operation instead acts under the round toward negative
infinity (floor) rounding mode. The floor rounding mode gives different
results than truncation when the exact result is negative."''
I'm yet unclear whether this might affect overall results. Right now,
we're handling all negative reported values wrong by adding 1 `binSize` to
them which we shouldn't do. Of course, negative reported values should be
the exception, not the rule, but there's a reason why we're keeping them
in the process, just like we're keeping values that we think are too large
because of noise. We'll have to see.
But regardless, I think we need to fix this. Opinions?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26022>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list