[tor-dev] Metrics: Estimating fraction of reported directory-request statistics

Sun Apr 17 00:16:23 UTC 2022

I am trying to reproduce the "frac" computation from the Reproducible
Metrics instructions:
https://metrics.torproject.org/reproducible-metrics.html#relay-users
Which is also Section 3 in the tech report on counting bridge users:
https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4

       h(R^H) * n(H) + h(H) * n(R\H)
frac = -----------------------------
                h(H) * n(N)

My minor goal is to reproduce the "frac" column from the Metrics web
site (which I assume is the same as the frac above, expressed as a
percentage):

https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&end=2022-04-08&country=all&events=off
date,country,users,lower,upper,frac
2022-04-01,,2262557,,,92
2022-04-02,,2181639,,,92
2022-04-03,,2179544,,,93
2022-04-04,,2350360,,,93
2022-04-05,,2388772,,,93
2022-04-06,,2356170,,,93
2022-04-07,,2323184,,,93
2022-04-08,,2310170,,,91

I'm having trouble with the computation of n(R\H) and h(R∧H). I
understand that R is the subset of relays that report directory request
counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
and H is the subset of relays that report directory request byte counts
(i.e. that have dirreq-write-history in their extra-info descriptors).
R and H partially overlap: there are relays that are in R but not H,
others that are in H but not R, and others that are in both.

The computations depend on some values that are directly from
descriptors:
n(R) = sum of hours, for relays with directory request counts
n(H) = sum of hours, for relays with directory write histories
h(H) = sum of written bytes, for relays with directory write histories

> Compute n(R\H) as the number of hours for which responses have been
> reported but no written directory bytes. This fraction is determined
> by summing up all interval lengths and then subtracting the written
> directory bytes interval length from the directory response interval
> length. Negative results are discarded.

I interpret this to mean: add up all the dirrect-stats-end intervals
(this is n(R)), add up all the dirreq-write-history intervals
(this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
would only be true when H is a subset of R.

> Compute h(R∧H) as the number of written directory bytes for the
> fraction of time when a server was reporting both written directory
> bytes and directory responses. As above, this fraction is determined
> by first summing up all interval lengths and then computing the
> minimum of both sums divided by the sum of reported written directory
> bytes.

This seems to be saying to compute h(R∧H) (a count of bytes) as
min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
hours / bytes. What would be more natural to me is
min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
n(R) and n(R) by the larger, then multiply this ratio by the observable
byte count. But this, too, only works when H is a subset of R.

Where is this computation done in the metrics code? I would like to
refer to it, but I could not find it.

Using the formulas and assumptions above, here's my attempt at computing
recent "frac" values:

date       `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
2022-04-01 166584 177638.  2.24e13 125491.       0   1.59e13 0.753
2022-04-02 166951 177466.  2.18e13 125686.       0   1.54e13 0.753
2022-04-03 167100 177718.  2.27e13 127008.       0   1.62e13 0.760
2022-04-04 166970 177559.  2.43e13 126412.       0   1.73e13 0.757
2022-04-05 166729 177585.  2.44e13 125389.       0   1.72e13 0.752
2022-04-06 166832 177470.  2.39e13 127077.       0   1.71e13 0.762
2022-04-07 166532 177210.  2.48e13 127815.       0   1.79e13 0.768
2022-04-08 167695 176879.  2.52e13 127697.       0   1.82e13 0.761

The "frac" column does not match the CSV. Also notice that n(N) < n(H),
which should be impossible because H is supposed to be a subset of N
(N is the set of all relays). But this is what I get when I estimate
n(N) from a network-status-consensus-3 and n(H) from extra-info
documents. Also notice that n(R) < n(H), which means that H cannot be a
subset of R, contrary to the observations above.