[tor-dev] Metrics: Estimating fraction of reported directory-request statistics

Mon Jun 27 02:33:46 UTC 2022

On Thu, Apr 21, 2022 at 05:47:12PM +0200, Silvia/Hiro wrote:
> On 17/4/22 2:16, David Fifield wrote:
> > I am trying to reproduce the "frac" computation from the Reproducible
> > Metrics instructions:
> > https://metrics.torproject.org/reproducible-metrics.html#relay-users
> > Which is also Section 3 in the tech report on counting bridge users:
> > https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf#page=4
> > 
> >         h(R^H) * n(H) + h(H) * n(R\H)
> > frac = -----------------------------
> >                  h(H) * n(N)
> > 
> > My minor goal is to reproduce the "frac" column from the Metrics web
> > site (which I assume is the same as the frac above, expressed as a
> > percentage):
> > 
> > https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&end=2022-04-08&country=all&events=off
> > date,country,users,lower,upper,frac
> > 2022-04-01,,2262557,,,92
> > 2022-04-02,,2181639,,,92
> > 2022-04-03,,2179544,,,93
> > 2022-04-04,,2350360,,,93
> > 2022-04-05,,2388772,,,93
> > 2022-04-06,,2356170,,,93
> > 2022-04-07,,2323184,,,93
> > 2022-04-08,,2310170,,,91
> > 
> > I'm having trouble with the computation of n(R\H) and h(R∧H). I
> > understand that R is the subset of relays that report directory request
> > counts (i.e. that have dirreq-stats-end in their extra-info descriptors)
> > and H is the subset of relays that report directory request byte counts
> > (i.e. that have dirreq-write-history in their extra-info descriptors).
> > R and H partially overlap: there are relays that are in R but not H,
> > others that are in H but not R, and others that are in both.
> > 
> > The computations depend on some values that are directly from
> > descriptors:
> > n(R) = sum of hours, for relays with directory request counts
> > n(H) = sum of hours, for relays with directory write histories
> > h(H) = sum of written bytes, for relays with directory write histories
> > 
> > > Compute n(R\H) as the number of hours for which responses have been
> > > reported but no written directory bytes. This fraction is determined
> > > by summing up all interval lengths and then subtracting the written
> > > directory bytes interval length from the directory response interval
> > > length. Negative results are discarded.
> > I interpret this to mean: add up all the dirrect-stats-end intervals
> > (this is n(R)), add up all the dirreq-write-history intervals
> > (this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it
> > would only be true when H is a subset of R.
> > 
> > > Compute h(R∧H) as the number of written directory bytes for the
> > > fraction of time when a server was reporting both written directory
> > > bytes and directory responses. As above, this fraction is determined
> > > by first summing up all interval lengths and then computing the
> > > minimum of both sums divided by the sum of reported written directory
> > > bytes.
> > This seems to be saying to compute h(R∧H) (a count of bytes) as
> > min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are
> > hours / bytes. What would be more natural to me is
> > min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of
> > n(R) and n(R) by the larger, then multiply this ratio by the observable
> > byte count. But this, too, only works when H is a subset of R.
> > 
> > Where is this computation done in the metrics code? I would like to
> > refer to it, but I could not find it.
> > 
> > Using the formulas and assumptions above, here's my attempt at computing
> > recent "frac" values:
> > 
> > date       `n(N)`  `n(H)`   `h(H)`  `n(R)` `n(R\H)` `h(R∧H)` frac
> > 2022-04-01 166584 177638.  2.24e13 125491.       0   1.59e13 0.753
> > 2022-04-02 166951 177466.  2.18e13 125686.       0   1.54e13 0.753
> > 2022-04-03 167100 177718.  2.27e13 127008.       0   1.62e13 0.760
> > 2022-04-04 166970 177559.  2.43e13 126412.       0   1.73e13 0.757
> > 2022-04-05 166729 177585.  2.44e13 125389.       0   1.72e13 0.752
> > 2022-04-06 166832 177470.  2.39e13 127077.       0   1.71e13 0.762
> > 2022-04-07 166532 177210.  2.48e13 127815.       0   1.79e13 0.768
> > 2022-04-08 167695 176879.  2.52e13 127697.       0   1.82e13 0.761
> > 
> > The "frac" column does not match the CSV. Also notice that n(N) < n(H),
> > which should be impossible because H is supposed to be a subset of N
> > (N is the set of all relays). But this is what I get when I estimate
> > n(N) from a network-status-consensus-3 and n(H) from extra-info
> > documents. Also notice that n(R) < n(H), which means that H cannot be a
> > subset of R, contrary to the observations above.
> 
> These computations are a bit hidden in metrics code. Specifically these are
> in the website repository but in the sql init scripts.
> 
> This is the view that is responsible for computing the data that are then
> published in the csv:
> 
> https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/master/src/main/sql/clients/init-userstats.sql#L695

Thank you for this reference. It helped a lot.

Indeed, the Reproducible Metrics instructions for h(R∧H) do not match
the code. The instructions say:

	Compute h(R^H) as the number of written directory bytes for the
	fraction of time when a server was reporting both written
	directory bytes and directory responses. As above, this fraction
	is determined by first summing up all interval lengths and then
	computing the minimum of both sums divided by the sum of
	reported written directory bytes.

But the code does:

https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/d1824014f6754f9658d7eb3abb72a460446d070e/src/main/sql/clients/init-userstats.sql#L521
	-- Update results based on nodes reporting both bytes and responses.
	UPDATE aggregated
	  SET hrh = aggregated_bytes_responses.hrh
	  FROM (
	    SELECT bytes.date, bytes.node,
		   SUM((LEAST(bytes.seconds, responses.seconds)
		       * bytes.val) / bytes.seconds) AS hrh
	    FROM update_no_dimensions bytes
	    LEFT JOIN update_no_dimensions responses
	    ON bytes.date = responses.date
	    AND bytes.fingerprint = responses.fingerprint
	    AND bytes.node = responses.node
	    WHERE bytes.metric = 'bytes'
	    AND responses.metric = 'responses'
	    AND bytes.seconds > 0
	    GROUP BY bytes.date, bytes.node
	  ) aggregated_bytes_responses
	  WHERE aggregated.date = aggregated_bytes_responses.date
	  AND aggregated.node = aggregated_bytes_responses.node;

The operative part is
	SUM((LEAST(bytes.seconds, responses.seconds) * bytes.val) / bytes.seconds) AS hrh

Two differences, one major and one minor:

1. "Divided by the sum of reported written directory bytes" is wrong. It
   is actually *multiplied* by the sum of reported written directory
   bytes, then *divided* by the sum of write interval lengths. This
   gives the right dimensions: h(R∧H) is a count of bytes.
2. The value is not the result of "first summing up all interval lengths
   and then computing the minimum of both sums"; rather it is taking the
   minimum of the two interval lengths (bytes.seconds and
   responses.seconds) *per entry*, then summing those minima. So, even
   though n(R) is the sum of responses.seconds and n(H) is the sum of
   bytes.seconds, one cannot just substitute n(R) and n(H) into the
   formula for h(R∧H), as I was doing earlier.

My interpretation of n(R\H) = n(R) − n(H) was basically correct, except
that something like point (2) above also applies. The Metrics SQL code
takes the maximum within each entry, then sums those maxima.

	Compute n(R\H) as the number of hours for which responses have
	been reported but no written directory bytes. This fraction is
	determined by summing up all interval lengths and then
	subtracting the written directory bytes interval length from the
	directory response interval length. Negative results are
	discarded.

https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/d1824014f6754f9658d7eb3abb72a460446d070e/src/main/sql/clients/init-userstats.sql#L541
	UPDATE aggregated
	  SET nrh = aggregated_responses_bytes.nrh
	  FROM (
	    SELECT responses.date, responses.node,
		   SUM(GREATEST(0, responses.seconds
				   - COALESCE(bytes.seconds, 0))) AS nrh
	    FROM update_no_dimensions responses
	    LEFT JOIN update_no_dimensions bytes
	    ON responses.date = bytes.date
	    AND responses.fingerprint = bytes.fingerprint
	    AND responses.node = bytes.node
	    WHERE responses.metric = 'responses'
	    AND bytes.metric = 'bytes'
	    GROUP BY responses.date, responses.node
	  ) aggregated_responses_bytes
	  WHERE aggregated.date = aggregated_responses_bytes.date
	  AND aggregated.node = aggregated_responses_bytes.node;

I updated my code (attached) to take the above into account. Now I get
numbers that are closer to the ones in the Metrics CSV, though they are
still about 1% too high.

      date      n(N)       n(H)       h(H)     h(R∧H)    n(R\H)   frac
2022-04-01  166584.0  177638.12  2.244e+13  1.958e+13    126.01  0.931
2022-04-02  166951.0  177466.02  2.176e+13  1.914e+13    185.40  0.936
2022-04-03  167100.0  177717.50  2.265e+13  2.005e+13    160.40  0.942
2022-04-04  166970.0  177559.21  2.433e+13  2.139e+13    124.20  0.936
2022-04-05  166729.0  177585.20  2.440e+13  2.146e+13     97.25  0.937
2022-04-06  166832.0  177470.35  2.392e+13  2.109e+13    157.37  0.939
2022-04-07  166532.0  177224.64  2.477e+13  2.181e+13    132.94  0.938
2022-04-08  167695.0  177072.36  2.519e+13  2.208e+13    197.58  0.927
-------------- next part --------------
A non-text attachment was scrubbed...
Name: frac-20220627.zip
Type: application/zip
Size: 5246 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20220626/8ab10c27/attachment.zip>