[metrics-bugs] #28137 [Metrics/Statistics]: Modify "Total consensus weights across bandwidth authorities" graph to only include relays that end up in the consensus

Thu Nov 22 11:47:04 UTC 2018

#28137: Modify "Total consensus weights across bandwidth authorities" graph to only
include relays that end up in the consensus
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  metrics-team
     Type:  enhancement         |         Status:  new
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:
--------------------------------+------------------------------

Comment (by teor):

 Replying to [comment:5 karsten]:
 > Alright, I implemented the idea above.
 >
 > However, it turns out that matching all vote entries with all consensus
 entries ''cannot'' be done with reasonable effort, at least not with the
 current tools we use. For example, processing 3 days of descriptors takes
 quite reasonable 5 minutes, but processing 3 weeks of descriptors already
 takes almost 3 hours. This simply doesn't scale to 3 months or 3 years.

 So the process scales non-linearly?
 When processing 3 days, each hour of consensus and votes takes about 4
 seconds.
 But when processing 3 weeks, each hour of consensus and votes takes 21
 seconds.

 Can you do each consensus separately?

 More precisely:

 For each consensus, in a set of temporary tables:
 * We import all fingerprints in a consensus together with the votes
 referenced from the consensus.
 * We import fingerprints into a table and assign numeric identifiers that
 we use in other tables.
 * We import all fingerprints in a vote together with a way to refer to the
 consensus coming out of it.
 * When aggregating, we join votes with the consensus they refer to, then
 persist relevant data in permanent tables, with permanent identifiers.
 * After aggregating, we delete all votes that we aggregated in the
 previous step and we delete all consensuses if we aggregated all votes
 referenced from consensuses.
 * If there is any data left, we persist that data in permanent tables,
 with permanent identifiers.

 If this is going to take a lot of effort, then don't worry about it: the
 difference isn't important in this case.

 > We do have these 3 weeks from my tests though, so let's look at the
 results:
 > ...
 > The red line is what's currently on the Tor Metrics website: it contains
 measured bandwidths of all relays in a vote, regardless of whether a relay
 made it into the consensus. The blue line only contains those relays in a
 vote that also appeared in the consensus. I'd say that the difference is
 almost negligible.

 I agree: we could account for it with some documentation.

 > What I'd like to try out is add a third line "Running in vote", which
 would at least kick out relays in a vote that the authority didn't find to
 be running. I'd expect that line to show up between red and blue. However,
 a relay that doesn't have the Running flag in one vote can still go into
 the consensus if the others think it's running. And a relay that has the
 Running flag from one authority can still not show up in the consensus if
 the others disagree. So, I'm unclear whether this really helps. Worth
 trying, and a much smaller change, because it doesn't require us to match
 vote entries with consensus entries.

 Let's see the results for Running, then decide.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28137#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online