[tor-bugs] #4631 [Core Tor/Tor]: Idea to make consensus voting more resistant

Wed Feb 19 01:19:43 UTC 2020

#4631: Idea to make consensus voting more resistant
-------------------------------------------------+-------------------------
 Reporter:  Sebastian                            |          Owner:  teor
     Type:  defect                               |         Status:
                                                 |  needs_review
 Priority:  Medium                               |      Milestone:  Tor:
                                                 |  0.4.4.x-final
Component:  Core Tor/Tor                         |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:  needs-dirauth-email needs-torspec-   |  Actual Points:  0.2
  update tor-dirauth robustness voting           |
Parent ID:  #33050                               |         Points:  0.5
 Reviewer:  nickm                                |        Sponsor:
                                                 |  Sponsor55-can
-------------------------------------------------+-------------------------
Changes (by teor):

 * status:  needs_revision => needs_review

Comment:

 Replying to [comment:21 nickm]:
 > Reviewing the idea only, before looking at the patch:
 >
 > So the idea is that we cut off the deadline for POSTed votes early, to
 increase the odds that fetching votes from else will give everybody the
 same set of votes?  That's plausible, but I'm wondering whether we're
 replacing one race condition with another.

 I think we *are* replacing one race condition with another, but in
 practice, the replacement is less risky. Here's why:
 * in the first phase, each authority uploads *its own vote* to each other
 authority (1->N)
 * in the second phase, each authority downloads *all votes* from each
 other authority (N<-N)

 So the second phase ensures that all well-connected authorities have the
 same set of votes. But it requires "no partial, late uploads", because
 each partial, late upload splits the set of authorities.

 This patch enforces the "no late uploads" rule, which implies "no partial,
 late uploads".

 > I'm also interested in:
 >   * what happens during the transition period where some authorities
 have upgraded but others have not.

 Some authorities reject late uploads, others do not.

 If the late upload is a full, late upload, then the set of authorities
 splits between:
 * authorities that reject the late upload, and
 * authorities that accept the late upload.
 The mitigation is to ensure that all authorities' clocks are synchronised
 within ~5 minutes.

 Right now, all authority clocks are synchronised to within ~1 second:
 https://consensus-health.torproject.org/#authorityclocks

 If the late upload is a partial, late upload, then the set of authorities
 is already split. The authorities that reject the late upload are
 protected from the split. (Strictly, they  reject the late upload, and end
 up in the set of authorities that didn't get the vote).

 In my chutney tests, this patch replaces "mistmatched digest" (consensus
 failure) with "Vote received too late" (late vote rejection, consensus
 success).

 I've also done some testing in mixed 0.3.5 and master networks (equal
 numbers of authorities). I haven't seen a consensus failure in a mixed
 network after applying this patch, they were common before.

 >   * whether the fetch time is the right point for the POST cutoff, or
 whether we'd be better off having it be a second or two before.

 The cutoff time is the time that the entire vote has been received. So the
 current code works as-is. But ideally, we would want multiple authorities
 to receive each vote before the cutoff.

 For chutney (and other fast networks), the current cutoff should not be
 changed: the intervals are only a few seconds long.

 For the public network (and other slow networks), votes are about 3.5 MB,
 and let's say authorities have about 10 Mbps of bandwidth spare for votes.
 So that's:
 `3 seconds per vote upload * 8 authorities = 24 seconds`

 So let's set the cutoff to: `fetch_missing_votes - vote_delay/6` ?

 Note that fetch_missing_votes is already `start - dist_delay -
 (vote_delay/2)`, so we can't go any earlier than `fetch_missing_votes -
 vote_delay/4`.

 Also, fast networks need about 4 seconds to vote reliably (upload,
 process, download, process), so the extra cutoff should be at most
 `vote_delay/5`.

 Interestingly, the "successful vote upload" bandwidth cutoff for public
 authorities is about:
 `3.5 MB * 8 vote uploads / 300 seconds * 8 bits = 0.75 Mbit/s`

 > We should also think about our diagnostics for unsynchronized
 consensuses. Are they good enough that we'll be able to tell whether this
 is helping or hurting?

 Yes, I've seen the diagnostics change in chutney from "mistmatched digest"
 to "Vote received too late".

 If rejecting late votes splits the consensus, we'll see both "Vote
 received too late" and "mistmatched digest" for the same consensus period.

 Replying to [comment:22 nickm]:
 > Teor, how confident are you that this is actually helping out for the
 chutney case?  Since you've been running into sync issues on chutney, I
 guess you've probably been testing this patch in that case?

 It is surprisingly effective. I haven't seen a consensus failure in ~100
 chutney runs, since applying this patch.

 > I've added a small comment on the patch. This looks reasonably solid.

 I've fixed some of the logging in the patch.

 Our remaining work is to choose the exact threshold.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/4631#comment:23>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online