[tor-commits] [torspec/master] Modify 210-faster...consensus-bootstrap for exponential backoff
nickm at torproject.org
nickm at torproject.org
Wed Oct 21 15:55:32 UTC 2015
commit fc4256c9073161cac5d8fe1a303b2901d18ecf25
Author: teor (Tim Wilson-Brown) <teor2345 at gmail.com>
Date: Fri Oct 2 15:46:54 2015 +0200
Modify 210-faster...consensus-bootstrap for exponential backoff
To implement #4483 we need to contact multiple directory mirrors
to increase bootstrap reliability. This patch implements the
exponential backoff suggested in
https://trac.torproject.org/projects/tor/ticket/4483#comment:22
The patch also analyses the reliability of the new scheme, and
compares it to the current Tor implementation.
---
.../210-faster-headless-consensus-bootstrap.txt | 124 ++++++++++++++------
1 file changed, 86 insertions(+), 38 deletions(-)
diff --git a/proposals/210-faster-headless-consensus-bootstrap.txt b/proposals/210-faster-headless-consensus-bootstrap.txt
index 6b1502b..42726e5 100644
--- a/proposals/210-faster-headless-consensus-bootstrap.txt
+++ b/proposals/210-faster-headless-consensus-bootstrap.txt
@@ -1,9 +1,10 @@
Filename: 210-faster-headless-consensus-bootstrap.txt
Title: Faster Headless Consensus Bootstrapping
-Author: Mike Perry
+Author: Mike Perry, Tim Wilson-Brown, Peter Palfrader
Created: 01-10-2012
+Last Modified: 02-10-2015
Status: Open
-Target: 0.2.4.x+
+Target: 0.2.8.x+
Overview and Motiviation
@@ -19,19 +20,39 @@ Design: Bootstrap Process Changes
parallel during the bootstrap process, and download the consensus from
the first connection that completes.
- Connection attempts will be done in batches of three. Only one
- connection will be performed to one of the canonical directory
- authorities. Two connections will be performed to randomly chosen hard
- coded directory mirrors.
-
- If no connections complete within 5 seconds, another batch of three
- connections will be launched. Otherwise, the first connection to
- complete will be used to download the consensus document and the others
- will be closed, after which bootstrapping will proceed as normal.
+ Connection attempts will be performed on an exponential backoff basis.
+ Initially, connections will be performed to randomly chosen hard
+ coded directory mirrors. If none of these connections complete within
+ 5 seconds, connections will also be performed to randomly chosen
+ canonical directory authorities.
+
+ We specify that mirror connections retry after half a second, and then
+ double the retry time with every connection:
+ 0, 0.5, 1, 2, 4, 8, 16, ...
+
+ We specify that directory authority connections start after a 5 second
+ delay, and retry after 5 seconds, doubling the retry time with every
+ connection:
+ 5, 10, 20, ...
+
+ The first connection to complete will be used to download the consensus
+ document and the others will be closed, after which bootstrapping will
+ proceed as normal.
+
+ We expect the vast majority of clients to succeed within 4 seconds,
+ after making up to 5 connection attempts to mirrors. Clients which can't
+ connect in the first 5 seconds, will then try to contact a directory
+ authority. We expect almost all clients to succeed within 10 seconds,
+ after up to 6 connection attempts to mirrors and up to 2 connection
+ attempts to authorities. This is a much better success rate than the
+ current Tor implementation, which fails k/n of clients if k of the n
+ directory authorities are down. (Or, if the connection fails in
+ certain ways, (k/n)^2.)
If at any time, the total outstanding bootstrap connection attempts
- exceeds 15, no new connection attempts are to be launched until existing
- connection attempts experience full timeout.
+ exceeds 10, no new connection attempts are to be launched until an
+ existing connection attempt experiences full timeout. The retry time
+ is not doubled when a connection is skipped.
Design: Fallback Dir Mirror Selection
@@ -43,8 +64,8 @@ Design: Fallback Dir Mirror Selection
This list of fallback dir mirrors should be updated with every
major Tor release. In future releases, the number of dir mirrors
- should be set at 20% of the current Guard nodes, rather than fixed at
- 100.
+ should be set at 20% of the current Guard nodes (approximately 200 as
+ of October 2015), rather than fixed at 100.
Performance: Additional Load with Current Parameter Choices
@@ -62,19 +83,20 @@ Performance: Additional Load with Current Parameter Choices
The dangerous case is in the event of a prolonged consensus failure
that induces all clients to enter into the bootstrap process. In this
- case, the number of initial TLS connections to the fallback dir mirrors
- would be 2*C/100, or 10,000 for C=500,000 users. If no connections
- complete before the five retries, this could reach as high as 50,000
- connection attempts, but this is extremely unlikely to happen in full
- aggregate.
+ case, the number of TLS connections to the fallback dir mirrors within
+ the first second would be 3*C/100, or 60,000 for C=2,000,000 users. If
+ no connections complete before the 10 retries, 7 of which go to
+ mirrors, this could reach as high as 140,000 connection attempts, but
+ this is extremely unlikely to happen in full aggregate.
However, in the no-consensus scenario today, the directory authorities
- would already experience C/9 or 55,555 connection attempts. The
- 5-retry scheme increases their total maximum load to about 275,000
- connection attempts, but again this is unlikely to be reached
- in aggregate. Additionally, with this scheme, even if the dirauths
- are taken down by this load, the dir mirrors should be able to survive
- it.
+ would already experience 2*C/9 or 444,444 connection attempts. (Tor
+ currently tries 2 authorities, before delaying the next attempt.) The
+ 10-retry scheme, 3 of which go to authorities, increases their total
+ maximum load to about 666,666 connection attempts, but again this is
+ unlikely to be reached in aggregate. Additionally, with this scheme,
+ even if the dirauths are taken down by this load, the dir mirrors
+ should be able to survive it.
Implementation Notes: Code Modifications
@@ -87,12 +109,16 @@ Implementation Notes: Code Modifications
a single directory server is selected and a connection is
eventually made through directory_initiate_command_rend().
- There appear to be a few options for altering this code to perform
- multiple connections. Without refactoring, one approach would be
- to make multiple calls to directory_initiate_command_routerstatus()
- from directory_get_from_dirserver() if the purpose is
+ There appear to be a few options for altering this code to retry multiple
+ simultaneous connections. Without refactoring, one approach would be to
+ set mirror and authority retry helper function timers in
+ directory_initiate_command_routerstatus() from
+ directory_get_from_dirserver() if the purpose is
DIR_PURPOSE_FETCH_CONSENSUS and the only directory servers available
- are the authorities and the fallback dir mirrors.
+ are the authorities and the fallback dir mirrors. (That is, there is no
+ valid consensus.) The retry helper function would check the list of
+ pending connections and, if it is 10 or greater, skip the connection
+ attempt, and leave the retry time constant.
The code in directory_initiate_command_rend() would then need to be
altered to maintain a list of the dircons created for this purpose as
@@ -104,10 +130,32 @@ Implementation Notes: Code Modifications
altered to examine the list of pending dircons, determine if this one is
the first to complete, and if so, then call directory_send_command() to
download the consensus and close the other pending dircons.
-
- An additional timer would need to be installed to re-call
- update_consensus_networkstatus_downloads() or a related helper after 5
- seconds. connection_dir_finished_connecting() would cancel this timer.
- The helper would check the list of pending connections and ensure it
- never exceeds 15.
-
+ connection_dir_finished_connecting() would also cancel both timers.
+
+Reliability Analysis
+
+ We make the pessimistic assumptions that 50% of connections to directory
+ mirrors fail, and that 20% of connections to authorities fail. (Actual
+ figures depend on relay churn, age of the fallback list, and authority
+ uptime.)
+
+ We expect the first 10 connection retry times to be:
+ Mirror: 0s 0.5s 1s 2s 4s 8s 16s
+ Auth: 5s 10s 20s
+ Success: 50% 75% 87% 94% 97% 99.4% 99.7% 99.94% 99.97% 99.99%
+
+ 97% of clients succeed while only using directory mirrors.
+ 2.4% of clients succeed on their first auth connection.
+ 0.24% of clients succeed after one more mirror and auth connection.
+ 0.05% of clients succeed after two more mirror and auth connections.
+ 0.01% of clients remain, but in this scenario, 3 authorities are down,
+ so the client is most likely blocked from the Tor network.
+
+ The current implementation makes 1 or 2 authority connections within the
+ first second, depending on exactly how the first connection fails. Under
+ the 20% authority failure assumption, these clients would have a success
+ rate of either 80% or 96% within a few seconds. The scheme above has a
+ similar success rate in the first few seconds, while spreading the load
+ among a larger number of directory mirrors. In addition, if all the
+ authorities are blocked, current clients will inevitably fail, as they
+ do not have a list of directory mirrors.
More information about the tor-commits
mailing list