[tor-dev] Next version of the algorithm
George Kadianakis
desnacked at riseup.net
Thu Mar 3 14:15:26 UTC 2016
Ola Bini <obini at thoughtworks.com> writes:
> Hi,
Here are some more comments to the latest version of the proposal, as seen here:
https://github.com/twstrike/torspec
> Filename: 259-guard-selection.txt
> Title: New Guard Selection Behaviour
> Author: Isis Lovecruft, George Kadianakis, [Ola Bini]
> Created: 2015-10-28
> Status: Draft
>
> §1. Overview
>
> Tor uses entry guards to prevent an attacker who controls some
> fraction of the network from observing a fraction of every user's
> traffic. If users chose their entries and exits uniformly at
> random from the list of servers every time they build a circuit,
> then an adversary who had (k/N) of the network would deanonymize
> F=(k/N)^2 of all circuits... and after a given user had built C
> circuits, the attacker would see them at least once with
> probability 1-(1-F)^C. With large C, the attacker would get a
> sample of every user's traffic with probability 1.
>
> To prevent this from happening, Tor clients choose a small number of
> guard nodes (currently 3). These guard nodes are the only nodes
> that the client will connect to directly. If they are not
> compromised, the user's paths are not compromised.
>
> But attacks remain. Consider an attacker who can run a firewall
> between a target user and the Tor network, and make
> many of the guards they don't control appear to be unreachable.
> Or consider an attacker who can identify a user's guards, and mount
> denial-of-service attacks on them until the user picks a guard
> that the attacker controls.
>
> In the presence of these attacks, we can't continue to connect to
> the Tor network unconditionally. Doing so would eventually result
> in the user choosing a hostile node as their guard, and losing
> anonymity.
>
> This proposal outlines a new entry guard selection algorithm, which
> addresses the following concerns:
>
> - Heuristics and algorithms for determining how and which guard(s)
> is(/are) chosen should be kept as simple and easy to understand
> as possible.
>
> - Clients in censored regions or who are behind a fascist firewall
> who connect to the Tor network should not experience any
> significant disadvantage in terms of reachability or usability.
>
> - Tor should make a best attempt at discovering the most
> appropriate behaviour, with as little user input and
> configuration as possible.
>
>
> §2. Design
>
> Alice, an OP attempting to connect to the Tor network, should
> undertake the following steps to determine information about the
> local network and to select (some) appropriate entry guards. In the
> following scenario, it is assumed that Alice has already obtained a
> recent, valid, and verifiable consensus document.
>
> The algorithm is divided into four components such that the full
> algorithm is implemented by first invoking START, then repeatedly
> calling NEXT while adviced it SHOULD_CONTINUE and finally calling
> END. For an example usage see §A. Appendix.
>
> Several components of NEXT can be invoked asynchronously. SHOULD_CONTINUE
> is used for the algorithm to be able to tell the caller whether we
> consider the work done or not - this can be used to retry primary
> guards when we finally are able to connect to a guard after a long
> network outage, for example.
>
> This algorithm keeps track of the unreachability status for guards
> in state private to the algorithm - this is re-initialized every time
> START is called.
>
Hmm, didn't we decide to persist the unreachability status over runs, right?
Or not?
The argument for doing so is "What happens if I'm a hidden service that needs to
make 5 circuits per second, and my first 2 guards happen to be offline? Do I
have to spend time probing them everytime I make a new circuit?".
> The algorithm expects several arguments to guide its behavior. These
> will be defined in §2.1.
>
> The goal of this algorithm is to strongly prefer connecting to the
> same guards we have connected to before, while also trying to detect
> conditions such as a network outage or a network environment that
> blocks most ports. The way it does this is by keeping track of how
> many guards we have exposed ourselves to, and if we have connected
> to too many we will fall back to only retrying the ones we have
> already tried. The algorithm also decides on sample set that should
> be persisted - in order to minimize the risk of an attacker forcing
> enumeration of the whole network by triggering rebuilding of
> circuits.
>
>
> §2.1. The START algorithm
>
> In order to start choosing an entry guard, use the START
> algorithm. This takes four arguments that can be used to fine tune
> the workings:
>
> USED_GUARDS
> This is a list that contains all the guards that have been used
> before by this client. We will prioritize using guards from this
> list in order to minimize our exposure. The list is expected to
> be sorted based on priority, where the first entry will have the
> highest priority.
>
> SAMPLED_UTOPIC_GUARDS
> This is a set that contains all guards that should be considered
> for connection under utopic conditions. This set should be
> persisted between runs. It will be filled in by the algorithm if
> it's empty, or if it contains less than SAMPLE_SET_THRESHOLD
> guards after winnowing out older guards. It should be filled by
> using NEXT_BY_BANDWIDTH with UTOPIC_GUARDS as an argument.
>
Should we use UTOPIC_GUARDS or REMAINING_UTOPIC_GUARDS as the argument?
> SAMPLED_DYSTOPIC_GUARDS
> This is a set that contains all guards that should be considered
> for connection under dystopic conditions. This set should be
> persisted between runs. It will be filled in by the algorithm if
> it's empty, or if it contains less than SAMPLE_SET_THRESHOLD
> guards after winnowing out older guards. It should be filled by
> using NEXT_BY_BANDWIDTH with DYSTOPIC_GUARDS as an argument.
>
> EXCLUDE_NODES
> A set of nodes that we should not consider using as a guard.
>
> N_PRIMARY_GUARDS
> The number of guards we should consider our primary
> guards. These guards will be retried more frequently and will
> take precedence in most situations. By default the primary
> guards will be the first N_PRIMARY_GUARDS guards from USED_GUARDS.
>
> DIR
> If this argument is set, we should only consider guards that can
> be directory guards. If not set, we will consider all guards.
>
> The primary work of START is to initialize the state machine depicted
> in §2.2 The NEXT algorithm.
> The initial state of the machine is defined by:
>
> GUARDS
> This is a set of all guards from the consensus, without
> EXCLUDE_NODES and potentially filtered if DIR is set.
>
> UTOPIC_GUARDS
> This is a set of all guards to use under utopic conditions. It
> will primarily be used to fill in SAMPLED_UTOPIC_GUARDS. This
> set will be initialized to be the same as GUARDS.
>
> DYSTOPIC_GUARDS
> This is a set of all guards to use under dystopic conditions
> (usually when we are subject to a firewall that restricts the
> ports we can connect to). It will primarily be used to fill in
> SAMPLED_UTOPIC_GUARDS. This set will be initialized to be the
I guess you mean SAMPLED_DYSTOPIC_GUARDS.
> subset of GUARDS that listen to ports that are allowed by
> dystopic conditions.
>
> REMAINING_UTOPIC_GUARDS
> This is a running set of the utopic guards we have not yet tried
> to connect to. It should be initialized to be SAMPLED_UTOPIC_GUARDS
> without USED_GUARDS.
>
Maybe here we should also mention that we will reinsert guards that we have not
tried in a long time (GUARDS_RETRY_TIME) as specified by 2.2.2?
> REMAINING_DYSTOPIC_GUARDS
> This is a running set of the dystopic guards we have not yet tried
> to connect to. It should be initialized to be SAMPLED_DYSTOPIC_GUARDS
> without USED_GUARDS.
>
> TRIED_GUARDS
> This set keeps track of all utopic guards we have tried connecting
> to. This should be initialized to the empty set.
>
> TRIED_DYSTOPIC_GUARDS
> This set keeps track of all dystopic guards we have tried connecting
> to. This should be initialized to the empty set.
>
> STATE
> A variable that keeps track of which state in the state
> machine we are currently in. It should be initialized to
> STATE_PRIMARY_GUARDS.
>
> PRIMARY_GUARDS
> This list keeps track of our primary guards. These are guards
> that we will prioritize when trying to connect, and will also
> retry more often in case of failure with other guards.
> It should be initialized by calling algorithm
> NEXT_PRIMARY_GUARD repeatedly until PRIMARY_GUARDS contains
> N_PRIMARY_GUARDS elements.
>
>
> §2.2. The NEXT algorithm
>
> The NEXT algorithm is composed of several different possibly flows. The
> first one is a simple state machine that can transfer between four
> different states. Every time NEXT is invoked, it will resume at the
> state where it left off previously. In the course of selecting an
> entry guard, a new consensus can arrive. When that happens we need
> to update the data structures used, but nothing else should change.
>
> Before jumping in to the state machine, we should first check if it
> was at least PRIMARY_GUARDS_RETRY_INTERVAL minutes since we tried
> any of the PRIMARY_GUARDS. If this is the case, and we are not in
> STATE_PRIMARY_GUARDS, we should save the previous state and set the
> state to STATE_PRIMARY_GUARDS.
>
>
> §2.2.1. The STATE_PRIMARY_GUARDS state
>
> Return each entry in PRIMARY_GUARDS in turn. For each entry, if it
> was not possible to connect to it, mark the entry as unreachable,
> and add it to TRIED_GUARDS.
> [XXX defining "was not possible to connect" as "entry is not live" according
> to current definition of "live entry guard" in tor source code, seems
> to improve success rate on the flaky network scenario.
> See: https://github.com/twstrike/tor_guardsim/issues/1#issuecomment-187374942]
>
Hmm, I'm not sure what this XXX means exactly. I believe we should actually try
to _connect_ to those primary guards and not just check if we think they are live.
> If all entries have been tried, restore the previous state and go
> there. If there is no previous state, transition to STATE_TRY_UTOPIC.
>
>
> §2.2.2. The STATE_TRY_UTOPIC state
>
> In order to give guards that have been marked as unreachable a
> chance to come back, add all entries in TRIED_GUARDS that were
> marked as unreachable more than GUARDS_RETRY_TIME minutes ago back
> to REMAINING_UTOPIC_GUARDS.
>
I'm a bit puzzled by this mechanism. Maybe it's benefits can be explained a bit
more clearly?
When we add guards back to REMAINING_UTOPIC_GUARDS, do we also remove them from
TRIED_GUARDS?
Also, the value here is currently 20 minutes. So this will trigger only when
this algorithm takes over 20 minutes to terminate. I guess this should only
happen when the network is down.
Now that we have persistent SAMPLED_UTOPIC_GUARDS is this still useful? Won't
we have fully populated our SAMPLED_*_GUARDS structures by the point this rule
triggers?
Sorry for the confusion :)
> Return each entry in USED_GUARDS that is not in PRIMARY_GUARDS in
> turn. For each entry, if it was not possible to connect to it, mark
> the entry as unreachable and add it to TRIED_GUARDS.
>
> Return each entry from REMAINING_UTOPIC_GUARDS using
> NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect
> to it, remove the entry from REMAINING_UTOPIC_GUARDS, mark it as
> unreachable and add it to TRIED_GUARDS.
>
> If no entries remain in REMAINING_UTOPIC_GUARDS, transition to
> STATE_TRY_DYSTOPIC.
>
>
> §2.2.3. The STATE_TRY_DYSTOPIC state
>
> In order to give guards that have been marked as unreachable a
> chance to come back, add all entries in TRIED_DYSTOPIC_GUARDS that were
> marked as unreachable more than GUARDS_RETRY_TIME minutes ago back
> to REMAINING_DYSTOPIC_GUARDS.
>
> Return each dystopic entry in USED_GUARDS that is not in PRIMARY_GUARDS in
> turn. For each entry, if it was not possible to connect to it, mark
> the entry as unreachable and add it to TRIED_DYSTOPIC_GUARDS.
>
> Return each entry from REMAINING_DYSTOPIC_GUARDS using
> NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect
> to it, remove the entry from REMAINING_DYSTOPIC_GUARDS, mark it as
> unreachable and add it to TRIED_DYSTOPIC_GUARDS.
>
> If no entries remain in REMAINING_DYSTOPIC_GUARDS, transition to
> STATE_PRIMARY_GUARDS.
>
>
> §2.2.5. ON_NEW_CONSENSUS
>
> First, ensure that all guard profiles are updated with information
> about whether they were in the newest consensus or not. If not, the
> guard is considered bad.
>
Maybe instead of "If not" we could say "If a guard is not included in the newest
consensus" to make it a bit clearer.
> If any PRIMARY_GUARDS have become bad, remove the guard from
> PRIMARY_GUARDS. Then ensure that PRIMARY_GUARDS contain
> N_PRIMARY_GUARDS entries by repeatedly calling NEXT_PRIMARY_GUARD.
>
> If any guards in USED_GUARDS have switched from being bad to being
> non-bad, add it back in the place it should have been in
> PRIMARY_GUARDS if it had been non-bad when populating
> PRIMARY_GUARDS. If this results in PRIMARY_GUARDS being larger than
> N_PRIMARY_GUARDS, truncate the list to be N_PRIMARY_GUARDS entries
> long.
> [XXX Does "add it back in the place it should have been in PRIMARY_GUARDS
> if it had been non-bad" implies keeping original order?]
>
If I understand correctly, I think the answer to this XXX is "Ideally, yes.".
I'm curious to see how this mechanism will be implemented because it's important
and it would be nice if it's done cleanly.
Also, we should be careful about when we count 'bad' guards. After a few weeks
of operation, the USED_GUARDS list can accumulate multiple bad guards, and we
should make sure we don't count them when we do our threshold checks.
>
> §2.3. The SHOULD_CONTINUE algorithm
>
> This algorithm takes as an argument a boolean indicating whether the
> circuit was successfully built or not.
>
> After the caller have tried to build a circuit with a returned
> guard, they should invoke SHOULD_CONTINUE to understand if the
> algorithm is finished or not. SHOULD_CONTINUE will always return
> true if the circuit failed. If the circuit succeeded,
> SHOULD_CONTINUE will always return false, unless the guard that
> succeeded was the first guard to succeed after
> INTERNET_LIKELY_DOWN_INTERVAL minutes - in that case it will set the
> state to STATE_PRIMARY_GUARDS and return true.
>
ACK.
Just a reminder that we also discussed adding the "Retry primary guards if we
have looped over the whole guardlist" heuristic somewhere here. Because in many
cases the network can go down and then back up in less than a minute.
>
> §2.4. The END algorithm
>
> The goal of this algorithm is simply to make sure that we keep track
> of successful connections made. This algorithm should be invoked
> with the guard that was used to correctly set up a circuit.
>
> Once invoked, this algorithm will mark the guard as used, and make
> sure it is in USED_GUARDS.
> [XXX if the guard is not in USED_GUARDS should be added *first*?
> Important because it will affect on the building of PRIMARY_GUARDS.]
>
IIUC, if the guard is not in USED_GUARDS it should be added *last* (that is,
with lowest priority).
>
> §2.5. Helper algorithms
>
> These algorithms are used in the above algorithms, but have been
> separated out here in order to make the flow clearer.
>
> NEXT_PRIMARY_GUARD
> - Return the first entry from USED_GUARDS that is not in
> PRIMARY_GUARDS and that is in the most recent consensus.
> - If USED_GUARDS is empty, use NEXT_BY_BANDWIDTH with
> REMAINING_UTOPIC_GUARDS as the argument.
>
> NEXT_BY_BANDWIDTH
> - Takes G as an argument, which should be a set of guards to
> choose from.
> - Return a randomly select element from G, weighted by bandwidth.
>
>
> §3. Consensus Parameters, & Configurable Variables
>
> This proposal introduces several new parameters that ideally should
> be set in the consensus but that should also be possible to
> set or override in the client configuration file. Some of these have
> proposed values, but for others more simulation and trial needs to
> happen.
>
> PRIMARY_GUARDS_RETRY_INTERVAL
> In order to make it more likely we connect to a primary guard,
> we would like to retry the primary guards more often than other
> types of guards. This parameter controls how many minutes should
> pass before we consider retrying primary guards again. The
> proposed value is 3.
>
> GUARDS_RETRY_TIME
> In the normal course of selecting guards, we want to continue
> retrying unreachable guards in case they have become
> reachable. This parameter controls the minimum amount of minutes
> before we should start to consider guards for connecting
> again. Proposed value is 20.
>
> SAMPLE_SET_THRESHOLD
> In order to allow us to recognize dystopic situations or a
> completely unreachable network, we would like to avoid
> connecting to too many guards before switching modes. We also
> want to avoid exposing ourselves to too many nodes in a
> potentially hostile situation. This parameter, expressed as a
> fraction, determines the number of guards we should keep as the
> sampled set of the only guards we will consider connecting
> to. It will be used as a fraction for both the utopic and the
> dystopic sampled set. If we assume there are 1900 utopic guards
> and of them there are 500 dystopic guards, a setting of 0.02
> means we will have a sample set of 38 utopic guards and 10
> dystopic guards. This limits our total exposure. Proposed value
> is 0.02.
>
We should decide if we want to actually use a dynamic percentage here, or just
set the threshold to a constant value.
A dynamic percentage might give us better security and reachability as the
network evolves, but might also cause unpredictable behaviors if we suddently
get too many guards or too many of them disappear.
I don't have a strong opinion here.
> INTERNET_LIKELY_DOWN_INTERVAL
> The number of minutes since we started trying to find an entry
> guard before we should consider the network down and consider
> retrying primary guards before using a functioning guard
> found. Proposed value 20.
>
It seems to me that the value 20 here could get reduced to something like 5 or
even less. Of course 5 is also an arbitrary value and to actually find out the
"best" number here we should test the algorithm ourselves in various network
types.
Let's say we are restricting ourselves to SAMPLE_SET_THRESHOLD guards. If the
network is down, we will start cycling through guards and we will have walked
through all of them in a matter of minutes. After we have exhausted our guard
list once and we did not manage to connect, our network is effectively
down. This might give extra reasons to do the "Retry guards if we have exhausted
our guard list once" heuristic.
> §4. Security properties and behavior under various conditions
>
> Under normal conditions, this algorithm will allow us to quickly
> connect and use guards we have used before with high likelihood of
> working. Assuming the first primary guard is reachable and in the
> consensus, this algorithm will deterministically always return that
> guard.
>
> Under dystopic conditions (when a firewall is in place that blocks
> all ports except for potentially port 80 and 443), this algorithm
> will try to connect to 2% of all guards before switching modes to try
> dystopic guards. Currently, that means trying to connect to circa 40
> guards before getting a successful connection. If we assume a
> connection try will take maximum 10 seconds, that means it will take
> up to 6 minutes to get a working connection.
>
> When the network is completely down, we will try to connect to 2% of
> all guards plus 2% of all dystopic guards before realizing we are
> down. This means circa 50 guards tried assuming there are 1900 guards
> in the network.
>
> In terms of exposure, we will connect to a maximum of 2% of all
> guards plus 2% of all dystopic guards, or 3% of all guards,
> whichever is lower. If N is the number of guards, and k is the
> number of guards an attacker controls, that means an attacker would
> have a probability of 1-(1-(k/N)^2)^(N * 0.03) to have one of their
> guards selected before we fall back. In real terms, this means an
> attacker would need to control over 10% of all guards in order to
> have a larger than 50% chance of controlling a guard for any given client.
>
> In addition, since the sampled set changes slowly (the suggestion
> here is that guards in it expire every month) it is not possible for
> an attacker to force a connection to an entry guard that isn't
> already in the users sampled set.
>
>
> §A. Appendix: An example usage
>
> In order to clarify how this algorithm is supposed to be used, this
> pseudo code illustrates the building of a circuit:
>
> OPEN_CIRCUIT:
> context = ALGO_CHOOSE_ENTRY_GUARD_START(used_guards, [], 3, false)
>
> while True:
> entryGuard = ALGO_CHOOSE_ENTRY_GUARD_NEXT(context)
> circuit = composeCircuitAndConnect(entryGuard)
>
> if not SHOULD_CONTINUE(isSuccessful(circuit)):
> ALGO_CHOOSE_ENTRY_GUARD_END(context, entryGuard)
> return circuit
>
> -*- coding: utf-8 -*-
More information about the tor-dev
mailing list