[tor-dev] Next version of the algorithm

Thu Mar 3 15:44:47 UTC 2016

Great, lots of good comments. I'll respond in depth once the fever has
left my brain! =D

On Thu, Mar 03, 2016 at 03:15:26PM +0100, George Kadianakis wrote:
> Ola Bini <obini at thoughtworks.com> writes:
> 
> > Hi,
> 
> Here are some more comments to the latest version of the proposal, as seen here:
> https://github.com/twstrike/torspec
> 
> > Filename: 259-guard-selection.txt
> > Title: New Guard Selection Behaviour
> > Author: Isis Lovecruft, George Kadianakis, [Ola Bini]
> > Created: 2015-10-28
> > Status: Draft
> > 
> > §1. Overview
> > 
> >   Tor uses entry guards to prevent an attacker who controls some
> >   fraction of the network from observing a fraction of every user's
> >   traffic. If users chose their entries and exits uniformly at
> >   random from the list of servers every time they build a circuit,
> >   then an adversary who had (k/N) of the network would deanonymize
> >   F=(k/N)^2 of all circuits... and after a given user had built C
> >   circuits, the attacker would see them at least once with
> >   probability 1-(1-F)^C.  With large C, the attacker would get a
> >   sample of every user's traffic with probability 1.
> > 
> >   To prevent this from happening, Tor clients choose a small number of
> >   guard nodes (currently 3).  These guard nodes are the only nodes
> >   that the client will connect to directly.  If they are not
> >   compromised, the user's paths are not compromised.
> > 
> >   But attacks remain.  Consider an attacker who can run a firewall
> >   between a target user and the Tor network, and make
> >   many of the guards they don't control appear to be unreachable.
> >   Or consider an attacker who can identify a user's guards, and mount
> >   denial-of-service attacks on them until the user picks a guard
> >   that the attacker controls.
> > 
> >   In the presence of these attacks, we can't continue to connect to
> >   the Tor network unconditionally.  Doing so would eventually result
> >   in the user choosing a hostile node as their guard, and losing
> >   anonymity.
> > 
> >   This proposal outlines a new entry guard selection algorithm, which
> >   addresses the following concerns:
> > 
> >     - Heuristics and algorithms for determining how and which guard(s)
> >       is(/are) chosen should be kept as simple and easy to understand
> >       as possible.
> > 
> >     - Clients in censored regions or who are behind a fascist firewall
> >       who connect to the Tor network should not experience any
> >       significant disadvantage in terms of reachability or usability.
> > 
> >     - Tor should make a best attempt at discovering the most
> >       appropriate behaviour, with as little user input and
> >       configuration as possible.
> > 
> > 
> > §2. Design
> > 
> >   Alice, an OP attempting to connect to the Tor network, should
> >   undertake the following steps to determine information about the
> >   local network and to select (some) appropriate entry guards.  In the
> >   following scenario, it is assumed that Alice has already obtained a
> >   recent, valid, and verifiable consensus document.
> > 
> >   The algorithm is divided into four components such that the full
> >   algorithm is implemented by first invoking START, then repeatedly
> >   calling NEXT while adviced it SHOULD_CONTINUE and finally calling
> >   END. For an example usage see §A. Appendix.
> > 
> >   Several components of NEXT can be invoked asynchronously. SHOULD_CONTINUE
> >   is used for the algorithm to be able to tell the caller whether we
> >   consider the work done or not - this can be used to retry primary
> >   guards when we finally are able to connect to a guard after a long
> >   network outage, for example.
> > 
> >   This algorithm keeps track of the unreachability status for guards
> >   in state private to the algorithm - this is re-initialized every time
> >   START is called.
> > 
> 
> Hmm, didn't we decide to persist the unreachability status over runs, right?
> Or not?
> 
> The argument for doing so is "What happens if I'm a hidden service that needs to
> make 5 circuits per second, and my first 2 guards happen to be offline? Do I
> have to spend time probing them everytime I make a new circuit?".
> 
> >   The algorithm expects several arguments to guide its behavior. These
> >   will be defined in §2.1.
> > 
> >   The goal of this algorithm is to strongly prefer connecting to the
> >   same guards we have connected to before, while also trying to detect
> >   conditions such as a network outage or a network environment that
> >   blocks most ports. The way it does this is by keeping track of how
> >   many guards we have exposed ourselves to, and if we have connected
> >   to too many we will fall back to only retrying the ones we have
> >   already tried. The algorithm also decides on sample set that should
> >   be persisted - in order to minimize the risk of an attacker forcing
> >   enumeration of the whole network by triggering rebuilding of
> >   circuits.
> > 
> > 
> > §2.1. The START algorithm
> > 
> >   In order to start choosing an entry guard, use the START
> >   algorithm. This takes four arguments that can be used to fine tune
> >   the workings:
> > 
> >   USED_GUARDS
> >       This is a list that contains all the guards that have been used
> >       before by this client. We will prioritize using guards from this
> >       list in order to minimize our exposure. The list is expected to
> >       be sorted based on priority, where the first entry will have the
> >       highest priority.
> > 
> >   SAMPLED_UTOPIC_GUARDS
> >       This is a set that contains all guards that should be considered
> >       for connection under utopic conditions. This set should be
> >       persisted between runs. It will be filled in by the algorithm if
> >       it's empty, or if it contains less than SAMPLE_SET_THRESHOLD
> >       guards after winnowing out older guards. It should be filled by
> >       using NEXT_BY_BANDWIDTH with UTOPIC_GUARDS as an argument.
> > 
> 
> Should we use UTOPIC_GUARDS or REMAINING_UTOPIC_GUARDS as the argument?
> 
> >   SAMPLED_DYSTOPIC_GUARDS
> >       This is a set that contains all guards that should be considered
> >       for connection under dystopic conditions. This set should be
> >       persisted between runs. It will be filled in by the algorithm if
> >       it's empty, or if it contains less than SAMPLE_SET_THRESHOLD
> >       guards after winnowing out older guards. It should be filled by
> >       using NEXT_BY_BANDWIDTH with DYSTOPIC_GUARDS as an argument.
> > 
> >   EXCLUDE_NODES
> >       A set of nodes that we should not consider using as a guard.
> > 
> >   N_PRIMARY_GUARDS
> >       The number of guards we should consider our primary
> >       guards. These guards will be retried more frequently and will
> >       take precedence in most situations. By default the primary
> >       guards will be the first N_PRIMARY_GUARDS guards from USED_GUARDS.
> > 
> >   DIR
> >       If this argument is set, we should only consider guards that can
> >       be directory guards. If not set, we will consider all guards.
> > 
> >   The primary work of START is to initialize the state machine depicted
> >   in §2.2 The NEXT algorithm.
> >   The initial state of the machine is defined by:
> > 
> >   GUARDS
> >       This is a set of all guards from the consensus, without
> >       EXCLUDE_NODES and potentially filtered if DIR is set.
> > 
> >   UTOPIC_GUARDS
> >       This is a set of all guards to use under utopic conditions. It
> >       will primarily be used to fill in SAMPLED_UTOPIC_GUARDS. This
> >       set will be initialized to be the same as GUARDS.
> > 
> >   DYSTOPIC_GUARDS
> >       This is a set of all guards to use under dystopic conditions
> >       (usually when we are subject to a firewall that restricts the
> >       ports we can connect to).  It will primarily be used to fill in
> >       SAMPLED_UTOPIC_GUARDS.  This set will be initialized to be the
> 
> I guess you mean SAMPLED_DYSTOPIC_GUARDS.
> 
> >       subset of GUARDS that listen to ports that are allowed by
> >       dystopic conditions.
> > 
> >   REMAINING_UTOPIC_GUARDS
> >       This is a running set of the utopic guards we have not yet tried
> >       to connect to. It should be initialized to be SAMPLED_UTOPIC_GUARDS
> >       without USED_GUARDS.
> > 
> 
> Maybe here we should also mention that we will reinsert guards that we have not
> tried in a long time (GUARDS_RETRY_TIME) as specified by 2.2.2?
> 
> >   REMAINING_DYSTOPIC_GUARDS
> >       This is a running set of the dystopic guards we have not yet tried
> >       to connect to. It should be initialized to be SAMPLED_DYSTOPIC_GUARDS
> >       without USED_GUARDS.
> > 
> >   TRIED_GUARDS
> >       This set keeps track of all utopic guards we have tried connecting
> >       to. This should be initialized to the empty set.
> > 
> >   TRIED_DYSTOPIC_GUARDS
> >       This set keeps track of all dystopic guards we have tried connecting
> >       to. This should be initialized to the empty set.
> > 
> >   STATE
> >       A variable that keeps track of which state in the state
> >       machine we are currently in. It should be initialized to
> >       STATE_PRIMARY_GUARDS.
> > 
> >   PRIMARY_GUARDS
> >       This list keeps track of our primary guards. These are guards
> >       that we will prioritize when trying to connect, and will also
> >       retry more often in case of failure with other guards.
> >       It should be initialized by calling algorithm
> >       NEXT_PRIMARY_GUARD repeatedly until PRIMARY_GUARDS contains
> >       N_PRIMARY_GUARDS elements.
> > 
> > 
> > §2.2. The NEXT algorithm
> > 
> >   The NEXT algorithm is composed of several different possibly flows. The
> >   first one is a simple state machine that can transfer between four
> >   different states. Every time NEXT is invoked, it will resume at the
> >   state where it left off previously. In the course of selecting an
> >   entry guard, a new consensus can arrive. When that happens we need
> >   to update the data structures used, but nothing else should change.
> > 
> >   Before jumping in to the state machine, we should first check if it
> >   was at least PRIMARY_GUARDS_RETRY_INTERVAL minutes since we tried
> >   any of the PRIMARY_GUARDS. If this is the case, and we are not in
> >   STATE_PRIMARY_GUARDS, we should save the previous state and set the
> >   state to STATE_PRIMARY_GUARDS.
> > 
> > 
> > §2.2.1. The STATE_PRIMARY_GUARDS state
> > 
> >   Return each entry in PRIMARY_GUARDS in turn. For each entry, if it
> >   was not possible to connect to it, mark the entry as unreachable,
> >   and add it to TRIED_GUARDS.
> >   [XXX defining "was not possible to connect" as "entry is not live" according
> >        to current definition of "live entry guard" in tor source code, seems
> >        to improve success rate on the flaky network scenario.
> >        See: https://github.com/twstrike/tor_guardsim/issues/1#issuecomment-187374942]
> > 
> 
> Hmm, I'm not sure what this XXX means exactly. I believe we should actually try
> to _connect_ to those primary guards and not just check if we think they are live.
> 
> >   If all entries have been tried, restore the previous state and go
> >   there. If there is no previous state, transition to STATE_TRY_UTOPIC.
> > 
> > 
> > §2.2.2. The STATE_TRY_UTOPIC state
> > 
> >   In order to give guards that have been marked as unreachable a
> >   chance to come back, add all entries in TRIED_GUARDS that were
> >   marked as unreachable more than GUARDS_RETRY_TIME minutes ago back
> >   to REMAINING_UTOPIC_GUARDS.
> > 
> 
> I'm a bit puzzled by this mechanism. Maybe it's benefits can be explained a bit
> more clearly?
> 
> When we add guards back to REMAINING_UTOPIC_GUARDS, do we also remove them from
> TRIED_GUARDS?
> 
> Also, the value here is currently 20 minutes. So this will trigger only when
> this algorithm takes over 20 minutes to terminate. I guess this should only
> happen when the network is down.
> 
> Now that we have persistent SAMPLED_UTOPIC_GUARDS is this still useful?  Won't
> we have fully populated our SAMPLED_*_GUARDS structures by the point this rule
> triggers?
> 
> Sorry for the confusion :)
> 
> >   Return each entry in USED_GUARDS that is not in PRIMARY_GUARDS in
> >   turn. For each entry, if it was not possible to connect to it, mark
> >   the entry as unreachable and add it to TRIED_GUARDS.
> > 
> >   Return each entry from REMAINING_UTOPIC_GUARDS using
> >   NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect
> >   to it, remove the entry from REMAINING_UTOPIC_GUARDS, mark it as
> >   unreachable and add it to TRIED_GUARDS.
> > 
> >   If no entries remain in REMAINING_UTOPIC_GUARDS, transition to
> >   STATE_TRY_DYSTOPIC.
> > 
> > 
> > §2.2.3. The STATE_TRY_DYSTOPIC state
> > 
> >   In order to give guards that have been marked as unreachable a
> >   chance to come back, add all entries in TRIED_DYSTOPIC_GUARDS that were
> >   marked as unreachable more than GUARDS_RETRY_TIME minutes ago back
> >   to REMAINING_DYSTOPIC_GUARDS.
> > 
> >   Return each dystopic entry in USED_GUARDS that is not in PRIMARY_GUARDS in
> >   turn. For each entry, if it was not possible to connect to it, mark
> >   the entry as unreachable and add it to TRIED_DYSTOPIC_GUARDS.
> > 
> >   Return each entry from REMAINING_DYSTOPIC_GUARDS using
> >   NEXT_BY_BANDWIDTH. For each entry, if it was not possible to connect
> >   to it, remove the entry from REMAINING_DYSTOPIC_GUARDS, mark it as
> >   unreachable and add it to TRIED_DYSTOPIC_GUARDS.
> > 
> >   If no entries remain in REMAINING_DYSTOPIC_GUARDS, transition to
> >   STATE_PRIMARY_GUARDS.
> > 
> > 
> > §2.2.5. ON_NEW_CONSENSUS
> > 
> >   First, ensure that all guard profiles are updated with information
> >   about whether they were in the newest consensus or not. If not, the
> >   guard is considered bad.
> > 
> 
> Maybe instead of "If not" we could say "If a guard is not included in the newest
> consensus" to make it a bit clearer.
> 
> >   If any PRIMARY_GUARDS have become bad, remove the guard from
> >   PRIMARY_GUARDS. Then ensure that PRIMARY_GUARDS contain
> >   N_PRIMARY_GUARDS entries by repeatedly calling NEXT_PRIMARY_GUARD.
> > 
> >   If any guards in USED_GUARDS have switched from being bad to being
> >   non-bad, add it back in the place it should have been in
> >   PRIMARY_GUARDS if it had been non-bad when populating
> >   PRIMARY_GUARDS. If this results in PRIMARY_GUARDS being larger than
> >   N_PRIMARY_GUARDS, truncate the list to be N_PRIMARY_GUARDS entries
> >   long.
> >   [XXX Does "add it back in the place it should have been in PRIMARY_GUARDS
> >   if it had been non-bad" implies keeping original order?]
> > 
> 
> If I understand correctly, I think the answer to this XXX is "Ideally, yes.".
> 
> I'm curious to see how this mechanism will be implemented because it's important
> and it would be nice if it's done cleanly.
> 
> Also, we should be careful about when we count 'bad' guards. After a few weeks
> of operation, the USED_GUARDS list can accumulate multiple bad guards, and we
> should make sure we don't count them when we do our threshold checks.
> 
> > 
> > §2.3. The SHOULD_CONTINUE algorithm
> > 
> >   This algorithm takes as an argument a boolean indicating whether the
> >   circuit was successfully built or not.
> > 
> >   After the caller have tried to build a circuit with a returned
> >   guard, they should invoke SHOULD_CONTINUE to understand if the
> >   algorithm is finished or not. SHOULD_CONTINUE will always return
> >   true if the circuit failed. If the circuit succeeded,
> >   SHOULD_CONTINUE will always return false, unless the guard that
> >   succeeded was the first guard to succeed after
> >   INTERNET_LIKELY_DOWN_INTERVAL minutes - in that case it will set the
> >   state to STATE_PRIMARY_GUARDS and return true.
> > 
> 
> ACK.
> 
> Just a reminder that we also discussed adding the "Retry primary guards if we
> have looped over the whole guardlist" heuristic somewhere here. Because in many
> cases the network can go down and then back up in less than a minute.
> 
> > 
> > §2.4. The END algorithm
> > 
> >   The goal of this algorithm is simply to make sure that we keep track
> >   of successful connections made. This algorithm should be invoked
> >   with the guard that was used to correctly set up a circuit.
> > 
> >   Once invoked, this algorithm will mark the guard as used, and make
> >   sure it is in USED_GUARDS.
> >   [XXX if the guard is not in USED_GUARDS should be added *first*?
> >   Important because it will affect on the building of PRIMARY_GUARDS.]
> > 
> 
> IIUC, if the guard is not in USED_GUARDS it should be added *last* (that is,
> with lowest priority).
> 
> > 
> > §2.5. Helper algorithms
> > 
> >   These algorithms are used in the above algorithms, but have been
> >   separated out here in order to make the flow clearer.
> > 
> >   NEXT_PRIMARY_GUARD
> >       - Return the first entry from USED_GUARDS that is not in
> >         PRIMARY_GUARDS and that is in the most recent consensus.
> >       - If USED_GUARDS is empty, use NEXT_BY_BANDWIDTH with
> >         REMAINING_UTOPIC_GUARDS as the argument.
> > 
> >   NEXT_BY_BANDWIDTH
> >       - Takes G as an argument, which should be a set of guards to
> >         choose from.
> >       - Return a randomly select element from G, weighted by bandwidth.
> > 
> > 
> > §3. Consensus Parameters, & Configurable Variables
> > 
> >   This proposal introduces several new parameters that ideally should
> >   be set in the consensus but that should also be possible to
> >   set or override in the client configuration file. Some of these have
> >   proposed values, but for others more simulation and trial needs to
> >   happen.
> > 
> >   PRIMARY_GUARDS_RETRY_INTERVAL
> >       In order to make it more likely we connect to a primary guard,
> >       we would like to retry the primary guards more often than other
> >       types of guards. This parameter controls how many minutes should
> >       pass before we consider retrying primary guards again. The
> >       proposed value is 3.
> > 
> >   GUARDS_RETRY_TIME
> >       In the normal course of selecting guards, we want to continue
> >       retrying unreachable guards in case they have become
> >       reachable. This parameter controls the minimum amount of minutes
> >       before we should start to consider guards for connecting
> >       again. Proposed value is 20.
> > 
> >   SAMPLE_SET_THRESHOLD
> >       In order to allow us to recognize dystopic situations or a
> >       completely unreachable network, we would like to avoid
> >       connecting to too many guards before switching modes. We also
> >       want to avoid exposing ourselves to too many nodes in a
> >       potentially hostile situation.  This parameter, expressed as a
> >       fraction, determines the number of guards we should keep as the
> >       sampled set of the only guards we will consider connecting
> >       to. It will be used as a fraction for both the utopic and the
> >       dystopic sampled set.  If we assume there are 1900 utopic guards
> >       and of them there are 500 dystopic guards, a setting of 0.02
> >       means we will have a sample set of 38 utopic guards and 10
> >       dystopic guards. This limits our total exposure. Proposed value
> >       is 0.02.
> > 
> 
> We should decide if we want to actually use a dynamic percentage here, or just
> set the threshold to a constant value.
> 
> A dynamic percentage might give us better security and reachability as the
> network evolves, but might also cause unpredictable behaviors if we suddently
> get too many guards or too many of them disappear.
> 
> I don't have a strong opinion here.
> 
> >   INTERNET_LIKELY_DOWN_INTERVAL
> >       The number of minutes since we started trying to find an entry
> >       guard before we should consider the network down and consider
> >       retrying primary guards before using a functioning guard
> >       found. Proposed value 20.
> > 
> 
> It seems to me that the value 20 here could get reduced to something like 5 or
> even less. Of course 5 is also an arbitrary value and to actually find out the
> "best" number here we should test the algorithm ourselves in various network
> types.
> 
> Let's say we are restricting ourselves to SAMPLE_SET_THRESHOLD guards. If the
> network is down, we will start cycling through guards and we will have walked
> through all of them in a matter of minutes. After we have exhausted our guard
> list once and we did not manage to connect, our network is effectively
> down. This might give extra reasons to do the "Retry guards if we have exhausted
> our guard list once" heuristic.
> 
> > §4.  Security properties and behavior under various conditions
> > 
> >   Under normal conditions, this algorithm will allow us to quickly
> >   connect and use guards we have used before with high likelihood of
> >   working. Assuming the first primary guard is reachable and in the
> >   consensus, this algorithm will deterministically always return that
> >   guard.
> > 
> >   Under dystopic conditions (when a firewall is in place that blocks
> >   all ports except for potentially port 80 and 443), this algorithm
> >   will try to connect to 2% of all guards before switching modes to try
> >   dystopic guards. Currently, that means trying to connect to circa 40
> >   guards before getting a successful connection. If we assume a
> >   connection try will take maximum 10 seconds, that means it will take
> >   up to 6 minutes to get a working connection.
> > 
> >   When the network is completely down, we will try to connect to 2% of
> >   all guards plus 2% of all dystopic guards before realizing we are
> >   down. This means circa 50 guards tried assuming there are 1900 guards
> >   in the network.
> > 
> >   In terms of exposure, we will connect to a maximum of 2% of all
> >   guards plus 2% of all dystopic guards, or 3% of all guards,
> >   whichever is lower. If N is the number of guards, and k is the
> >   number of guards an attacker controls, that means an attacker would
> >   have a probability of 1-(1-(k/N)^2)^(N * 0.03) to have one of their
> >   guards selected before we fall back. In real terms, this means an
> >   attacker would need to control over 10% of all guards in order to
> >   have a larger than 50% chance of controlling a guard for any given client.
> > 
> >   In addition, since the sampled set changes slowly (the suggestion
> >   here is that guards in it expire every month) it is not possible for
> >   an attacker to force a connection to an entry guard that isn't
> >   already in the users sampled set.
> > 
> > 
> > §A. Appendix: An example usage
> > 
> >   In order to clarify how this algorithm is supposed to be used, this
> >   pseudo code illustrates the building of a circuit:
> > 
> >     OPEN_CIRCUIT:
> >       context = ALGO_CHOOSE_ENTRY_GUARD_START(used_guards, [], 3, false)
> > 
> >       while True:
> >         entryGuard = ALGO_CHOOSE_ENTRY_GUARD_NEXT(context)
> >         circuit = composeCircuitAndConnect(entryGuard)
> > 
> >         if not SHOULD_CONTINUE(isSuccessful(circuit)):
> >           ALGO_CHOOSE_ENTRY_GUARD_END(context, entryGuard)
> >           return circuit
> > 
> > -*- coding: utf-8 -*-
> 

-- 
 Ola Bini (https://olabini.se)

 "Yields falsehood when quined" yields falsehood when quined.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 931 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20160303/92c5229e/attachment-0001.sig>