[tor-dev] Detailed algorithm

Tue Feb 9 18:32:12 UTC 2016

Ola Bini <obini at thoughtworks.com> writes:

> OK, with your feedback and thinking a bit more about it, here is a
> revision of the algorithm from yesterday. I think we are starting to
> get close so we will rip out the original simulation code and
> implement something that matches this now. Hopefully, the changes will
> be smaller.
>
> - Start of algorithm (arguments: USED_GUARDS, EXCLUDE_NODES)
>   - If selecting directory guards, GUARDS is all guards from the consensus with the V2Flag,
>     otherwise GUARDS is all guards from the consensus
>   - Set UTOPIC_GUARDS to be all guards to use under utopic conditions from GUARDS
>   - Set DYSTOPIC_GUARDS to be all guards to use under dystopic conditions from GUARDS
>   - Set REMAINING_UTOPIC_GUARDS to be UTOPIC_GUARDS without EXCLUDE_NODES
>   - Set REMAINING_DYSTOPIC_GUARDS to be DYSTOPIC_GUARDS without EXCLUDE_NODES
>   - Create a list of PRIMARY_GUARDS that contain N_PRIMARY_GUARDS that are not bad by:
>     - Taking the next entry from USED_GUARDS
>     - If USED_GUARDS is empty:
>       - randomly select an entry from UTOPIC_GUARDS, weighted by bandwidth
>   - Set TRIED_GUARDS to be an empty set
>   - Set TRIED_DYSTOPIC_GUARDS to be an empty set
>   - Set state = STATE_PRIMARY_GUARDS
>
> - Each iteration of algorithm
>   - If a new consensus has arrived:
>     - Update all guard profiles with new bad/non-bad information
>     - If any PRIMARY_GUARDS have become bad:
>       - re-add to the list of PRIMARY_GUARDS using the same procedure
>     - If any USED_GUARDS have become non-bad:
>       - add it back to PRIMARY_GUARDS at the place it would have been if
>         it was non-bad when running the start of the algorithm. If this
>         results in PRIMARY_GUARDS being larger than N_PRIMARY_GUARDS,
>         remove from the end of the list until the list is N_PRIMARY_GUARDS long
>     - Ensure that UTOPIC_GUARDS and DYSTOPIC_GUARDS are updated with the changes
>       from the consensus
>
>   - If it was at least 3 minutes since we tried the primary guards and we are not in STATE_PRIMARY_GUARDS:
>     - save previous state
>     - set state = STATE_PRIMARY_GUARDS
>
>   - STATE_PRIMARY_GUARDS:
>     - return each entry in PRIMARY_GUARDS in turn
>       - mark each entry as "unreachable" if algorithm doesn't terminate
>     - restore previous state (or STATE_TRY_UTOPIC if no previous)
>
>   - STATE_TRY_UTOPIC:
>     - for each entry in TRIED_GUARDS that was marked as unreachable more than 20 minutes ago
>       - add it back to REMAINING_UTOPIC_GUARDS
>     - return each remaining entry from USED_GUARDS in turn
>       - for each entry, if algorithm doesn't terminate
>         - mark entry as "unreachable"
>         - add entry to TRIED_GUARDS
>         - if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME
>           is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm
>         - if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD
>           proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
>     - return each entry from REMAINING_UTOPIC_GUARDS randomly selected, weighted by bandwidth
>       - remove the returned entry from REMAINING_UTOPIC_GUARDS
>       - for each entry, if algorithm doesn't terminate
>         - mark entry as "unreachable"
>         - add entry to TRIED_GUARDS
>         - if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME
>           is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm
>         - if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD
>           proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
>
>   - STATE_TRY_DYSTOPIC:
>     - for each entry in TRIED_DYSTOPIC_GUARDS that was marked as unreachable more than 20 minutes ago
>       - add it back to REMAINING_DYSTOPIC_GUARDS
>     - return each remaining DYSTOPIC entry from USED_GUARDS in turn
>       - for each entry, if algorithm doesn't terminate
>         - mark entry as "unreachable"
>         - add entry to TRIED_DYSTOPIC_GUARDS
>         - if the number of entries in TRIED_GUARDS+TRIED_DYSTOPIC_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME
>           is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm
>         - if the number of entries in TRIED_DYSTOPIC_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD
>           proportion of DYSTOPIC_GUARDS:
>           - mark all guards in PRIMARY_GUARDS, TRIED_GUARDS and TRIED_DYSTOPIC_GUARDS as not "unreachable"
>           - return failure from the algorithm
>     - return each entry from REMAINING_DYSTOPIC_GUARDS randomly selected, weighted by bandwidth
>       - remove the returned entry from REMAINING_DYSTOPIC_GUARDS
>       - for each entry, if algorithm doesn't terminate
>         - mark entry as "unreachable"
>         - add entry to TRIED_DYSTOPIC_GUARDS
>         - if the number of entries in TRIED_GUARDS+TRIED_DYSTOPIC_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME
>           is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm
>         - if the number of entries in TRIED_DYSTOPIC_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD
>           proportion of DYSTOPIC_GUARDS:
>           - mark all guards in PRIMARY_GUARDS, TRIED_GUARDS and TRIED_DYSTOPIC_GUARDS as not "unreachable"
>           - return failure from the algorithm
>
> - End of algorithm
>   - If circuit is set up correctly, let algorithm know
>     - Algorithm marks the guard chosen as used and makes sure it is in USED_GUARDS
>   - Otherwise do another run of the algorithm
>

Ah great! Good improvements. I think we are going the right way.

Here is another quick review. I include a second copy of the algorithm and
comment inline:

>
> - Start of algorithm (arguments: USED_GUARDS, EXCLUDE_NODES)
>   - If selecting directory guards, GUARDS is all guards from the consensus with the V2Flag,
>     otherwise GUARDS is all guards from the consensus
>   - Set UTOPIC_GUARDS to be all guards to use under utopic conditions from GUARDS
>   - Set DYSTOPIC_GUARDS to be all guards to use under dystopic conditions from GUARDS
>   - Set REMAINING_UTOPIC_GUARDS to be UTOPIC_GUARDS without EXCLUDE_NODES
>   - Set REMAINING_DYSTOPIC_GUARDS to be DYSTOPIC_GUARDS without EXCLUDE_NODES

Maybe we also need to exclude USED_GUARDS from these two lists?

>   - Create a list of PRIMARY_GUARDS that contain N_PRIMARY_GUARDS that are not bad by:
>     - Taking the next entry from USED_GUARDS
>     - If USED_GUARDS is empty:
>       - randomly select an entry from UTOPIC_GUARDS, weighted by bandwidth
>   - Set TRIED_GUARDS to be an empty set
>   - Set TRIED_DYSTOPIC_GUARDS to be an empty set
>   - Set state = STATE_PRIMARY_GUARDS
>
> - Each iteration of algorithm
>   - If a new consensus has arrived:
>     - Update all guard profiles with new bad/non-bad information
>     - If any PRIMARY_GUARDS have become bad:
>       - re-add to the list of PRIMARY_GUARDS using the same procedure
>     - If any USED_GUARDS have become non-bad:
>       - add it back to PRIMARY_GUARDS at the place it would have been if
>         it was non-bad when running the start of the algorithm. If this
>         results in PRIMARY_GUARDS being larger than N_PRIMARY_GUARDS,
>         remove from the end of the list until the list is N_PRIMARY_GUARDS long
>     - Ensure that UTOPIC_GUARDS and DYSTOPIC_GUARDS are updated with the changes
>       from the consensus
>

Not sure if this is part of this algorithm, or it's actually another helper
algorithm that is called when a consensus arrives. I feel it might be cleaner
if we do it as a separate algo, but we can proceed like this as well since it's
not too confusing.

>   - If it was at least 3 minutes since we tried the primary guards and we are not in STATE_PRIMARY_GUARDS:
>     - save previous state
>     - set state = STATE_PRIMARY_GUARDS
>
>   - STATE_PRIMARY_GUARDS:
>     - return each entry in PRIMARY_GUARDS in turn
>       - mark each entry as "unreachable" if algorithm doesn't terminate

So IIUC the "algorithm" referenced here is _not_ the algorithm that we are
describing right now (let's call it ALGO_CHOOSE_GUARD). Instead the "algorithm"
here is the _caller of ALGO_CHOOSE_GUARD_, which is the algorithm responsible
for creating circuits, testing whether they work and reporting the results back
to CHOOSE_GUARD_ALGO (let's call this other algorithm ALGO_BUILD_CIRCUIT).

Maybe we can clarify this?

Also, do PRIMARY_GUARDS go into TRIED_GUARDS and count against our thresholds?

>     - restore previous state (or STATE_TRY_UTOPIC if no previous)
>
>   - STATE_TRY_UTOPIC:
>     - for each entry in TRIED_GUARDS that was marked as unreachable more than 20 minutes ago
>       - add it back to REMAINING_UTOPIC_GUARDS
>     - return each remaining entry from USED_GUARDS in turn
>       - for each entry, if algorithm doesn't terminate

In general, I think we should make the context of this algorithm
(ALGO_CHOOSE_GUARD) a bit clearer, so that we know when "Start of algorithm" is
supposed to run, and when "End of algorithm" is supposed to run.

Because for example one could think here that in an asynchronous setting, when
we try a entry from USED_GUARDS and find out whether the circuit succeeded or
not, we need to run the algorithm from the beginning including the "Start of
algorithm" step. Whereas if I understand correctly you assume that we will just
drop in and continue exactly from this point.

I feel that this confusion is caused by the ALGO_CHOOSE_GUARD algorithm being
pseudo-asynchronous. To address this I suggested additional states for when we
are waiting for the results of a circuit construction, but I agree that this
might complicate things too much.

Is there a way we can make this cleaner?

>         - mark entry as "unreachable"
>         - add entry to TRIED_GUARDS
>         - if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME
>           is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm
>         - if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD
>           proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
>     - return each entry from REMAINING_UTOPIC_GUARDS randomly selected, weighted by bandwidth
>       - remove the returned entry from REMAINING_UTOPIC_GUARDS
>       - for each entry, if algorithm doesn't terminate
>         - mark entry as "unreachable"
>         - add entry to TRIED_GUARDS
>         - if the number of entries in TRIED_GUARDS that were tried within GUARDS_TRY_THRESHOLD_TIME
>           is larger than GUARDS_TRY_THRESHOLD, return failure from the algorithm
>         - if the number of entries in TRIED_GUARDS is larger than a GUARDS_FAILOVER_THRESHOLD
>           proportion of UTOPIC_GUARDS, set state = STATE_TRY_DYSTOPIC
>
> <snip>

Finally, what kind of statistics and measurements does the simulator conduct?

For example, I can think of reachability related stats like:

 * Time (or number of guards) till we manage to build first circuit
 * Time till we manage to recover from flaky network

and I can also think of security related stats like:

 * Number of guards we tried before succeeding first circuit
 * Number of guards we exposed ourselves to after time t

etc.