[tor-dev] Next version of the algorithm

Thu Feb 25 11:27:07 UTC 2016

Reinaldo de Souza Jr <rjunior at thoughtworks.com> writes:

> On 2/16/16 12:20, George Kadianakis wrote:
>> The
>> very latest prop259 basically forgets the unreachable guard status as soon as
>> the algorithm terminates. I wonder if we actually want this. Hopefully guardsim
>> has a simulation scenario that will illustrate whether that's a good idea or
>> not.
>
> We are assuming the unreachable state is persisted (by setting
> unreachable_since in the guard) in the current simulation. We are also
> ignoring entries known to be unreachable when choosing guards from
> either USED_GUARDS or REMAINING_UTOPIC_GUARDS to be our PRIMARY_GUARDS.
>

Thanks for the work on the proposal! I haven't had time to read the new version
yet apart from the stuff you point out in your email.

Making the unreachable state persistent seems like a reasonable idea.

However, I'm not so sure about rejecting previously unreachable guards from
being primary guards. The concept of primary guards is to _always_ try to use
the top N guards in your guardlist (as long as they are in the latest
consensus).  So the list of primary guards should stay the same as long as none
of those nodes disappear from the consensus.

Consider the following case: The network was offline for a few seconds and we
ended up marking our top 8 guards as unreachable. Then the network comes back
up, and we want to build a new circuit: When it's time to choose the primary
guards set we end up choosing the 9th, 10th and 11th node instead of the top 3
guards, even though the top 3 guards are actually online. Ideally Tor will
always connect to the top 3 guards, and this is notthe case here.

Maybe the above issue is not possible with the new prop259 algorithm? If that's
the case, why?

I'm also curious to see how your simulation reacts to this new change.

>> Maybe to reduce this
>> exposure, we should try to go back to STATE_PRIMARY_GUARDS in those cases? Tor
>> does a similar trick right now which has been very helpful:
>>     https://gitweb.torproject.org/tor.git/tree/src/or/entrynodes.c?id=tor-0.2.7.6#n803
>> 
>> Maybe an equivalent heuristic would be that if we are in STATE_RETRY_ONLY and
>> we manage to connect to a non-primary guard, we hang up the connection, and go
>> back into STATE_PRIMARY_GUARDS.
>
> Having this in mind, and revisiting
> entry_guard_register_connect_status(), it seems the
> appropriate approach would be marking all USED_GUARDS for retry,
> discarding the current circuit and trying to build a new circuit
> (restart the whole algorithm) rather than marking the PRIMARY_GUARDS and
> going back to STATE_PRIMARY_GUARDS.
>
> This is because by marking all the used guards for retry we can expect
> the next circuit to use the first available USED_GUARD as an entry guard.
>
> In addition to that, my understanding of this behavior in the current
> guard selection algorithm is:
>
> When you succeed to build a circuit with a new guard, you mark all
> previously used guards (the global entry_guards var in tor) for retry
> which are:
>
> - listed in the latest consensus
> - not bad_since some time
> - unreachable_since some time and is "time to retry"*
> - (and it does not have other restrictions which are not important for
> our simulation, like being a bridge when required or fast/stable)
>
> If you find any retryable guard in the condition above, you discard the
> current circuit and try to build a new one:
>
>   - best case: you have a new circuit with one of the previously
> unreachable used guards.
>   - worst case: all previously unreachable used guards marked for retry
> fail again, and you have a new circuit with the same guard you discarded
> before (unless it is also unreachable now :S).
>

I think your evaluation is correct here.

> It's not really "every time you connect to a new guard, retry the
> previously used" but "every time you connect to a new guard, check if
> its time to retry a unreachable previously used guard".
>
> * time to retry is progressive depending on how long the node has been
> unreachable, for example hourly in the first 6 hours, every 4 hours
> until 3 days, etc. (see: entry_is_time_to_retry).
>

>> Can this heuristic be improved? I think it should be considered for 
>> the algorithm.
>
> We have reviewed this strategy in the simulation for the original
> algorithm and we are generating new data. We are also updating the
> proposal to include the described strategy.
>

Great!

BTW, I noticed that the graphs here:
        https://github.com/twstrike/tor_guardsim/issues/1
are not really explained, and also the x and y axes are undocumented some times.
I have not had time to read the new tor_guardsim to understand these
graphs. Some documentation would be really helpful here; otherwise, you can
explain these things to me in Valencia.

Thanks!