[tor-dev] Proposal 328: Make Relays Report When They Are Overloaded

Wed Mar 3 18:06:49 UTC 2021

On 3/3/21 1:14 PM, David Goulet wrote:
> 
> On 02 Mar (20:58:43), Mike Perry wrote:
>>
>>
>> On 3/2/21 6:01 PM, George Kadianakis wrote:
>>>
>>> David Goulet <dgoulet at torproject.org> writes:
>>>
>>>> Greetings,
>>>>
>>>> Attached is a proposal from Mike Perry and I. Merge requsest is here:
>>>>
>>>> https://gitlab.torproject.org/tpo/core/torspec/-/merge_requests/22
>>>>
>>>
>>> Hello all,
>>>
>>> while working on this proposal I had to change it slightly to add a few
>>> more metrics and also to simplify some engineering issues that we would
>>> encounter. You can find the changes here:
>>>            https://gitlab.torproject.org/asn/torspec/-/commit/b57743b9764bd8e6ef8de689d14483b7ec9c91ec
>>>
>>> Mike, based on your comments in the #40222 ticket, I would appreciate
>>> comments on the way the DNS issues will be reported. David argued that
>>> they should not be part of the "overload-general" line because they are
>>> not an overload and it's not the fault of the network in any way. This
>>> is why we added them as separate lines. Furthermore, David suggested we
>>> turn them into a threshold "only report if 25% of the total requests
>>> have timed out" instead of "only report if at least one time out has
>>> occured" since that would be more useful.
>>
>> I'm confused by this confusion. There's pretty clear precedent for
>> treating packet drops as a sign of network capacity overload. We've also
>> seen it experimentally specifically with respect to DNS, during Rob's
>> experiment. We discussed this on Monday.
>>
>> However, I agree there's a chance that a single packet drop can be
>> spurious, and/or could be due to ephemeral overload as TCP congestion
>> causes. But 25% is waaaaaaaaaay too high. Even 1% is high IMO, but is
>> more reasonable. We should ask some exits what they see now. The fact
>> that our DNS scanners are not currently seeing this at all, and the
>> issue appeared only for the exact duration of Rob's experiment, suggests
>> that DNS packets drops are extremely rare in healthy network conditions.
> 
> Ok, likely 25% is way too high indeed.
> 
> The idea behind this was simply that a network hiccup or a temporary faulty
> DNS server would not move away traffic from the Exit for a 72h period
> (reminder that the "overload-general" sticks for 72h in the extrainfo once
> hit).

Yes, it sticks for 72 hours because sbws does not store descriptors.
However, the timestamp should *not* update unless the overload condition
occurs again. In this way, we can defer the logic to if the overload
signal is "fresh" vs "stale" to sbws, rather than have it on relays.

This also suggests we want to put some kind of counter in there, like
"number of times this has been listed in the past 72 hours" as well.
That way we can also defer the heuristics to respond to temporary hiccup
vs persistent overload to sbws, too, without needing to bake too of this
logic into relays (which are a PITA to upgrade and may end up running
different versions of this).

George also said you guys felt pressure to rush for the 0.4.6 merge
deadline on this. I would suggest that we not try to bang this out in a
week, but instead try to address these issues with a bit more thought.
If we miss 0.4.6, it's not the end of the world.

Plus this week is shaping up to be pure madness anyway, in other areas.

>> Furthermore, revealing the specific type of overload condition
>> increases the ability for the adversary to use this information for
>> various attacks. I'd rather it be combined in all cases, so that the
>> specific cause is not visible. In all cases, the reaction of our systems
>> should be the same: direct less load to relays with this line. If we
>> need to dig, that's what MetricsPort is for.
>>
>> In fact, this DNS packet drop signal may be particularly useful in
>> traffic analysis attacks. Its reporting, and likely all of this overload
>> reporting, should probably be delayed until something like the top of
>> the hour after it happens. We may even want this delay to be a consensus
>> parameter. Something like "Report only after N minutes", or "Report only
>> N minute windows", perhaps?
> 
> Yes definitely and I would even add a random component in this so not all
> relays will report an overload in a predictable timeframe and thus "if the
> line appear, I know it was hit N hours ago" type of calculation.

Nice.

Wrt what Georg said, the reason for consensus parameter(s) is also for
agility in the face of uncertainty of potential attacks and how much
they may be helped by a particular response time/fuzz factor. Who can
say if Ian's excitement was performance research, or new attack papers
(kidding Ian, but you know how it goes :).

-- 
Mike Perry