[tor-relays] cases where relay overload can be a false positive

Wed Jan 12 19:44:00 UTC 2022

On 1/12/22 5:36 PM, David Goulet wrote:
> On 01 Jan (21:12:38), s7r wrote:
>> One of my relays (guard, not exit) started to report being overloaded since
>> once week ago for the first time in its life.
>>
>> The consensus weight and advertised bandwidth are proper as per what they
>> should be, considering the relay's configuration. More than this, they have
>> not changed for years. So, I started to look at it more closely.
>>
>> Apparently the overload is triggered at 5-6 days by flooding it with circuit
>> creation requests. All I can see in tor.log is:
>>
>> [warn] Your computer is too slow to handle this many circuit creation
>> requests! Please consider using the MaxAdvertisedBandwidth config option or
>> choosing a more restricted exit policy. [68382 similar message(s) suppressed
>> in last 482700 seconds]
>>
>> [warn] Your computer is too slow to handle this many circuit creation
>> requests! Please consider using the MaxAdvertisedBandwidth config option or
>> choosing a more restricted exit policy. [7882 similar message(s) suppressed
>> in last 60 seconds]
>>
>> This message is logged like 4-5 or 6 time as 1 minute (60 sec) difference
>> between each warn entry.
>>
>> After that, the relay is back to normal. So it feels like it is being probed
>> or something like this. CPU usage is at 65%, RAM is at under 45%, SSD no
>> problem, bandwidth no problem.
> 
> Very plausible theory, especially in the context of such "burst" of traffic,
> we can rule out that all the sudden your relay has become facebook.onion
> guard.
> 
>> Metrics port says:
>>
>> tor_relay_load_tcp_exhaustion_total 0
>>
>> tor_relay_load_onionskins_total{type="tap",action="processed"} 52073
>> tor_relay_load_onionskins_total{type="tap",action="dropped"} 0
>> tor_relay_load_onionskins_total{type="fast",action="processed"} 0
>> tor_relay_load_onionskins_total{type="fast",action="dropped"} 0
>> tor_relay_load_onionskins_total{type="ntor",action="processed"} 8069522
>> tor_relay_load_onionskins_total{type="ntor",action="dropped"} 273275
>>
>> So if we account the dropped ntor circuits with the processed ntor circuits
>> we end up with a reasonable % (it's  >8 million vs <300k).
> 
> Yeah so this is ~3.38% drop so it immediately triggers the overload signal.
 >
>> So the question here is: does the computed consensus weight of a relay
>> change if that relay keeps sending reports to directory authorities that it
>> is being overloaded? If yes, could this be triggered by an attacker, in
>> order to arbitrary decrease a relay's consensus weight even when it's not
>> really overloaded (to maybe increase the consensus weights of other
>> malicious relays that we don't know about)?
> 
> Correct, this is a possibility indeed. I'm not entirely certain that this is
> the case at the moment as sbws (bandwidth authority software) might not be
> downgrading the bandwidth weights just yet.
> 
> But regardless, the point is that it is where we are going to. But we have
> control over this so now is a good time to notice these problems and act.
> 
> I'll try to get back to you asap after talking with the network team.

My thinking is that sbws would avoid reducing weight of a relay that is 
overloaded until it sees a series of these overload lines, with fresh 
timestamps. For example, just one with a timestamp that never updates 
again could be tracked but not reacted to, until the timestamp changes N 
times.

We can (and should) also have logic that prevents sbws from demoting the 
capacity of a Guard relay so much that it loses the Guard flag, so DoS 
attacks can't easily cause clients to abandon a Guard, unless it goes 
entirely down.

Both of these things can be done in sbws side. This would not solve 
short blips of overload from still being reported on the metrics portal, 
but maybe we want to keep that property.

>> Also, as a side note, I think that if the dropped/processed ratio is not
>> over 15% or 20% a relay should not consider itself overloaded. Would this be
>> a good idea?
> 
> Plausible that it could be better idea! Unclear what an optimal percentage is
> but, personally, I'm leaning towards that we need higher threshold so they are
> not triggered in normal circumstances.
> 
> But I think if we raise this to 20% let say, it might not stop an attacker
> from triggering it. It might just make it that it is a bit longer.

Hrmm. Parameterizing this threshold as a consensus parameter might be a 
good idea. I think that if we can make it such that an attack has to be 
"severe" and "ongoing" long enough such that a relay has lost capacity 
and/or lost the ability to complete circuits, and that relay can't do 
anything about it, that relay unfortunately should not be used as much. 
It's not like the circuit will be likely to succeed or be fast enough to 
use in that case anyway.

We need better DoS defenses generally :/

-- 
Mike Perry