[tor-relays] Tor relay occasionally maxing out CPU usage

Wed May 20 12:24:41 UTC 2020

To me it sounds like there isn't actually a problem. This is the way Tor
works now (now == since consensus diffs were added). It's unfortunate
that Tor isn't more multithreaded, so much happens in the same main
loop, and client throughput is momentarily impacted, but that's the way
it is and there isn't a problem here to be solved. At least not for you
the relay operator.

Getting more into tor-dev@ territory here, but doesn't compressing
consensus documents sound like something that could easily be shoved
over into a worker thread? I'm unfamiliar with the subsystem and I'm
sure many of my implicit assumptions are wrong.

Matt

On 5/19/20 11:59, William Kane wrote:
> Okay, so your suspicion was just confirmed:
> 
> consdiffmgr_rescan_flavor_(): The most recent ns consensus is
> valid-after 2020-05-19T15:00:00. We have diffs to this consensus for
> 0/25 older ns consensuses. Generating diffs for the other 25.
> 
> Right after, diffs were compressed with zstd and lzma, causing the CPU
> usage to spike.
> 
> Disabling DirCache still gives me the following warning on Tor 0.4.3.5:
> 
> May 19 17:56:42.909 [warn] DirCache is disabled and we are configured
> as a relay. We will not become a Guard.
> 
> So, unless I sacrifice the Guard flag, there doesn't seem to be a way
> to fix this problem in an easy way.
> 
> Please correct me if I'm wrong.
> 
> 
> 2020-05-19 15:07 GMT, William Kane <ttallink at googlemail.com>:
>> Another thing, from the change-log:
>>
>> - Update the message logged on relays when DirCache is disabled.
>>   Since 0.3.3.5-rc, authorities require DirCache (V2Dir) for the
>>   Guard flag. Fixes bug 24312; bugfix on 0.3.3.5-rc.
>>
>> If I understand this correctly, my relay would no longer be a Guard if
>> I choose to disable DirCache in order to prevent Tor from hogging my
>> CPU?
>>
>> From the code that I have seen, simply not setting the directory port
>> does not stop the relay from caching / compressing diffs.
>>
>> Or has this been changed more recently?
>>
>> Not being a guard would honestly suck, and being a guard but with
>> limited bandwidth due to Tor hogging the CPU also sucks.
>>
>> Any ideas on what to do?
>>
>> 2020-05-19 13:43 GMT, William Kane <ttallink at googlemail.com>:
>>> Dear Alexander,
>>>
>>> I have added 'Log [dirserv]info notice stdout' to my configuration and
>>> will be monitoring the system closely.
>>>
>>> Tor was also upgraded to version 0.4.3.5, and the linux kernel was
>>> upgraded to version 5.6.13 but I do not think this will change
>>> anything.
>>>
>>> Expect a follow-up within the next 12 hours.
>>>
>>> William
>>>
>>> 2020-05-18 1:40 GMT, Alexander FÃ¦rÃ¸y <ahf at torproject.org>:
>>>> Hello,
>>>>
>>>> On 2020/05/17 18:20, William Kane wrote:
>>>>> Occasionally, the CPU usage hit's 100%, and the maximum throughput
>>>>> drops down to around 16 Mbps from it's usual 80 Mbps. This happens
>>>>> randomly and not a fixed intervals which makes it pretty hard to
>>>>> profile.
>>>>
>>>> One of the subsystem's that I can think of that could potentially lead
>>>> to the problem that you are describing is our "consensus diff"
>>>> subsystem. The consensus diff subsystem is responsible for turning
>>>> consensus documents into these patch(1)-like diffs that clients can
>>>> fetch without the need to transfer the whole consensus for each minor
>>>> change.
>>>>
>>>> The subsystem also takes care of compression, which includes LZMA, which
>>>> is a beast when it comes to burning CPU cycles.
>>>>
>>>>> No abnormal entries in the log files.
>>>>
>>>> I suspect you're logging at `notice` log-level, which is the reasonable
>>>> thing to do. We need to log at slightly higher granularity to discover
>>>> the problem here.
>>>>
>>>> Could I get you to add `Log [dirserv]info notice syslog` to your
>>>> `torrc`? This line makes Tor log everything at notice log-level (the
>>>> default), to the system logger, except for the directory server
>>>> subsystem, which will be logged at `info` log-level instead. The code
>>>> responsible for generating consensus diffs uses the `dirserv` for
>>>> logging purposes.
>>>>
>>>> If the CPU spike happens right after a log message that says something
>>>> in the line of "The most recent XXX consensus is valid-after XXX. We
>>>> have diffs to this consensus for XXX/XXX older XXX consensuses.
>>>> Generating diffs for the other XXX." then I think we have our winner.
>>>>
>>>> Please remember to remove the `info` log-level when the experiment is
>>>> over :-)
>>>>
>>>> I'm curious what you figure out here. Let me know if you need any help.
>>>>
>>>> All the best,
>>>> Alex.
>>>>
>>>> --
>>>> Alexander FÃ¦rÃ¸y
>>>> _______________________________________________
>>>> tor-relays mailing list
>>>> tor-relays at lists.torproject.org
>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays
>>>>
>>>
>>
> _______________________________________________
> tor-relays mailing list
> tor-relays at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-relays
>