[tor-bugs] #925 [Tor Relay]: Tor fails badly when accept(2) returns EMFILE or ENFILE
Tor Bug Tracker & Wiki
torproject-admin at torproject.org
Tue Aug 24 19:24:15 UTC 2010
#925: Tor fails badly when accept(2) returns EMFILE or ENFILE
--------------------------------+-------------------------------------------
Reporter: riastradh | Type: defect
Status: new | Priority: minor
Milestone: Tor: 0.2.2.x-final | Component: Tor Relay
Version: 0.2.0.33 | Resolution: None
Keywords: | Parent:
--------------------------------+-------------------------------------------
Old description:
> If accept(2) in connection_handle_listener_read returns EMFILE or
> ENFILE, Tor logs a failure and returns to the event loop. The
> listening socket remains ready for reading, however, so that Tor again
> tries to accept a connection. This leads to tens of thousands of
> logged failures per second. Here is an excerpt from my syslog:
>
> Feb 11 05:57:36 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:57:54 last message repeated 301536 times
> Feb 11 05:57:54 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
> Feb 11 05:57:54 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:58:05 last message repeated 184158 times
> Feb 11 05:58:05 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
> Feb 11 05:58:05 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:58:13 last message repeated 127556 times
> Feb 11 05:58:13 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
> Feb 11 05:58:13 Tor[20415]: accept failed: Too many open files. Dropping
> incoming connection.
> Feb 11 05:58:26 last message repeated 223556 times
> Feb 11 05:58:26 Tor[20415]: Failing because we have 1765 connections
> already. Please raise your ulimit -n.
>
> I don't know what the right thing to do here is, but spiking the CPU
> and spraying log messages is not a very graceful mode of failure. One
> way to mitigate the damage might be to close the listening socket,
> which I believe won't be reopened until a minute later. This is no
> worse for the Tor network than just wedging, and perhaps better, since
> prospective connectors would be refused rather than silently forgotten
> in a flurry of furious logging.
>
> Also, it would be nice to document the number of file descriptors
> generally required by a Tor relay, or a formula for computing it. For
> example, is it proportional to the bandwidth and to the number of
> relays in the Tor network? Or to the bandwidth and to the number of
> users in the Tor network? This way, prospective operators of Tor
> relays would not need to repeatedly restart their relays as they test
> incremental bumps in the file descriptor ulimits, unless there is some
> way to bump them without restarting the relay (but I doubt whether
> there is).
>
> (Apologies if this is duplicated: I hit !^A while editing this, in order
> to move to the beginning of the line, but the obnoxious !@#!^%&%!^& web
> form [and my obnoxiously colluding web browser] interpreted it to mean
> something else for which I quickly hit the stop button. I don't know
> what hitting !^A actually did.)
>
> [Automatically added by flyspray2trac: Operating System: All]
New description:
If accept(2) in connection_handle_listener_read returns EMFILE or
ENFILE, Tor logs a failure and returns to the event loop. The
listening socket remains ready for reading, however, so that Tor again
tries to accept a connection. This leads to tens of thousands of
logged failures per second. Here is an excerpt from my syslog:
Feb 11 05:57:36 Tor[20415]: accept failed: Too many open files. Dropping
incoming connection.
Feb 11 05:57:54 last message repeated 301536 times
Feb 11 05:57:54 Tor[20415]: Failing because we have 1765 connections
already. Please raise your ulimit -n.
Feb 11 05:57:54 Tor[20415]: accept failed: Too many open files. Dropping
incoming connection.
Feb 11 05:58:05 last message repeated 184158 times
Feb 11 05:58:05 Tor[20415]: Failing because we have 1765 connections
already. Please raise your ulimit -n.
Feb 11 05:58:05 Tor[20415]: accept failed: Too many open files. Dropping
incoming connection.
Feb 11 05:58:13 last message repeated 127556 times
Feb 11 05:58:13 Tor[20415]: Failing because we have 1765 connections
already. Please raise your ulimit -n.
Feb 11 05:58:13 Tor[20415]: accept failed: Too many open files. Dropping
incoming connection.
Feb 11 05:58:26 last message repeated 223556 times
Feb 11 05:58:26 Tor[20415]: Failing because we have 1765 connections
already. Please raise your ulimit -n.
I don't know what the right thing to do here is, but spiking the CPU
and spraying log messages is not a very graceful mode of failure. One
way to mitigate the damage might be to close the listening socket,
which I believe won't be reopened until a minute later. This is no
worse for the Tor network than just wedging, and perhaps better, since
prospective connectors would be refused rather than silently forgotten
in a flurry of furious logging.
Also, it would be nice to document the number of file descriptors
generally required by a Tor relay, or a formula for computing it. For
example, is it proportional to the bandwidth and to the number of
relays in the Tor network? Or to the bandwidth and to the number of
users in the Tor network? This way, prospective operators of Tor
relays would not need to repeatedly restart their relays as they test
incremental bumps in the file descriptor ulimits, unless there is some
way to bump them without restarting the relay (but I doubt whether
there is).
(Apologies if this is duplicated: I hit !^A while editing this, in order
to move to the beginning of the line, but the obnoxious !@#!^%&%!^& web
form [and my obnoxiously colluding web browser] interpreted it to mean
something else for which I quickly hit the stop button. I don't know
what hitting !^A actually did.)
[Automatically added by flyspray2trac: Operating System: All]
--
Comment(by nickm):
So part one of the "fix" here, if we want to try it, is for
connection_handle_listener_read() to compare get_n_open_sockets() with
get_options->ConnLimit [grep through the rest of that file to see how we
do it]. If we have too many sockets, then we should immediately close the
new connection...
...and part two is, if we ever get an EMFILE/ENFILE, to reset our idea of
what our connlimit is based on the number of files we currently have
open...
...but first, we need to look through the code that connects to
ORs/directories, and make sure that we don't actually treat a completed
connect() attempt as meaning that a server is up. I am 97% sure that we
don't.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/925#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list