Analysis of the problems many relay operators are currently facing
Sebastian Hahn
mail at sebastianhahn.net
Wed Apr 21 16:04:24 UTC 2010
I'll try to summarize here what I've learned in the past weeks over the
problems we are currently having with the Tor network as a whole, and
the
issues that individual relay operators have; as well as describing the
issues
we have identified (some of which have been adressed already). As the
information comes from #tor-dev on OFTC, bug reports and mailing
lists, but no
overview exists, it seems worthwhile to collect what we know.
For the past months, quite a few relays have been sporadically
dropping from
the consensus; either until they published the next descriptor again
or for
longer periods of time. For this we have identified the following
problems:
Some vendors have backported openssl features to older versions,
rendering
those relays either completely useless as they are unable to
establish
connections so won't even bootstrap; or useless as relays to
people using
certain openssl versions. Thus, the directory authorities couldn't
establish connections to them, meaning they marked them offline.
We believe this is now fixed as of 0.2.2.11-alpha. A fix for the
stable
series of Tor has not been released yet.
Authorities only downloaded descriptors for relays from V2
directory
authorities if they didn't have them available themselves. As
only two
V2 auths remain, one of which probably disallows most relays from
publishing descriptors, this led to authorities knowing only
about a part
of the network. Some relays were thus unreachable by the majority
of
dir authorities, meaning they dropped out of the consensus.
We believe this is now fixed as of 0.2.2.12-alpha. Not all
authorities
have upgraded yet.
Relays (and authorities) running 0.2.2.11-alpha crash 24 hours
after start
if they have the statistic gathering functionality enabled.
We believe this is now fixed as of 0.2.2.12-alpha. A workaround
is to
disable statistic gathering.
Another issue exists that has not been identified yet, where a
relay is
only reachable from outside sporadically, even though there is no
load.
This issue is rare and has not been reproduced reliably.
Another class of problems exists which affects some/many relays: The
relay
attracts a huge amount of connections, affecting stability of network
equipment
and operating system. These problems might occur:
The Tor process runs out of memory, because it has too many open
connections. Tor is then killed by the OS's OOM-killer.
Tor exhausts the ulimit -n that is affecting it, meaning random
things
like opening logfiles, establishing new connections or gathering
more
entropy fail, often creating many warnings in Tor's logfile. In
some
cases it appears that Tor is spinning until a file descriptor
becomes
available, burning all cpu.
Tor makes a home router/DSL modem/kernel lock up, because it cannot
handle the load. Symptoms include that internet access is
completely
nonfunctional even after the relay is stopped, or that it is
extremely
slow. These symptoms might last until the relevant piece of
equipment is
restarted.
All these share the same underlying problem: Tor is getting more
connections than it can handle. One way to help would be to make
sure
unused connections are closed more quickly, so that relays don't
need
to maintain as many active connections concurrently as they need
to do
now. A Tor patch that logs what state current connections have
[0] shows
that on some systems, around 10% of all connections were used for a
begindir operation before, but now don't have a circuit attached
anymore.
Generally, the fraction of connections used exclusively for begindir
operations appears to be high, so it might be worthwhile to close the
circuits on them more quickly and not keep them around for possible
later
cannibalization.
Another theory is that the fastest relays (by consensus weights) are
used
by a large proportion of users. This means that almost every Tor user
will
make a connection to those few relays, massively increasing the
amount of
connections the relay has to handle at the same time. Some evidence
supporting this is that even after the bw authorities voted
blutmagie's bw
weight down a lot after the operator lowered the banwdidthrate
considerably,
it was still seeing many concurrent connections, while the amount of
new
connections/s was dropping a lot.
As many relay operators are forced to turn off their relay because they
don't have the resources to keep their relay up anymore, the problem
only
gets worse for the other operators, who need to deal with an unchanged
number of clients.
One last concern is that we're seeing scalability problems with our
current
design. Lots of chinese users are back on the network, as many relays
have
been unblocked by the gfw. Some relays are seeing more than 40k active
connections, while being far away from reaching their bw limits. If
usage
increases to grow and a clear bug cannot be identified that causes the
massive amount of connections and it can be determined that this is
just
Tor's popularity growing, alternative designs that don't require
tcp connections might become a necessity very quickly.
I hope I didn't forget any problem/solution/analysis here, if so,
please add it
so we can all track this down as quickly as possible.
Thanks
Sebastian
[0] http://archives.seul.org/or/relays/Apr-2010/msg00066.html
More information about the tor-dev
mailing list