Analysis of the problems many relay operators are currently facing

Sebastian Hahn mail at sebastianhahn.net
Wed Apr 21 16:04:24 UTC 2010


I'll try to summarize here what I've learned in the past weeks over the
problems we are currently having with the Tor network as a whole, and  
the
issues that individual relay operators have; as well as describing the  
issues
we have identified (some of which have been adressed already). As the
information comes from #tor-dev on OFTC, bug reports and mailing  
lists, but no
overview exists, it seems worthwhile to collect what we know.

For the past months, quite a few relays have been sporadically  
dropping from
the consensus; either until they published the next descriptor again  
or for
longer periods of time. For this we have identified the following  
problems:

     Some vendors have backported openssl features to older versions,  
rendering
     those relays either completely useless as they are unable to  
establish
     connections so won't even bootstrap; or useless as relays to  
people using
     certain openssl versions. Thus, the directory authorities couldn't
     establish connections to them, meaning they marked them offline.

     We believe this is now fixed as of 0.2.2.11-alpha. A fix for the  
stable
     series of Tor has not been released yet.


     Authorities only downloaded descriptors for relays from V2  
directory
     authorities if they didn't have them available themselves. As  
only two
     V2 auths remain, one of which probably disallows most relays from
     publishing descriptors, this led to authorities knowing only  
about a part
     of the network. Some relays were thus unreachable by the majority  
of
     dir authorities, meaning they dropped out of the consensus.

     We believe this is now fixed as of 0.2.2.12-alpha. Not all  
authorities
     have upgraded yet.


     Relays (and authorities) running 0.2.2.11-alpha crash 24 hours  
after start
     if they have the statistic gathering functionality enabled.

     We believe this is now fixed as of 0.2.2.12-alpha. A workaround  
is to
     disable statistic gathering.


     Another issue exists that has not been identified yet, where a  
relay is
     only reachable from outside sporadically, even though there is no  
load.
     This issue is rare and has not been reproduced reliably.

Another class of problems exists which affects some/many relays: The  
relay
attracts a huge amount of connections, affecting stability of network  
equipment
and operating system. These problems might occur:

     The Tor process runs out of memory, because it has too many open
     connections. Tor is then killed by the OS's OOM-killer.

     Tor exhausts the ulimit -n that is affecting it, meaning random  
things
     like opening logfiles, establishing new connections or gathering  
more
     entropy fail, often creating many warnings in Tor's logfile. In  
some
     cases it appears that Tor is spinning until a file descriptor  
becomes
     available, burning all cpu.

     Tor makes a home router/DSL modem/kernel lock up, because it cannot
     handle the load. Symptoms include that internet access is  
completely
     nonfunctional even after the relay is stopped, or that it is  
extremely
     slow. These symptoms might last until the relevant piece of  
equipment is
     restarted.


     All these share the same underlying problem: Tor is getting more
     connections than it can handle. One way to help would be to make  
sure
     unused connections are closed more quickly, so that relays don't  
need
     to maintain as many active connections concurrently as they need  
to do
     now. A Tor patch that logs what state current connections have  
[0] shows
     that on some systems, around 10% of all connections were used for a
     begindir operation before, but now don't have a circuit attached  
anymore.
	Generally, the fraction of connections used exclusively for begindir
	operations appears to be high, so it might be worthwhile to close the
	circuits on them more quickly and not keep them around for possible  
later
	cannibalization.
	
	
	Another theory is that the fastest relays (by consensus weights) are  
used
	by a large proportion of users. This means that almost every Tor user  
will
	make a connection to those few relays, massively increasing the  
amount of
	connections the relay has to handle at the same time. Some evidence
	supporting this is that even after the bw authorities voted  
blutmagie's bw
	weight down a lot after the operator lowered the banwdidthrate  
considerably,
	it was still seeing many concurrent connections, while the amount of  
new
	connections/s was dropping a lot.
	
	
	As many relay operators are forced to turn off their relay because they
	don't have the resources to keep their relay up anymore, the problem  
only
	gets worse for the other operators, who need to deal with an unchanged
	number of clients.
	
	
	One last concern is that we're seeing scalability problems with our  
current
	design. Lots of chinese users are back on the network, as many relays  
have
	been unblocked by the gfw. Some relays are seeing more than 40k active
	connections, while being far away from reaching their bw limits. If  
usage
	increases to grow and a clear bug cannot be identified that causes the
	massive amount of connections and it can be determined that this is  
just
	Tor's popularity growing, alternative designs that don't require
	tcp connections might become a necessity very quickly.
	
I hope I didn't forget any problem/solution/analysis here, if so,  
please add it
so we can all track this down as quickly as possible.

Thanks
Sebastian
	
[0] http://archives.seul.org/or/relays/Apr-2010/msg00066.html



More information about the tor-dev mailing list