Proposal 168: Reduce default circuit window
Björn Scheuermann
scheuermann at cs.uni-duesseldorf.de
Tue Sep 1 11:50:45 UTC 2009
Hi all,
this sounds like an interesting proposal! However, I'd like to point you to
some implications and side effects. Over the past months, one of my students
(Daniel Marks, on CC) has worked on congestion control in Tor. One of his
results is an implementation of Tor circuits in a network simulation framework
(ns-2). We use this implementation to simulate Tor networks of different size
and with different traffic patterns, play with various parameters and measure
throughput, cell latencies, etc. Currently, our aim is to gain a better
understanding of congestion effects in Tor and their implications and
interrelations.
Over the weekend, Daniel has run some simulations with the proposed reduction
of the window size. They are certainly still a bit preliminary, but
nevertheless some trends are clear.
> 1. Overview
>
> We should reduce the starting circuit "package window" from 1000 to
> 101. The lower package window will mean that clients will only be able
> to receive 101 cells (~50KB) on a circuit before they need to send a
> 'sendme' acknowledgement cell to request 100 more.
>
> Starting with a lower package window on exit relays should save on
> buffer sizes (and thus memory requirements for the exit relay), and
> should save on queue sizes (and thus latency for users).
A central problem currently is the huge queues that build up in the onion
routers. A reduction of the window size will certainly result in shorter
queues. Whether these translate into lower latencies depends on the definition
of "latency", however. If "latency" is the time that a given cell spends
within the Tor overlay, certainly yes. If "latency", however, is the time that
it takes to download, e.g., a certain web page, this does not necessarily hold
true (see below).
> Lowering the package window will induce an extra round-trip for every
> additional 50298 bytes of the circuit. This extra step is clearly a
> slow-down for large streams, but ultimately we hope that a) clients
> fetching smaller streams will see better response, and b) slowing
> down the large streams in this way will produce lower e2e latencies,
> so the round-trips won't be so bad.
As far as we see, the current Tor design largely decouples the queues of
individual circuits (which is good!). If some given circuit's queue in an
onion router is short, then cells from this circuit will experience
low queueing delays, even if the queues of other circuits in the same router
are long - the round robin scheduler will pick cells from the short queue for
transmission not long after they arrived. (The outgoing TCP buffer also has a
certain impact here, but the by far largest fraction of the queueing delay is
apparently within Tor itself.) This also implies: slowing down the other
streams does not necessarily and automatically help to achieve shorter
download times!
> 2. Motivation
>
> Karsten's torperf graphs show that the median download time for a 50KB
> file over Tor in mid 2009 is 7.7 seconds, whereas the median download
> time for 1MB and 5MB are around 50s and 150s respectively. The 7.7
> second figure is way too high, whereas the 50s and 150s figures are
> surprisingly low.
Your statement implies that you are looking at something like the
"transmission latency per KB" (which is, by the way, identical to the inverse
throughput). This metric can be treacherous. To illustrate my point, imagine a
(very) long communication link, e.g., going from the earth to the moon
(ok...). Assume it has enormously large available bandwidth, but also a very
high link latency (say, 7.7 seconds ;-) ). If you send a file of 50 KB over
that link, it this will never arrive after less than 7.7 seconds. If you send
a 5 MB file (without waiting for ACKs), it will likewise arrive after 7.7
seconds - so you get a much lower "transmission latency per KB of data" (=
bandwidth!). Artificially limiting the 5 MB transmissions' bandwidth will not
help the 50 KB transmissions to deliver their data any faster.
This may be a bit (ok, not just a bit) simplistic, but it underlines that
throughput and latency are not independent and can only to a certain extent be
optimized individually or traded off against each other. And if there are
multiple interfering circuits, things become really interesting and often
behave differently than one would initially expect.
The key question here is what the 7.7 seconds are spent for; they are a
symptom, not the cause. Our current impression is that the key problem in Tor
is not as much the maximum window size as the fact that the window is opened
in huge chunks. This makes the traffic extremely bursty and results in
unnecessarily long queues. If you just reduce the maximum window size to 101
but still release 100 cells at once into the overlay, this burstiness will
remain (or it will become even worse).
> The median round-trip latency appears to be around 2s, with 25% of
> the data points taking more than 5s. That's a lot of variance.
That's also what we see in the simulations. However, it appears that the
variance gets worse if you reduce the window size. The reason is that a "stop-
and-go" traffic pattern emerges: after 101 cells, the source must wait for the
sendme; during that time, the circuit becomes completely empty. Potentially
free bandwidth at the routers is not used, while readily prepared cells must
wait for a long time in the circuit's source. After the sendme arrives, the
first cells make it through the overlay very quickly. The differences between
individual cells' latencies are therefore huge.
> We designed Tor originally with the original goal of maximizing
> throughput. We figured that would also optimize other network properties
> like round-trip latency. Looks like we were wrong.
Well - cutting down the bandwidth (at the risk of underutilizing resources in
the Tor overlay) does not automatically improve other performance aspects. I
have a feeling this could turn out to be a case of "out of the frying pan into
the fire".
> (...)
>
> 3.3. What about variable circuit windows?
>
> Once upon a time we imagined adapting the circuit package window to
> the network conditions. That is, we would start the window small,
> and raise it based on the latency and throughput we see.
>
> In theory that crude imitation of TCP's windowing system would allow
> us to adapt to fill the network better. In practice, I think we want
> to stick with the small window and never raise it. The low cap reduces
> the total throughput you can get from Tor for a given circuit. But
> that's a feature, not a bug.
Note that a hard window size limit does not put a cap on the throughput, but
on the bandwidth-delay product. Then, circuits with longer RTTs would
experience a dramatic breakdown in throughput, because they can only send ~50
KB and must then wait for the sendme to arrive. Circuits with low RTT are much
less affected. Applications will also suffer from extremely bursty traffic ("stop-
and-go"), probably worse than in the current Tor overlay. This is exactly what
we observe in our simulations.
(BTW: Could this have implications on security? From the maximum throughput of
a circuit, or from the burst pattern, it might be possible to infer
information on the circuit's length/delay, or on an onion router's position
within the circuit.)
> 4. Evaluation
>
> How do we know this change is actually smart? It seems intuitive that
> it's helpful, and some smart systems people have agreed that it's
> a good idea (or said another way, they were shocked at how big the
> default package window was before).
>
> To get a more concrete sense of the benefit, though, Karsten has been
> running torperf side-by-side on exit relays with the old package window
> vs the new one. The results are mixed currently -- it is slightly faster
> for fetching 40KB files, and slightly slower for fetching 50KB files.
Note that file sizes of 40 or 50 KB cannot tell you much about such a
modification. As you state above, 100 cells is equivalent to ~50 KB of payload.
I.e., you do the whole transmission within the initial transfer window. So,
the data transfer will be over before the circuit's window mechanisms have had
a chance to become active at all! Consequently, modifications to the window
mechanism will not impact transfers that are below that threshold directly.
> I think it's going to be tough to get a clear conclusion that this is
> a good design just by comparing one exit relay running the patch. The
> trouble is that the other hops in the circuits are still getting bogged
> down by other clients introducing too much traffic into the network.
Maybe I'm getting something royally wrong - but what exactly are the
mechanisms within an onion router that would result in one circuit
experiencing much longer cell queuing delays just because *another* circuit
has a long queue in that router? In terms of fair resource sharing between
circuits (and the opportunity to forward cells quickly), our impression is
that the round-robin scheduler does its job pretty well. It will be hard to
improve on that just by introducing intentional resource underutilization.
But you guys know the Tor sources much better than we do - are we missing
something important here?
Cheers
Björn
--
Jun.-Prof. Dr. Björn Scheuermann
Mobile and Decentralized Networks Group
Heinrich Heine University
Universitätsstr. 1, D-40225 Düsseldorf, Germany
Building 25.12, Room 02.42
Tel: +49 211 81 11692
Fax: +49 211 81 11638
scheuermann at cs.uni-duesseldorf.de
http://www.cn.uni-duesseldorf.de
More information about the tor-dev
mailing list