Effect of Tor window size on performance

Wed Feb 4 09:25:31 UTC 2009

Dear All,

I'm new to the or-dev list so please excuse me if I mention something 
that was already discussed.
We've did a study of Tor's flow-control mechanism, and what came out is 
that reducing the flow-control window size (CIRCWINDOW) would be 
beneficial not just for the individual client, but for the whole Tor 
network.
You can find a tech report describing our results (still work in 
progress) at

http://disi.unitn.it/locigno/preprints/TR-DISI-08-041.pdf

It is also about other topics (namely measures with datagram based 
solutions, in our case using IPsec, and a theoretical evaluation of TCP 
based solutions), but I think what is most interesting for immediate use 
in development is how much the flow-control window size can influence 
application-level delay in an overload situation. In my understanding, 
overload is the natural state of the Tor network, so it is important to 
make the system behave well in these conditions.

I've started to discuss this topic with Roger and he suggested me to 
extend the discussion to the or-dev list. See our conversation below and 
my original mail at the very end of this long mail.
Any feedback/discussion of the ideas is welcome,
Csaba

Roger Dingledine wrote:
> On Sat, Nov 01, 2008 at 07:03:36PM +0100, Csaba Kiraly wrote:
>   
>> One things that could be of immediate use for Tor is the study of Tor's
>> flow-control mechanism, and its behavior with different window sizes. In
>> short, by reducing the window size from the current 1000 to e.g. 200 in
>> the Tor network, there would be a huge decrease in application level
>> round trip times. It would also mean much smaller buffers in OR nodes,
>> reducing their memory consumption significantly. At the same time,
>> throughput would remain the same. Figure 6 of the tech report shows
>> delay and throughput with some window sizes.
>>     
>
> ...
>
> I think these are great ideas. You're right that the initial choice of
> window size was just a random number that I picked that seemed large
> enough to provide good throughput. And on thinking about it more,
> I think you're right that a larger number actually just congests the
> network more.
>
> I've had an item on the 'volunteer' page related to window size for a
> few years now: Item #8 on https://www.torproject.org/volunteer#Research
>
> It seems that variable window sizes would be even more flexible than what
> you propose, since then it could adapt to different levels of congestion.
>   
> In particular, I worry about:
>
>   
>> Your throughput is limited to cell_size * CIRCWINDOW / D (without
>> considering the size of CIRCWINDOW_INCREMENT and delay of the SENDME
>> cell on the return path)
>> With typical data: cell size=0.5 kbytes ; CWND_SIZE=1000, D=1 sec
>> The throughput is limited to 500 kbytes/sec
>>     
>
> What if D=5 because there's still congestion? If we then cut the circuit
> window size by another factor of 5, we end up with an expected throughput
> of 20KB, which is a bit on the 'uncomfortably low' end. (What if D=10?)
>   
As a first attempt we are trying fixed reduced window size. If it gets 
variable, what we introduce instead of the current end-to-end 
flow-control is some kind of congestion control with an end-to-end 
feedback loop. To avoid strange interactions between various levels of 
congestion-control, I think this variability should be kept really slow 
(obviously it should be done per circuit).

>   
>> Lets concentrate now on 5-7.
>> 5, This is studied in Fig. 2, confronting TCP-in-TCP with normal TCP on
>> the same path (TCP-in-IP). With one stream, delays are almost the same.
>> TCP is good in adapting to the congestion and keeping its delay low.
>>     
>
> This introduces the recent discussion about using UDP for transport. See
> e.g. http://archives.seul.org/or/talk/Oct-2008/msg00095.html
>
> One of the big reasons they concluded why Tor is slow is because it puts
> both loud and quiet streams together into one TCP connection. When TCP's
> back-off kicks in, all the streams get punished, not just the loud ones.
>
> Ultimately we'd like to resolve this whole TCP flow control thing by
> switching to UDP transport between each relay, and then either have a
> user-space TCP stack sitting at each relay to reconstruct the packets,
> or have one sitting at each end of the circuit to reconstruct them.
> Alas, there are still a big pile of open development problems there
> -- as a first example, there appears to be no free-software stable
> platform-independent secure user-space TCP stack out there.
>
>   
On the long run, I'm also more of a fan of a datagram based solution 
with only end-to-end congestion control. What we've been using in the 
tech report is
IPsec tunnels, and at that point kernel level TCP stack comes for free. 
With this solution, performance is almost equivalent to native IP 
tunneling on the same path.
Obviously IPsec and kernel TCP has all of its well known drawbacks, but 
that's off-topic here.
We had our reasons to use IPsec, but what's important from Tor's 
perspective, is that a UDP based transport would produce almost the same 
performance, if OR nodes are strong enough and delay factors 5-7 are 
handled well. (see delay factors at the end in my original mail)

Another thing that comes to my mind is that a better separation of 
signaling (directory lookup, circuit setup, etc.) and data path would be 
beneficial. This would ease things such as experimenting with DTLS or 
other transport methods on the overlay tunnel, and experimenting with 
different solutions end-to-end.
>> 6, I've noticed this limit (MIN_TLS_FLUSHLEN) in the code. Its size is
>> about 32 cells, doesn't seems to be an important delay component.
>>     
>
> Right. I added that in because in some cases we were reading in hundreds
> of kilobytes without initiating any writes until all the reads were
> done. I have absolutely no idea whether that was a good idea. It would
> be great to be able to have some data about these things.
>
> Speaking of which, Tor's round-robin when we run low on our bandwidth
> buckets has to matter here. We try to round-robin 16KB at a time, but
> have smaller allocations when our token bucket is getting low. Again,
> I just made up some numbers, and it would be great to get real data there.
>
>   
>> 7, The huge part of the delay is here: hundreds if not thousands of
>> cells waiting in a FIFO queue (the OR's output buffer towards another
>> OR) before getting to the TLS tunnel.
>>     
>
> So the next question is an implementation one. Right now the window sizes
> are hard-coded at both ends. I've been meaning to extend the protocol
> so sendme cells have a number in them, and so the initial window sizes
> are specified in the 'create' and 'created' cells for circuits and the
> 'begin' and 'connected' cells for streams. But we haven't really fleshed
> out the details of those designs, or how they could be phased in and still
> handle clients and relays that don't use (or know about) the numbers.
>
> So the big deployment question is: is it worth it to work on a design
> for the above, and then either shrink the default window sizes or do
> something smarter like variable window sizes, or should we just be
> patient until a UDP-based solution is more within reach?
>
> One answer is that if you were interested in working on a design proposal
> and patch, it would be much more likely to get implemented. :)
>   
We are doing verifications on this. Our lab experiments (the ones in the 
tech report) show that there is a huge gain on the user side in delays, 
while throughput is untouched. Throughput is capped with a static window 
size, but I think the cap can be chosen better than what it is now. 
There should also be a big gain in the memory consumption of ORs, 
although we didn't measure it yet. Since the Tor network is kind of 
overloaded all the time, memory usage should decrease almost linearly 
with the window size.

Currently we are verifying one-side modification of the circuit, i.e. 
whether one side of the connection can reduce the widow size on its own, 
without explicitly notifying the other side.  From the  code it seems to 
me that this will work, and if so, phasing in a smaller window size in a 
new release should not be a problem.

>   
>> One negative effect which could be considered, is the following: having
>> less cells in buffers, the "mixing" effect is less. I know that there is
>> no real mixing in Tor, what I mean is that even if cells are not
>> re-ordered, due to the introduced delays, finding correspondence between
>> an entering cell and its correspondent cell leaving later on is quite
>> hard. I don't think reducing this delay is a problem, but it should be
>> mentioned at least.
>>     
>
> This doesn't bother me either. I think we're quite vulnerable already
> to an adversary trying to match up flows.
>
>   
>> In the report, we also analyze the effect of external TCP tunnels when
>> several streams are multiplexed over them. In this case, an TCP tunnel
>> becomes a real bottleneck.
>>     
>
> Indeed. Is this the same statement that I made above, from Joel's
> thesis?
>   
Well,  we've had different ways of analysing it but some the conclusions 
are similar, even if not the same :)
We did not try analysing the effect of mixing different (loud and quiet) 
kinds of traffic. What we say is that all streams suffer even if they 
are from the same kind.

>   
>> Finally, we also discuss how the combination of 1) datagram based
>> operation, 2) allowing for cell/packet drops in overlay nodes, and 3)
>> relying on only one end-to-end congestion control loop increase
>> performance. We do this through an example of an IPsec based
>> architecture, but results partially apply to DTLS as well.
>>     
>
> Sounds great.
>
> Are you able to send this to the or-dev list too? I think it would be
> more useful to have other people see the thread too.
>
> Thanks!
> --Roger
>   

Csaba Kiraly wrote:  
>> Dear Roger,
>>
>> As Giuseppe said, the tech report is work in progress, however I think
>> there are already some parts that might be interesting for you and for
>> others involved in Tor development.
>>
>> One things that could be of immediate use for Tor is the study of Tor's
>> flow-control mechanism, and its behavior with different window sizes. In
>> short, by reducing the window size from the current 1000 to e.g. 200 in
>> the Tor network, there would be a huge decrease in application level
>> round trip times. It would also mean much smaller buffers in OR nodes,
>> reducing their memory consumption significantly. At the same time,
>> throughput would remain the same. Figure 6 of the tech report shows
>> delay and throughput with some window sizes.
>>
>> I hope you don't mind if I give a more detailed Tor-specific explanation
>> of this aspect here below, since it was not (yet) discussed in the tech
>> report in enough detail.
>>
>> As you know, Tor breaks up the end-to-end TCP loop, and replaces it with
>> its own reliable delivery service. It implements reliable delivery by
>> using TCP on the overlay links, and by not dropping cells in OR nodes
>> (it can't drop cells in ORs, because there is no other end-to-end
>> reliability mechanism that would resend a cell).
>>
>> If you consider also that bottlenecks are typically inside the Tor
>> network (and not after the exit node) you get to the following 
>> phenomenon:
>> The source sends cells as fast as it can until its CIRCWINDOW  (or
>> STREAMWINDOW)permits. These cells all get buffered at the OR before the
>> bottleneck overlay link. When some (CIRCWINDOW_INCREMENT) data gets
>> through the bottleneck and arrives to the exit node, a SENDME cell is
>> sent back. The source pushes again, and as a result, the buffer before
>> the bottleneck link is always almost full. This introduces two problems:
>> large delays for end nodes and huge memory consumption for ORs.
>>
>> Large delay :
>> The delay introduced by the FIFO buffer in the OR before the bottleneck
>> can be estimated as:
>> cell_size * CIRCWINDOW / bottleneck_bandwidth
>>
>> With typical data: cell size=0.5 Kbytes; CIRCWINDOW=1000;
>> bottleneck_bandwidth= 50 Kbytes/sec;
>> you get a potential delay of 10 seconds. Fortunately, applications are
>> typically not that eager to push all that information inside, still, the
>> accumulated delay is huge.
>>
>> Large memory consumption at OR nodes:
>> an OR node should potentially buffer cell_size * CIRCWINDOW, i.e. 500
>> Kbytes for each circuit.
>>
>> With a smaller window, both the delay and memory consumption would be
>> smaller. Thus the important question is: why not reduce the window? One
>> would say throughput, but the important thing to note is that a smaller
>> (but not too small) window would not reduce throughput. This can be seen
>> on the right side of fig.4: all the Tor curves are the same, independent
>> of the window size (and the number of streams).
>>
>> The reason is simple: if there is a bottleneck on the path taken by the
>> circuit, the throughput depends on how the bandwidth of the TCP
>> connection of the bottlenecked OR link is shared among circuits.
>> Buffering more cells before this TCP connection does not make it more
>> effective.
>>
>> Delay and throughput can be analyzed in more detail.
>>
>> In Tor, delay comes from the following factors (I hope I've managed to
>> list them all):
>> 0, circuit setup delay
>> 1, delay outside the Tor network: TCP from application to SOCKS proxy
>> 2, delay outside the Tor network: TCP from exit node to destination node
>> 3, cell forming delay at the edge of the circuit (either at the OP or at
>> the exit OR)
>> 4, IP level delay on each overlay link (due to physical distance, IP
>> hops, etc.)
>> 5, delay introduced by TCP on each overlay link
>> 6, delay introduced by TLS record forming on each overlay link
>> 7, delay of FIFO cell buffers in ORs for each overlay link
>>
>> With CIRCWINDOW=1000, it seems that point 7. is responsible for large
>> part of the overall delay:
>> 0, is important at the beginning, but not during transmission
>> 1, is almost negligible
>> 2, is in the range of normal unprotected operation. There is not too
>> much to do to reduce it, except for selecting the exit node near the
>> destination, but this could compromise privacy.
>> 3, you already have a mechanism (based on port numbers) to mitigate this
>> 4, there is not too much to do with this, except changing the OR
>> selection strategy, and thus compromising privacy. There are numerous
>> proposals on that, as far as I remember.
>>
>> These factors add up for some hundreds of milliseconds, depending on the
>> choice of the OR nodes, and how they are connected to the Internet.
>>
>> Lets concentrate now on 5-7.
>> 5, This is studied in Fig. 2, confronting TCP-in-TCP with normal TCP on
>> the same path (TCP-in-IP). With one stream, delays are almost the same.
>> TCP is good in adapting to the congestion and keeping its delay low.
>>
>> 6, I've noticed this limit (MIN_TLS_FLUSHLEN) in the code. Its size is
>> about 32 cells, doesn't seems to be an important delay component.
>>
>> 7, The huge part of the delay is here: hundreds if not thousands of
>> cells waiting in a FIFO queue (the OR's output buffer towards another
>> OR) before getting to the TLS tunnel.
>>
>> This would suggest to reduce CIRCWINDOW as much as possible. To
>> understand what possible means, throughput should be analyzed:
>> Suppose for a minute that the Tor network is not overloaded, and there
>> is no overlay link bottleneck that would limit your throughput. In this
>> case 7) becomes negligible, while CIRCWINDOW limits your cells in
>> flight, and thus your throughput. Let D be the delays due to delay
>> factors 1..6
>>
>> Your throughput is limited to cell_size * CIRCWINDOW / D (without
>> considering the size of CIRCWINDOW_INCREMENT and delay of the SENDME
>> cell on the return path)
>> With typical data: cell size=0.5 kbytes ; CWND_SIZE=1000, D=1 sec
>> The throughput is limited to 500 kbytes/sec
>>
>> Reducing CIRCWINDOW to 200, this goes down to 100 kbytes/sec (if D=1
>> sec), which is still much above what is needed. Moreover, the Tor
>> network IS more congested than that (at least this was my impression),
>> so reducing CIRCWINDOW would have absolutely no negative effect on the
>> throughput.
>>
>> One negative effect which could be considered, is the following: having
>> less cells in buffers, the "mixing" effect is less. I know that there is
>> no real mixing in Tor, what I mean is that even if cells are not
>> re-ordered, due to the introduced delays, finding correspondence between
>> an entering cell and its correspondent cell leaving later on is quite
>> hard. I don't think reducing this delay is a problem, but it should be
>> mentioned at least.
>>
>> In the report, we also analyze the effect of external TCP tunnels when
>> several streams are multiplexed over them. In this case, an TCP tunnel
>> becomes a real bottleneck.
>>
>> Finally, we also discuss how the combination of 1) datagram based
>> operation, 2) allowing for cell/packet drops in overlay nodes, and 3)
>> relying on only one end-to-end congestion control loop increase
>> performance. We do this through an example of an IPsec based
>> architecture, but results partially apply to DTLS as well.
>>
>> I hope you don't mind for this long mail!
>> Best regards,
>> Csaba