[tor-scaling] Making tor a High Performance Router
Mike Perry
mikeperry at torproject.org
Mon Feb 25 22:49:25 UTC 2019
Going two-for-one and replying to both Tom and David. I hope this
doesn't get too confusing.
Tom Ritter:
> I think it's safe for me to relay a +1 on behalf of Mozilla... At
> least three of the things you mentioned were things Patrick McManus
> had identified:
>
> - Congestion Control
> - Scaling the daemon up, to essentially operate at line speed; a
> minimum of 5 gbps on a 10gbps link if on appropriate hardware.
The above two are both very big chunks of work. But if we think about
why we want them, there's a lot of smaller chunks of work that can get
us the same types of gains, and will also still continue to provide
benefit after the above moonshots are completed.
The following are straight-forward engineering tasks for which
comprehensive research already exists. Some of these might need a
proposal and/or minor protocol tweaks, but they are generally well
understood changes. I've tried to sort these in order of best results
for amount of engineering effort required:
- Tune Circuit Build Timeout
The circuit build timeout is an adaptive mechanism that clients use
to dynamically avoid congested relays by giving up on circuits that
take too long to build (due to network congestion).
Right now, we try to time out circuits at the 80% quantile, so that
clients only use the fastest 80% of paths. We should verify that
we're actually doing this, and we could lower this further, via
consensus parameter change.
- Use two guards in a balanced way
Using two guards in a balanced way will let clients use Circuit Build
Timeout to automatically shift away from temporarily congested guards,
and has several other benefits due to other single-point-of-failure
problems of trying to use only one guard (and failing to do so, btw).
This is also just a consensus parameter change.
- Cut out garbage relays
As David and Iain have said, we have relays that do not have enough
capacity to cover the amount of overhead caused by sending their
descriptor information to all clients. At the very least, they
should be cut from the network. But relays that are measured
significantly below the network average in terms of stream capacity
by sbws should also be cut.
This is also a simple consensus change.
- Latency-based circuit migration
If a circuit appears to slow based on regular RTT pings by the client,
clients could rebuild another path to the same RP or exit, and resume
the previous circuit using a Conflux-style UUID. This will help smooth
out congestion, and is the first piece of a Conflux deployment.
(Care needs to be taken so that clients don't migrate circuits too
often, which would enable guard discovery, but moving each circuit up
to K=3-5 times should be fine.)
- Unreliable conflux
(https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf)
Unreliable conflux (ie the system described in that paper) can be
built very easily using the above RTT pings and migration UUID,
using the the RTT load balancer from the paper.
The upsides of Conflux from that paper are dynamic load balancing
and throughput improvements. The downside is that Tor circuits become
twice as unreliable, since if either branch fails, the whole circuit
fails. (See below for reliable conflux).
- Non-bandwidth load balancing feedback
Right now, we cannot achieve load balancing equilibrium to devote
enough traffic to fast relays that can handle it, because many
relays will run into CPU, socket, and memory exhaustion first,
before they are measured at capacity.
This situation will become worse as we move closer to a world where
relays can handle wire speed loads, as Tom Ritter suggested above.
If we had a feedback mechanism for these other load signals, we
could actually send as much traffic as these relays could take.
The following improvements require research. These are again sorted
in terms of best-gains-for-effort-required:
- Hacky congestion control (either per-hop ECN, or client-measured).
Any congestion control is better than no congestion control, but
unfortunately we need to carefully consider side channel effects
of even ECN, and measure the performance benefits of various
side-channel minimizing implementations. This requires a bit of
research and experimentation.
- Failure resistant conflux
Failure resistant conflux is a conflux implementation that buffers
enough cells to be able to re-transmit any cells that are in flight
over a circuit branch that fails due to connectivity or node uptime.
Failure resistant conflux is great for security and for Snowflake.
However, we will need to do some minimum experiments on how to
detect failures as soon as possible, and how much retransmit window
buffer memory is required to recover from most failures.
- Paid exit pools w/ anon creds
Exits are both a censorship and bandwidth bottleneck. If we allowed
paid exits using anonymous credentials, not only could this help
people avoid Exit bans, but it could also result in faster service.
- Walking Onions, with shorter consensus intervals
Nick has a very interesting pre-proposal to remove the requirement
for clients to know the entire network consensus. This would be
a big win for scalability, but it will alow enable faster consensus
generation, which means more dynamic load balancing that reacts
quicker to relay load.
I believe research is needed on the best crypto primitive to use
here, though, as well as how to handle things like exit policies.
- Faster decentralized relay measurement
Systems like Peerflow could be deployed to measure relays much
quicker than the current Torflow/sbws system, and also more securely.
With Walking Onions, this measurement feedback can take effect
much more rapidly than every 4 hours.
- Move to multithreaded networking and crypto
Relays currently stop sending data to the kernel while doing crypto
and other expensive things. If we were able to parallelize this,
there might be less stalling at fast relays. Not to mention using all
cores and reducing the number of TorServersMunich1-12 or similar
going on out there on single machines, which has savings in terms
of consensus overhead.
- Datagram Tor
This is a deep rabbithole of potential side channel leaks, as
well as massive engineering effort. But it will provide massive
scalability benefits by making relay memory requirements for
queuing be O(1) instead of O(#clients). It will also drastically
reduce Tor's latency and performance variance.
- Userland TCP
With a Datagram Tor, we can eliminate TCP termination at exits,
vastly reducing their memory costs for Failure Resistant Conflux
as well as general TCP termination overhead.
Doing this also requires a userland TCP stack on the client side,
to avoid anonymity issues from TCP fingerprinting.
> - (Maybe not something you said explicitly, but something I thought
> you hinted at) A confusion over where the overhead is in the network.
> Why do we have X amount of advertised bandwidth but we only use X/2 of
> the bandwidth? Why aren't we reaching better utilization?
It's generally not optimal for performance for a network to be anywhere
near saturated. This is because as you get closer to saturation, the
probability of one additional connection causing saturation and backoff
is much higher. We're debating how to communicate this to relay
operators in our metrics and measurement reporting.
> > In order to try to go forward instead of complaining, here are my ideas for
> > scaling the traffic relay performance of Tor:
> >
> > - As I said above, the very first thing for me is to change our mindset and
> > now see "tor" as the "crème de la crème" in terms of userspace traffic
> > routing application. With that in mind, the rest follows but we should
> > really not settle for something lower. Not many projects have a network the
> > size we have so not many of "us" on the Internet.
I am deeply concerned about Tor's development agility due to our strict
waterfall model. Key example: When I look at Tor 0.3.5 being supported
until 2022, I can't imagine us getting to a network with a different
end-to-end model before then, but probably not even until the *next* LTS
release after that stops being used by relay operators, which will
probably be 2026.
If we want to radically overhaul Tor for performance and scalability, we
have to consciously decide how to manage change more efficiently. In my
eyes, this means either an experimental network and/or changing our
package distribution methods so that relay operators update much more
frequently (eliminating the need for an LTS).
The alternative is to accept increased anonymity risk due to increased
heterogeneity/lack of uniformity on the main net, which I think most of
us would disagree with out of the gate.
Given the magnitude of the changes we know we need to make to scale and
perform well, something has to change about our methodology here, or we
will ossify ourselves into irrelevance.
> > - We *need* to have developers taking the necessary time to understand,
> > cleanup, improve, and tests our current ways of relaying traffic within
> > little-t tor.
> >
> > In my very strong opinion, we will simply *NOT* scale with our current code
> > and architecture situation. I can outline this whole thing in technical
> > details if need be on the "why".
I have long wondered if an experimental Tor implementation is what we
need here, perhaps 100% in another language like Rust or Go, similar to
what Mozilla did with Servo. This will speed both research and
development experimentation.
Then we can take pieces of that thing and move it into core-tor or just
switch to it entirely, depending on results.
If it were up to me, I would not require full compatibility with the
current Tor network for this implementation, due to my concerns about
protocol and release rigidity above. But OTOH, if it is too
incompatible/different, then it may not be possible to take pieces of it
for core-tor.
--
Mike Perry
More information about the tor-scaling
mailing list