[tor-scaling] Making tor a High Performance Router

Mon Feb 25 22:49:25 UTC 2019

Going two-for-one and replying to both Tom and David. I hope this
doesn't get too confusing.

Tom Ritter:
> I think it's safe for me to relay a +1 on behalf of Mozilla... At
> least three of the things you mentioned were things Patrick McManus
> had identified:
> 
>  - Congestion Control
>  - Scaling the daemon up, to essentially operate at line speed; a
> minimum of 5 gbps on a 10gbps link if on appropriate hardware.

The above two are both very big chunks of work. But if we think about
why we want them, there's a lot of smaller chunks of work that can get
us the same types of gains, and will also still continue to provide
benefit after the above moonshots are completed.

The following are straight-forward engineering tasks for which
comprehensive research already exists. Some of these might need a
proposal and/or minor protocol tweaks, but they are generally well
understood changes. I've tried to sort these in order of best results
for amount of engineering effort required:

- Tune Circuit Build Timeout
  The circuit build timeout is an adaptive mechanism that clients use
  to dynamically avoid congested relays by giving up on circuits that
  take too long to build (due to network congestion).

  Right now, we try to time out circuits at the 80% quantile, so that
  clients only use the fastest 80% of paths. We should verify that
  we're actually doing this, and we could lower this further, via
  consensus parameter change.

- Use two guards in a balanced way
  Using two guards in a balanced way will let clients use Circuit Build
  Timeout to automatically shift away from temporarily congested guards,
  and has several other benefits due to other single-point-of-failure
  problems of trying to use only one guard (and failing to do so, btw).

  This is also just a consensus parameter change.

- Cut out garbage relays
  As David and Iain have said, we have relays that do not have enough
  capacity to cover the amount of overhead caused by sending their
  descriptor information to all clients. At the very least, they
  should be cut from the network. But relays that are measured
  significantly below the network average in terms of stream capacity
  by sbws should also be cut.

  This is also a simple consensus change.

- Latency-based circuit migration
  If a circuit appears to slow based on regular RTT pings by the client,
  clients could rebuild another path to the same RP or exit, and resume
  the previous circuit using a Conflux-style UUID. This will help smooth
  out congestion, and is the first piece of a Conflux deployment.

  (Care needs to be taken so that clients don't migrate circuits too
  often, which would enable guard discovery, but moving each circuit up
  to K=3-5 times should be fine.)

- Unreliable conflux
  (https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf)
  Unreliable conflux (ie the system described in that paper) can be
  built very easily using the above RTT pings and migration UUID,
  using the the RTT load balancer from the paper.

  The upsides of Conflux from that paper are dynamic load balancing
  and throughput improvements. The downside is that Tor circuits become
  twice as unreliable, since if either branch fails, the whole circuit
  fails. (See below for reliable conflux).

- Non-bandwidth load balancing feedback
  Right now, we cannot achieve load balancing equilibrium to devote
  enough traffic to fast relays that can handle it, because many 
  relays will run into CPU, socket, and memory exhaustion first,
  before they are measured at capacity.

  This situation will become worse as we move closer to a world where
  relays can handle wire speed loads, as Tom Ritter suggested above.

  If we had a feedback mechanism for these other load signals, we
  could actually send as much traffic as these relays could take.

The following improvements require research. These are again sorted
in terms of best-gains-for-effort-required:

- Hacky congestion control (either per-hop ECN, or client-measured).
  Any congestion control is better than no congestion control, but
  unfortunately we need to carefully consider side channel effects
  of even ECN, and measure the performance benefits of various
  side-channel minimizing implementations. This requires a bit of
  research and experimentation.

- Failure resistant conflux
  Failure resistant conflux is a conflux implementation that buffers
  enough cells to be able to re-transmit any cells that are in flight
  over a circuit branch that fails due to connectivity or node uptime.
  Failure resistant conflux is great for security and for Snowflake.

  However, we will need to do some minimum experiments on how to
  detect failures as soon as possible, and how much retransmit window
  buffer memory is required to recover from most failures.

- Paid exit pools w/ anon creds
  Exits are both a censorship and bandwidth bottleneck. If we allowed
  paid exits using anonymous credentials, not only could this help
  people avoid Exit bans, but it could also result in faster service.

- Walking Onions, with shorter consensus intervals
  Nick has a very interesting pre-proposal to remove the requirement
  for clients to know the entire network consensus. This would be
  a big win for scalability, but it will alow enable faster consensus
  generation, which means more dynamic load balancing that reacts
  quicker to relay load.

  I believe research is needed on the best crypto primitive to use
  here, though, as well as how to handle things like exit policies.

- Faster decentralized relay measurement
  Systems like Peerflow could be deployed to measure relays much
  quicker than the current Torflow/sbws system, and also more securely.
  With Walking Onions, this measurement feedback can take effect
  much more rapidly than every 4 hours.

- Move to multithreaded networking and crypto
  Relays currently stop sending data to the kernel while doing crypto
  and other expensive things. If we were able to parallelize this,
  there might be less stalling at fast relays. Not to mention using all
  cores and reducing the number of TorServersMunich1-12 or similar
  going on out there on single machines, which has savings in terms
  of consensus overhead.

- Datagram Tor
  This is a deep rabbithole of potential side channel leaks, as
  well as massive engineering effort. But it will provide massive
  scalability benefits by making relay memory requirements for
  queuing be O(1) instead of O(#clients). It will also drastically
  reduce Tor's latency and performance variance.

- Userland TCP
  With a Datagram Tor, we can eliminate TCP termination at exits,
  vastly reducing their memory costs for Failure Resistant Conflux
  as well as general TCP termination overhead.

  Doing this also requires a userland TCP stack on the client side,
  to avoid anonymity issues from TCP fingerprinting.

>  - (Maybe not something you said explicitly, but something I thought
> you hinted at) A confusion over where the overhead is in the network.
> Why do we have X amount of advertised bandwidth but we only use X/2 of
> the bandwidth? Why aren't we reaching better utilization?

It's generally not optimal for performance for a network to be anywhere
near saturated. This is because as you get closer to saturation, the
probability of one additional connection causing saturation and backoff
is much higher. We're debating how to communicate this to relay
operators in our metrics and measurement reporting.

> > In order to try to go forward instead of complaining, here are my ideas for
> > scaling the traffic relay performance of Tor:
> >
> > - As I said above, the very first thing for me is to change our mindset and
> >   now see "tor" as the "crème de la crème" in terms of userspace traffic
> >   routing application. With that in mind, the rest follows but we should
> >   really not settle for something lower. Not many projects have a network the
> >   size we have so not many of "us" on the Internet.

I am deeply concerned about Tor's development agility due to our strict
waterfall model. Key example: When I look at Tor 0.3.5 being supported
until 2022, I can't imagine us getting to a network with a different
end-to-end model before then, but probably not even until the *next* LTS
release after that stops being used by relay operators, which will
probably be 2026.

If we want to radically overhaul Tor for performance and scalability, we
have to consciously decide how to manage change more efficiently. In my
eyes, this means either an experimental network and/or changing our
package distribution methods so that relay operators update much more
frequently (eliminating the need for an LTS).

The alternative is to accept increased anonymity risk due to increased
heterogeneity/lack of uniformity on the main net, which I think most of
us would disagree with out of the gate.

Given the magnitude of the changes we know we need to make to scale and
perform well, something has to change about our methodology here, or we
will ossify ourselves into irrelevance.

> > - We *need* to have developers taking the necessary time to understand,
> >   cleanup, improve, and tests our current ways of relaying traffic within
> >   little-t tor.
> >
> >   In my very strong opinion, we will simply *NOT* scale with our current code
> >   and architecture situation. I can outline this whole thing in technical
> >   details if need be on the "why".

I have long wondered if an experimental Tor implementation is what we
need here, perhaps 100% in another language like Rust or Go, similar to
what Mozilla did with Servo. This will speed both research and
development experimentation.

Then we can take pieces of that thing and move it into core-tor or just
switch to it entirely, depending on results.

If it were up to me, I would not require full compatibility with the
current Tor network for this implementation, due to my concerns about
protocol and release rigidity above. But OTOH, if it is too
incompatible/different, then it may not be possible to take pieces of it
for core-tor.

-- 
Mike Perry