[tor-dev] Load Balancing in 2.7 series - incompatible with OnionBalance ?
Alec Muffett
alecm at fb.com
Thu Oct 22 16:30:55 UTC 2015
info at tvdw.eu wrote:
> Hi Alec,
Hi Tom! I love your proposal, BTW. :-)
> Most of what you said sounds right, and I agree that caching needs TTLs (not just here, all caches need to have them, always).
Thank you!
> However, you mention that one DC going down could cause a bad experience for users. In most HA/DR setups I've seen there should be enough capacity if something fails, is that not the case for you? Can a single data center not serve all Tor traffic?
It's not the datacentre which worries me - we already know how to deal with those - it's the failure-based resource contention for the limited introduction-point space that is afforded by a maximum (?) of six descriptors each of which cites 10 introduction points.
A cap of 60 IPs is a clear protocol bottleneck which - even with your excellent idea - could break a service deployment.
Yes, in the meantime the proper solution is to split the service three ways, or even four, but that's administrative burden which less well-resourced organisations might struggle with.
Many (most?) will have a primary site and a single failover site, and it seems perverse that they could bounce just ONE of those sites and automatically lose 50% of their Onion capacity for up to 24 hours UNLESS they also take down the OTHER site for long enough to invalidate the OnionBalance descriptors.
Such is not the description of a high-availability (HA) service, and it might put people off.
> If that is a problem, I would suggest adding more data centers to the pool. That way if one fails, you don't lose half of the capacity, but a third (if N=3) or even a tenth (if N=10).
...but you lose it for 1..24 hours, even if you simply reboot the Tor daemon.
> Anyway, such a thing is probably off-topic. To get back to the point about TTLs, I just want to note that retrying failed nodes until all fail is scary:
I find that worrying, also. I'm not sure what I think about it yet, though.
> what will happen if all ten nodes get a 'rolling restart' throughout the day? Wouldn't you eventually end up with all the traffic on a single node, as it's the only one that hadn't been restarted yet?
Precisely.
> As far as I can see the only thing that can avoid holes like that is a TTL, either hard coded to something like an hour, or just specified in the descriptor. Then, if you do a rolling restart, make sure you don't do it all within one TTL length, but at least two or three depending on capacity.
Concur.
desnacked at riseup.net wrote:
> Please see rend_client_get_random_intro_impl(). Clients will pick a random intro point from the descriptor which seems to be the proper behavior here.
That looks great!
> I can see how a TTL might be useful in high availability scenarios like the one you described. However, it does seem like something with potential security implications (like, set TTL to 1 second for all your descriptors, and now you have your clients keep on making directory circuits to fetch your descs).
Okay, so, how about:
IDEA: if ANY descriptor introduction point connection fails AND the descriptor's ttl has been exceeded THEN refetch the descriptor before trying again?
It strikes me (though I may be wrong?) that the degenerate case for this would be someone with an onion killing their IP in order to force the user to refetch a descriptor - which is what I think would happen anyway?
At very least this proposal would add a work factor.
> For this reason I'd be interested to see this specified in a formal Tor proposal (or even as a patch to prop224). It shouldn't be too big! :)
I would hesitate to add it to Prop 224 which strikes me as rather large and distant. I'd love to see this by Christmas :-P
teor2345 at gmail.com wrote:
> Do we connect to introduction points in the order they are listed in the descriptor? If so, that's not ideal, there are surely benefits to a random choice (such as load balancing).
Apparently not (re: George) :-)
> That said, we believe that rendezvous points are the bottleneck in the rendezvous protocol, not introduction points.
Currently, and in most current deployments, yes.
> However, if you were to use proposal #255 to split the introduction and rendezvous to separate tor instances, you would then be limited to:
> - 6*10*N tor introduction points, where there are 6 HSDirs, each receiving 10 different introduction points from different tor instances, and N failover instances of this infrastructure competing to post descriptors. (Where N = 1, 2, 3.)
> - a virtually unlimited number of tor servers doing the rendezvous and exchanging data (say 1 server per M clients, where M is perhaps 100 or so, but ideally dynamically determined based on load/response time).
> In this scenario, you could potentially overload the introduction points.
Exactly my concern, especially when combined with overlong lifetimes of mostly-zombie descriptors.
- alec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20151022/9b3b4b0d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20151022/9b3b4b0d/attachment.sig>
More information about the tor-dev
mailing list