Bridge stability

Wed Mar 10 16:29:21 UTC 2010

On Wed, Mar 10, 2010 at 03:20:24PM +0100, Karsten Loesing wrote:
> On 3/10/10 12:39 AM, Roger Dingledine wrote:
> > On Mon, Feb 15, 2010 at 09:05:54PM +0100, Karsten Loesing wrote:
> >>>> However, I cannot take
> >>>> changing IP addresses into account for this analysis, because I removed
> >>>> the IP addresses when sanitizing the bridge descriptors.
> >>>
> >>> What's the process by which we sanitize them? It seems that a fine
> >>> solution would be to hash the IP addresses keyed with a secret that
> >>> remains constant across the hashes. So you could tell if the IP addresses
> >>> are the same without being able to tell what they are. The main challenge
> >>> there is keeping the secret somewhere secret in between batches (and
> >>> maybe rotating the secret monthly, for some level of forward secrecy).
> >>
> >> Yes, we can do something like that. I assume that it'll keep my server
> >> busy for a day or two to parse all the descriptors once more. But I can
> >> do that.
> >>
> >> Instead of the secret input to the hash function, how about we
> >> concatenate bridge identity and IP address as input? Note that we
> >> don't put the bridge identity in the sanitized descriptor, but only
> >> its hash. That way we'd avoid using a secret that we'll lose or forget
> >> anyway and have something reproducible. To be precise, this is what I
> >> have in mind:
> >>
> >>   sanitized bridge identity = H(bridge identity)
> >>
> >>   sanitized IP address = H(bridge identity + IP address)[:4]
> > 
> > Interesting idea. This approach clearly does leak more information:
> > if you learn the bridge identity at any point, you can guess-and-check
> > past IP addresses for the bridge.
> > 
> > The next question is then: so what? Is that something we want to protect?
> 
> A fine question. I don't think this is something we want to protect. My
> understanding of bridges is that they shall make it hard for an
> adversary to block the entry points to the Tor network. That means we
> shouldn't reveal current bridge IP addresses, nor bridge identities
> which can be used to learn about current and future IP addresses.
> 
> But why should we care about past IP addresses of a bridge? What would
> the adversary---who learns about a bridge identity somehow---do with
> this piece of information? Tell that someone has been using Tor via this
> bridge in the past when connecting to that IP address? Is this something
> we want to protect? That would imply that it's considered a security
> feature that bridges change their IP addresses on a regular basis. What
> about bridges on static IP addresses: when an adversary learns about
> such a bridge, does that mean its past users are more screwed than the
> past users of a bridge on a dynamic IP address?
> 

Much of our motivation for using Tor is because you don't know what
behavior you need to protect so be cautious.  So similarly and purely
speculatively, this means that bridges which are run by people who
moslty didn't want to run public nodes but wanted to help would now
have a public permanently confirmable record connected to something in
the outside world. They weren't signing up expecting to have forward
anonymity in any robust sense (at least I hope not), but without the
record if they run a bridge (say from a static IP that they hold for
an indefinite time) and then decide not to later, unless someone
recorded that bridge usage at the time there is no public record of
their participation. So it's a commitment that is less permanent hence
less scary. If at some point in the future someone finds it useful to
go through and look for IP addresses that have run bridges for
whatever currently unimagined nefarious purpose, then it's better if
that is not available. I'm not saying this trumps using hashed salted,
etc. addresses in some publicly listed directory info for any reason,
not even to compare it to the uses mentioned below. But you asked, so
I tried to come up with an answer.

-Paul

> The question is: What are we trying to protect? I'm happy to protect
> past IP addresses of a bridge if there's a reason to do so. But knowing
> what is worth protecting and what is not would be helpful. After all,
> not publishing any bridge descriptors would give us best protection; but
> that's not what we want.
> 
> > There are two benefits to leaking this information. First, we can generate
> > incremental updates to the sanitized bridge descriptor database, and
> > they will be compatible sanitized-IP-address-wise with the existing
> > database. That makes updates more convenient on our side.
> 
> Yes, not including a monthly changing secret in the hash function makes
> the sanitized descriptors more useful for statistics.
> 
> > Second, it is
> > possible to ask questions about where bridges have been over the space
> > of months, not just inside a given month. It's not clear that we plan
> > to ask those questions right now, though.
> 
> Unclear. I don't think we'll be asking these questions.
> 
> > So the conclusion is either "A) yes, we should do it that way, the
> > information leak is not a big deal", or "B) let's do it the safer way for
> > now, to get the answers we are looking for now; and if later we decide we
> > want more detailed answers, we still have the original bridge descriptors,
> > and we can publish slightly less sanitized data at the point we decide
> > we should".
> > 
> > I'm not sure there's a clear answer, but my instinct is to go for B.
> 
> Okay. I went for B by taking the hash of the bridge's IP address plus a
> fixed secret string that I use for all bridges. I'm still hesitant to
> publish these descriptors, though. We might be giving away too much by
> including the bridge's country code (which can be a country with only
> very few IP addresses) plus H(IP address + secret)[:4]. Maybe we should
> do H(IP address + bridge identity + secret)[:4] or something.
> 
> In any case, I'm tempted not to update all the sanitized bridge
> descriptors, but only those for December 2009 and January 2010 which I'm
> using in the bridge-stability analysis. (I pondered using some 2008
> descriptors, but they aren't as meaningful for the current bridge
> stability situation.) How about I do the H(IP address + bridge identity
> + secret)[:4] thing and make these two tarballs available?
> 
> >> Note that only the first 4 bytes of the result are used, because the
> >> result is written as the bridge's IP address, covering the entire range
> >> between 0.0.0.0 and 255.255.255.255. Of course, there's a reasonable
> >> chance for collisions for a bridge identity with two different IP
> >> addresses.
> > 
> > Right -- the birthday paradox brings us to "once we've looked at 65k
> > addresses, we should expect a collision".
> 
> Should be fine. Even if such a collision happens, it doesn't
> significantly affect the analysis result.
> 
> >> But I want the network status to contain all relevant
> >> information rather than re-assembling network status entries and bridge
> >> descriptors (which could contain more information in their contact
> >> line). Are there better ways to add 20 bytes to the network status? We
> >> might still add the full hash to the descriptor's contact line.
> > 
> > So far we've been trying to make sure that the sanitized descriptors
> > we publish still happen to conform to dir-spec.txt. At some point this
> > technique is going to break down. We shouldn't be too afraid to abandon
> > that technique when it gets too burdensome, so long as we still give
> > people tools that can parse whatever format we publish.
> 
> True. So far it works okay. I'm trying to conform to dir-spec.txt as
> long as possible. The tools I'm giving to people should already be less
> complex, not more.
> 
> Thanks!
> --Karsten

 LocalWords:  confirmible