Safely collecting data to estimate the number of Tor users

Sun Aug 29 11:17:35 UTC 2010

Hi,

Steven J. Murdoch wrote:
> Another thing to do is to shrink the hash size. If we assume that
> there are going to be, say, no more than 1 million distinct IP
> addresses, we could use a 40 bit hash with only a small number of
> collisions (due to the Birthday Paradox). However, someone who tries
> to reverse the full 32 bit IPv4 space will get many collisions.

with a 40 bit hash, the vast majority of 2^32 IP addresses will not
collide with any other IP address, despite the birthday paradox. So this
would not really be a significant additional level of protection. It
would make a difference only for the small fraction of the IP addresses
that actually do collide.

However, you don't need that many bits anyway, as collisions are not a
big issue when your goal is to only count how many different addresses
you have seen. If you proceed as you suggested, you can, as detailed
below, do well with a 21-bit hash. With additional tweaks, you can
further reduce that to 12-bit hash values.

If n distinct IP addresses have occurred and you use m-bit hash values,
any given one of the 2^m possible hash values will have been recorded at
least once with probability

  1 - (1 - 1/(2^m))^n

You would thus expect that

  2^m * ( 1 - (1 - 1/(2^m))^n )

different hash values have been recorded. This can be approximated (by
a Poisson approximation) by

  2^m * ( 1 - e^(-n/(2^m)) )

If, after a measurement period, you have seen Z different hash values in
your logs, you can invert that formula:

      Z = 2^m * ( 1 - e^(-n/(2^m)) )
 <=>  e^(-n/(2^m)) = 1 - Z / (2^m)
 <=>  -n/(2^m) = ln(1 - Z / (2^m))
 <=>  n = - 2^m * ln(1 - Z / (2^m))

By using the above formula (known as "hit counting"), you can obtain a
very good estimate for the number n of different IP addresses with much
smaller hashes than 40 bits.

In general, this approach works well as long as less than half of the
possible hash values have occurred. To be on the safe side, I'd reduce
that further to, say, one third. If you assume that there are no more
than one million distinct IP addresses, a 21-bit hash should be
perfectly fine to get a very good idea of the number of Tor users. Use
22 or 23 bits to reduce the estimation error further if you want.

That, though, is still suboptimal. If you use a non-uniform hash, you
can do even better. If you replace the uniform hash function by a
specific non-uniform one (yes, along the lines of FM sketches again),
you could get estimates for the number of distinct IP addresses with a
standard error of 0.05 by using only 4096 different hash values for the
IP addresses, i.e., with a non-uniform 12-bit hash function. Such a hash
function is very easy to construct starting from any arbitrary uniform
"standard" hash. I'm happy to provide more details (and code) if you
want.

Best regards

Björn