Safely collecting data to estimate the number of Tor users
Björn Scheuermann
scheuermann at cs.uni-duesseldorf.de
Fri Aug 27 07:07:45 UTC 2010
Hi Karsten,
> So, here's my plan for researching this more: I'd like to run an
> experiment with multiple fast directory mirrors run by the same operator
> on the same host (like Jake's trusted and Pandora*, Olaf's blutmagie*,
> Moritz's torserversNet*, etc.). I'm going to write a patch for Tor to
> accept some key string in its torrc and extend SafeLogging to accept the
> value 'encrypt'. Tor will then pass all client IP addresses through a
> keyed hash function using the provided key string and write the result
> to its logs. I'm also going to implement #1668 to make log granularity
> configurable. The operators configure the same key string for all their
> relays and run them with the new SafeLogging option and logging
> granularity of 15 minutes for, say, a week. Operators then delete the
> key string and only keep the logs. The operators do not give out these
> logs to me or anyone else. I'm going to write Python scripts to analyze
> the logs and publish them for the operators and others to review. The
> operators will run these scripts and publish the results.
>
> I hope to learn more about the overlap of unique IP address sets seen by
> fast directory mirrors and also about client uptime sessions. I'd like
> to try out different schemes to safely combine unique IP address sets to
> come up with a better user count.
>
> Before writing code, what questions or concerns are there about this
> experiment? Are there better ways to achieve what I'm trying to achieve?
I think an interesting alternative option could be to use a so-called FM
sketch for counting distinct IP addresses at the relays. FM sketches are
bitfield-based data structures that can be used to estimate the number
of distinct elements in a multiset, based on a specific kind of hash
function. Originally, they have been proposed by Flajolet and Martin (F
and M...) in the 1980s (Flajolet, Martin: Probabilistic Counting
Algorithms for Data Base Applications, Journal of Computer and System
Sciences, vol 31(2),
http://algo.inria.fr/flajolet/Publications/FlMa85.pdf ).
Basically, FM sketches work as follows: you start with an empty (i.e.,
all-zero) bit field. For each element (i.e., IP address) that you see,
you calculate a hash function which maps the element to one position in
that bit field. The hash function is a specifically "crafted" one which
is not uniformly distributed, that is, not every bit is "hit" with the
same probability. Whenever some bit in the bit field is hit by hashing
an IP address, you set the respective bit to one, regardless of whether
it was zero or whether it was one already. The bit fields that are used
are actually quite small (at most a few thousand bits), and the hash
value distribution is highly non-uniform, so that typically many IP
addresses map to the same bit. You can then extract an estimate for the
number of *distinct* IP addresses that you have "seen" and hashed into
the bit field by analyzing the pattern of bits that emerged in the bit
field in a specific way. I am happy to provide more detailed
explanations and more references if you want.
FM sketches have (at least) three very intriguing properties for the
application that you outline:
1) Since many IP addresses map to the same bit, you cannot reverse the
operation, i.e., at least to me it seems that it would be no problem at
all to exchange the generated bit fields between operators (or between
operators and you), even if the used hash function, keys, etc. are
known.
2) You can easily combine FM sketches from multiple sources if they use
the same hash function. To this end, you simply calculate the bit-wise
OR or the respective bit fields. What you will get is the very same bit
field that would have emerged if all IP addresses from all relays were
hashed to the same bit field. If the same IP address occurred at
different relays, the respective bit will be set in both bit fields, so
that the address will be counted only once (FM sketches are "duplicate
insensitive"), i.e., you can really obtain the overall number of
distinct IP addresses without exchanging and merging any lists, just
based on the bit fields. (I personally think this is *really*
cool. :-) )
3) You can trade off bit field size versus accuracy: larger bit fields
give more accurate estimates. In your context this means that you can
basically trade off anonymization of IP addresses (smaller bit fields =>
more IP addresses are mapped to the same bit) versus accuracy. The
tradeoff is very favorable, though: a few hundred to a few thousand bits
can already give you an accuracy to within a few percent estimation
error.
I have already used FM sketches and variants thereof in several
different projects and contexts, and my feeling is that they are a
perfect match for your problem. I'm happy to help if needed.
Cheers
Björn
--
Jun.-Prof. Dr. Björn Scheuermann
Mobile and Decentralized Networks Group
Heinrich Heine University
Universitätsstr. 1, D-40225 Düsseldorf, Germany
Building 25.12, Room 02.42
Tel: +49 211 81 11692
Fax: +49 211 81 11638
scheuermann at cs.uni-duesseldorf.de
http://www.cn.uni-duesseldorf.de
More information about the tor-dev
mailing list