Anonymity-preserving collection of usage data of a hidden service authoritative directory

Thu Apr 19 02:52:01 UTC 2007

On Sun, Apr 15, 2007 at 12:43:50AM +0200, Karsten Loesing wrote:
> Bad news first: The disadvantage of defining aggregation of findings
> without knowing the real data is that you cannot find unexpected things
> using explorative data analysis. It's not that I did not think about my
> expected findings, but I cannot say that this list is complete. :(

Right. We should also keep in mind that hidden services suck right now
(performance wise and reliability wise), so the current usage data may
not reflect what people actually want to do.

There are also a lot of artifacts based on the current design -- for
example I expect you will find that there are far more publishing events
than there "should" be, because we never bothered to code dir authorities
to save them to disk, and so we need to republish very frequently in
case an authority restarts and forgets them all.

But I agree, collecting info now could be a useful start, and in any
case getting a data point now will be handy down the road when we get
a data point then and wonder how it compares.

> - --- begin of specification ---

This looks really good to me, actually, in terms of anonymity risks. I
suspect that it will turn out we want some other data (the problem that
comes to mind first is that the 20th percentile will go down if a bunch of
new hidden services spring up briefly, or it will go down if the numbers
are actually going down, and I'm not sure how we'll tell the difference),
but most of the coding and design work will be in aggregating the data
and we can tweak what we actually report once it's all up and working.

The only major change I would suggest is to not publish the list from
the dir port, but instead just write it to a file in the datadir. That
will be easier to code (and less invasive to the Tor code), and it also
will remove some of the feedback to an attacker that uses Nick's "guess
and check" style attacks. It would then be a bit more of a hassle to
operate, but if we don't plan to be running this feature all the time,
that's not so bad.

The rephist.c file is where a lot of the statistics should be kept.
(That's already where we keep other statistics that Tor collects.)
Let me know if you have further questions or ideas. :)

Thanks!
--Roger