[tor-dev] Two protocols to measure relay-sensitive hidden-service statistics

Fri Jan 9 12:36:38 UTC 2015

"A. Johnson" <aaron.m.johnson at nrl.navy.mil> writes:

> Hello tor-dev,
>

Hello,

and thanks for posting this to the list.

> While helping design ways to publish statistics about hidden services in a privacy-preserving
> manner, it has become clear to me that certain statistics cannot be safely reported using the
> current  method of having each relay collect and report measurements. I am going to describe a
> couple of simple protocols to handle this problem that I think should be implementable without much
> effort. I'd be happy to get feedback in particular about the security or ease-of-implementation of
> these protocols.
>
> Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
>   1. The number of descriptor fetches received by a hidden-service directory (HSDir)
>   2. The number of client introduction requests at an introduction points (IPs)
> The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so
> the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a
> measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely)
> unique to an HS, and so the number of client introductions at its IPs could reveal the number of
> client connections it received.
>

I was wondering, why do we care so much about these two statistics?

>From what I see in this post, you also just care about their total
numbers (without connecting them to specific HSDirs or IPs). Is it
because you are curious about the total number of HS users?

If that's the case, why not focus on "Number of rendezvous requests"
which is not tied to specific hidden services or clients. It seems to
me like an easier target (that would still need to be *thoroughly
analyzed* of course).

> A approach to solve this problem would be to anonymize the reported statistics. Doing so raises
> a couple of challenges, however:
>   1. Anonymous statistics should be authenticated as coming from some relay. Otherwise, statistics
>   could be polluted by any malicious actor.
>   2. Statistical inference should be made robust to outliers. Without the relay identities, it will
>   be difficult to detect and remove values that are incorrect, whether due to faulty measurement or
>   malicious action by a relay.
>
> I propose some simple cryptographic techniques to privately collect the above statistics while
> handling the above challenges. I assume that there exists a set of Statistics Authorities
> (StatAuths), of which at least one must be honest for the protocol to be secure, but all of which
> can be "curious" (that is, we want to maintain privacy from them as well and allow their action to
> be completely public). The Directory
> Authorities could serve as StatAuths. A single server could as well, if you trust it to be honest-
> but-curious. If you have a trusted non-curious server, things become much simpler, of course: just
> have relays report their values to the server and then have it publish a global aggregate only.
>
> The AnonStats1 protocol to privately publish both statistics if we trust relays not to pollute the
> statistics (i.e. #2 is not a problem) is as follows:
>   1. Each StatAuth provides 2k partially-blind signatures on authentication tokens to each relay in
>   a consensus during the measurement period. The blind part of a signed token is simply a random
>   number chosen by the relay. The non-blind part of a token consists of the start time of the
>   current measurement period. The 2k tokens will allow the relay to submit k values to the
>   StatAuths. Note that we could avoid using partially-blind signatures by changing keys at the
>   StatAuths every measurement period and then simply providing blind signatures on random numbers.
>   2. At the end of the measurement period, for each statistic, each relay uploads the following
>   each on its own Tor circuit and accompanied by a unique token from each StatAuth:
>     1. The count
>     2. The ``statistical weight'' of the relay (1/(# HSDirs) for statistic #1 and the probability of
>     selection as an IP for statistic #2)
>   3. The StatAuths verify that each uploaded value is accompanied by a unique token from each
>   StatAuth that is valid for the current measurement period. To infer the global statistic from
>   the anonymous per-relay statistic, the StatAuths add the counts, add the weights, and divide
>   the former by the latter.
>

I'm going to mainly focus on Anonstats2, since IIUC Anonstats1 leaks
the probability of selection as an IP of that relay which leaks the
identity of the relay:

> The AnonStats1 protocol is vulnerable to a relay that publishes manipulated statistics (i.e.
> challenge #2). If this is a concern, the AnonStats2 protocol mitigates it by using a median
> statistic intead of a sum:
>   1. Relays are divided into b bins by consensus weight. The bins have exponentially-increasing
>   length, and the rate of increase c is set such that the
>   ith bin by increasing weights has at least r relays each of weight at least some minimum
>   min_weight (say, r=500, min_weight=100). The first bin has weights in [0, min_weight), and the ith
>   bin has weights in [min_weight*c^(i-2), min_weight*c^(i-1)).

I'm curious to see how this binning strategy would work with the
current Tor network. The network and hence the anonymity set of its
relays is not that big. Also, the *big* hidden services are not that
secret lately.

Let's consider the descriptor fetches statistic (#1) and assume that
you are a StatAuth. Also let's say that there are 5 hidden services
that generate most of the descriptor fetches of the network (e.g. botnets).
It should be easy for you to find the responsible HSDirs of those
hidden services, and then it might be possible to match them up with
the blinded relays that report the highest counts of descriptor fetches.
Is this wanted? 

>   2. Each StatAuth provides k partially-blind signatures on authentication tokens to each relay in
>   a consensus during the measurement period. The blind part of a signed token is simply a random
>   number chosen by the relay. The non-blind part of a token consists of the start time of the
>   current measurement period and the bin containing the relay's consensus weight.
>   3. At the end of the measurement period, for each statistic, each relay divides each statistic
>   by the relay's ``statistical weight'' (1/(# HSDirs) for statistic #1 and the probability of
>   selection as an IP for statistic #2). The result is the relay's estimate of the global
>   value of the statistic. The relay then uploads this value via a new Tor circuit, accompanied by a
>   unique token from each StatAuth.

It's worth mentioning that different relays use different consensuses
and hence have different views of the network. So, relays will
calculate their <probability of selection as an IP> using
different/older consensuses than others.

>   4. The StatAuths verify that each uploaded value is accompanied by a unique token from each
>   StatAuth that is valid for the current measurement period and that contains the same relay-weight
>   bin. To infer a final global statistic from the anonymous per-relay estimates, the StatAuths
>   use the weighted median of the received estimates, where the weight of an estimates is taken to be
>   the smallest value of its accompanying bin (i.e. the bin's left edge).

I'm now wondering how useful this final statistic will be for us.

How useful is it to learn the median value of "client introduction
requests" of all the relays of the network? And in our case we get an
extrapolated weighted version of the median. How can such a number be
used by us?

statistically yours!