[tor-dev] Feedback on obfuscating hidden-service statistics

Thu Nov 20 12:42:43 UTC 2014

"A. Johnson" <aaron.m.johnson at nrl.navy.mil> writes:

>>>> George and I have been working on a small proposal to add two 
>>>> hidden-service related statistics: number of hidden services and 
>>>> total hidden-service traffic.
>>> 
>>> Great, I’m starting to focus more on this project now. Well,
>>> actually I’m going on a trip for a week today, but *then* I’m
>>> focusing more on this project :-)
>> 
>> Sounds great!  We're meeting every Tuesday at 16:00 UTC in #tor-dev.
>> Feel free to drop by.
>
> Excellent. I won’t be there this coming Tuesday, but I’ll be there the next Tuesday.
>
>> Replicas mean that each descriptor is stored under two identifiers, so
>> that's two places.  Further, descriptor identifiers change once per
>> day, so during a 24-hour period, there are up to four descriptor
>> identifiers for a hidden service.
>
> That makes sense. It would be nice if the statistics would allow you
> to identify how long (i.e. how many hour periods) each descriptor was
> observed being published. That would allow us to figure out if there
> are lots of short-lived services or fewer long-lived
> services. Publishing statistics every hour would pretty much take care
> of this. If you are really set on 24 hours, then perhaps you could add
> the total number of published descriptors in addition to the number of
> *unique* published descriptors.
>
> Also, my suggestion about using additive noise applies equally well to
> the descriptor statistics. And multiplicative noise is a *bad idea* if
> you don’t have some adjustment for small values (e.g. 10% noise of a 0
> value is 0, and 10% of 1 is only 0.1).
>
>> We have been thinking about many more hidden-service related
>> statistics in a separate document.  We're currently discussing whether
>> we should turn it into a tech report, because we'll probably not want
>> to implement most of those statistics.  If you have remarks or more
>> ideas, please feel free to edit the document.  We're going to have a
>> public review round for this, too, but that might not happen in the
>> next week or two.
>> 
>> https://etherpad.wikimedia.org/p/hs_stats_78281091
>
> Great! I think we should go for at least a little more data in the
> current proposal (what is the timeline for this, btw?). I think we
> should come up with a list of statistics we might imagine gathering
> and identify the subset of those that we’re comfortable gathering at
> this point. For example, I think failure statistics is much more
> innocuous than other data, and those would be very useful. For
> example, they would help us understand how to improve the protocol is
> failing, and it might help us identify misuse of hidden services
> (e.g. by botnets clients stupidly looking for non-existent descriptors
> or by malicious crawlers attempting to brute force descriptors). So
> here are some ideas:
>   1. Number of fetch requests for descriptors that don’t exist (number of fetch requests that do succeed would of course be very useful as well)
>   2. Number of descriptor publishes to the wrong HSDir (actually I suspect that the HSDir doesn’t check this and wants to be accepting of any publish)
>   3. Number of rendezvous circuits that never connect (from the RP perspective)
>   4. Number of rendezvous circuits on which no data cells are ever sent
>

(CC'ed [tor-dev])

Thanks for the input Aaron!

The timeline here is that we are hoping the proposal _and_ the
implementation to be ready by mid-December. Then we are hoping that we
can deploy the code to a few relays so that we have some data by January.

So, time is tight.

I'm currently OK with the two statistics in:
https://people.torproject.org/~karsten/volatile/238-hs-relay-stats.txt

I feel that any other statistics will need to be carefully analyzed.
We should add the ideas you mentioned in the etherpad, and get them
included in the tech report (which we are also hoping to have ready in
some form by mid-January).

The tech report is supposed to contain and analyze most of the HS
statistics we can think of. It will likely contain many stats that we
will never do, but also some stats that might be a good idea. The good
ones we should eventually integrate to the Tor proposal and write code
for.

>> Thanks for the very valuable input!  Let me know if the following
>> draft looks okay, and I'll start another thread on tor-dev at .
>> 
>> https://people.torproject.org/~karsten/volatile/238-hs-relay-stats-2014-11-20.txt
>
> "Lab(\epsilon/C)” -> "Lap(\epsilon/C)” (that was my mistake. I think
> having the added noise both parameterized and included in the reported
> statistics is an idea worth thinking about. Making it a parameter
> allows you to easily change it without upgrading. Including it in the
> statistics would allow us to correct better for noise if different
> relays might be adding different amounts of noise due to inconsistent
> opinions of the noise parameter (if this should never happen, then I
> guess this wouldn’t be necessary).
>
> So again, sorry that I’m not going to be very responsive on this for the next week. I’m really happy that you’re working on it!
>
> Best,
> Aaron