[tor-dev] Should we disable the collection of some stats published in extra-infos?

Karsten Loesing karsten at torproject.org
Tue Jan 19 08:45:57 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15/01/16 23:00, Rob Jansen wrote:
> Hello,

Hi Rob,

I'm moving this discussion from metrics-team@ to tor-dev@, because I
think it's relevant for little-t-tor devs who are not subscribed to
metrics-team at .  Hope you don't mind.

> I was recently reviewing the statistics that Tor allows relays to
> collect and report to the dir servers [1], which then get published
> in extra-info documents [2]. Most of this can be enabled by simply
> setting a torrc option. There are quite a few statistics that I
> feel should not be collected. I'm wondering if the original purpose
> for collecting many of these statistics still exists, and if we
> still feel that the privacy compromises that were made when the
> collection was implemented are still valid in most cases.
> 
> Here are the stats I am most worried about, and why:
> 
> [unique ips per country code] *-ips (there are many of these, e.g.
> "entry-ips") Usually this involves storing individual user IP
> addresses in memory (in order to track uniqueness) over some period
> of time (usually 24 hours), sometimes for longer than the user
> would have otherwise been known to Tor (if a user's session is 1
> hour, Tor could remember the IP for at most 23 additional hours).
> This is reported, e.g., per entry; there are many cases in the data
> where it is very likely that only one user is connecting to a guard
> from a given country (because it is rounded up to 8). Users in
> small countries have the greatest risk (intersection attacks become
> really easy).

I agree that might just lose these statistics.  We used them in the
past as first approximation to counting users, but obviously that only
works as long as clients only connect to a single relay.  The only
place where we're still using them is in a workaround for estimating
bridge users.  See #15469 for more details and #8786 for something
we'd have to implement before taking these statistics out.

> [exit statistics by port number] exit-kibibytes-written 
> exit-kibibytes-read exit-streams-opened Tor is classifying its
> traffic into ports, which could uniquely identify the application
> being used by the client. They also track bandwidth usage per port
> (and per exit); again, this is bad for those using a random or
> unique looking ports (that a given exit does not see very often)
> because it could be used to create a fingerprint. Intersection
> attacks become easier with this information.

Agreed, I can see us dropping these statistics, too.  We're currently
not using them.  But also see my suggestion below.

> The less problematic stats:
> 
> [circuit-based cell statistics] cell-processed-cells 
> cell-queued-cells cell-time-in-queue cell-circuits-per-decile This
> provides queue timings and number of cells being processed at a
> relay. The number of cells can be used to compute bandwidth of
> circuits. It may be possible to launch some attacks that create
> several circuits with the intent of moving which decile buckets
> some legitimate circuits get placed into, but this is less
> worrisome of an attack than the others.

I'm less worried about this one.  But, suggestion below.

> Should Tor still be collecting these things? Should Tor disable the
> collection of these statistics until we have a more
> privacy-preserving way to collect and aggregate them?
> 
> The good news is that privacy-preserving techniques exist that can
> reduce information leakage. I'm developing a tool based on the
> secret-sharing variant of PrivEx [3] to collect some of these types
> of statistics while providing privacy guarantees. We are currently
> using it to collect only those stats that are useful for producing
> Tor traffic models. A great advantage of this tool is that the
> various counters that we store during the collection phase get
> noise added and are randomized during initialization; only the
> aggregates are ever known and revealed by the aggregation server,
> limiting the information that is lost if a relay is compromised.
> This is a large improvement over the current collection method,
> which only adds noise before publication and reveals statistics on
> a per-relay basis.

Suggestion: How about we evaluate these statistics published by relays
in the past years to see if there are other benefits or risks we
didn't think of, and then we decide whether to leave them in, modify
them, or take them out?

The reason is that I'd want to avoid removing this code only to
realize shortly after that we overlooked a good reason for keeping it.
 These statistics are being collected for years now, and it might take
another year or so for relays to upgrade to stop collecting them.  So
what's another month.

Thanks for (re-)starting this discussion!

> All the best, Rob

All the best,
Karsten


> [1] https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt [2]
> https://collector.torproject.org/recent/relay-descriptors/extra-infos/
>
> 
[3] www.cypherpunks.ca/~iang/pubs/privex-ccs14.pdf
> 
> 
> 
> _______________________________________________ metrics-team
> mailing list metrics-team at lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
> 

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWnffFAAoJEJD5dJfVqbCrSREH/1G3nZDwC7a8FGFEySgQX3MZ
n1dlMcylDU6kypgGMEYs2KmMRwhNYmCVFn6fRJgrFN/KgJ0oZJG1pRcglnIFwNwH
5VUWDZp3+frhY5jBbJ3JXqA4Exhi/0mUYCTRwfcFQ/JOyzjlcFbLRW/rkqFHn62H
wLidaVeagiJ+TI/T8zgwtqQjTDSZrZPpmlxvlO57D1bGhW1ZPJVbOeUzuudhNpRS
WVu//MH51juCzcML32MeiMV6wWYYOm1irKZ8lZHVCPbsL98qoh3ewtO6fb62THIa
sWcQZlwMQdcC0+509PmMeYwkRn+40MUwkk84/glyP57dfpgtOwzZCohL3yQRQ1s=
=F6F+
-----END PGP SIGNATURE-----


More information about the tor-dev mailing list