[tor-dev] Issues With Ticket #7532 - "Count unique IPs in an anonymous way"

Tue Mar 28 23:01:55 UTC 2017

Ah, I see - PCSA can actually keep track of unique IP's without actually
revealing them. Your first link cleared it up a lot for me. PCSA is a
really cool technique!

I'd love to work on this as a GSoC project. I'll write up a proposal and
send it out soon.

On Tue, Mar 28, 2017 at 2:55 PM, Jaskaran Singh <jvsg1303 at gmail.com> wrote:

> Hi Samir,
>
> Brute force does affect Bloom filter/hashed-values as you rightly
> mentioned, but not Probabilistic Counting by Stochastic Averaging (PCSA).
>
> PCSA works on the principle that in an input the probability of n
> consecutive bits having value '0' from the left side(could be right as
> well, but for now assume it left) is 2^(-(n+1)). Bit 'i' of the
> Bitmap(which is our main data structure) is set if a the number of
> consecutive zeros (from left) is 'i'.
>
> We keep repeating it for every input(IP address). We then end up with a
> Bitmap whose  most significant '1' can be computed to give us an
> approximate number of inputs that must have been gone into the algorithm.
>
> In simple words, if I tell you that I have seen the value 1010000 out of
> a total of 'x' values I examined. You could guess that I had examined a
> total of 2^5 values before I saw that particular value.
>
> We would tweak the algorithm to store only the significant most '1' in
> bitmap instead of storing '1' at every iteration. This would mean that
> all that adversary could get hold of is a bitmap whose just one of the
> bit is '1'.
>
> Example, the adversary might get a data structure that looks like:
>                 000001000000
> and would have no way tell what IP addresses were used as an input.
>
> This was just the basic idea behind PCSA. The actual PCSA makes use of
> complicated looking formula to get the approximate number of unique IP
> addresses in order to keep error rate low.
>
> I hope this makes sense.
>
> For some more information and simulation, please check
>
> [0] https://research.neustar.biz/2013/04/02/sketch-of-the-
> day-probabilistic-counting-with-stochastic-averaging-pcsa/
> [1] http://content.research.neustar.biz/blog/runs.html
>
> Regards,
> Jaskaran
>
> On Wednesday 29 March 2017 01:54 AM, samir menon wrote:
> > This ticket [1] was suggested as a GSoC project, but I think there might
> > be an issue with the security model/perceived threat.
> >
> > To summarize the ticket and its child [1], basically, we currently store
> > all the IP's seen by a node so that we can count unique IP's. The idea
> > is that this is dangerous; if a node is compromised, then all of those
> > IP addresses can be retrieved from memory. Therefore, a variety of
> > mitigation methods have been proposed (most prominently, the
> > 'Probabilistic Counting Algorithm' from [2])
> >
> > Here's my issue: what about brute force?
> >
> > No matter what method we use, we will arrive at a data structure that
> > should be able to, given an IP address, tell us whether it is new (and
> > we should increment the unique counter) or old (and we should leave the
> > unique counter the same), with some reasonably small false positive
> > rate. Basically, we're supposed to use some kind of Bloom filter like
> > structure.
> >
> > Then can't that structure then be brute-forced, offline, by an attacker?
> > IPv4 addresses are 32-bits (~4.3 billion of them), so an attacker could
> > just run whatever method we use to check membership over and over, and
> > then recover the set of IP's. The same happens if we hash the IP's
> > beforehand.
> >
> > So, is this attack acceptable? The only mitigation I've seen is the one
> > referenced by 'Aaron' in the ticket, which is the system that git uses,
> > cryptolog; there, they have a random salt that changes daily. Then, an
> > attacker can only learn the IP's for one day. This sounds like a
> > reasonable compromise to me, but then the implementation becomes rather
> > simple; just hash the IP's with a random salt that changes daily before
> > putting them in the set.
> >
> > IPv6 also solves this (128 bits), but there again, the solution is just
> > to hash the IP's before storing them - the Bloom filter/'Probabilistic
> > Counting Algorithm' is unnecessary.
> >
> > I think I must be missing something about how the 'Probabilistic
> > Counting Algorithm' works - somehow, it needs to keep track of the # of
> > unique IP's without knowing (with a high probability) whether any 1
> > individual IP has been seen.
> >
> > Any help/pointing out of errors in my reasoning would be useful.
> >
> > Thanks,
> > Samir Menon
> > menon.samir at gmail.com <mailto:menon.samir at gmail.com>
> > samir2 at stanford.edu <mailto:samir2 at stanford.edu>
> >
> > [1] https://trac.torproject.org/projects/tor/ticket/7532
> > [2] http://www.mathcs.emory.edu/~cheung/papers/StreamDB/Probab/
> 1985-Flajolet-Probabilistic-counting.pdf
> >
> >
> > _______________________________________________
> > tor-dev mailing list
> > tor-dev at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
> >
>
> --
> Jaskaran Veer Singh (jvsg)
> jvsg1303 at gmail dot com
> PGP 2814 3FB7 A32D 429B 092E 27F0 8AA3 C532 9E1A 6AD8
>
> _______________________________________________
> tor-dev mailing list
> tor-dev at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20170328/40b06d25/attachment-0001.html>