[metrics-team] New way of counting Direct Users per country
David Goulet
dgoulet at torproject.org
Thu May 5 16:16:00 UTC 2016
Hi everyone!
(TL;DR; I describe an algorithm for counting direct users using dirreq requests
and ignoring dirreq bytes history.)
Last week, me and armadev wanted to reproduce metrics results that is the daily
number of users per country[1]. We looked at the tech report[2] and the metrics
code[3].
It seems that we are counting direct users (not bridge) using the same
technique as we do with bridge that is counting dirreq and extrapolating using
bytes history. For bridges, it makes sense (see why in the tech report) but for
direct users, statistics comes from relays in the consensus so _maybe_ there is
a better approach of estimating the number of users per-country.
I'll be describing what armadev and I came up with. Maybe it's crazy, maybe
some pieces are missing, maybe it's not at all better then what metrics does.
This is why I'm writing this email, see if all this makes senses.
1) For each relay, we'll compute the BW fraction for the dirreq stats period
(dirreq-stats-end) for the interval. We make a bandwidth average for that
period (average of all bw values of a relay in that interval). We then
divide that value by the total bandwidth at that time in the network:
R1_bw / (R1_bw + R2_bw + ... + Rn_bw)
...where n is the total number of relays in the network.
We'll split that value in two in case the time period overlaps between two
days (it actually happens all the time). Here is a great ascii art!! showing
you the dirreq stats period P between the 4th and 5th of some month:
4 5 6
+----------|-----------+----------|-----------+
P: ^_______________.______^
For the period P, we have 16 hours on the 4th and 8 hours on the 5th so using
our BW fraction for a relay, we can split that fraction in two fractions for
each day.
2) For each relay, we count all requests per country using "dirreq-v3-reqs"
from the extra-info document. Since the period overlap between days, we need
to split in two as well like step 1). For instance, if we have 32 clients
for "ao" then on the 4th we have (32 - 4) * (16/24) and on the 5th, we end
up with (32 - 4) * (8/24).
(See technical report on why 4 is substracted here[4])
3) At this step, for each relay reporting dirreq-v3-reqs stats, we have a BW
fraction per day basically the chance of being picked by a client. We also
have a count per country code of clients seen per day as well.
For a relay, take the per-country per-day client number, divide it by the bw
fraction and then divide it by 10 (again see tech report on why but
basically we estimate a client, over 24h, will do between 8 and 12 directory
requests). Suming up that value for each relay gives us the final client
number for that country.
For the 4th:
R1: (cc-users[4th] / bw_fraction[4th]) / 10
Relay already have the number of clients they've seen per country so the
approach here is super simple, take advantage of that and extrapolate using the
relay weight during the stats period.
Maybe this is over simplistic, maybe it's been thought out before. However, the
results is an interesting part. For March 5th of 2016, here are the two
estimate for the "de" country, from metrics[5] and this algorithm:
Metrics: 2016-03-05,relay,de,,,158830,204819,183596,71
--> 183596 is the number of estimate clients.
Email: 2016-03-05 - 95745 estimated clients.
As you can see, the difference is almost half! I ran the numbers for other
smaller countries and we are closer to what metrics says usually with countries
< 10k users. For instance Iran "ir":
Metrics: 2016-03-05,relay,ir,,,5748,7971,7044,71
--> 7044 is the number of estimate clients
Email: 2016-03-05 - 5913 estimated clients.
Now, lots and lots might have gone wrong above with my PoC script or issues in
the algorithm itself so this is why I would like for the metrics team to pin
point obvious issues with the algorithm and maybe a better way to improve it!
At least it's out there now :).
Thanks!
David
[1] https://metrics.torproject.org/userstats-relay-country.html
[2] https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf
[3] https://gitweb.torproject.org/metrics-web.git/tree/modules/clients/init-userstats.sql
[4] https://gitweb.torproject.org/metrics-web.git/tree/modules/clients/src/org/torproject/metrics/clients/Main.java#n132
[5] https://metrics.torproject.org/stats/clients.csv
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160505/b3aa0ff1/attachment.sig>
More information about the metrics-team
mailing list