[tor-commits] [metrics-tasks/master] Add a more useful README for detector.py (#2718).
karsten at torproject.org
karsten at torproject.org
Tue Aug 9 13:59:09 UTC 2011
commit 56889ddfad3b3a1fdc1b259d4f95f3895e764972
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date: Tue Aug 9 15:52:59 2011 +0200
Add a more useful README for detector.py (#2718).
---
task-2718/README | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 120 insertions(+), 0 deletions(-)
diff --git a/task-2718/README b/task-2718/README
new file mode 100644
index 0000000..8f1d1e2
--- /dev/null
+++ b/task-2718/README
@@ -0,0 +1,120 @@
+Tor Censorship Detector
+=======================
+
+The Tor Censorship Detector is a script that reads a file containing the
+number of daily Tor users and finds anomalies that might be indicative of
+censorship. This README explains how to use the script and then makes an
+attempt to describe how the math behind the script works.
+
+We start with downloading the estimated Tor user numbers from the metrics
+website:
+
+ $ wget https://metrics.torproject.org/csv/direct-users.csv
+
+This file contains estimated daily Tor users with columns being country
+codes and rows being dates. A detailed description of how these estimates
+are obtained is contained in this tech report:
+
+ https://metrics.torproject.org/papers/countingusers-2010-11-30.pdf
+
+An excerpt of direct-users.csv is:
+
+ date,??,a1,a2,ad,ae,...,zw,all
+ 2011-08-06,6936,448,61,24,2460,...,13,337627
+ 2011-08-07,5904,398,53,15,2335,...,13,335626
+
+The "date" column contains the ISO-formatted date. The number in the "??"
+column stands for all IP addresses that could not be resolved to a country
+by the GeoIP database. The columns "a1" to "zw" contain the number of
+users by country, including MaxMind-specific country codes described at
+http://www.maxmind.com/app/iso3166 . The "all" column contains the sum of
+all Tor users on a given day.
+
+In order to run the Python script to detect anomalies, make an img/
+directory and install the required Python packages:
+
+ $ mkdir img/
+ $ sudo apt-get install python-numpy python-scipy python-matplotlib
+
+Run the detector.py script (this may take a while):
+
+ $ python detector.py
+
+The output consists of a file direct-users-ranges.csv containing the
+expected range of users per day and country, a file img/summary.txt with
+an overview of possible censorship events, and graphs in img/ for all
+countries with possible censorship events.
+
+An excerpt of direct-users-ranges.csv is:
+
+ date,country,minusers,maxusers
+ 2011-08-06,ae,1559.43780955,3880.20765967
+ 2011-08-07,ae,1460.8116866,3452.1615707
+
+This output means that the expected number of users on August 6, 2011 for
+the United Arab Emirates was 1559 to 3880 users. The observed number of
+users in direct-users.csv was 2460, so that the script wouldn't suspect a
+censorship event here.
+
+The img/summary.txt file begins with the following lines:
+
+ =======================
+ Report for 2011-02-03 to 2011-08-07
+ =======================
+ sc -- down: 17 (up: 25 affected: 148)
+ ly -- down: 13 (up: 17 affected: 29)
+ py -- down: 10 (up: 8 affected: 137)
+
+This output means that, for example, in Libya there were 13 possible
+censorship events (downturns) and 17 possible releases of censorship
+(upturns) between February 3 and August 7, 2011.
+
+The graph img/013-ly-censor.png visualizes these downturns and upturns in
+a time plot.
+
+The core of the censorship detector script is contained in the functions
+make_tendencies_minmax() and write_all() in detector.py.
+
+In make_tendencies_minmax(), the detector is given the user number series
+of the top-50 countries by users based on the last day in direct-users.csv
+for the entire interval in direct-users.csv. For example, the last 10
+values in the series for the United States are:
+
+ 66171,72866,76900,76292,77753,75749,81680,68084,77526,75499
+
+For each of these countries, the detector computes the quotients between
+the number of users on a given day and 1 week before. These quotients for
+the series above are:
+
+ 1.048,1.075,1.015,0.987,1.148,0.959,1.176,1.029,1.064,0.982
+
+For every day in direct-users.csv, the detector considers all non-zero
+quotients of the top-50 countries. It then discards outliers which are
+more than 4 times the interquartile range away from the median. For
+August 7, 2011, all 50 quotients were in the interval from 0.697 to 1.263
+with no outliers.
+
+In the next step, the detector fits a normal distribution to these
+quotients and uses the inverse cumulative function to look up the 0.01-th
+and 99.99-th percentiles. These values are the hypothetic quotients that
+are greater than 0.01% and 99.99% of all quotients, respectively. In the
+data mentioned above, the fitted normal distribution has a mean of 0.992
+and a standard deviation of 0.091, and the looked up percentiles are 0.654
+and 1.33.
+
+In write_all(), the detector calculates an estimated minimum and maximum
+for a given country and date based on the user number 1 week ago. The
+detector first looks up the 0.01-th and the 99.99-th percentile of a
+Poisson distribution with the mean being the user number from 1 week ago.
+It then weights these percentiles with the network-wide quotients
+calculated above.
+
+For example, when looking at the user numbers from the United States,
+there were 75499 users on August 7, 2011 and 76900 users 1 week before on
+July 31, 2011. The 0.01-th and the 99.99th percentiles of the Poisson
+distribution with a mean of 76900 are 75871 and 77933. The estimated
+range of users on August 7, 2011 goes from 0.654 * 75871 = 49620 to
+1.33 * 77933 = 103651. The actually observed 75499 users are within this
+interval, so there's no suspected censorship event, nor release of
+censorship in the United States on August 7, 2011.
+
More information about the tor-commits
mailing list