[metrics-bugs] #9316 [Obfuscation/BridgeDB]: BridgeDB should export statistics

Tue Apr 23 20:12:45 UTC 2019

#9316: BridgeDB should export statistics
-------------------------------------------+---------------------------
 Reporter:  asn                            |          Owner:  dgoulet
     Type:  task                           |         Status:  assigned
 Priority:  Medium                         |      Milestone:
Component:  Obfuscation/BridgeDB           |        Version:
 Severity:  Normal                         |     Resolution:
 Keywords:  metrics, bridgedb, prometheus  |  Actual Points:
Parent ID:  #19332                         |         Points:  3
 Reviewer:                                 |        Sponsor:  Sponsor19
-------------------------------------------+---------------------------

Comment (by dcf):

 Replying to [comment:16 phw]:
 > Here's a preliminary list of statistics that we may want, and why we
 want them. Needless to say, we need to figure out how to collect these
 statistics safely.

 If it's possible, I would like to have a guess at what fraction of bridge
 requesters are bots. Proxy-distribution papers usually assume that an
 adversary controls some fraction of the users--it would be great to know
 what the fraction is in this case. For example
 [https://censorbib.nymity.ch/#Mahdian2010a Mahdian2010a] "''n'' users,
 ''k'' of whom [are] adversaries," [https://censorbib.nymity.ch/#Wang2013a
 Wang2013a] "Let ''f'' denote the fraction of malicious users among all
 potential bridge users.... We expect a typical value of ''f'' between 1%
 and 5%...."

 Here are some possible ways to identify bots:
  * IP address clustering--for example if BridgeDB considers all addresses
 in a /24 the same, find the most commonly occurring /20
  * auto-generated email addresses following a pattern
    * to start, you could make a histogram of the lengths of email
 addresses, and see if it's concentrated at a single point. or count the
 frequency of short prefixes and suffixes of email address local-parts, and
 see if there are any that appear overwhelmingly more often than others.
  * an anachronistic HTTP User-Agent (for example, Chrome from 2 years ago,
 when most real Chrome users auto-update)
  * inconsistent HTTP headers, for example Chrome or Firefox without
 `Accept-Encoding: gzip`

 With some sort of bot-classification heuristic, then it would be good to
 analyze the statistics you mentioned already (e.g. fraction
 allowed/denied) for bot and non-bot requests.

 I would like to see a graph that shows how long it takes for a single
 bridge to be given to ''n'' different requesters. When BridgeDB starts
 distributing a bridge, how long does it take before 5 people know about
 it? Before 50 people know about it?

 > * Approximate number of ''HTTPS'' requests coming from proxies.
 >   * This may be an indicator of people trying to game the system.

 On this point, specifically I would want to know what fraction of of
 requests have an `X-Forwarded-For` or `Via` header, ''and'' how many
 entries it contains. I mention this because not only can these headers
 indicate the use of a proxy, a client may spoof them. And I seem to
 remember that BridgeDB may process `X-Forwarded-For` incorrectly, like it
 reads the entries in the wrong order when there are multiple of them.

 For this analysis, you will have to be aware that requests via Moat always
 have at least one `X-Forwarded-For` (I believe), because Moat is
 implemented using an Apache `ProxyPass` reverse proxy and Apache adds that
 header.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/9316#comment:19>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online