[tor-dev] [tor-talk] Client simulation

Fri Jun 7 18:04:25 UTC 2013

On 6/7/13 2:37 AM, Karsten Loesing wrote:
> (Sorry for cross-posting, but I think this is a topic for tor-dev@, not
> tor-talk at .  If you agree, please reply on tor-dev@ only.  tor-talk@
> people can follow the thread here:

OK, following up on tor-dev.  I thought tor-dev might be the more 
appropriate list, but the description for tor-talk is "all discussion 
about theory, design, and development of Onion Routing," whereas that 
for tor-dev is "discussion regarding Tor development."  Maybe since I 
think more about theory than code, the former description seemed more 
applicable...

> On 6/6/13 7:32 PM, Norman Danner wrote:
>> I have two questions regarding a possible research project.
>>
>> First, the research question:  can one use machine-learning techniques
>> to construct a model of Tor client behavior?  Or in a more general form:
>>   can one use <fill-in-the-blank> to construct a model of Tor client
>> behavior?  A student of mine did some work on this over the last year,
>> and the results are encouraging, though not strong enough to do anything
>> with yet.
>>
>> Second, the meta-question:  is it worthwhile to answer the first
>> question?
> [snip]
> Hi Norman,
>
> yes, it's worthwhile to answer this question!  I can imagine how at
> least Shadow and the Tor path generator would benefit from better client
> models.  User number estimates on the metrics website might benefit from
> them, too.
>
> I found two tickets where we asked similar questions before, and maybe
> there are more tickets like these:
>
> https://trac.torproject.org/projects/tor/ticket/2963
>
> https://trac.torproject.org/projects/tor/ticket/6295
>
> Some very early thoughts:
>
> - How do we make sure that we ask a representative set of people to
> instrument their clients and export data on their usage behavior?  If we
> only ask people who read their favorite news site twice per day, our
> client model will be just that, but not representative for all Tor
> users.  (Still, we would know more than we know now.)
>
> - Can we somehow aggregate usage information enough to make it safe for
> people to send actual usage reports to us?  I could imagine having a
> torrc flag that is disabled by default and that, when enabled, writes
> sanitized usage information to disk.  For this we need a very good idea
> what we're planning to do with the data, and we'll need to specify the
> aggregation approach in a tech report and get it reviewed by the community.
>
> Are your student's results available somewhere?

The written portion of my student Julian Applebaum's Senior thesis is 
available at

	http://wesscholar.wesleyan.edu/etd_hon_theses/1042/

Our focus in this project (which I left intentionally vague in my 
posting) was to try to model clients at a very high level. 
Specifically, we wanted to see if we could model something like the 
timing patterns of the Tor cells that clients send to the network.  An 
intended application (not yet completely thought through...) would be to 
use such information to get a more accurate sense of how well timing 
attacks work, by deploying them (in simulation) against presumably 
realistic clients.  Our strategy was roughly:

* Instrument our guard node to record cell arrival times from clients 
pseudononymously (i.e., we know when two different cells belong to the 
same circuit, but we only record circuits as A, B, etc.).
* Record such data for a short period of time.
* Represent each circuit as a time series.
* Cluster the collection of time series using Markov model clustering 
techniques.

The intent is that each cluster (represented by a single hidden Markov 
model) represents a "type" of client, even though we don't know for sure 
what that client type does.  We can make some guesses about some:  the 
"type" of steady high-volume cell counts is probably a bulk downloader; 
the "type" of steady zero cell counts is probably an unused circuit; 
etc.  But in some sense, I'm thinking that what counts is the behavior 
of the client, not the reason for that behavior.  We don't have to 
instrument clients for this.  Of course, then one has to ask whether 
this kind of modeling is in fact useful.  It is somewhat different than 
what you are envisioning, I think.

There are about a billion variations (at last count) on this theme.  We 
chose one particular one as a test case to play with the methodology.  I 
think the methodology is mostly OK, though I'm not completely satisfied 
with the results of the particular variation Julian worked on.  So now 
I'm trying to figure out whether to push this forward and in particular 
what directions and end goals would be useful.

	- Norman

-- 
Norman Danner - ndanner at wesleyan.edu - http://ndanner.web.wesleyan.edu
Department of Mathematics and Computer Science - Wesleyan University