[tor-project] Constructing a real-world dataset for studying website fingerprinting

Sat Apr 22 23:22:37 UTC 2023

Hi Rob,

Your earlier work on online WF and this proposal are exciting. We can 
learn a lot from such a proposed dataset, so please collect it, but I 
think that there are some challenges around framing it as a dataset for 
assessing WF attacks and defenses.

To be brief, I think we solve traffic diversity with BigEnough-style 
datasets that never overlap subpages [0] and that data staleness and 
network overheads are minor problems. Capable attackers (we should 
consider) can build closed worlds of ~100 websites [1], no need for 
open-world data.

The biggest strength of the proposed dataset is also its biggest 
weakness: capturing real-world user diversity in traces. It is more 
user/client diversity than browser diversity (in the sense of more 
broad), because the dataset captures various browsers and as well as 
different Tor clients implementations, other Tor configurations like VPN 
mode, running on routers, in strange VMs, headless browsers etc., as 
well as a wide range of network configurations and probably more I 
missed used by the very diverse Tor-network userbase. While capturing 
this diversity is super interesting and valuable in different ways, it 
also risks being too filled with junk to help assess WF attacks or defenses.

Do you have any approaches or thoughts around pruning the dataset or 
further refining labels beyond the first domain? Some worries:

- I fear that we will likely have to do much guesswork on interpreting 
results based on the dataset. Do we want to be assessing WF defenses and 
attacks based on random torified curl scrapers, python scripts, and who 
knows what? Right now, we are creeping closer and closer to 
Internet~web, reflected in Tor-traffic. Without any ground truth, how 
can we avoid that most labels aren't mostly junk? What does it mean to 
reach 50% accuracy in evaluating the dataset? If a WF attack trained on 
TB "fails" to associate a website visit using curl that's probably a 
feature (not what the attacker was after anyway), not a bug. It would be 
fantastic if it were possible to have a sizeable subset of traces for 
some/most labels containing known configurations (or even just confirmed 
website visits with, say vanilla TB, maybe some more inline domain 
fingerprinting to set a bool for key website domains of Tranco top-100 
requested on the circuit or something?).

- Filled with an unknown rate of junk, the dataset alone will be 
insufficient to train WF attacks in noisy environments (you want ~1000+ 
samples per class of something coherent, at least with current sota DL 
models pushed to their limits: as we move on to transformers probably 
even more). Suppose you subscribe to the claims of data staleness being 
a factor. In that case, researchers cannot even go through the very 
consuming process of collecting adequate labeled training data in the 
same way to show success on the proposed dataset. The exit collection 
vantage point makes this worse.

- Using the dataset as a basis for simulating defenses, one would have 
to simulate the corresponding client and middle traces to feed into the 
simulated defenses based on exit traces. Kinda messy with poor 
signalling for Tor-network characteristics in the dataset. When 
collecting at the client, getting realistic client traces (for the 
particular configuration) is basically for free. At the same time, 
middles change for every circuit, so a half-assed approach seems to get 
you far (speaking from experience of half-assing things here!). I worry 
about half-assing both client and middle traces, however.

If we make this proposed dataset and its method the bar of doing 
"real-world WF", it might lead to too high of a bar. We want more 
real-world implementations and data collection with defenses, not less, 
I think.

Sorry if the above may come across as a bit negative, it's not my 
intent: I *want* the dataset you describe, we can learn a lot from it 
for sure. Wish I had a chance to chat in person in Costa Rica! Please 
don't feel obliged to reply, just food for thought.

Best,
Tobias

[0]: "SoK: A Critical Evaluation of Efficient Website Fingerprinting 
Defenses", https://www-users.cse.umn.edu/~hoppernj/sok_wf_def_sp23.pdf
[1]: "Website Fingerprinting with Website Oracles", 
https://petsymposium.org/2020/files/papers/issue1/popets-2020-0013.pdf

On 20/04/2023 23:16, Jansen, Robert G CIV USN NRL (5543) Washington DC 
(USA) via tor-project wrote:
> Hello Tor friends,
> 
> We are planning to construct a real-world dataset for studying Tor website
> fingerprinting that researchers and developers can use to evaluate potential
> attacks and to design informed defenses that improve Tor’s resistance to such
> attacks. We believe the dataset will help us make Tor safer, because it will
> allow us to design defenses that can be shown to protect *real* Tor traffic
> instead of *synthetic* traffic. This will help ground our evaluation of proposed
> defenses in reality and help us more confidently decide which, if any, defense
> is worth deploying in Tor.
> 
> We have submitted detailed technical plans for constructing a dataset to the Tor
> Research Safety Board and after some iteration have arrived at a plan in which
> we believe the benefits outweigh the risks. We are now sharing an overview of
> our plan with the broader community to provide an opportunity for comment.
> 
> More details are below. Please let us know if you have comments.
> 
> Peace, love, and positivity,
> Rob
> 
> P.S. Apologies for posting near the end of the work-week, but I wanted to get
> this out in case people want to talk to me about it in Costa Rica.
> 
> ===
> 
> BACKGROUND
> 
> Website fingerprinting attacks distill traffic patterns observed between a
> client and Tor entry into a sequence of packet directions: -1 if a packet is
> sent toward the destination, +1 if a packet is sent from toward the client. An
> attacker can collect a list of these directions and then train machine learning
> classifiers to associate a website domain name or url with the particular list
> of directions observed when visiting that website. Once it does this training,
> then when it observes a new list of directions it can use the trained model to
> predict which website corresponds to that pattern.
> 
> For example, suppose [-1,-1,+1,+1] is associated with website1 and [-1,+1,-1,+1]
> is associated with website2. There are two steps in an attack:
> 
> Step 1:
> In the first step the attacker itself visits website1 and website2 many times
> and learns:
> [-1,-1,+1,+1] -> website1
> [-1,+1,-1,+1] -> website2
> It trains a machine learning model to learn this association.
> 
> Step 2:
> In the second step, with the trained model in hand, the attacker monitors a Tor
> client (maybe the attacker is the client’s ISP, or some other entity in a
> position to observe a client’s traffic) and when it observes the pattern:
> [-1,-1,+1,+1]
> the model will predict that the client went to website1. This example is
> *extremely* simplified, but I hope gives an idea how the attack works.
> 
> PROBLEM
> 
> Because researchers don’t know which websites Tor users are visiting, it’s hard
> to do a very good job creating a representative dataset that can be used to
> accurately evaluate attacks or defenses (i.e., to emulate steps 1 and 2). The
> standard technique has been to just select popular websites from top website
> lists (e.g., Alexa or Tranco) and then set up a Tor webpage crawler to visit the
> front-pages of those websites over and over and over again. Then they use that
> data to write papers. This approach has several problems:
> 
> - Low traffic diversity: Tor users don’t only visit front-pages. For example,
> they may conduct a web search and then click a link that brings them directly to
> an internal page of a website. The patterns produced from front-page visits may
> be simpler and unrepresentative of the patterns that would be observed from more
> complicated internal pages.
> 
> - Low browser diversity: It has been shown by research from Marc Juarez [0] and
> others that webpage crawlers used by researchers lack diversity in important
> aspects that cause us to overestimate the accuracy of WF attacks. For example,
> the browser versions, configuration choices, variation in behavior (e.g., using
> multiple tabs at once), and network location of the client can all significantly
> affect the observable traffic patterns in ways that a crawler methodology does
> not capture.
> 
> - Data staleness: Researchers collect data over a short time-frame and then
> evaluate the attacks assuming this static dataset. In the real world, websites
> are being updated over time, and a model trained on an old version of a website
> may not transfer to the new version.
> 
> In addition to the above problems in methodology, current research also causes
> incidental consequences for the Tor network:
> 
> - Network overhead: machine learning is a hot topic and several research groups
> have crawled tens of thousands of websites over Tor many times each. While each
> individual page load might be insignificant compared with the normal usage of
> Tor, crawling does add additional load to the network and can contribute to
> congestion and performance bottlenecks.
> 
> Researchers have been designing attacks that are shown to be extremely accurate
> using the above synthetic crawling methodology. But because of the above
> problems, we don’t properly understand the *true* threat of the attack against
> the Tor network. It is possible that the simplicity of the crawling approach is
> what makes the attacks work well, and that the attacks would not work as well if
> evaluated with more realistic traffic and browser diversity.
> 
> PLAN
> 
> So our goal is to construct a real-world dataset for studying Tor website
> fingerprinting that researchers and developers can use to evaluate potential
> attacks and to design informed defenses that improve Tor’s resistance to such
> attacks. This dataset would enable researchers to use a methodology that does
> not have any of the above limitations. We believe that such a dataset will help
> us make Tor safer, because it will allow us to design defenses that can be shown
> to protect *real* Tor traffic instead of *synthetic* traffic. This would lead to
> a better understanding of proposed defenses and enable us to more confidently
> decide which, if any, defense is worth deploying in Tor.
> 
> The dataset will be constructed from a 13-week exit relay measurement that is
> based on the measurement process established in recent work [1]. The primary
> information being measured is the directionality of the first 5k cells sent on a
> measurement circuit, and a keyed-HMAC of the first domain name requested on the
> circuit. We also measure relative circuit and cell timestamps (relative to the
> start of measurement). The measurement data is compressed, encrypted using a
> public-key encryption scheme (the secret key is stored offline), and then
> temporarily written to persistent storage before being securely retrieved from
> the relay machine.
> 
> We hope that this dataset can become a standard tool that website fingerprinting
> researchers and developers can use to (1) accelerate their study of attacks and
> defenses, and (2) produce evaluation and results that are more directly
> applicable to the Tor network. We plan to share it upon request only to other
> researchers who appear to come from verifiable research organizations, such as
> students from well-known universities. We will require researchers with whom we
> share the data to (1) keep the data private, and (2) direct others who want a
> copy of the data to us to mitigate unauthorized sharing.
> 
> [0] A Critical Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS 2014. https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf
> 
> [1] Online Website Fingerprinting: Evaluating Website Fingerprinting Attacks on Tor in the Real World. Cherubin et al., USENIX Security 2022. https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin
> 
> 
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project