[tor-project] Ethics Guidelines; crawling .onion

Sat Jul 23 00:37:20 UTC 2016

> On 21 Jul 2016, at 18:10, Virgil Griffith <i at virgil.gr> wrote:
> 
> > I think you've misinterpreted the ethics guidelines here.
> > "crawling" means running a HSDir to discover .onion addresses that would otherwise be private.
> > It doesn't (necessarily) mean accessing web pages on .onion sites using an automated process.
> 
> If so, this is news to me, and I would be delighted to hear it.
> 
> Can we get a confirmation then that /robots.txt is a totally cool standard?

That's not what I said, Virgil.
There are a number of different opinions here.
No interpretation of the ethics guidelines changes that.
It feels like you're engaging in rules lawyering, trying to find a policy or statement that will let you do what you want to do.
Please engage with people's concerns instead.

For the sake of clarifying what I meant, here is my analysis:

When I've seen people talk about "crawling .onion sites", the issue that has received the most focus is the harvesting of .onion addresses by running a malicious HSDir. We do things to prevent this behaviour, including blacklisting HSDirs. This behaviour is clearly unethical, there is a community consensus about it, and we invest resources in preventing it.

As for accessing .onion sites via an automated process or non-anonymous proxy (e.g. Tor2web), that's something we're still talking about. There are significant issues around client anonymity, server anonymity, and access to sensitive data. We might decide we want to actively prevent it. We might decide we don't want to put any effort into supporting it in future.

There's also the issue of searching these sites. Perhaps some kinds of search are ok, but others are too powerful (like regular expressions, which many search sites avoid). Again, this is something we're discussing.

And there's a final issue here: your previous and current behaviour. You've attempted to monetise Tor2web client information by selling it to investigative agencies, including providing public samples. You've tried to push the discussion (towards supporting you?) by acting in ways that could potentially harm users. You've repeatedly released potentially sensitive client logs (some had minimal redactions). People have asked you to stop. But you still keep on releasing logs and lists of scraped data in your emails.

You just can't seem to stop showing off the kinds of data that you have access to.
You've illustrated exactly the sort of things that an unscrupulous Tor2web operator can do.
(Even if you were doing it for demonstration purposes, the ethical way to demonstrate is with test data, not live client requests from actual people.)

I don't know if I'd trust you to be in a position where you see client requests.
I'm not sure I'd even trust you to run a Guard node, and Tor2web admins see far more than a Guard node does.

-----

As an aside:

You might want to enable automatic redirects from http://onion.link to https://onion.link.
I know some people object to automatic HTTPS redirects. But I think in this case it's an important protection for clients.

Normally I'd be concerned you use Google Analytics rather than a local analytics solution. But since you're loading an embedded Google search box, Google gets all the client data anyway, including search queries and client IP addresses. You could use a privacy-preserving search site, or proxy the requests to Google to hide client IP addresses.

Your headers look fine, but onion.link leaves the connection open for a long time if the hidden service receives a request, and keeps the connection open, but doesn't respond. I wonder if this is a bug in Tor2web, or in onion.link, or desired behaviour. In either case, it's a denial of service risk. (It's not that serious, because each open client connection requires an open hidden service connection. And I wonder if it would time out eventually.)

Tim

> 
> -V
> 
> On Thu, Jul 21, 2016 at 12:31 PM, Tim Wilson-Brown - teor <teor2345 at gmail.com> wrote:
> 
> > On 21 Jul 2016, at 14:23, Virgil Griffith <i at virgil.gr> wrote:
> >
> > Does anyone want to vouch for view (A) ?  Note that view (A) is currently enshrined in the ethics guidelines.
> 
> I think you've misinterpreted the ethics guidelines here.
> "crawling" means running a HSDir to discover .onion addresses that would otherwise be private.
> It doesn't (necessarily) mean accessing web pages on .onion sites using an automated process.
> 
> Tim
> 
> >  The following are currently in conflict with (A):
> >
> > * the largest tor2web nodes
> > * MEMEX and other government programs
> > * beloved metrics applications like OnionStats
> >
> > -V
> >
> > On Tuesday, 19 July 2016, Nurmi, Juha <juha.nurmi at ahmia.fi> wrote:
> > Hi,
> >
> > Virgil pointed out several good points with onion search engines.
> >
> > 1) Anonymous vs. hidden
> >
> > >>> Whereas some would say Tor users are "anonymous", others would instead say any and everything Tor is "private".  I believe this needs to be clarified.
> >
> > I am publishing a paper about my onion service experiment: I deployed 100 onion servers and followed TCP traffic to these services. As a result, they got accessed by multiple different scanners (curl, wget, browser, scrapers, ssh). This means that some people do HSDir harvesting and scan onions.
> >
> > 2) Search engines can efficiently map content
> >
> > >>> For what it's worth, ahmia.fi actually supports regex searching right out of the box.  In fact, a single line of JSON spits out all known bitcoin addresses ahmia knows about.
> >
> > At the moment I have no public documentation how to use regex search but Ahmia supports this feature. Is this good or not? I know that Google has disabled these kind of features because privacy issues.
> >
> > 3) Is a web-site a public place?
> >
> > >>> Here's how I currently see this.  I put on my amateur legal hat and say, "Well, the Internet/world-wide-web is considered a public space.  Onion-sites are like the web, but with masked speakers."
> >
> > Good point! I think you are right.
> >
> > Best,
> > Juha
> >
> >
> > On Thu, Jul 7, 2016 at 9:28 AM, Virgil Griffith <i at virgil.gr> wrote:
> > > you might want to remove the client IP address (X-Forwarded-For) from HTTP headers
> >
> > Agreed!  And yes we already remove x-forwarded-for.
> > https://github.com/globaleaks/Tor2web/blob/master/tor2web/t2w.py#L701
> >
> > I recall that the very, very beginning we had a python proxy library automatically adding x-forwarded-for, but once we realized it was doing that we corrected it.  FWIW, it was actually Aaron who wrote that code ;)
> >
> > AFAIK Tor2web hasn't leaked any privacy-invading headers for sometime.  If ones are discovered they would be fixed ASAP.
> >
> >
> > > Is the opt-out permanent, or does your server re-check every time it connects?
> > > I can imagine there being issues with either model - one involves storing a list, the other, regular connections.
> >
> > I don't know.  This is Google/Bing's department.  Do we have someone on list familiar enough with either?  If I were to guess the Googley/Bingy-way of doing this, I'd imagine them storing the list, and then when crawling the site again they'd do a HEAD request to see if the /robots.txt has changed.  And if the /robots.txt has changed, to overwrite their stored list.
> >
> >
> > > I am disappointed that we have a Tor2web design where Tor2web needs to connect to a hidden service first, then check if it has given permission for Tor2web to connect to it.
> >
> > /robots.txt isn't a permission to "connect to", it's a permission to crawl/index.  I'm aware of no standard within or outside of Tor to say whether node A has permission to connect to node B.  If such a standard or even unofficial exists I'm down for spending some weekends implementing it.
> >
> > > I am also disappointed that this only works for HTTP onions on the default port 80.
> >
> > I agree completely.  But if the issue is operator privacy, isn't it even *better* that tor2web only works for port 80?  As an aside, there is tor2tcp at: https://cryptoparty.at/tor2tcp
> >
> >
> > > I am also concerned about threat models where a single unwanted connection, or a number of unwanted connections, are security factors.
> > > For example:
> > > Imagine there is an (unknown) attack which can determine 1 bit of the 1024-bit RSA key per hidden service connection.
> > > (Some known attacks on broken crypto systems are like this, as are some side-channels.)
> > > Or imagine there is an attack which can determine 1 bit of the IPv4 address per connection.
> > > Is there an alternative to position (A) that supports threat models like this?
> >
> > I don't have a good solution to this.  As stated above, I'm aware of no protocol for saying "Please don't connect to me."  The security person in me is a little skeptical how useful it would be---if someone wanted to make many connections to learn a private key, I presume she won't be obeying said requests.  However, if someone doesn't want to be connected to, upon such a standard existing I would happily abide by it.
> >
> > > there is also the possibility of exerting social pressure to prevent people from running servers that continually connect to tor hidden services.
> >
> > The closest things I know of for social pressure are:
> >
> > (1) Liberal caching headers in the HTTP response:
> >
> > ```
> > max-age=604800           #can be cached by browser and any intermediary caches for up to 1 week
> > ```
> >
> > (2) In /robots.txt putting long crawl-delays:
> >
> > ```
> > User-Agent: *
> > Crawl-delay: 86400   #wait 1 day between each fetch.
> > ```
> >
> > > I believe that a technical solution to this threat model is hidden service client authentication (and the next-generation hidden service protocol, when available).
> >
> > Agreed.
> >
> > -V
> >
> > On Thu, Jul 7, 2016 at 1:44 PM, Tim Wilson-Brown - teor <teor2345 at gmail.com> wrote:
> >
> > > On 7 Jul 2016, at 15:24, Virgil Griffith <i at virgil.gr> wrote:
> > >
> > > > How do you make sure that Tor2web users are anonymised (as possible) when accessing hidden services?
> > >
> > > I make a good faith effort not to wantonly reveal personally identifying information.  But in short, it's hard.  I urge people to think of tor2web nodes as closer to Twitter where they record what links you click.  I wholly support having the "where is Tor2web in regards to user privacy" discussion (hopefully could even make some improvements to it!), but it is orthogonal to the "robots.txt on .onion" discussion.  Let's address the robots.txt issue and then we can return to Tor2web user-privacy.
> >
> > Well, as a separate issue, you might want to remove the client IP address (X-Forwarded-For) from HTTP headers your caching proxies send to hidden services. And work out if any of the other headers are sensitive.
> >
> > > On 7 Jul 2016, at 14:40, Virgil Griffith <i at virgil.gr> wrote:
> > >
> > > So now we have *three* different positions among respected members of the Tor community.
> > >
> > > (A) isis et al: robots.txt is insufficient
> > > --- "Consent is not the absence of saying 'no' — it is explicitly saying 'yes'."
> > >
> > > (B) onionlink/ahmia/notevil/grams: we respect robots.txt
> > > --- "Default is yes, but you can always opt-out."
> >
> > Is the opt-out permanent, or does your server re-check every time it connects?
> > I can imagine there being issues with either model - one involves storing a list, the other, regular connections.
> >
> > > (C) onionstats/memex: we ignore robots.txt
> > > --- "Don't care even if you opt-out." (see https://onionscan.org/reports/may2016.html)
> > >
> > >
> > > Isis did a good job arguing for (A) by claiming that representing (B) and (C) are "blatant and disgusting workaround[s] to the trust and expectations which onion service operators place in the network." https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
> > >
> > > This is me arguing for (B): https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
> > >
> > > I have no link arguing for (C).
> >
> > I am disappointed that we have a Tor2web design where Tor2web needs to connect to a hidden service first, then check if it has given permission for Tor2web to connect to it. I am also disappointed that this only works for HTTP onions on the default port 80.
> >
> > I would like to see a much better design for this.
> >
> > I am also concerned about threat models where a single unwanted connection, or a number of unwanted connections, are security factors.
> > For example:
> > Imagine there is an (unknown) attack which can determine 1 bit of the 1024-bit RSA key per hidden service connection.
> > (Some known attacks on broken crypto systems are like this, as are some side-channels.)
> > Or imagine there is an attack which can determine 1 bit of the IPv4 address per connection.
> >
> > For security, a hidden service operator decides to only allow 10 connections before rolling over their hidden service to a new key and server.
> >
> > There are at least 10 connections to known .onion addresses every week, because there are at least 10 Tor2web or memex or onionstats instances on the web.
> > Therefore, every week, the operator must roll over their hidden service, and arrange to notify users of the new address in a secure fashion. Alternately, they must keep the address secret, even from the HSDir hash ring, which is not possible.
> >
> > Is there an alternative to position (A) that supports threat models like this?
> >
> > I believe that a technical solution to this threat model is hidden service client authentication (and the next-generation hidden service protocol, when available).
> > However, there is also the possibility of exerting social pressure to prevent people from running servers that continually connect to tor hidden services.
> >
> > Tim
> >
> > Tim Wilson-Brown (teor)
> >
> > teor2345 at gmail dot com
> > PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> > ricochet:ekmygaiu4rzgsk6n
> >
> >
> >
> >
> >
> > Tim
> >
> > Tim Wilson-Brown (teor)
> >
> > teor2345 at gmail dot com
> > PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> > ricochet:ekmygaiu4rzgsk6n
> >
> >
> >
> >
> >
> > _______________________________________________
> > tor-project mailing list
> > tor-project at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> >
> >
> >
> > _______________________________________________
> > tor-project mailing list
> > tor-project at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> >
> >
> > _______________________________________________
> > tor-project mailing list
> > tor-project at lists.torproject.org
> > https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> 
> Tim Wilson-Brown (teor)
> 
> teor2345 at gmail dot com
> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> OTR 8F39BCAC 9C9DDF9A DF5FAE48 1D7D99D4 3B406880
> ricochet:ekmygaiu4rzgsk6n
> 
> 
> 
> 
> 
> 
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> 
> 
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
OTR 8F39BCAC 9C9DDF9A DF5FAE48 1D7D99D4 3B406880
ricochet:ekmygaiu4rzgsk6n

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20160723/d61a4e0a/attachment-0001.sig>