[tor-project] Ethics Guidelines; crawling .onion

Thu Jul 21 04:31:09 UTC 2016

> On 21 Jul 2016, at 14:23, Virgil Griffith <i at virgil.gr> wrote:
> 
> Does anyone want to vouch for view (A) ?  Note that view (A) is currently enshrined in the ethics guidelines.

I think you've misinterpreted the ethics guidelines here.
"crawling" means running a HSDir to discover .onion addresses that would otherwise be private.
It doesn't (necessarily) mean accessing web pages on .onion sites using an automated process.

Tim

>  The following are currently in conflict with (A):
> 
> * the largest tor2web nodes
> * MEMEX and other government programs
> * beloved metrics applications like OnionStats
> 
> -V
> 
> On Tuesday, 19 July 2016, Nurmi, Juha <juha.nurmi at ahmia.fi> wrote:
> Hi,
> 
> Virgil pointed out several good points with onion search engines.
> 
> 1) Anonymous vs. hidden
> 
> >>> Whereas some would say Tor users are "anonymous", others would instead say any and everything Tor is "private".  I believe this needs to be clarified.
> 
> I am publishing a paper about my onion service experiment: I deployed 100 onion servers and followed TCP traffic to these services. As a result, they got accessed by multiple different scanners (curl, wget, browser, scrapers, ssh). This means that some people do HSDir harvesting and scan onions.
> 
> 2) Search engines can efficiently map content
> 
> >>> For what it's worth, ahmia.fi actually supports regex searching right out of the box.  In fact, a single line of JSON spits out all known bitcoin addresses ahmia knows about.
> 
> At the moment I have no public documentation how to use regex search but Ahmia supports this feature. Is this good or not? I know that Google has disabled these kind of features because privacy issues.
> 
> 3) Is a web-site a public place?
> 
> >>> Here's how I currently see this.  I put on my amateur legal hat and say, "Well, the Internet/world-wide-web is considered a public space.  Onion-sites are like the web, but with masked speakers."
> 
> Good point! I think you are right.
> 
> Best,
> Juha
> 
> 
> On Thu, Jul 7, 2016 at 9:28 AM, Virgil Griffith <i at virgil.gr> wrote:
> > you might want to remove the client IP address (X-Forwarded-For) from HTTP headers
> 
> Agreed!  And yes we already remove x-forwarded-for.
> https://github.com/globaleaks/Tor2web/blob/master/tor2web/t2w.py#L701
> 
> I recall that the very, very beginning we had a python proxy library automatically adding x-forwarded-for, but once we realized it was doing that we corrected it.  FWIW, it was actually Aaron who wrote that code ;)
> 
> AFAIK Tor2web hasn't leaked any privacy-invading headers for sometime.  If ones are discovered they would be fixed ASAP.
> 
> 
> > Is the opt-out permanent, or does your server re-check every time it connects?
> > I can imagine there being issues with either model - one involves storing a list, the other, regular connections.
> 
> I don't know.  This is Google/Bing's department.  Do we have someone on list familiar enough with either?  If I were to guess the Googley/Bingy-way of doing this, I'd imagine them storing the list, and then when crawling the site again they'd do a HEAD request to see if the /robots.txt has changed.  And if the /robots.txt has changed, to overwrite their stored list.
> 
> 
> > I am disappointed that we have a Tor2web design where Tor2web needs to connect to a hidden service first, then check if it has given permission for Tor2web to connect to it.
> 
> /robots.txt isn't a permission to "connect to", it's a permission to crawl/index.  I'm aware of no standard within or outside of Tor to say whether node A has permission to connect to node B.  If such a standard or even unofficial exists I'm down for spending some weekends implementing it.
> 
> > I am also disappointed that this only works for HTTP onions on the default port 80.
> 
> I agree completely.  But if the issue is operator privacy, isn't it even *better* that tor2web only works for port 80?  As an aside, there is tor2tcp at: https://cryptoparty.at/tor2tcp
> 
> 
> > I am also concerned about threat models where a single unwanted connection, or a number of unwanted connections, are security factors.
> > For example:
> > Imagine there is an (unknown) attack which can determine 1 bit of the 1024-bit RSA key per hidden service connection.
> > (Some known attacks on broken crypto systems are like this, as are some side-channels.)
> > Or imagine there is an attack which can determine 1 bit of the IPv4 address per connection.
> > Is there an alternative to position (A) that supports threat models like this?
> 
> I don't have a good solution to this.  As stated above, I'm aware of no protocol for saying "Please don't connect to me."  The security person in me is a little skeptical how useful it would be---if someone wanted to make many connections to learn a private key, I presume she won't be obeying said requests.  However, if someone doesn't want to be connected to, upon such a standard existing I would happily abide by it.
> 
> > there is also the possibility of exerting social pressure to prevent people from running servers that continually connect to tor hidden services.
> 
> The closest things I know of for social pressure are:
> 
> (1) Liberal caching headers in the HTTP response:
> 
> ```
> max-age=604800	   #can be cached by browser and any intermediary caches for up to 1 week
> ```
> 
> (2) In /robots.txt putting long crawl-delays:
> 
> ```
> User-Agent: *
> Crawl-delay: 86400   #wait 1 day between each fetch.
> ```
> 
> > I believe that a technical solution to this threat model is hidden service client authentication (and the next-generation hidden service protocol, when available).
> 
> Agreed.
> 
> -V
> 
> On Thu, Jul 7, 2016 at 1:44 PM, Tim Wilson-Brown - teor <teor2345 at gmail.com> wrote:
> 
> > On 7 Jul 2016, at 15:24, Virgil Griffith <i at virgil.gr> wrote:
> >
> > > How do you make sure that Tor2web users are anonymised (as possible) when accessing hidden services?
> >
> > I make a good faith effort not to wantonly reveal personally identifying information.  But in short, it's hard.  I urge people to think of tor2web nodes as closer to Twitter where they record what links you click.  I wholly support having the "where is Tor2web in regards to user privacy" discussion (hopefully could even make some improvements to it!), but it is orthogonal to the "robots.txt on .onion" discussion.  Let's address the robots.txt issue and then we can return to Tor2web user-privacy.
> 
> Well, as a separate issue, you might want to remove the client IP address (X-Forwarded-For) from HTTP headers your caching proxies send to hidden services. And work out if any of the other headers are sensitive.
> 
> > On 7 Jul 2016, at 14:40, Virgil Griffith <i at virgil.gr> wrote:
> >
> > So now we have *three* different positions among respected members of the Tor community.
> >
> > (A) isis et al: robots.txt is insufficient
> > --- "Consent is not the absence of saying 'no' — it is explicitly saying 'yes'."
> >
> > (B) onionlink/ahmia/notevil/grams: we respect robots.txt
> > --- "Default is yes, but you can always opt-out."
> 
> Is the opt-out permanent, or does your server re-check every time it connects?
> I can imagine there being issues with either model - one involves storing a list, the other, regular connections.
> 
> > (C) onionstats/memex: we ignore robots.txt
> > --- "Don't care even if you opt-out." (see https://onionscan.org/reports/may2016.html)
> >
> >
> > Isis did a good job arguing for (A) by claiming that representing (B) and (C) are "blatant and disgusting workaround[s] to the trust and expectations which onion service operators place in the network." https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
> >
> > This is me arguing for (B): https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
> >
> > I have no link arguing for (C).
> 
> I am disappointed that we have a Tor2web design where Tor2web needs to connect to a hidden service first, then check if it has given permission for Tor2web to connect to it. I am also disappointed that this only works for HTTP onions on the default port 80.
> 
> I would like to see a much better design for this.
> 
> I am also concerned about threat models where a single unwanted connection, or a number of unwanted connections, are security factors.
> For example:
> Imagine there is an (unknown) attack which can determine 1 bit of the 1024-bit RSA key per hidden service connection.
> (Some known attacks on broken crypto systems are like this, as are some side-channels.)
> Or imagine there is an attack which can determine 1 bit of the IPv4 address per connection.
> 
> For security, a hidden service operator decides to only allow 10 connections before rolling over their hidden service to a new key and server.
> 
> There are at least 10 connections to known .onion addresses every week, because there are at least 10 Tor2web or memex or onionstats instances on the web.
> Therefore, every week, the operator must roll over their hidden service, and arrange to notify users of the new address in a secure fashion. Alternately, they must keep the address secret, even from the HSDir hash ring, which is not possible.
> 
> Is there an alternative to position (A) that supports threat models like this?
> 
> I believe that a technical solution to this threat model is hidden service client authentication (and the next-generation hidden service protocol, when available).
> However, there is also the possibility of exerting social pressure to prevent people from running servers that continually connect to tor hidden services.
> 
> Tim
> 
> Tim Wilson-Brown (teor)
> 
> teor2345 at gmail dot com
> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> ricochet:ekmygaiu4rzgsk6n
> 
> 
> 
> 
> 
> Tim
> 
> Tim Wilson-Brown (teor)
> 
> teor2345 at gmail dot com
> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
> ricochet:ekmygaiu4rzgsk6n
> 
> 
> 
> 
> 
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> 
> 
> 
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> 
> 
> _______________________________________________
> tor-project mailing list
> tor-project at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
OTR 8F39BCAC 9C9DDF9A DF5FAE48 1D7D99D4 3B406880
ricochet:ekmygaiu4rzgsk6n

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20160721/8e7c79df/attachment-0001.sig>