[tor-project] Ethics Guidelines; crawling .onion

Thu Jul 21 04:23:26 UTC 2016

Does anyone want to vouch for view (A) ?  Note that view (A) is currently
enshrined in the ethics guidelines.  The following are currently in
conflict with (A):

* the largest tor2web nodes
* MEMEX and other government programs
* beloved metrics applications like OnionStats

-V

On Tuesday, 19 July 2016, Nurmi, Juha <juha.nurmi at ahmia.fi> wrote:

> Hi,
>
> Virgil pointed out several good points with onion search engines.
>
> 1) Anonymous vs. hidden
>
> >>> Whereas some would say Tor users are "anonymous", others would instead
> say any and everything Tor is "private".  I believe this needs to be
> clarified.
>
> I am publishing a paper about my onion service experiment: I deployed 100
> onion servers and followed TCP traffic to these services. As a result, they
> got accessed by multiple different scanners (curl, wget, browser, scrapers,
> ssh). This means that some people do HSDir harvesting and scan onions.
>
> 2) Search engines can efficiently map content
>
> >>> For what it's worth, ahmia.fi actually supports regex searching right
> out of the box.  In fact, a single line of JSON spits out all known bitcoin
> addresses ahmia knows about.
>
> At the moment I have no public documentation how to use regex search but
> Ahmia supports this feature. Is this good or not? I know that Google has
> disabled these kind of features because privacy issues.
>
> 3) Is a web-site a public place?
>
> >>> Here's how I currently see this.  I put on my amateur legal hat and
> say, "Well, the Internet/world-wide-web is considered a public space.
> Onion-sites are like the web, but with masked speakers."
>
> Good point! I think you are right.
>
> Best,
> Juha
>
>
> On Thu, Jul 7, 2016 at 9:28 AM, Virgil Griffith <i at virgil.gr
> <javascript:_e(%7B%7D,'cvml','i at virgil.gr');>> wrote:
>
>> > you might want to remove the client IP address (X-Forwarded-For) from
>> HTTP headers
>>
>> Agreed!  And yes we already remove x-forwarded-for.
>> https://github.com/globaleaks/Tor2web/blob/master/tor2web/t2w.py#L701
>>
>> I recall that the very, very beginning we had a python proxy library
>> automatically adding x-forwarded-for, but once we realized it was doing
>> that we corrected it.  FWIW, it was actually Aaron who wrote that code ;)
>>
>> AFAIK Tor2web hasn't leaked any privacy-invading headers for sometime.
>> If ones are discovered they would be fixed ASAP.
>>
>>
>> > Is the opt-out permanent, or does your server re-check every time it
>> connects?
>> > I can imagine there being issues with either model - one involves
>> storing a list, the other, regular connections.
>>
>> I don't know.  This is Google/Bing's department.  Do we have someone on
>> list familiar enough with either?  If I were to guess the Googley/Bingy-way
>> of doing this, I'd imagine them storing the list, and then when crawling
>> the site again they'd do a HEAD request to see if the /robots.txt has
>> changed.  And if the /robots.txt has changed, to overwrite their stored
>> list.
>>
>>
>> > I am disappointed that we have a Tor2web design where Tor2web needs to
>> connect to a hidden service first, then check if it has given permission
>> for Tor2web to connect to it.
>>
>> /robots.txt isn't a permission to "connect to", it's a permission to
>> crawl/index.  I'm aware of no standard within or outside of Tor to say
>> whether node A has permission to connect to node B.  If such a standard or
>> even unofficial exists I'm down for spending some weekends implementing it.
>>
>> > I am also disappointed that this only works for HTTP onions on the
>> default port 80.
>>
>> I agree completely.  But if the issue is operator privacy, isn't it even
>> *better* that tor2web only works for port 80?  As an aside, there is
>> tor2tcp at: https://cryptoparty.at/tor2tcp
>>
>>
>> > I am also concerned about threat models where a single unwanted
>> connection, or a number of unwanted connections, are security factors.
>> > For example:
>> > Imagine there is an (unknown) attack which can determine 1 bit of the
>> 1024-bit RSA key per hidden service connection.
>> > (Some known attacks on broken crypto systems are like this, as are some
>> side-channels.)
>> > Or imagine there is an attack which can determine 1 bit of the IPv4
>> address per connection.
>> > Is there an alternative to position (A) that supports threat models
>> like this?
>>
>> I don't have a good solution to this.  As stated above, I'm aware of no
>> protocol for saying "Please don't connect to me."  The security person in
>> me is a little skeptical how useful it would be---if someone wanted to make
>> many connections to learn a private key, I presume she won't be obeying
>> said requests.  However, if someone doesn't want to be connected to, upon
>> such a standard existing I would happily abide by it.
>>
>> > there is also the possibility of exerting social pressure to prevent
>> people from running servers that continually connect to tor hidden services.
>>
>> The closest things I know of for social pressure are:
>>
>> (1) Liberal caching headers in the HTTP response:
>>
>> ```
>> max-age=604800   #can be cached by browser and any intermediary caches
>> for up to 1 week
>> ```
>>
>> (2) In /robots.txt putting long crawl-delays:
>>
>> ```
>> User-Agent: *
>> Crawl-delay: 86400   #wait 1 day between each fetch.
>> ```
>>
>> > I believe that a technical solution to this threat model is hidden
>> service client authentication (and the next-generation hidden service
>> protocol, when available).
>>
>> Agreed.
>>
>> -V
>>
>> On Thu, Jul 7, 2016 at 1:44 PM, Tim Wilson-Brown - teor <
>> teor2345 at gmail.com <javascript:_e(%7B%7D,'cvml','teor2345 at gmail.com');>>
>> wrote:
>>
>>>
>>> > On 7 Jul 2016, at 15:24, Virgil Griffith <i at virgil.gr
>>> <javascript:_e(%7B%7D,'cvml','i at virgil.gr');>> wrote:
>>> >
>>> > > How do you make sure that Tor2web users are anonymised (as possible)
>>> when accessing hidden services?
>>> >
>>> > I make a good faith effort not to wantonly reveal personally
>>> identifying information.  But in short, it's hard.  I urge people to think
>>> of tor2web nodes as closer to Twitter where they record what links you
>>> click.  I wholly support having the "where is Tor2web in regards to user
>>> privacy" discussion (hopefully could even make some improvements to it!),
>>> but it is orthogonal to the "robots.txt on .onion" discussion.  Let's
>>> address the robots.txt issue and then we can return to Tor2web user-privacy.
>>>
>>> Well, as a separate issue, you might want to remove the client IP
>>> address (X-Forwarded-For) from HTTP headers your caching proxies send to
>>> hidden services. And work out if any of the other headers are sensitive.
>>>
>>> > On 7 Jul 2016, at 14:40, Virgil Griffith <i at virgil.gr
>>> <javascript:_e(%7B%7D,'cvml','i at virgil.gr');>> wrote:
>>> >
>>> > So now we have *three* different positions among respected members of
>>> the Tor community.
>>> >
>>> > (A) isis et al: robots.txt is insufficient
>>> > --- "Consent is not the absence of saying 'no' — it is explicitly
>>> saying 'yes'."
>>> >
>>> > (B) onionlink/ahmia/notevil/grams: we respect robots.txt
>>> > --- "Default is yes, but you can always opt-out."
>>>
>>> Is the opt-out permanent, or does your server re-check every time it
>>> connects?
>>> I can imagine there being issues with either model - one involves
>>> storing a list, the other, regular connections.
>>>
>>> > (C) onionstats/memex: we ignore robots.txt
>>> > --- "Don't care even if you opt-out." (see
>>> https://onionscan.org/reports/may2016.html)
>>> >
>>> >
>>> > Isis did a good job arguing for (A) by claiming that representing (B)
>>> and (C) are "blatant and disgusting workaround[s] to the trust and
>>> expectations which onion service operators place in the network."
>>> https://lists.torproject.org/pipermail/tor-project/2016-May/000356.html
>>> >
>>> > This is me arguing for (B):
>>> https://lists.torproject.org/pipermail/tor-project/2016-May/000411.html
>>> >
>>> > I have no link arguing for (C).
>>>
>>> I am disappointed that we have a Tor2web design where Tor2web needs to
>>> connect to a hidden service first, then check if it has given permission
>>> for Tor2web to connect to it. I am also disappointed that this only works
>>> for HTTP onions on the default port 80.
>>>
>>> I would like to see a much better design for this.
>>>
>>> I am also concerned about threat models where a single unwanted
>>> connection, or a number of unwanted connections, are security factors.
>>> For example:
>>> Imagine there is an (unknown) attack which can determine 1 bit of the
>>> 1024-bit RSA key per hidden service connection.
>>> (Some known attacks on broken crypto systems are like this, as are some
>>> side-channels.)
>>> Or imagine there is an attack which can determine 1 bit of the IPv4
>>> address per connection.
>>>
>>> For security, a hidden service operator decides to only allow 10
>>> connections before rolling over their hidden service to a new key and
>>> server.
>>>
>>> There are at least 10 connections to known .onion addresses every week,
>>> because there are at least 10 Tor2web or memex or onionstats instances on
>>> the web.
>>> Therefore, every week, the operator must roll over their hidden service,
>>> and arrange to notify users of the new address in a secure fashion.
>>> Alternately, they must keep the address secret, even from the HSDir hash
>>> ring, which is not possible.
>>>
>>> Is there an alternative to position (A) that supports threat models like
>>> this?
>>>
>>> I believe that a technical solution to this threat model is hidden
>>> service client authentication (and the next-generation hidden service
>>> protocol, when available).
>>> However, there is also the possibility of exerting social pressure to
>>> prevent people from running servers that continually connect to tor hidden
>>> services.
>>>
>>> Tim
>>>
>>> Tim Wilson-Brown (teor)
>>>
>>> teor2345 at gmail dot com
>>> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
>>> ricochet:ekmygaiu4rzgsk6n
>>>
>>>
>>>
>>>
>>>
>>> Tim
>>>
>>> Tim Wilson-Brown (teor)
>>>
>>> teor2345 at gmail dot com
>>> PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
>>> ricochet:ekmygaiu4rzgsk6n
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> tor-project mailing list
>>> tor-project at lists.torproject.org
>>> <javascript:_e(%7B%7D,'cvml','tor-project at lists.torproject.org');>
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>>>
>>>
>>
>> _______________________________________________
>> tor-project mailing list
>> tor-project at lists.torproject.org
>> <javascript:_e(%7B%7D,'cvml','tor-project at lists.torproject.org');>
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20160721/6232f2d8/attachment.html>