[tor-dev] Sanitizing and publishing our web server logs
Karsten Loesing
karsten.loesing at gmx.net
Mon Sep 5 08:28:35 UTC 2011
On 9/2/11 7:32 PM, Marsh Ray wrote:
> On 08/25/2011 03:08 AM, Karsten Loesing wrote:
>> Hi everyone,
>>
>> we have been discussing sanitizing and publishing our web server logs
>> for quite a while now. The idea is to remove all potentially sensitive
>> parts from the logs, publish them in monthly tarballs on the metrics
>> website, and analyze them for top visited pages, top downloaded
>> packages, etc. See the tickets #1641 and #2489 for details.
>
> Why?
>
> I.e., what are the great benefits hoped to arise from such publication
> to outweigh the considerable risks?
The benefits are, e.g., that we learn more about our website visitors
and can make our websites more useful for them. And we can learn which
packages users download, including their platforms, languages, etc.
which may help us concentrate our efforts better. These are just two
examples, but I think we agree that analyzing web logs does provide a
benefit.
Our general approach with analyzing potentially sensitive data is to
openly discuss the algorithm to remove any sensitive parts, make the
resulting data publicly available, and only analyze those. Ideally, we
don't want to collect the sensitive parts at all, but sometimes that's
not feasible (IP addresses in bridge descriptors, request order in web
server logs), so we need to post-process the data before publication.
I think the overall risk of our approach is considerably lower than
trying to keep the data you're planning to analyze private, because
there's always the risk of losing data.
See this paper and website for a better answer:
https://metrics.torproject.org/papers/wecsr10.pdf
https://metrics.torproject.org/formats.html
What are the considerable risks you're referring to?
Best,
Karsten
More information about the tor-dev
mailing list