[tor-dev] Sanitizing and publishing our web server logs

Karsten Loesing karsten.loesing at gmx.net
Mon Sep 5 07:51:42 UTC 2011


On 9/2/11 3:06 PM, Sebastian Hahn wrote:
> 
> On Sep 2, 2011, at 2:46 PM, Karsten Loesing wrote:
> 
>> Hi Andrew,
>>
>> On 9/2/11 2:18 AM, Andrew Lewman wrote:
>>> On Thursday, August 25, 2011 04:08:00 Karsten Loesing wrote:
>>>> we have been discussing sanitizing and publishing our web server logs
>>>> for quite a while now.  The idea is to remove all potentially sensitive
>>>> parts from the logs, publish them in monthly tarballs on the metrics
>>>> website, and analyze them for top visited pages, top downloaded
>>>> packages, etc.  See the tickets #1641 and #2489 for details.
>>>
>>> My concern is that we have the data at all.  We shouldn't have any
>>> sensitive information logged on the webservers. Therefore sanitizing the
>>> logs should not be necessary.
>>
>> My concern is that we remove details from the logs and learn in a few
>> months that we wanted to analyze them.  I'd like to sanitize the
>> existing logs first, make them available for people to analyze, and only
>> change the Apache configuration once we're really sure we found the
>> level of detail that we want.  There's no rush in changing the Apache
>> configuration now, right?
> 
> So, if we decide in a few months that we need more detail, we can
> change the logging then. Sure, we won't have history, but that just
> means that the graphs we make start in 2012 instead of 2007.

You're right.  Once we change the logging we'll only have graphs from
then on.  But there's no immediate need to change the logging now.  We
can still do that in a few months from now when we have more experience
with the sanitizing process (which we need anyway, if only for
reordering requests) and subsequent analysis.

>> Finally, we'll have to find a way to encode the country code in the logs
>> and still keep Apache's Combined Log Format.  And do we still care about
>> the HTTP vs. HTTPS bit?  Because if we use the IP column for the country
>> code, we'll have to encode the HTTP/HTTPS thing somewhere else.
> 
> IP addresses have plenty of bits for a country code and http/https
> encoding, we could for example use the first bytes for country code.

Sounds like a hack to me (not that I'm too opposed to it).  How do other
people encode country codes in Apache logs?

>> So, it should be possible to implement GeoIP lookups in the future.  I'd
>> like to consider that a separate task from sanitizing the existing web
>> logs, though.
> 
> It's separate, but without the on-the-fly geoip lookups we won't have
> any, because the sanitizing process doesn't get them magically.

Right.  I'm just trying to keep the scope of this first discussion round
small to speed things up.  This is something to revisit in a few months
from now.

Thanks for your comments!

Best,
Karsten


More information about the tor-dev mailing list