[tor-bugs] #1641 [Metrics]: Make website logs available in the Metrics Portal
Tor Bug Tracker & Wiki
torproject-admin at torproject.org
Wed Feb 2 10:52:19 UTC 2011
#1641: Make website logs available in the Metrics Portal
---------------------+------------------------------------------------------
Reporter: karsten | Owner: karsten
Type: task | Status: assigned
Priority: minor | Milestone:
Component: Metrics | Version:
Keywords: | Points:
Parent: |
---------------------+------------------------------------------------------
Changes (by karsten):
* status: new => assigned
* owner: Karsten => karsten
Comment:
I looked at a web log sample from January 30 from one of our currently
three www servers. Here's a sample line:
{{{
0.0.0.1 - - [30/Jan/2011:00:00:00 +0000] "GET /projects/projects.html.en
HTTP/1.1" 200 3029 "https://www.torproject.org/docs/bridges.html.en" "-"
}}}
The format is Apache's Combined Log Format with the following exceptions:
- The client IP address is replaced with either 0.0.0.0 for HTTP requests
or 0.0.0.1 for HTTPS requests.
- The request time is set to 00:00:00 +0000.
- The user-agent string is set to "-".
However, I found CONNECT request and other non-GET requests in the logs
which are potentially sensitive. Also, the referer string may be
sensitive, especially if it's a non-Tor URL. We should remove all log
lines except GET requests and set the referer string to "-".
An even better approach is to define the information we want to keep:
- We publish only GET requests with the following data fields:
- 0.0.0.0 for HTTP request or 0.0.0.1 for HTTPS requests,
- the request date,
- the requested URL,
- the HTTP version,
- the server's HTTP status code, and
- the size of the returned object.
We retain Apache's Combined Log Format for the sanitized logs, so that we
can use standard web log analysis tools.
Runa has web server log analysis on her TODO list. I explained this
approach to her yesterday. She agreed with settling a format like the one
above and said that she'll find a way to work with it.
How do we proceed? Andrew says the sanitizing process cannot take place
on the web servers, because they are quite busy already. Can we set up
copying our web server logs to yatei to do the sanitizing there? I can
write a parser as part of metrics-db and make daily updated sanitized web
logs available in the metrics portal. I also want to make a graph on
downloaded packages per day available on the metrics website. Once Runa
starts her web server log analysis, we can extend this setup to copy the
web server logs, either from the web servers or from yatei, to wherever
she does the analysis.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/1641#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list