[tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Aug 15 09:59:31 UTC 2017
#23243: write a spec for web-server-access log descriptors
-------------------------------------+-----------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_information
Priority: Medium | Milestone:
Component: Metrics/Metrics website | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------+-----------------------------------
Comment (by karsten):
Replying to [ticket:23243 iwakeh]:
> This document should answer the following questions:
Good idea to start such a document! I'll start filling information below.
> * What will the raw input data look like?
> - compressed logs
Very likely, though compression shouldn't be a strict requirement.
> - varying dates in log-lines despite the file being tagged with a
single date
Yes, to a certain degree. We'll have to ask the admins for details, but I
believe that the date in the file name is put in when rotating logs and
that the date per line is when the host started processing a request. Now,
it's possible that some requests are received before midnight and
completed after midnight. And depending on when the log is rotated it's
possible that some requests are started on the day before the log was
rotated and finished after rotating the log.
> - are there only GET log-lines of 200 responses to be expected?
No, there might be other methods and other response codes.
> - size could be huge (in future)
Yes.
> - exact input format (if possible to define)
Good question. We should ideally support Apache's Combined Log Format,
even though we'd currently only receive Tor's privacy* log formats:
{{{
LogFormat "0.0.0.0 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
\"%{Referer}i\" \"-\" %{Age}o" privacy
LogFormat "0.0.0.1 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
\"%{Referer}i\" \"-\" %{Age}o" privacyssl
LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
\"%{Referer}i\" \"-\" %{Age}o" privacyhs
}}}
And there's already the first contradiction: The `%{Age}o` part is not
contained in the Combined Log Format:
{{{
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
combined
}}}
Maybe we require lines to start with the Common Log Format and ignore any
further fields? Needs discussion.
> - meta-data is provided in paths and filenames
Yep.
> - ...
> * What will sanitized stored (on disk) logs look like?
> - cleaned log-lines, define exact format, give examples (as this might
deviate from the current python sanitation)
> - meta-data is provided in paths and filenames
> - should files be reassembled, i.e., only log lines of a given date in
a descriptor for that log date?
Yes! That's important! Otherwise we'll leak information of lines contained
for a given date before/after rotating logs. That's a much shorter time
frame than 24 hours then. We'll have to do this.
> - should storage (on disk) be in compressed files (opposed to storing
other descriptors uncompressed)?
Yes. Configurable by the application, but yes.
> - Should such log be stored (on disk) in reasonably sized chunks (once
a GB size is reached)?
No, compression should already reduce the size enough so that we'll never
run into such sizes. Never!
> - ...
>
> Please add more.
Looks like a good start! Will add more as more comes to mind.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list