[metrics-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Sep 5 20:10:34 UTC 2017
#23243: write a spec for web-server-access log descriptors
-------------------------------------+------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Metrics website | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------------+------------------------------
Comment (by iwakeh):
The actual date (or system date) is only of concern for publishing the
logs. All other dates refer to the date the (original) log is finalized.
I introduced the term 'reference date' for this.
The diff:
{{{
--- webstats-spec.3.txt
+++ webstats-spec.4.txt
@@ -33,7 +33,7 @@
Tor's webservers are configured to rotate logs at least once per day,
which does not necessarily happen at 00:00:00 UTC. As a result, log files
may contain requests from up to two UTC days and several log files may
contain requests that have been started on the same UTC day.
-All access log files written by Tor's webservers follow the naming
convention <hostname>.torproject.org-access.log-YYYYMMDD.
+All access log files written by Tor's webservers follow the naming
convention <hostname>.torproject.org-access.log-YYYYMMDD, where 'YYYYMMDD'
is the date of the rotation and finalization of the log file. This date
will be referred to as 'reference date' in the following sections.
# Sanitizing steps
@@ -41,16 +41,16 @@
## Discarding non-matching files
-As first safeguard against publishing log files that are too sensitive,
we discard all files not matching the naming convention for access logs.
This is to prevent, for example, error logs from slipping through.
+As first safeguard against publishing log files that are too sensitive,
we discard all files not matching the naming convention for access logs.
This is to prevent, for example, error logs from slipping through. In
addition, the log file's name is supposed to contain the reference date,
which is used to determine the validity of log lines. If the log file's
name doesn't end in a date string of the format 'YYYYMMDD' the entire file
is discarded.
## Discarding non-matching lines
-Log files are expected to contain exactly 1 request per line. We process
these files line by line and discard any lines not matching the following
criteria:
+Log files are expected to contain exactly 1 request per line. We process
these files line by line and discard any lines not matching the following
criteria:
- Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s
%b") or a compatible format like one of Tor's privacy formats. It is
acceptable if lines start with a format that is compatible to the Common
Log Format and continue with additional fields. Those additional fields
will later be discarded, but the line will not be discarded because of
them.
- The request IP address starts with "0.0.0.", followed by any number
between 0 and 255.
- - The time the request was received does not lie in the future.
- - The date the request was received, after converting the request time
to UTC, does not lie more than 1 day in the past. (Bulk imports of
archived logs are exempt from this requirement.)
+ - The time the request was received does not lie in the future of the
reference date.
+ - The date the request was received, after converting the request time
to UTC, does not lie more than 1 day in the past of the reference date.
- The request protocol is HTTP.
- The request method is either GET or HEAD.
- The final status of the request is neither 400 ("Bad Request") nor 404
("Not Found").
@@ -83,7 +83,7 @@
<virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-
YYYYMMDD[.xz]
-Due to the fact that the date when a log file was rotated and the start
date of contained requests may not always overlap, we need to delay
publishing sanitized log files until the start date of requests in UTC
plus 2 days. After this delay, all log files containing requests from that
date are assumed to be processed. Sanitized log files are published and
not further modified in the future. (Again, bulk imports of archived logs
are exempt from this.)
+Due to the fact that the date when a log file was rotated and the start
date of contained requests may not always overlap, we need to delay
publishing sanitized log files until the start date of requests in UTC
plus 2 days. After this delay, all log files containing requests from that
date are assumed to be processed.
As last and certainly not least important sanitizing step, all rewritten
log lines are sorted alphabetically, so that request order cannot be
inferred from sanitized log files.
}}}
And, I don't see the necessity for stating that the files won't be changed
in future. This doesn't seem part of a spec here. Anyway, we might want
to re-sanitize these files, if suddenly there is a privacy issue with
fields that seem benign now (as with bridge descriptors, for example).
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:13>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list