[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Oct 30 15:18:33 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by karsten):
While reviewing our discussion above I discovered another weakness in our
specification: our naming convention for sanitized log files does not take
into account that host names may include dashes. For example, there are
virtual hosts like `cdn-fastly-backend.torproject.org` and physical hosts
like `oo-hetzner-03.torproject.org`, which we would combine to `cdn-
fastly-backend.torproject.org-oo-hetzner-03.torproject.org-
access.log-20171030.xz`. Where does the virtual host name end and where
does the physical host name begin?
We might consider changing the naming convention to something like
`<virtual-host>-access.log-<physical-host>-YYYYMMDD[.xz]`, but even for
that we might find host names producing file names that cannot be parsed
unambiguously. Maybe we'll have to return to putting phsyical host names
in parent directory names and only virtual names in file names. Hmm.
But going back to the discussion above, I don't think we can make
assumptions that would allow us to implement your suggestions 1 and 2
above. Yet I don't see how suggestion 3 would be more error prone. It's up
to us to design something that is robust, so we'll have to go through the
possible edge cases and be prepared to handle them.
And talking about assumptions, I feel like our one-log-per-day assumption
is unnecessarily strong. I do agree that the naming requirement only
permits one log file per physical host, virtual host, and date. But it
seems like it should be up to the web server operators to decide to rotate
logs only once per day or more often to keep log files small.
Here's an idea to reduce the number of edge cases: how about we simply
ignore the date in the input log file name and only rely on the date given
in each of the contained log lines?
As stated earlier, we could process everything in `in/webstats/` and write
everything to `out/` and `recent/` except the first and last encountered
UTC days. In theory, we could even drop the log rotation requirement
entirely.
This approach should work just fine for bulk processing. And for running
several times per day we could keep a state file to avoid re-processing
input files that we already processed before, by storing file name, last-
modified time, and last contained UTC date. Worst case if we lose that
state file is that we'll read everything in `in/webstats/` once again.
Do you see any conceptual weaknesses in ignoring the date in input files?
Do you want to give this implementation a try? Otherwise I'd be willing to
write some proof-of-concept code to see whether this can work.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:50>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list