[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Oct 30 17:05:05 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by iwakeh):
Replying to [comment:50 karsten]:
> While reviewing our discussion above I discovered another weakness in
our specification: our naming convention for sanitized log files does not
take into account that host names may include dashes. For example, there
are virtual hosts like `cdn-fastly-backend.torproject.org` and physical
hosts like `oo-hetzner-03.torproject.org`, which we would combine to `cdn-
fastly-backend.torproject.org-oo-hetzner-03.torproject.org-
access.log-20171030.xz`. Where does the virtual host name end and where
does the physical host name begin?
Yikes! This is really bad, but good to become aware of it now.
>
> We might consider changing the naming convention to something like
`<virtual-host>-access.log-<physical-host>-YYYYMMDD[.xz]`, but even for
that we might find host names producing file names that cannot be parsed
unambiguously. Maybe we'll have to return to putting phsyical host names
in parent directory names and only virtual names in file names. Hmm.
How much influence do we have on the naming of input files?
Who decides about input file naming and structuring?
For the output files, we could use other separators, maybe the underscore?
>
> But going back to the discussion above, I don't think we can make
assumptions that would allow us to implement your suggestions 1 and 2
above. Yet I don't see how suggestion 3 would be more error prone. It's up
to us to design something that is robust, so we'll have to go through the
possible edge cases and be prepared to handle them.
I just want to point out that many questions or assumptions I asked about
are not based on my preferences, but only on the fact that implementation
and design need the unambiguous information. Thus, a valid answer to 1 to
3 above is, that the assumptions of 1 and 2 are not valid. That's
perfectly fine.
>
> And talking about assumptions, I feel like our one-log-per-day
assumption is unnecessarily strong. I do agree that the naming requirement
only permits one log file per physical host, virtual host, and date. But
it seems like it should be up to the web server operators to decide to
rotate logs only once per day or more often to keep log files small.
As above:
How much influence do we have on the structuring of logs?
Again, I pointed out above (comment:45) that the implementation could
accommodate this easily. But, for a start there has to be a valid
description of the log names to be expected.
>
> Here's an idea to reduce the number of edge cases: how about we simply
ignore the date in the input log file name and only rely on the date given
in each of the contained log lines?
>
> As stated earlier, we could process everything in `in/webstats/` and
write everything to `out/` and `recent/` except the first and last
encountered UTC days. In theory, we could even drop the log rotation
requirement entirely.
I'm not really worried about the edge cases.
Initially, the date in the log name was used to ignore log lines that
actually belong to older logs. This was introduced because of the current
way of processing in the shell-python implementation.
A second requirement is that log files shouldn't change once published.
This makes accepting the last UTC date more difficult. There has to be
the information when a log file is ready to be published. And, so far we
used the log's reference date for that purpose.
Again, I don't have a preference, but the topic needs to be solved before
implementation.
>
> This approach should work just fine for bulk processing. And for running
several times per day we could keep a state file to avoid re-processing
input files that we already processed before, by storing file name, last-
modified time, and last contained UTC date. Worst case if we lose that
state file is that we'll read everything in `in/webstats/` once again.
>
> Do you see any conceptual weaknesses in ignoring the date in input
files? Do you want to give this implementation a try? Otherwise I'd be
willing to write some proof-of-concept code to see whether this can work.
How is the "when is a log ready for publication" question solved here?
Thinking about the process or introducing a performance measurement like
the stat file is taking the third step before the first.
Or, do you imply to change the requirement of not altering sanitized logs
once they're published?
Conclusion:
The implementation is not difficult either way. Currently the
specification (on which the implementation needs to rely) is a moving
target. I think we should make sure that our specification is correct and
answers all questions. When there is a solid specification, the
implementation follows easily.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:51>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list