[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Oct 24 12:15:55 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by iwakeh):
Replying to [comment:46 karsten]:
> I'm not sure if we can resolve these questions by hard thinking.
Well, we need to work on thoughtful decision making. There're not that
many questions above except yours:
> ... what happens if a web server rotates logs more often than once per
day? At least that's something that we write in the specification. I'm not
sure how this would work with file names, so maybe we in fact require that
logs are rotated exactly once per day, and we just didn't write that in
the specification yet. However, it seems rather restrictive to prescribe
exact log rotation intervals in order to sanitize logs subsequently. Maybe
we should be less restrictive here.
The current webstat code and the spec require a log per day. So, if
someone decides to change the log rotation to more than that, the spec and
code will have to be adapted. Thus, it seems this question refers to a
hypothetical change (afaik). In comment:45 I point out that this is a
small issue for implementation based on the reasoning that rotated logs
usually add a number or a time or both to the log file name. Either way
is easily adapted.
> - Would it help to know the log and log rotation configuration used on
the various Tor web servers?
Unless you have reason to think that current logging procedures are going
to change or even changed already from the one log per day schema, this is
not necessary.
> - Would it help to have access to the current host that sanitizes web
server logs?
I think there are no questions regarding the current process.
> - Does the existing code for sanitizing web server logs contain any
more hints on the input data?
We put all the information from the current code into the spec and the
current implementation suggestion. The old code also uses a 'cue' (as
mentioned comment:45): `sanitize.py` returns the sanitized log file name
for the day before the processed log file, which is the cue for the
calling shell script that this file is now complete and can be published.
Both the old and suggested new version of webstats need an outside cue and
without this cue an input log day would not be published.
Now focussing on the new implementation of the webstats module for
CollecTor there are several ways of preventing log file loss:
1. Make sure by outside means that there is no day without a log (e.g. by
providing an empty file for that day using 'touch'). This would work
without additional implementation for CollecTor and this works for bulk
imports as well as daily processing. As a result there will be a
sanitized log for each day offered by CollecTor, some might be empty.
2. For bulk processing a property could signal CollecTor to use all logs
without insisting on an uninterrupted chain. This still requires outside
measures for making sure no log lines are lost and might result in days
without any logs, unless CollecTor creates empty ones.
3. Think out a mechanism that enables more automated processing of an
interrupted chain of logs. This seems error prone an will result in many
edge cases.
I think 1. is the easiest in terms of operation, i.e., providing input
logs, and implementation (it's there already). In addition, the
uninterupted chain of (possibly empty) sanitized logs is also easy to
verify and understand. An empty file could result from no log line being
valid or no log being available for that day.
So, in order to get forward one of the above methods needs to be chosen
(or a new one made up).
The other question about smaller log rotation intervals is only relevant,
if that is put into practice. If so, it should be a straightforward task
to adapt the code.
Hope this makes some sense. Is there anything else missing here?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:47>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list