[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Nov 2 11:02:01 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by karsten):
A few thoughts:
- I don't have a good answer to the question of putting virtual and
physical host name into a file name. Underscores might work. Trying to
solve the bigger issue first.
- We have little influence on the naming of input files. We can define
reasonable requirements that are ideally met by existing input files, and
we can reject future input files not matching these requirements.
- I don't imply that we should alter sanitized log files after
publication. We shouldn't. That would be pretty bad.
- The specification is a moving target, because it was ambiguous and the
implementation would have been fragile. It's a valid use case that a
physical host does not see a single request for a given virtual host for
days, and the implementation (and specification) did not cover that case.
But let's look at the idea to process all input files and produce output
files for all dates except the first and last UTC days. And let's ignore
performance considerations for now. Why would it not solve the "when is a
log ready for publication" question? Can you give an example?
Not sure if this is what you have in mind, but I think we ''cannot''
handle the case of logs files "in the middle" being missing in one run and
being present in a subsequent run. For example, if we receive input files
with requests from the following dates:
- 2017-11-01 and 2017-11-02
- 2017-11-02 and 2017-11-03
- (gap)
- 2017-11-04 and 2017-11-05
- 2017-11-05 and 2017-11-06
We would produce output files for:
- (skip 2017-11-01, because first UTC date)
- 2017-11-02
- 2017-11-03
- 2017-11-04
- 2017-11-05
- (skip 2017-11-06, because last UTC date)
Now, if we later find another file filling the gap with the following
contained request dates:
- 2017-11-03 and 2017-11-04
We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04
anymore! We would simply leave them unchanged, containing just the
requests we processed earlier.
But is this a bug we should be able to handle? It seems like a bug in the
log-copying script combined with bad timing. During normal operation and
in the bulk-import case this should not happen.
Note that if you think that cutting off the first and last days is not
enough, we could easily change that to cutting off the first and last two
days. Or the first and the last two. Or first and last three. Whatever we
think works best.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:52>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list