[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Nov 2 13:38:09 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by iwakeh):
Replying to [comment:52 karsten]:
> A few thoughts:
> - I don't have a good answer to the question of putting virtual and
physical host name into a file name. Underscores might work. Trying to
solve the bigger issue first.
Yes, just using a placeholder for spec and implementation is fine here.
> - We have little influence on the naming of input files. We can define
reasonable requirements that are ideally met by existing input files, and
we can reject future input files not matching these requirements.
Ok, I see. From that it seems we should draw as little as possible from
the input-filename-path combination, where the two hostnames seem the
absolute minimum as they cannot be inferred from the logs' contents.
So far, the reference date from the log name was used for deciding which
log line dates were valid. If there was no reference date, all log line
dates would be accepted from a single input log.
> - I don't imply that we should alter sanitized log files after
publication. We shouldn't. That would be pretty bad.
True.
> - The specification is a moving target, because it was ambiguous and
the implementation would have been fragile. It's a valid use case that a
physical host does not see a single request for a given virtual host for
days, and the implementation (and specification) did not cover that case.
No offense, I didn't mean the 'moving target spec' as accusation. It's
normal that requirements change and extend. Currently, it is important to
put emphasis on the spec, b/c the open questions we have, are there. And,
the implementation should only cover the spec, neither more nor less.
>
> But let's look at the idea to process all input files and produce output
files for all dates except the first and last UTC days. And let's ignore
performance considerations for now. Why would it not solve the "when is a
log ready for publication" question? Can you give an example?
The 'gap(s)' (as you name it below) would be the problem.
>
> Not sure if this is what you have in mind, but I think we ''cannot''
handle the case of logs files "in the middle" being missing in one run and
being present in a subsequent run. For example, if we receive input files
with requests from the following dates:
>
> - 2017-11-01 and 2017-11-02
> - 2017-11-02 and 2017-11-03
> - (gap)
> - 2017-11-04 and 2017-11-05
> - 2017-11-05 and 2017-11-06
>
> We would produce output files for:
>
> - (skip 2017-11-01, because first UTC date)
> - 2017-11-02
> - 2017-11-03
> - 2017-11-04
> - 2017-11-05
> - (skip 2017-11-06, because last UTC date)
>
> Now, if we later find another file filling the gap with the following
contained request dates:
>
> - 2017-11-03 and 2017-11-04
>
> We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04
anymore! We would simply leave them unchanged, containing just the
requests we processed earlier.
How to find out that 2017-11-04 is already there: only the lookup in
'out' and 'recent' could tell.
>
> But is this a bug we should be able to handle? It seems like a bug in
the log-copying script combined with bad timing. During normal operation
and in the bulk-import case this should not happen.
During a bulk import it might be harder to guarantee the correct order.
Hmm, but that should be manageable somehow ...
>
> Note that if you think that cutting off the first and last days is not
enough, we could easily change that to cutting off the first and last two
days. Or the first and the last two. Or first and last three. Whatever we
think works best.
That cut-off time could be kept variable and be adjusted later, true.
Summary:
* Only hostnames are inferred from the logs' names and paths.
* The 'reference date' used in the current spec is dropped.
* Only the log line dates covered in one run become the reference
interval, of which a certain amount at beginning and end is not processed
(aka: cut-off time).
* Sanitized files for dates, that are already available in 'out', are
//not// created and corresponding log lines ignored.
* Gaps in import logs cannot be filled in later.
* File provision for (bulk) imports has to insure proper order.
* Use a placeholder for sanitized log file names (starting with
underscore, but easily changeable).
Does this seem solid?
Shall I amend the spec with these changes?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:53>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list