[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Thu Nov 2 13:38:09 UTC 2017

#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------

Comment (by iwakeh):

 Replying to [comment:52 karsten]:
 > A few thoughts:
 >  - I don't have a good answer to the question of putting virtual and
 physical host name into a file name. Underscores might work. Trying to
 solve the bigger issue first.

 Yes, just using a placeholder for spec and implementation is fine here.

 >  - We have little influence on the naming of input files. We can define
 reasonable requirements that are ideally met by existing input files, and
 we can reject future input files not matching these requirements.

 Ok, I see.  From that it seems we should draw as little as possible from
 the input-filename-path combination, where the two hostnames seem the
 absolute minimum as they cannot be inferred from the logs' contents.
 So far, the reference date from the log name was used for deciding which
 log line dates were valid.  If there was no reference date, all log line
 dates would be accepted from a single input log.

 >  - I don't imply that we should alter sanitized log files after
 publication. We shouldn't. That would be pretty bad.

 True.

 >  - The specification is a moving target, because it was ambiguous and
 the implementation would have been fragile. It's a valid use case that a
 physical host does not see a single request for a given virtual host for
 days, and the implementation (and specification) did not cover that case.

 No offense, I didn't mean the 'moving target spec' as accusation.  It's
 normal that requirements change and extend.  Currently, it is important to
 put emphasis on the spec, b/c the open questions we have, are there.  And,
 the implementation should only cover the spec, neither more nor less.

 >
 > But let's look at the idea to process all input files and produce output
 files for all dates except the first and last UTC days. And let's ignore
 performance considerations for now. Why would it not solve the "when is a
 log ready for publication" question? Can you give an example?

 The 'gap(s)' (as you name it below) would be the problem.

 >
 > Not sure if this is what you have in mind, but I think we ''cannot''
 handle the case of logs files "in the middle" being missing in one run and
 being present in a subsequent run. For example, if we receive input files
 with requests from the following dates:
 >
 >  - 2017-11-01 and 2017-11-02
 >  - 2017-11-02 and 2017-11-03
 >  - (gap)
 >  - 2017-11-04 and 2017-11-05
 >  - 2017-11-05 and 2017-11-06
 >
 > We would produce output files for:
 >
 >  - (skip 2017-11-01, because first UTC date)
 >  - 2017-11-02
 >  - 2017-11-03
 >  - 2017-11-04
 >  - 2017-11-05
 >  - (skip 2017-11-06, because last UTC date)
 >
 > Now, if we later find another file filling the gap with the following
 contained request dates:
 >
 >  - 2017-11-03 and 2017-11-04
 >
 > We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04
 anymore! We would simply leave them unchanged, containing just the
 requests we processed earlier.

 How to find out that 2017-11-04 is already there:  only the lookup in
 'out' and 'recent' could tell.

 >
 > But is this a bug we should be able to handle? It seems like a bug in
 the log-copying script combined with bad timing. During normal operation
 and in the bulk-import case this should not happen.

 During a bulk import it might be harder to guarantee the correct order.
 Hmm, but that should be manageable somehow ...

 >
 > Note that if you think that cutting off the first and last days is not
 enough, we could easily change that to cutting off the first and last two
 days. Or the first and the last two. Or first and last three. Whatever we
 think works best.

 That cut-off time could be kept variable and be adjusted later, true.

 Summary:
 * Only hostnames are inferred from the logs' names and paths.
 * The 'reference date' used in the current spec is dropped.
 * Only the log line dates covered in one run become the reference
 interval, of which a certain amount at beginning and end is not processed
 (aka: cut-off time).
 * Sanitized files for dates, that are already available in 'out', are
 //not// created and corresponding log lines ignored.
 * Gaps in import logs cannot be filled in later.
 * File provision for (bulk) imports has to insure proper order.
 * Use a placeholder for sanitized log file names (starting with
 underscore, but easily changeable).

 Does this seem solid?
 Shall I amend the spec with these changes?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:53>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online