[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Nov 2 11:02:01 UTC 2017


#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------

Comment (by karsten):

 A few thoughts:
  - I don't have a good answer to the question of putting virtual and
 physical host name into a file name. Underscores might work. Trying to
 solve the bigger issue first.
  - We have little influence on the naming of input files. We can define
 reasonable requirements that are ideally met by existing input files, and
 we can reject future input files not matching these requirements.
  - I don't imply that we should alter sanitized log files after
 publication. We shouldn't. That would be pretty bad.
  - The specification is a moving target, because it was ambiguous and the
 implementation would have been fragile. It's a valid use case that a
 physical host does not see a single request for a given virtual host for
 days, and the implementation (and specification) did not cover that case.

 But let's look at the idea to process all input files and produce output
 files for all dates except the first and last UTC days. And let's ignore
 performance considerations for now. Why would it not solve the "when is a
 log ready for publication" question? Can you give an example?

 Not sure if this is what you have in mind, but I think we ''cannot''
 handle the case of logs files "in the middle" being missing in one run and
 being present in a subsequent run. For example, if we receive input files
 with requests from the following dates:

  - 2017-11-01 and 2017-11-02
  - 2017-11-02 and 2017-11-03
  - (gap)
  - 2017-11-04 and 2017-11-05
  - 2017-11-05 and 2017-11-06

 We would produce output files for:

  - (skip 2017-11-01, because first UTC date)
  - 2017-11-02
  - 2017-11-03
  - 2017-11-04
  - 2017-11-05
  - (skip 2017-11-06, because last UTC date)

 Now, if we later find another file filling the gap with the following
 contained request dates:

  - 2017-11-03 and 2017-11-04

 We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04
 anymore! We would simply leave them unchanged, containing just the
 requests we processed earlier.

 But is this a bug we should be able to handle? It seems like a bug in the
 log-copying script combined with bad timing. During normal operation and
 in the bulk-import case this should not happen.

 Note that if you think that cutting off the first and last days is not
 enough, we could easily change that to cutting off the first and last two
 days. Or the first and the last two. Or first and last three. Whatever we
 think works best.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:52>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list