[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Oct 24 15:56:22 UTC 2017


#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------

Comment (by iwakeh):

 Replying to [comment:48 karsten]:
 > Replying to [comment:47 iwakeh]:
 > > Replying to [comment:46 karsten]:
 > > > I'm not sure if we can resolve these questions by hard thinking.
 > >
 > > Well, we need to work on thoughtful decision making.
 >
 > Unfortunately, I'm not a good person these days to dive deep enough into
 this topic to make thoughtful decisions. Too many topics, too little time
 to do any of them well enough. That's why I hoped you'd just solve all
 problems here and I could then review the solution. :)

 There are always many solutions. So, you'd also need to be involved in
 choosing ;-)

 >
 > ...
 > > The current webstat code and the spec require a log per day.
 >
 > Well, no. The spec says "Tor's web servers are configured to rotate logs
 ''at least'' once per day". If we didn't mean that, let's phrase it
 differently. But how?

 The spec indirectly only allows at most one log per day and virtual-
 physical host combination by its naming requirements `<phyiscal-host
 >/<virtual-host>-access.log-YYYYMMDD[.gz]`.  Combining this with the at-
 least-once-per-day yields exactly one log per day.  It might be useful to
 write this explicitly.
 (If more logs for the same date are supplied, the first wins.)

 >
 > And we should write down possible failure modes for the case that logs
 are rotated less often or more often.

 The naming requirements don't allow for more than one log per day.
 Logs that don't match the requirements are dropped (which can be logged as
 warning but that's part of the implementation not part of the spec).

 >
 > In any case, we should warn in case we run into one of these cases,
 rather than silently continuing operation and simply producing
 fewer/smaller sanitized logs.

 There will be warnings in the logs, if suddenly the naming of logs doesn't
 comply anymore to the specs requirements.  And, that would inevitably
 happen, if logs are rotated more often than once per day.

 >
 > > [...]
 > > 1. Make sure by outside means that there is no day without a log (e.g.
 by providing an empty file for that day using 'touch').  This would work
 without additional implementation for CollecTor and this works for bulk
 imports as well as daily processing.  As a result there will be a
 sanitized log for each day offered by CollecTor, some might be empty.
 >
 > I'd say we need to do something that doesn't require any upstream
 changes. In other words, whatever ends up in `in/webstats/` is what we
 should be able to work with. We shouldn't require upstream to touch files
 for us.
 >
 > > 2. For bulk processing a property could signal CollecTor to use all
 logs without insisting on an uninterrupted chain.  This still requires
 outside measures for making sure no log lines are lost and might result in
 days without any logs, unless CollecTor creates empty ones.
 > > 3. Think out a mechanism that enables more automated processing of an
 interrupted chain of logs.  This seems error prone an will result in many
 edge cases.
 >
 > I don't know, maybe we can do something with system time or state files.
 Or we could process everything in `in/webstats/` and write everything to
 `out/` and `recent/` except the first and last encountered UTC days. Just
 some ideas.
 >
 > Again, I'm not deep enough into this to make a good decision. I just
 hope that whatever thing we'll build here is robust enough to either
 handle all of the cases or warns loudly whenever it runs into an
 unforeseen case.
 >
 > I'm very concerned about silently losing data. That's the worst thing
 that could happen to us, in particular given that we don't keep archives
 of the input data in this case.


 From the above I infer that you think we should opt for number 3, i.e.,
 automate as much as possible, warn whenever there is some unforeseen
 naming etc.
 Is this correct?

 If that's the case, my next steps would be to provide a patch for the spec
 for making the one log per day requirement explicit and also discuss what
 happens if logs are missing; and secondly, based on that I will extend the
 implementation in #22428 to address the new requirements.

 Does that sound like a plan?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:49>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list