[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Oct 24 14:09:18 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by karsten):
Replying to [comment:47 iwakeh]:
> Replying to [comment:46 karsten]:
> > I'm not sure if we can resolve these questions by hard thinking.
>
> Well, we need to work on thoughtful decision making.
Unfortunately, I'm not a good person these days to dive deep enough into
this topic to make thoughtful decisions. Too many topics, too little time
to do any of them well enough. That's why I hoped you'd just solve all
problems here and I could then review the solution. :)
> There're not that many questions above except yours:
> > ... what happens if a web server rotates logs more often than once per
day? At least that's something that we write in the specification. I'm not
sure how this would work with file names, so maybe we in fact require that
logs are rotated exactly once per day, and we just didn't write that in
the specification yet. However, it seems rather restrictive to prescribe
exact log rotation intervals in order to sanitize logs subsequently. Maybe
we should be less restrictive here.
>
> The current webstat code and the spec require a log per day.
Well, no. The spec says "Tor's web servers are configured to rotate logs
''at least'' once per day". If we didn't mean that, let's phrase it
differently. But how?
And we should write down possible failure modes for the case that logs are
rotated less often or more often.
In any case, we should warn in case we run into one of these cases, rather
than silently continuing operation and simply producing fewer/smaller
sanitized logs.
> [...]
> 1. Make sure by outside means that there is no day without a log (e.g.
by providing an empty file for that day using 'touch'). This would work
without additional implementation for CollecTor and this works for bulk
imports as well as daily processing. As a result there will be a
sanitized log for each day offered by CollecTor, some might be empty.
I'd say we need to do something that doesn't require any upstream changes.
In other words, whatever ends up in `in/webstats/` is what we should be
able to work with. We shouldn't require upstream to touch files for us.
> 2. For bulk processing a property could signal CollecTor to use all logs
without insisting on an uninterrupted chain. This still requires outside
measures for making sure no log lines are lost and might result in days
without any logs, unless CollecTor creates empty ones.
> 3. Think out a mechanism that enables more automated processing of an
interrupted chain of logs. This seems error prone an will result in many
edge cases.
I don't know, maybe we can do something with system time or state files.
Or we could process everything in `in/webstats/` and write everything to
`out/` and `recent/` except the first and last encountered UTC days. Just
some ideas.
Again, I'm not deep enough into this to make a good decision. I just hope
that whatever thing we'll build here is robust enough to either handle all
of the cases or warns loudly whenever it runs into an unforeseen case.
I'm very concerned about silently losing data. That's the worst thing that
could happen to us, in particular given that we don't keep archives of the
input data in this case.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:48>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list