[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Oct 24 15:56:22 UTC 2017
#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+--------------------------------
Comment (by iwakeh):
Replying to [comment:48 karsten]:
> Replying to [comment:47 iwakeh]:
> > Replying to [comment:46 karsten]:
> > > I'm not sure if we can resolve these questions by hard thinking.
> >
> > Well, we need to work on thoughtful decision making.
>
> Unfortunately, I'm not a good person these days to dive deep enough into
this topic to make thoughtful decisions. Too many topics, too little time
to do any of them well enough. That's why I hoped you'd just solve all
problems here and I could then review the solution. :)
There are always many solutions. So, you'd also need to be involved in
choosing ;-)
>
> ...
> > The current webstat code and the spec require a log per day.
>
> Well, no. The spec says "Tor's web servers are configured to rotate logs
''at least'' once per day". If we didn't mean that, let's phrase it
differently. But how?
The spec indirectly only allows at most one log per day and virtual-
physical host combination by its naming requirements `<phyiscal-host
>/<virtual-host>-access.log-YYYYMMDD[.gz]`. Combining this with the at-
least-once-per-day yields exactly one log per day. It might be useful to
write this explicitly.
(If more logs for the same date are supplied, the first wins.)
>
> And we should write down possible failure modes for the case that logs
are rotated less often or more often.
The naming requirements don't allow for more than one log per day.
Logs that don't match the requirements are dropped (which can be logged as
warning but that's part of the implementation not part of the spec).
>
> In any case, we should warn in case we run into one of these cases,
rather than silently continuing operation and simply producing
fewer/smaller sanitized logs.
There will be warnings in the logs, if suddenly the naming of logs doesn't
comply anymore to the specs requirements. And, that would inevitably
happen, if logs are rotated more often than once per day.
>
> > [...]
> > 1. Make sure by outside means that there is no day without a log (e.g.
by providing an empty file for that day using 'touch'). This would work
without additional implementation for CollecTor and this works for bulk
imports as well as daily processing. As a result there will be a
sanitized log for each day offered by CollecTor, some might be empty.
>
> I'd say we need to do something that doesn't require any upstream
changes. In other words, whatever ends up in `in/webstats/` is what we
should be able to work with. We shouldn't require upstream to touch files
for us.
>
> > 2. For bulk processing a property could signal CollecTor to use all
logs without insisting on an uninterrupted chain. This still requires
outside measures for making sure no log lines are lost and might result in
days without any logs, unless CollecTor creates empty ones.
> > 3. Think out a mechanism that enables more automated processing of an
interrupted chain of logs. This seems error prone an will result in many
edge cases.
>
> I don't know, maybe we can do something with system time or state files.
Or we could process everything in `in/webstats/` and write everything to
`out/` and `recent/` except the first and last encountered UTC days. Just
some ideas.
>
> Again, I'm not deep enough into this to make a good decision. I just
hope that whatever thing we'll build here is robust enough to either
handle all of the cases or warns loudly whenever it runs into an
unforeseen case.
>
> I'm very concerned about silently losing data. That's the worst thing
that could happen to us, in particular given that we don't keep archives
of the input data in this case.
From the above I infer that you think we should opt for number 3, i.e.,
automate as much as possible, warn whenever there is some unforeseen
naming etc.
Is this correct?
If that's the case, my next steps would be to provide a patch for the spec
for making the one log per day requirement explicit and also discuss what
happens if logs are missing; and secondly, based on that I will extend the
implementation in #22428 to address the new requirements.
Does that sound like a plan?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:49>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list