[metrics-bugs] #22428 [Metrics/CollecTor]: Add webstats module
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Jan 25 11:49:34 UTC 2018
#22428: Add webstats module
-------------------------------+---------------------------------
Reporter: iwakeh | Owner: iwakeh
Type: enhancement | Status: needs_revision
Priority: High | Milestone: CollecTor 1.5.0
Component: Metrics/CollecTor | Version:
Severity: Normal | Resolution:
Keywords: metrics-2018 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------+---------------------------------
Comment (by karsten):
Replying to [comment:51 iwakeh]:
> Replying to [comment:50 karsten]:
> > Alright, I made a few tweaks to your branch
[https://gitweb.torproject.org/karsten/metrics-db.git/log/?h=task-22428-4
in my task-22428-4 branch] with the latest commit being 1f1adec.
>
> Changes look fine, thanks! I will do further work based on this branch.
Great!
> > I also ran a first round of tests. Here's what I found:
> > - The `recent/` directory contains only 1 log file per
physical/virtual host from 3 days ago. The reason is that the last 2 days
are excluded as too recent. But the `recent/` directory is supposed to
contain the last 72 hours of data produced by CollecTor, not data
originally published within the last 72 hours. The goal is to give
CollecTor clients 72 hours to obtain newly provided data before they need
to fall back to archives. I think the other modules only look at last-
modified time of provided files and delete them after 72 hours. Maybe we
should do that here, too.
>
> This is related to comment:44, which quotes the relevant section of our
spec (and still waits for comments).
Huh, I did not see that comment. First thought is that we probably don't
need another config option as long as we clearly document how the
processing works. But I'll think more about this.
> The spec allows only earliest logs from two days before the current
date. Providing data for the last three days of collecting results would
mean to extend the logic to allow data from up to five days ago. Example
for the '''intended''' functionality:
> Log files from (omitting the year) 1/11, 1/12, 1/13, 1/14, 1/15,
1/16, 1/17 should result in having the logs of 1/13, 1/14, 1/15 in
'recent'.
>
> If yes I'll add a patch commit.
Well, no. 1/11 and 1/12 would be included in the `recent/` directory, too,
at least after processing the stated files in bulk. When processing them
in an ongoing way, we'd indeed only expect 1/13, 1/14, and 1/15 in
`recent/`.
Maybe it helps to ignore the 2-days logic entirely here and only look at
the last-modified time of sanitized files. This may produce a gigantic
`recent/` directory after the first bulk import and the 72 hours after
that, but after that everything should stabilize.
> > - The regular expression in metrics-lib's `WebServerAccessLogLine` is
too strict. It does not consider the following log line to be valid, but
it should: `0.0.0.0 - - [22/Jan/2018:00:00:00 +0000] "GET
/collector/archive HTTP/1.1" 301 -`. We should probably also compare the
regular expression to the Apache specification for similar cases where
we're being too strict.
>
> If such lines are contained and valid we should make sure it is also
reflected in the spec.
> I will provide a patch.
Sounds good. Maybe also see what the Apache log file specification says
here.
> > - There are subtle differences between files provided by
webstats.tp.o and the ones produced here. We should probably write a
simple (shell) script to convert files from webstats.tp.o to the format
we'd have produced. Things like cutting off columns at the end or cutting
off the `?` from parameters. We'd run that script once when copying over
files.
>
> I'm objecting against the script solution.
> The easiest way and another way of testing might be to simply import
these using CollecTor. Setting log level to debug will tell us what gets
dropped and provide another good test scenario with a useful result.
Ah, sure, that would work, too.
> > - I guess we'll need to extend create-tarballs script to produce
tarballs containing webstats files. I haven't looked, though.
>
> Oh, true.
>
> >
> > Do you want to look into these issues?
>
> Yep.
> Will you later import the old logs? I could also do that on corsicum or
on my machine and upload there.
I can do that.
> > Remaining changes after those above are:
> > - Run another round of tests with most or even all original log files
as input.
> ^^^^ karsten's task
Yep, I'll do that.
> - import the old logs (iwakeh or karsten see above)
And this.
> > - Clean up `CHANGELOG.md`.
> > - Release metrics-lib 2.2.0 and update `build.xml`.
> > - Update copyrights to 2018.
>
> I could rebase your most recent branch on master and then add these
changes?
I'm happy to do this. Please keep working on the current branch.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:52>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list