[tor-bugs] #22428 [Metrics/CollecTor]: Add webstats module

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Jan 25 11:49:34 UTC 2018


#22428: Add webstats module
-------------------------------+---------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_revision
 Priority:  High               |      Milestone:  CollecTor 1.5.0
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:  metrics-2018       |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+---------------------------------

Comment (by karsten):

 Replying to [comment:51 iwakeh]:
 > Replying to [comment:50 karsten]:
 > > Alright, I made a few tweaks to your branch
 [https://gitweb.torproject.org/karsten/metrics-db.git/log/?h=task-22428-4
 in my task-22428-4 branch] with the latest commit being 1f1adec.
 >
 > Changes look fine, thanks!  I will do further work based on this branch.

 Great!

 > > I also ran a first round of tests. Here's what I found:
 > >  - The `recent/` directory contains only 1 log file per
 physical/virtual host from 3 days ago. The reason is that the last 2 days
 are excluded as too recent. But the `recent/` directory is supposed to
 contain the last 72 hours of data produced by CollecTor, not data
 originally published within the last 72 hours. The goal is to give
 CollecTor clients 72 hours to obtain newly provided data before they need
 to fall back to archives. I think the other modules only look at last-
 modified time of provided files and delete them after 72 hours. Maybe we
 should do that here, too.
 >
 > This is related to comment:44, which quotes the relevant section of our
 spec (and still waits for comments).

 Huh, I did not see that comment. First thought is that we probably don't
 need another config option as long as we clearly document how the
 processing works. But I'll think more about this.

 > The spec allows only earliest logs from two days before the current
 date.  Providing data for the last three days of collecting results would
 mean to extend the logic to allow data from up to five days ago.  Example
 for the '''intended''' functionality:
 >    Log files from (omitting the year) 1/11, 1/12, 1/13, 1/14, 1/15,
 1/16, 1/17 should result in having the logs of 1/13, 1/14, 1/15 in
 'recent'.
 >
 > If yes I'll add a patch commit.

 Well, no. 1/11 and 1/12 would be included in the `recent/` directory, too,
 at least after processing the stated files in bulk. When processing them
 in an ongoing way, we'd indeed only expect 1/13, 1/14, and 1/15 in
 `recent/`.

 Maybe it helps to ignore the 2-days logic entirely here and only look at
 the last-modified time of sanitized files. This may produce a gigantic
 `recent/` directory after the first bulk import and the 72 hours after
 that, but after that everything should stabilize.

 > >  - The regular expression in metrics-lib's `WebServerAccessLogLine` is
 too strict. It does not consider the following log line to be valid, but
 it should: `0.0.0.0 - - [22/Jan/2018:00:00:00 +0000] "GET
 /collector/archive HTTP/1.1" 301 -`. We should probably also compare the
 regular expression to the Apache specification for similar cases where
 we're being too strict.
 >
 > If such lines are contained and valid we should make sure it is also
 reflected in the spec.
 > I will provide a patch.

 Sounds good. Maybe also see what the Apache log file specification says
 here.

 > >  - There are subtle differences between files provided by
 webstats.tp.o and the ones produced here. We should probably write a
 simple (shell) script to convert files from webstats.tp.o to the format
 we'd have produced. Things like cutting off columns at the end or cutting
 off the `?` from parameters. We'd run that script once when copying over
 files.
 >
 > I'm objecting against the script solution.
 > The easiest way and another way of testing might be to simply import
 these using CollecTor.  Setting log level to debug will tell us what gets
 dropped and provide another good test scenario with a useful result.

 Ah, sure, that would work, too.

 > >  - I guess we'll need to extend create-tarballs script to produce
 tarballs containing webstats files. I haven't looked, though.
 >
 > Oh, true.
 >
 > >
 > > Do you want to look into these issues?
 >
 > Yep.
 > Will you later import the old logs?  I could also do that on corsicum or
 on my machine and upload there.

 I can do that.

 > > Remaining changes after those above are:
 > >  - Run another round of tests with most or even all original log files
 as input.
 >    ^^^^ karsten's task

 Yep, I'll do that.

 >   - import the old logs (iwakeh or karsten see above)

 And this.

 > >  - Clean up `CHANGELOG.md`.
 > >  - Release metrics-lib 2.2.0 and update `build.xml`.
 > >  - Update copyrights to 2018.
 >
 > I could rebase your most recent branch on master and then add these
 changes?

 I'm happy to do this. Please keep working on the current branch.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:52>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list