[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Jan 10 16:38:56 UTC 2018


#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_review
 Priority:  High             |      Milestone:  metrics-lib 2.2.0
Component:  Metrics/Library  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2018     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+-----------------------------------

Comment (by iwakeh):

 Replying to [comment:58 karsten]:
 > ...
 > > You only need to look at the latest four commits.  The others are
 reviewed already (comment:47).
 >
 > Oh. That information would have saved some time this morning. That also
 explains why I have just one remark on the earlier commits and a couple
 more on the first four!

 Hrmm, in my request for review I did mention:

    ... The additional commits should address all topics from the comments
 above starting at comment:47. ...

 Maybe, too subtle for 'please re-read the comments starting at comment:47
 unless you memorized them earlier' ;-)

 >
 > ...
 > There are two types of changes here:
 >
 >  1. Changes in the review process that require knowing the branch under
 review. Ideally, we'd use "fixup" or "squash" commits for all commits made
 during the review process. Alternatively, we can agree that commits will
 be squashed prior to merge, which works for me, too.
 >  2. Changes in the merge process that only require knowing master before
 the merge. Somebody who didn't follow the review process, including our
 future selfs, should see changes where we're not going forth and back
 throughout commit series to achieve something that we could do in one
 step.

 Sounds ok.

 >
 >  ...
 > Anyway, here's what I found in your branch:
 >
 >  - `AccessLogLineSanVal` contains an explicit mention of `"FTP"`. Does
 this mean we're considering FTP log lines to be valid? Why FTP in
 particular and no other protocols? Or should we take that out and restrict
 ourselves to HTTP? Not feeling strongly about this one.

 Please also refer to comments 50 to 52.  The main issue was to not only
 declare lines as valid that resemble sanitized lines, but be more general.
 Thus, allowing different protocols could be part of that, but I don't
 insist.
 Which log lines should be considered valid?  How much can log lines differ
 from sanitized log lines and still be considered valid?  The current code
 only sketches one possibility, e.g. allowing different protocols, various
 ip addresses, and other request types.  Before making any more code
 changes there should be at least a list of valid and invalid lines or a
 description of either state.

 >  - Further down below we're comparing `referenceDate` to `extractedDate`
 and discarding lines where the two don't match. Not sure if this is
 problematic, but I'm at least flagging it here as potential issue. If this
 is a non-issue, feel free to ignore this comment.

 The variable `extractedDate` refers to the date extracted from the log
 line and reference date is the log files date (see
 [https://metrics.torproject.org/web-server-logs.html#n-re-assembling-log-
 files spec section 4.3]).  As discarding is only performed during
 sanitization this is ok.  The lines would be considered valid.

 >  - A few lines further down we're formatting a date using
 `DateTimeFormatter.ofPattern()`. Is that expensive? Would we be able to
 create the formatter just once and re-use it here? Not trying to optimize
 prematurely, but trying to avoid shooting ourselves in the foot.

 Makes sense. (1)

 >  - Logging an error in case of catching a `Throwable` might become very
 noisy. After all, we're processing third-party input here, and we can
 typically proceed without parsing that line. The debug level seemed like
 the better choice here.

 Oh, that's an oversight. (2)

 >  - In `WebServerAccessLog`, what exactly is the log date? Is it the date
 of contained requests, is it the date when the file was written to disk,
 or what is it? This is still unclear in the docs.

 [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/tree/src/main/java/org/torproject/descriptor/WebServerAccessLog.java?h=task-22983-4&id=5dbbeb0ac38728f09718c80e295e8524d92d2e68#n19
 Javadoc] says:
 `Returns the date of the log as <code>LocalDate</code>.`
 and the [https://metrics.torproject.org/web-server-logs.html#n-re-
 assembling-log-files spec 4.3] says: `Rewritten log lines are re-assembled
 into sanitized log files based on physical host, virtual host, and request
 start date.`

 Where would a clarification be needed?

 >
 > When these remaining issues are clearer to me, do you mind me editing
 your branch and preparing one or more "fixup"/"squash" commits? There are
 still a few things that I'd like to document differently or where I'd like
 to make the new code more similar to existing code (in the good sense).
 You could still go through them and talk me out of making those changes.
 Sound okay?

 Sure, go ahead.  Would you also make the tiny adaptions from (1) and (2),
 or should I add these to the current branch first?

 >
 > Oh, and here's another remark for future reviews: let's try to reduce
 the time for reviewing and for revising a branch to a few days. Neither
 reviewing nor revising is fun, but it's even less fun when doing it
 several weeks later. I know this is a special case with the holiday break,
 and I'm not blaming you any more than I'm blaming myself. Just an idea to
 make reviews a little more fun. :)

 The holidays were not the problem here.  I think tasks like adding a new
 module especially those spanning two or more products should be treated in
 a more formal/structured approach in future.   There were several
 iterations that were not coding or (technical) design questions, but
 actually only concerned the specification (not only the one published on
 metrics.tp.o, but the definition of the wanted functionality for the
 webstats module itself).  Even now we still have an iteration on wanted
 functionality, i.e., which log lines should be considered valid.  These
 discussions for clarifying wanted functionality should happen before
 writing code.  The current approach produces code that is written,
 reviewed, tossed, and rewritten only because of not having a clear picture
 of the wanted functionality beforehand (also thinking of CollecTor's
 changes here).  Something we should have in mind when adding other modules
 or making major changes in future.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:59>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list