[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Jan 10 16:38:56 UTC 2018
#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_review
Priority: High | Milestone: metrics-lib 2.2.0
Component: Metrics/Library | Version:
Severity: Normal | Resolution:
Keywords: metrics-2018 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+-----------------------------------
Comment (by iwakeh):
Replying to [comment:58 karsten]:
> ...
> > You only need to look at the latest four commits. The others are
reviewed already (comment:47).
>
> Oh. That information would have saved some time this morning. That also
explains why I have just one remark on the earlier commits and a couple
more on the first four!
Hrmm, in my request for review I did mention:
... The additional commits should address all topics from the comments
above starting at comment:47. ...
Maybe, too subtle for 'please re-read the comments starting at comment:47
unless you memorized them earlier' ;-)
>
> ...
> There are two types of changes here:
>
> 1. Changes in the review process that require knowing the branch under
review. Ideally, we'd use "fixup" or "squash" commits for all commits made
during the review process. Alternatively, we can agree that commits will
be squashed prior to merge, which works for me, too.
> 2. Changes in the merge process that only require knowing master before
the merge. Somebody who didn't follow the review process, including our
future selfs, should see changes where we're not going forth and back
throughout commit series to achieve something that we could do in one
step.
Sounds ok.
>
> ...
> Anyway, here's what I found in your branch:
>
> - `AccessLogLineSanVal` contains an explicit mention of `"FTP"`. Does
this mean we're considering FTP log lines to be valid? Why FTP in
particular and no other protocols? Or should we take that out and restrict
ourselves to HTTP? Not feeling strongly about this one.
Please also refer to comments 50 to 52. The main issue was to not only
declare lines as valid that resemble sanitized lines, but be more general.
Thus, allowing different protocols could be part of that, but I don't
insist.
Which log lines should be considered valid? How much can log lines differ
from sanitized log lines and still be considered valid? The current code
only sketches one possibility, e.g. allowing different protocols, various
ip addresses, and other request types. Before making any more code
changes there should be at least a list of valid and invalid lines or a
description of either state.
> - Further down below we're comparing `referenceDate` to `extractedDate`
and discarding lines where the two don't match. Not sure if this is
problematic, but I'm at least flagging it here as potential issue. If this
is a non-issue, feel free to ignore this comment.
The variable `extractedDate` refers to the date extracted from the log
line and reference date is the log files date (see
[https://metrics.torproject.org/web-server-logs.html#n-re-assembling-log-
files spec section 4.3]). As discarding is only performed during
sanitization this is ok. The lines would be considered valid.
> - A few lines further down we're formatting a date using
`DateTimeFormatter.ofPattern()`. Is that expensive? Would we be able to
create the formatter just once and re-use it here? Not trying to optimize
prematurely, but trying to avoid shooting ourselves in the foot.
Makes sense. (1)
> - Logging an error in case of catching a `Throwable` might become very
noisy. After all, we're processing third-party input here, and we can
typically proceed without parsing that line. The debug level seemed like
the better choice here.
Oh, that's an oversight. (2)
> - In `WebServerAccessLog`, what exactly is the log date? Is it the date
of contained requests, is it the date when the file was written to disk,
or what is it? This is still unclear in the docs.
[https://gitweb.torproject.org/user/iwakeh/metrics-
lib.git/tree/src/main/java/org/torproject/descriptor/WebServerAccessLog.java?h=task-22983-4&id=5dbbeb0ac38728f09718c80e295e8524d92d2e68#n19
Javadoc] says:
`Returns the date of the log as <code>LocalDate</code>.`
and the [https://metrics.torproject.org/web-server-logs.html#n-re-
assembling-log-files spec 4.3] says: `Rewritten log lines are re-assembled
into sanitized log files based on physical host, virtual host, and request
start date.`
Where would a clarification be needed?
>
> When these remaining issues are clearer to me, do you mind me editing
your branch and preparing one or more "fixup"/"squash" commits? There are
still a few things that I'd like to document differently or where I'd like
to make the new code more similar to existing code (in the good sense).
You could still go through them and talk me out of making those changes.
Sound okay?
Sure, go ahead. Would you also make the tiny adaptions from (1) and (2),
or should I add these to the current branch first?
>
> Oh, and here's another remark for future reviews: let's try to reduce
the time for reviewing and for revising a branch to a few days. Neither
reviewing nor revising is fun, but it's even less fun when doing it
several weeks later. I know this is a special case with the holiday break,
and I'm not blaming you any more than I'm blaming myself. Just an idea to
make reviews a little more fun. :)
The holidays were not the problem here. I think tasks like adding a new
module especially those spanning two or more products should be treated in
a more formal/structured approach in future. There were several
iterations that were not coding or (technical) design questions, but
actually only concerned the specification (not only the one published on
metrics.tp.o, but the definition of the wanted functionality for the
webstats module itself). Even now we still have an iteration on wanted
functionality, i.e., which log lines should be considered valid. These
discussions for clarifying wanted functionality should happen before
writing code. The current approach produces code that is written,
reviewed, tossed, and rewritten only because of not having a clear picture
of the wanted functionality beforehand (also thinking of CollecTor's
changes here). Something we should have in mind when adding other modules
or making major changes in future.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:59>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list