[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Oct 10 14:23:42 UTC 2017


#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:  metrics-lib 2.2.0
Component:  Metrics/Library  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+-----------------------------------

Comment (by iwakeh):

 Replying to [comment:44 karsten]:
 > Alright, here's what I found:

 Thanks for the detailed review!

 >
 > Commit c6c805c looks trivially correct. It comes a bit out of nowhere,
 so may you could add a remark to the commit message of future commits like
 this saying why you're making this change now. Like, did it violate a
 checkstyle rule? And was this the only place that was lacking a space? In
 any case, ready to merge this one.

 The misterious space results from a rule we don't have for spaces between
 cast operator and variable.  I made the usage in this file consistent by
 adding the space as [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/tree/src/main/java/org/torproject/descriptor/impl/KeyValueMap.java?h=task-22983-3&id=6b8de66d171dd85d8244a7d326ca436ce68e0d78#n57
 the earlier cast] did have one.  But, we should add a checkstyle rule to
 make it consistent everywhere; I don't have a preference with or without
 space is fine.

 >
 > Commit 3b426bf contains a couple of code style violations: the summary
 fragment should describe the package and not be a warning; the summary
 fragment is not supposed to contain markup like <h1>; there needs to be an
 empty line before the next <p>; and the line starting with
 "descriptors.</p>" is wrongly indented. I'm providing some tweaks in a
 fixup commit. Please review and tweak further as needed.

 These seem fine.  Thanks for fixing them and rewording the javadoc.

 >
 > Commit 7822a27 has several issues that require several fixup commits
 and/or discussion:
 >  - My commit e4d6e90 in my task-22983-4 branch contains a bunch of
 whitespace fixes and other trivial tweaks which I believe are mostly
 oversights. (Unless they are not, in which case see the next item.)

 I think the javadoc reformulations are fine.  Some tweaks are in my new
 branch.
 I also agree to the other few tweaks to the code or came up with something
 else, which I added in tiny fixup commits to make reviewing easier
 (hopefully)

 >  - My commit 664f540 is a mix of suggestions and tweaks. Some of the
 tweaks are an attempt to make your code fit better into existing code,
 which is either based on our style guide or on implicit conventions in
 existing code for which we don't have a written style guide yet. Please
 review that commit carefully. If you believe that some of these are non-
 issues, let's discuss that. I'm not saying that anything or anyone is
 wrong, I'm just trying to keep (heh, or make?) the overall code base as
 readable as possible and at the same time make future review processes
 even smoother.

 Ok, things should match the old code base, but at the same time evolve to
 the better, or be modernized.
 I admid that I have a knack for terse variable names (just two letters
 seem verbose sometimes ;-) but you're right in some cases, thanks for
 fixing.
 I'd say there are also very many non-issues here, which I know are
 different in the existing older code base, but seem to make the code less
 readable to me, when I started to become familiar with the old codebase.
 For example, there are almost line filling variable names (not in this
 patch) that differ only in one to five letters.  And there is the word
 'Descriptor' all over, which often feels like cluttering when reading the
 code for the first time.  The following is an example:
 (all following diffs are from the 664f540 commit)

 {{{
 -  public static List<Descriptor> parse(byte[] raw, File file)
 ...
 +  public static List<Descriptor> parse(byte[] rawDescriptorBytes,
 +      File descriptorFile) ...
 }}}

 Maybe, 'raw' alone is too terse, but 'rawBytes' seems fine whereas in
 'rawDescriptorBytes' the word part 'Descriptor' overwhelms the important
 information.  The method 'parse' receives raw bytes and tries to find a
 descriptor.

 Here some other examples from the current patch&fixup round (we could
 recycle them for the guide lines, so I try to be verbose):

 Renaming of isValid, here:
 {{{
 -    public boolean isValid(String line);
 +    public boolean validateLine(String line);
 }}}

 makes the code less readable.  For example:

 {{{
 -          -> null != line && !line.isEmpty() && !validator.isValid(line))
   ...
 +          -> null != line && !line.isEmpty() &&
 !validator.validateLine(line))
 }}}

 `validateLine` doesn't say that the result of the validation is returned
 (as a boolean).  In addition, `isValid(line)` is more 'readable' as it
 'translates' (e.g., read aloud) to "is valid line",  whereas
 `validateLine(line)` results in "validate line line", which even without
 the duplication of line doesn't hint what happens.  Similarly,
 `sanitizeLine(line)` vs. `sanitize(line)` (where I had `clean(line)`
 initially, but I don't mind the renaming) and `postProcessLines(lines)`
 vs. `postProcessing(lines)`.

 Why rename `logBytes` to `rawDescriptorBytes`?  `logBytes` seems fine in a
 log descriptor
 implementation.  If I read this code for a first time I would wonder if
 `rawDescriptorBytes`
 is inherited because of its generic name.

 (Instead of renaming 'extension' to 'fileExtension' I'd suggest
 'fileType', because that's what it is, i.e., not only an extension, which
 could be mistaken to be a string.)

 In general, a method name shouldn't contain the parameter (e.g., better
 `sanitize` than `sanitizeLine`).

 >  - I wonder if we should take out the `LogDescriptor` interface for
 several reasons: there is just one subinterface extending that, so YAGNI;
 the only fields/methods that it adds are the log date, and it's unclear if
 future logs will cover a single date or have entirely different
 requirements. Looks like premature interface hierarchy optimization to me.

 Logs are different from other descriptors: they are only kept compressed
 in the filesystem; they don't have annotations; etc. So, basing them on
 another interface and not directly on `Descriptor` is useful, because
 almost all `Descriptor` inherited methods have a different or no meaning
 for logs.
 Actually, I would suggest to even drop `getLogDateMillis()` and move
 `getLogDate()` to `WebServerAccessLog` (see the commit in my branch).
 This way it is a clear marker interface for logs and for future logs the
 methods in common with `WebServerAccessLog` could be added to
 `LogDescriptor`.

 >  - I'm unclear whether this new code will cover both original and
 sanitized web server logs. Possibly related to that question, why is
 validation separated from sanitization, and is validation performed before
 sanitization or not? Can you clarify?

 Some points about log handling:
 * Sanitized as well as original log lines have to be valid and have to
 comply to the same pattern.
 * metrics-lib cannot distinguish between sanitized and original logs.  An
 original or sanitized log can be read using metrics-lib as long as it
 complies to the line pattern defined in the specification.
 * metrics-lib won't attempt sanitization of log lines.
 * CollecTor will only supply sanitized logs and uses the sanitization
 functionality provided by metrics-lib internal functionality.  This
 happens by reading the log, which is automatically validated, and then
 calling 'sanitize()'.

 > When you make changes, please make them as fixup commits on top of my
 branch. Thanks!

 All additional commits are based on your new branch and contain small
 changes that correspond to the findings/explanations above.

 Please review seven new commits on
 [https://gitweb.torproject.org/user/iwakeh/metrics-
 lib.git/log/?h=task-22983-4 this branch], which is rebased on the current
 master.

 This turned into a huge comment; please try to split the reply into
 smaller comments, otherwise we loose track of all the different topics.
 Thanks!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:46>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list