[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Oct 10 14:23:42 UTC 2017
#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone: metrics-lib 2.2.0
Component: Metrics/Library | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+-----------------------------------
Comment (by iwakeh):
Replying to [comment:44 karsten]:
> Alright, here's what I found:
Thanks for the detailed review!
>
> Commit c6c805c looks trivially correct. It comes a bit out of nowhere,
so may you could add a remark to the commit message of future commits like
this saying why you're making this change now. Like, did it violate a
checkstyle rule? And was this the only place that was lacking a space? In
any case, ready to merge this one.
The misterious space results from a rule we don't have for spaces between
cast operator and variable. I made the usage in this file consistent by
adding the space as [https://gitweb.torproject.org/user/iwakeh/metrics-
lib.git/tree/src/main/java/org/torproject/descriptor/impl/KeyValueMap.java?h=task-22983-3&id=6b8de66d171dd85d8244a7d326ca436ce68e0d78#n57
the earlier cast] did have one. But, we should add a checkstyle rule to
make it consistent everywhere; I don't have a preference with or without
space is fine.
>
> Commit 3b426bf contains a couple of code style violations: the summary
fragment should describe the package and not be a warning; the summary
fragment is not supposed to contain markup like <h1>; there needs to be an
empty line before the next <p>; and the line starting with
"descriptors.</p>" is wrongly indented. I'm providing some tweaks in a
fixup commit. Please review and tweak further as needed.
These seem fine. Thanks for fixing them and rewording the javadoc.
>
> Commit 7822a27 has several issues that require several fixup commits
and/or discussion:
> - My commit e4d6e90 in my task-22983-4 branch contains a bunch of
whitespace fixes and other trivial tweaks which I believe are mostly
oversights. (Unless they are not, in which case see the next item.)
I think the javadoc reformulations are fine. Some tweaks are in my new
branch.
I also agree to the other few tweaks to the code or came up with something
else, which I added in tiny fixup commits to make reviewing easier
(hopefully)
> - My commit 664f540 is a mix of suggestions and tweaks. Some of the
tweaks are an attempt to make your code fit better into existing code,
which is either based on our style guide or on implicit conventions in
existing code for which we don't have a written style guide yet. Please
review that commit carefully. If you believe that some of these are non-
issues, let's discuss that. I'm not saying that anything or anyone is
wrong, I'm just trying to keep (heh, or make?) the overall code base as
readable as possible and at the same time make future review processes
even smoother.
Ok, things should match the old code base, but at the same time evolve to
the better, or be modernized.
I admid that I have a knack for terse variable names (just two letters
seem verbose sometimes ;-) but you're right in some cases, thanks for
fixing.
I'd say there are also very many non-issues here, which I know are
different in the existing older code base, but seem to make the code less
readable to me, when I started to become familiar with the old codebase.
For example, there are almost line filling variable names (not in this
patch) that differ only in one to five letters. And there is the word
'Descriptor' all over, which often feels like cluttering when reading the
code for the first time. The following is an example:
(all following diffs are from the 664f540 commit)
{{{
- public static List<Descriptor> parse(byte[] raw, File file)
...
+ public static List<Descriptor> parse(byte[] rawDescriptorBytes,
+ File descriptorFile) ...
}}}
Maybe, 'raw' alone is too terse, but 'rawBytes' seems fine whereas in
'rawDescriptorBytes' the word part 'Descriptor' overwhelms the important
information. The method 'parse' receives raw bytes and tries to find a
descriptor.
Here some other examples from the current patch&fixup round (we could
recycle them for the guide lines, so I try to be verbose):
Renaming of isValid, here:
{{{
- public boolean isValid(String line);
+ public boolean validateLine(String line);
}}}
makes the code less readable. For example:
{{{
- -> null != line && !line.isEmpty() && !validator.isValid(line))
...
+ -> null != line && !line.isEmpty() &&
!validator.validateLine(line))
}}}
`validateLine` doesn't say that the result of the validation is returned
(as a boolean). In addition, `isValid(line)` is more 'readable' as it
'translates' (e.g., read aloud) to "is valid line", whereas
`validateLine(line)` results in "validate line line", which even without
the duplication of line doesn't hint what happens. Similarly,
`sanitizeLine(line)` vs. `sanitize(line)` (where I had `clean(line)`
initially, but I don't mind the renaming) and `postProcessLines(lines)`
vs. `postProcessing(lines)`.
Why rename `logBytes` to `rawDescriptorBytes`? `logBytes` seems fine in a
log descriptor
implementation. If I read this code for a first time I would wonder if
`rawDescriptorBytes`
is inherited because of its generic name.
(Instead of renaming 'extension' to 'fileExtension' I'd suggest
'fileType', because that's what it is, i.e., not only an extension, which
could be mistaken to be a string.)
In general, a method name shouldn't contain the parameter (e.g., better
`sanitize` than `sanitizeLine`).
> - I wonder if we should take out the `LogDescriptor` interface for
several reasons: there is just one subinterface extending that, so YAGNI;
the only fields/methods that it adds are the log date, and it's unclear if
future logs will cover a single date or have entirely different
requirements. Looks like premature interface hierarchy optimization to me.
Logs are different from other descriptors: they are only kept compressed
in the filesystem; they don't have annotations; etc. So, basing them on
another interface and not directly on `Descriptor` is useful, because
almost all `Descriptor` inherited methods have a different or no meaning
for logs.
Actually, I would suggest to even drop `getLogDateMillis()` and move
`getLogDate()` to `WebServerAccessLog` (see the commit in my branch).
This way it is a clear marker interface for logs and for future logs the
methods in common with `WebServerAccessLog` could be added to
`LogDescriptor`.
> - I'm unclear whether this new code will cover both original and
sanitized web server logs. Possibly related to that question, why is
validation separated from sanitization, and is validation performed before
sanitization or not? Can you clarify?
Some points about log handling:
* Sanitized as well as original log lines have to be valid and have to
comply to the same pattern.
* metrics-lib cannot distinguish between sanitized and original logs. An
original or sanitized log can be read using metrics-lib as long as it
complies to the line pattern defined in the specification.
* metrics-lib won't attempt sanitization of log lines.
* CollecTor will only supply sanitized logs and uses the sanitization
functionality provided by metrics-lib internal functionality. This
happens by reading the log, which is automatically validated, and then
calling 'sanitize()'.
> When you make changes, please make them as fixup commits on top of my
branch. Thanks!
All additional commits are based on your new branch and contain small
changes that correspond to the findings/explanations above.
Please review seven new commits on
[https://gitweb.torproject.org/user/iwakeh/metrics-
lib.git/log/?h=task-22983-4 this branch], which is rebased on the current
master.
This turned into a huge comment; please try to split the reply into
smaller comments, otherwise we loose track of all the different topics.
Thanks!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:46>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list