[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Oct 18 16:03:16 UTC 2017
#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone: metrics-lib 2.2.0
Component: Metrics/Library | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+-----------------------------------
Comment (by karsten):
Heh, that's indeed a huge comment! And this one is going to be huge, too.
However, rather than splitting it into multiple smaller comment I'll try
to keep it as organized as I can by using the output of `git log --reverse
--pretty=oneline origin/master..iwakeh/task-22983-4` as section headers in
this ~~little essay~~comment.
==== 1232ff767375993f65a4e78d3a9383b23681925b Added a space.
This was commit c6c805c before your rebase to master.
In the future, please avoid rebasing the branch under review if you can,
because it changes all commit IDs and makes it harder to refer to commits
here. In most cases I'll be able to do the rebase as part of merging to
master just fine.
Other than that, the explanation in your previous comment makes sense to
me and this commit is good to go in.
==== 4434316a769ea6403d2872073bc76f10a2822cf2 Added package-info.
This was commit 3b426bf before your rebase to master.
We agree on this one plus the fixup commit below, so it's good to go in.
==== 04997b1d9b3a73d2724a2e9d39a144895ded460f Add new descriptor type for
web server access logs.
This was commit 7822a27 before your rebase to master.
It's the commit that most of the subsequent fixup and squash commits will
be squashed into. I'll comment on your feedback under the fixup commits
below.
==== 444b55709bef99b1f63a20e3595c1439aaa68775 Add tests for log descriptor
functionality.
This was commit 929e265 before your rebase to master.
This one still needs review. That review was so far blocking on my
question regarding validation vs. sanitization. I'll get to this in the
next round after that question is resolved. So, still '''needs review'''!
==== 4b3f27e93761b172dd3ee102f89c66059bd3da7a squash! Added package-info.
This was commit 1bb7a43 before your rebase to master.
You say that you agree with these tweaks, so I'll take that as a go.
==== be986cc88e66cb1df9772e65421462ca0570e190 fixup! Add new descriptor
type for web server access logs.
This was commit e4d6e90 before your rebase to master.
You say that you're fine with these changes and have more tweaks in your
branch. I'll assume this one is good to be merged then, and I'll comment
on your subsequent commits further down below.
==== 3de717268c3d27ad096cf669fc73a52f049acb59 fixup! Add new descriptor
type for web server access logs.
This was commit 664f540 before your rebase to master.
This commit brought up a few topics that we need to resolve before moving
on.
===== Variable names
Thanks for writing down your thinking about variable names in the given
detail. It helps a lot, not only for this ticket but also as a guide for
future tickets. Let's go through the examples:
- Leaving out the somewhat redundant "descriptor" from
`rawDescriptorBytes` and `descriptorFile` is fine with me. It's indeed
obvious what's meant here.
- I quite strongly disagree with `isValid(line)` as a name for a method
that takes a line, tries to validate it, and returns whether that was
successful. To me, `isX()` is the name for a getter, not the name for a
method that does something with a given parameter. If this were a `Line`
class with a method `isValid()` that returns whether the line is valid or
not, that would be something different. But that's not the case here. For
another example, consider `File.delete()` which deletes the file and
returns whether that was successful. We wouldn't argue about renaming that
method to `isDeleted()` just because it returns `true` or `false`. As a
general rule I'd say that the name of a method that performs something
should be the verb describing the action, whereas `is` is typically
reserved for getters. Ah, and in this case it's up to the documentation to
say what `validate(line)` returns, though that's relatively obvious.
- Leaving out "line" in `sanitizeLine(line)` and friends is okay, too.
- You're probably right about keeping `logBytes` rather than
`rawDescriptorBytes`.
===== `LogDescriptor` interface
I agree with keeping `LogDescriptor`. But let me explain my earlier
hesitation:
- I'm not a big fan of marker interfaces. :) However, it seems that we'll
soon add a `LogDescriptor.LogLine` subinterface, and that is (to me) a
good reason to keep the `LogDescriptor` interface.
- In the future we might create an `AnnotatedDescriptor` interface with a
`getAnnotations()` method, a `RawDescriptor` interface with a
`getRawDescriptorBytes()` and a `getRawDescriptorLength()` method, and a
`ParsedDescriptor` interface with a `getUnrecognizedLines()` method, and a
`ReadDescriptor` interface with a `getDescriptorFile()` method. And then
we'll turn `Descriptor` into a marker interface! :) And `LogDescriptor`
would only extend maybe `ParsedDescriptor` and `ReadDescriptor`. And we'd
create a `TorDescriptor` interface for all relay and bridge descriptors
and a `RelayDescriptor` and `BridgeDescriptor` for the various types of
relay and bridge descriptors. And so on. Not part of this ticket,
obviously. But it seems okay to start going in this direction now with a
`LogDescriptor` interface.
===== Validation vs. sanitization
I'm still confused what validation means in this context.
Is a line containing a POST request a valid line, or one that uses FTP as
protocol or that returns HTTP 404 as status code. It's okay that CollecTor
skips these lines as part of the sanitizing process. But that doesn't make
them invalid.
I'm also a bit uncleear if the separate validate and sanitize steps have a
negative impact on performance. In theory, it should be sufficient to
touch each line once. But I could be convinced that we're trading
performance for better design, if this is the case.
However, I'd really want us to be clear what it means for a line to be
valid!
==== e8a2d57b3fad2fdb94d98ba472314fd2a0fb0ada fixup! fixup! Add new
descriptor type for web server access logs.
This one is fine.
==== 3864d14c8491c6a76514a28c702c3da5d1eb5775 fixup! fixup! Add new
descriptor type for web server access logs.
This one, too.
==== 3e3dd56e317ef61296261291dc3ee29d42ebf3b8 Make code more readable by
renaming method names.
The commit message does not indicate this, so I'll ask: Is this commit and
are subsequent commits supposed to be squashed with 04997b1?
See my ~~rant~~remarks about `isValid` from above. That's something I'd
like us to change.
==== 5b20be449785b1e28fd10bcb814e697cb9de178d Rename rawDescriptorBytes to
logBytes.
Okay.
One thing we should do is document whether `private byte[] logBytes` might
be compressed or not. We have been discussing that many times now, and I'm
deeply confused already what we're doing there.
==== f5456c474f6a5a9683c6cf6ba3f791bc5d6278ec Avoid missleading name for
variable.
Sounds good.
==== 629ef152be1fd2f5a00d203b614fc01e946c518d Tweak pattern for logline
validation.
One question about the pattern: Does the `?:` in
`"^((?:\\d{1,3}\\.){3}\\d{1,3}) ..."` mean that we're now ignoring the
first 3 octets regardless of whether they're `0.0.0.` or not? That would
not follow the specification then where we claim that ''"If the remote
hostname starts with "0.0.0.", it is kept unchanged, otherwise it's
rewritten to "0.0.0.0"."'' Example: `1.2.3.4` would be rewritten to
`0.0.0.4` if we simply ignore the first three octets, rather than
`0.0.0.0`. In a way, I like that approach even better, but we'd have to
change the spec for that. And we should both be certain that this is what
we want to do.
==== e24dda11613f340da3fbd6f1a93ba07d857f0b16 Remove getLogDateMillis and
move getLogDate to WebServerAccessLog.
I wonder why there's no `getLogDateMillis()` anymore. But I can be
convinced that users should just extract millisecond from `LocalDate` if
they need them. Is that the plan? If so, green light.
==== 6a6e93a2fd8b3f3b912ce9a258f4b32069c18ef8 (iwakeh/task-22983-4) Set
dev version.
Let's revert this commit. It bumps to the wrong dev version (2.2.0-dev
rather than 2.1.1-dev), and it's something we should do as part of merging
to master. In fact, I bumped to 2.1.1-dev this morning. So, this commit
can go away, by reverting it, not by simply dropping it.
==== Closing remarks
Whee! This was hard work. And we're not done yet. But we're getting
closer. :) Thanks for working on this!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:47>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list