[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Oct 18 16:03:16 UTC 2017


#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:  metrics-lib 2.2.0
Component:  Metrics/Library  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+-----------------------------------

Comment (by karsten):

 Heh, that's indeed a huge comment! And this one is going to be huge, too.
 However, rather than splitting it into multiple smaller comment I'll try
 to keep it as organized as I can by using the output of `git log --reverse
 --pretty=oneline origin/master..iwakeh/task-22983-4` as section headers in
 this ~~little essay~~comment.

 ==== 1232ff767375993f65a4e78d3a9383b23681925b Added a space.

 This was commit c6c805c before your rebase to master.

 In the future, please avoid rebasing the branch under review if you can,
 because it changes all commit IDs and makes it harder to refer to commits
 here. In most cases I'll be able to do the rebase as part of merging to
 master just fine.

 Other than that, the explanation in your previous comment makes sense to
 me and this commit is good to go in.

 ==== 4434316a769ea6403d2872073bc76f10a2822cf2 Added package-info.

 This was commit 3b426bf before your rebase to master.

 We agree on this one plus the fixup commit below, so it's good to go in.

 ==== 04997b1d9b3a73d2724a2e9d39a144895ded460f Add new descriptor type for
 web server access logs.

 This was commit 7822a27 before your rebase to master.

 It's the commit that most of the subsequent fixup and squash commits will
 be squashed into. I'll comment on your feedback under the fixup commits
 below.

 ==== 444b55709bef99b1f63a20e3595c1439aaa68775 Add tests for log descriptor
 functionality.

 This was commit 929e265 before your rebase to master.

 This one still needs review. That review was so far blocking on my
 question regarding validation vs. sanitization. I'll get to this in the
 next round after that question is resolved. So, still '''needs review'''!

 ==== 4b3f27e93761b172dd3ee102f89c66059bd3da7a squash! Added package-info.

 This was commit 1bb7a43 before your rebase to master.

 You say that you agree with these tweaks, so I'll take that as a go.

 ==== be986cc88e66cb1df9772e65421462ca0570e190 fixup! Add new descriptor
 type for web server access logs.

 This was commit e4d6e90 before your rebase to master.

 You say that you're fine with these changes and have more tweaks in your
 branch. I'll assume this one is good to be merged then, and I'll comment
 on your subsequent commits further down below.

 ==== 3de717268c3d27ad096cf669fc73a52f049acb59 fixup! Add new descriptor
 type for web server access logs.

 This was commit 664f540 before your rebase to master.

 This commit brought up a few topics that we need to resolve before moving
 on.

 ===== Variable names

 Thanks for writing down your thinking about variable names in the given
 detail. It helps a lot, not only for this ticket but also as a guide for
 future tickets. Let's go through the examples:

  - Leaving out the somewhat redundant "descriptor" from
 `rawDescriptorBytes` and `descriptorFile` is fine with me. It's indeed
 obvious what's meant here.

  - I quite strongly disagree with `isValid(line)` as a name for a method
 that takes a line, tries to validate it, and returns whether that was
 successful. To me, `isX()` is the name for a getter, not the name for a
 method that does something with a given parameter. If this were a `Line`
 class with a method `isValid()` that returns whether the line is valid or
 not, that would be something different. But that's not the case here. For
 another example, consider `File.delete()` which deletes the file and
 returns whether that was successful. We wouldn't argue about renaming that
 method to `isDeleted()` just because it returns `true` or `false`. As a
 general rule I'd say that the name of a method that performs something
 should be the verb describing the action, whereas `is` is typically
 reserved for getters. Ah, and in this case it's up to the documentation to
 say what `validate(line)` returns, though that's relatively obvious.

  - Leaving out "line" in `sanitizeLine(line)` and friends is okay, too.

  - You're probably right about keeping `logBytes` rather than
 `rawDescriptorBytes`.

 ===== `LogDescriptor` interface

 I agree with keeping `LogDescriptor`. But let me explain my earlier
 hesitation:

  - I'm not a big fan of marker interfaces. :) However, it seems that we'll
 soon add a `LogDescriptor.LogLine` subinterface, and that is (to me) a
 good reason to keep the `LogDescriptor` interface.

  - In the future we might create an `AnnotatedDescriptor` interface with a
 `getAnnotations()` method, a `RawDescriptor` interface with a
 `getRawDescriptorBytes()` and a `getRawDescriptorLength()` method, and a
 `ParsedDescriptor` interface with a `getUnrecognizedLines()` method, and a
 `ReadDescriptor` interface with a `getDescriptorFile()` method. And then
 we'll turn `Descriptor` into a marker interface! :) And `LogDescriptor`
 would only extend maybe `ParsedDescriptor` and `ReadDescriptor`. And we'd
 create a `TorDescriptor` interface for all relay and bridge descriptors
 and a `RelayDescriptor` and `BridgeDescriptor` for the various types of
 relay and bridge descriptors. And so on. Not part of this ticket,
 obviously. But it seems okay to start going in this direction now with a
 `LogDescriptor` interface.

 ===== Validation vs. sanitization

 I'm still confused what validation means in this context.

 Is a line containing a POST request a valid line, or one that uses FTP as
 protocol or that returns HTTP 404 as status code. It's okay that CollecTor
 skips these lines as part of the sanitizing process. But that doesn't make
 them invalid.

 I'm also a bit uncleear if the separate validate and sanitize steps have a
 negative impact on performance. In theory, it should be sufficient to
 touch each line once. But I could be convinced that we're trading
 performance for better design, if this is the case.

 However, I'd really want us to be clear what it means for a line to be
 valid!

 ==== e8a2d57b3fad2fdb94d98ba472314fd2a0fb0ada fixup! fixup! Add new
 descriptor type for web server access logs.

 This one is fine.

 ==== 3864d14c8491c6a76514a28c702c3da5d1eb5775 fixup! fixup! Add new
 descriptor type for web server access logs.

 This one, too.

 ==== 3e3dd56e317ef61296261291dc3ee29d42ebf3b8 Make code more readable by
 renaming method names.

 The commit message does not indicate this, so I'll ask: Is this commit and
 are subsequent commits supposed to be squashed with 04997b1?

 See my ~~rant~~remarks about `isValid` from above. That's something I'd
 like us to change.

 ==== 5b20be449785b1e28fd10bcb814e697cb9de178d Rename rawDescriptorBytes to
 logBytes.

 Okay.

 One thing we should do is document whether `private byte[] logBytes` might
 be compressed or not. We have been discussing that many times now, and I'm
 deeply confused already what we're doing there.

 ==== f5456c474f6a5a9683c6cf6ba3f791bc5d6278ec Avoid missleading name for
 variable.

 Sounds good.

 ==== 629ef152be1fd2f5a00d203b614fc01e946c518d Tweak pattern for logline
 validation.

 One question about the pattern: Does the `?:` in
 `"^((?:\\d{1,3}\\.){3}\\d{1,3}) ..."` mean that we're now ignoring the
 first 3 octets regardless of whether they're `0.0.0.` or not? That would
 not follow the specification then where we claim that ''"If the remote
 hostname starts with "0.0.0.", it is kept unchanged, otherwise it's
 rewritten to "0.0.0.0"."'' Example: `1.2.3.4` would be rewritten to
 `0.0.0.4` if we simply ignore the first three octets, rather than
 `0.0.0.0`. In a way, I like that approach even better, but we'd have to
 change the spec for that. And we should both be certain that this is what
 we want to do.

 ==== e24dda11613f340da3fbd6f1a93ba07d857f0b16 Remove getLogDateMillis and
 move getLogDate to WebServerAccessLog.

 I wonder why there's no `getLogDateMillis()` anymore. But I can be
 convinced that users should just extract millisecond from `LocalDate` if
 they need them. Is that the plan? If so, green light.

 ==== 6a6e93a2fd8b3f3b912ce9a258f4b32069c18ef8 (iwakeh/task-22983-4) Set
 dev version.

 Let's revert this commit. It bumps to the wrong dev version (2.2.0-dev
 rather than 2.1.1-dev), and it's something we should do as part of merging
 to master. In fact, I bumped to 2.1.1-dev this morning. So, this commit
 can go away, by reverting it, not by simply dropping it.

 ==== Closing remarks

 Whee! This was hard work. And we're not done yet. But we're getting
 closer. :) Thanks for working on this!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:47>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list