[metrics-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Aug 15 08:34:56 UTC 2017
#22983: add a descriptor interface and implementation for web-logs
---------------------------------+-----------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone: metrics-lib 2.1.0
Component: Metrics/metrics-lib | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
---------------------------------+-----------------------------------
Changes (by karsten):
* status: needs_review => needs_revision
Comment:
Alright, I looked at the remaining commits (in
[https://gitweb.torproject.org/user/karsten/metrics-
lib.git/log/?h=task-22983-3 my task-22983-3 branch]). I'll start with the
major issues/questions in no specific order:
1. Why do we compress previously uncompressed log files? I see the point
of saving memory, but we'd be doing that by sacrificing CPU time. And if
we later change to leaving file contents on disk and only storing offsets
and lengths into files, it would be wasteful to store a compressed copy of
the file in memory just in case the application might need it later.
Ideally, we'd just store a reference to the `byte[]` in whatever
compression we're given, including uncompressed.
2. I already brought this up in
[https://trac.torproject.org/projects/tor/ticket/22983#comment:12 comment
12] above but didn't see a response: "Shouldn't `getRawDescriptorBytes()`
return the uncompressed bytes and a separate method `getCompressedBytes()`
return the compressed bytes? Thinking of being consistent with other
descriptors where `getRawDescriptorBytes()` returns uncompressed bytes,
too. Not sure about this one." Related to this, should
`getRawDescriptorLength()` return the length of the ''uncompressed'' byte
array? (This possibly requires uncompressing the file on first invocation
and storing the length in an attribute.) What do you think about changing
this for the sake of library consistency?
3. Why do sanitized log lines contain a trailing `-`, as in: `... 403 294
\"-\" \"-\" -\n`? I know that the
[https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
script] also added that trailing dash, so I'm asking if you think there's
a reason to keep that. The [https://httpd.apache.org/docs/2.4/logs.html
Combined Log Format] does not specify one. If you think it can go away we
should quickly check with Sebastian and then take it out.
4. From the tests it seems like `POST` requests are kept, too. However,
we should only keep `GET` and `HEAD` requests, just like the
[https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
script]. Likewise, `400` and `404` requests should be discarded. Maybe
check for other deviations from the script yourself. And in the next
review round we should compare the two sanitizers (Python and Java) using
some real logs. Or do you still have logs to run some tests yourself?
5. `LogDescriptorImpl` should not sort logs by default as part part of
the sanitizing step. That's a specific sanitizing technique for web server
logs. It might be that a future log format only requires removing certain
fields but not re-ordering log lines. Maybe there should be a second
method `cleanLines(List<String>)` in `InternalLogDescriptor.Sanitizer`,
and the existing method should be renamed to `cleanLine(String)`. The
default sanitizer should keep the order unchanged and simply return the
list it gets.
I also found a few minor issues where it might be easiest if I fix them
myself. I'll do that in the next review round as one or more suggested
commits, and if you agree with those changes, I'll squash them, and maybe
we can include them in the coding conventions afterwards.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:22>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list