[metrics-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Aug 15 09:56:12 UTC 2017


#22983: add a descriptor interface and implementation for web-logs
---------------------------------+-----------------------------------
 Reporter:  iwakeh               |          Owner:  metrics-team
     Type:  enhancement          |         Status:  needs_revision
 Priority:  Medium               |      Milestone:  metrics-lib 2.1.0
Component:  Metrics/metrics-lib  |        Version:
 Severity:  Normal               |     Resolution:
 Keywords:                       |  Actual Points:
Parent ID:                       |         Points:
 Reviewer:                       |        Sponsor:
---------------------------------+-----------------------------------

Comment (by iwakeh):

 Replying to [comment:22 karsten]:
 > Alright, I looked at the remaining commits (in
 [https://gitweb.torproject.org/user/karsten/metrics-
 lib.git/log/?h=task-22983-3 ​my task-22983-3 branch]). I'll start with the
 major issues/questions in no specific order:
 >
 >  1. Why do we compress previously uncompressed log files? I see the
 point of saving memory, but we'd be doing that by sacrificing CPU time.
 And if we later change to leaving file contents on disk and only storing
 offsets and lengths into files, it would be wasteful to store a compressed
 copy of the file in memory just in case the application might need it
 later. Ideally, we'd just store a reference to the `byte[]` in whatever
 compression we're given, including uncompressed.
 >
 >  2. I already brought this up in
 [https://trac.torproject.org/projects/tor/ticket/22983#comment:12 comment
 12] above but didn't see a response: "Shouldn't `getRawDescriptorBytes()`
 return the uncompressed bytes and a separate method `getCompressedBytes()`
 return the compressed bytes? Thinking of being consistent with other
 descriptors where `getRawDescriptorBytes()` returns uncompressed bytes,
 too. Not sure about this one." Related to this, should
 `getRawDescriptorLength()` return the length of the ''uncompressed'' byte
 array? (This possibly requires uncompressing the file on first invocation
 and storing the length in an attribute.) What do you think about changing
 this for the sake of library consistency?


 Here and in 1. I noticed that I had some implicit assumptions about log
 descriptors that led to the chosen implementation.  Once #23243 is
 answered these concerns can be addressed in a better way.


 >
 >  3. Why do sanitized log lines contain a trailing `-`, as in: `... 403
 294 \"-\" \"-\" -\n`? I know that the
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
 script] also added that trailing dash, so I'm asking if you think there's
 a reason to keep that. The [https://httpd.apache.org/docs/2.4/logs.html
 Combined Log Format] does not specify one. If you think it can go away we
 should quickly check with Sebastian and then take it out.

 Also a discussion for the spec ticket #23243.  But in general I don't see
 a need for the trailing dash, I only reproduced the log-lines from the
 python implementation.

 >
 >  4. From the tests it seems like `POST` requests are kept, too. However,
 we should only keep `GET` and `HEAD` requests, just like the
 [https://gitweb.torproject.org/webstats.git/tree/src/sanitize.py Python
 script]. Likewise, `400` and `404` requests should be discarded. Maybe
 check for other deviations from the script yourself. And in the next
 review round we should compare the two sanitizers (Python and Java) using
 some real logs. Or do you still have logs to run some tests yourself?

 I did run such tests and some of the test files are taken from the real
 vs. python-cleaned logs (the real ones without pi info).  In #23234 we
 should craft the input and target formats; once that is done change
 implementation and tests accordingly.

 >
 >  5. `LogDescriptorImpl` should not sort logs by default as part part of
 the sanitizing step. That's a specific sanitizing technique for web server
 logs. It might be that a future log format only requires removing certain
 fields but not re-ordering log lines. Maybe there should be a second
 method `cleanLines(List<String>)` in `InternalLogDescriptor.Sanitizer`,
 and the existing method should be renamed to `cleanLine(String)`. The
 default sanitizer should keep the order unchanged and simply return the
 list it gets.

 True, the re-ordering is web-server-access-log specific and should be
 moved.

 >
 > I also found a few minor issues where it might be easiest if I fix them
 myself. I'll do that in the next review round as one or more suggested
 commits, and if you agree with those changes, I'll squash them, and maybe
 we can include them in the coding conventions afterwards.

 Ok, if these are 'orthogonal' to the above topics, but please give #23234
 a higher priority.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:25>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list