[metrics-bugs] #22983 [Metrics/Library]: Add a Descriptor subinterface and implementation for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Nov 21 14:47:17 UTC 2017
#22983: Add a Descriptor subinterface and implementation for Tor web server logs
-----------------------------+-----------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone: metrics-lib 2.2.0
Component: Metrics/Library | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+-----------------------------------
Comment (by iwakeh):
Thanks for hanging in and working through all!
I try to shorten my reply and simply leave out all points of agreement
withou any further comment. Feel free to raise them again if necessary.
Replying to [comment:47 karsten]:
> ===== Variable names
I copied the comments about variable names to ticket #24370, which is for
defining some naming guidelines all over metrics. Let's continue the
discussion there.
> ===== Validation vs. sanitization
> I'm still confused what validation means in this context.
Metrics-lib provides web log parsing of sanitized logs as available on
CollecTor. When parsing such a log file lines need to be validated, i.e.,
metrics-lib verifies that these are standard sanitized log lines.
Metrics-lib does not supply a general log parser.
Sanitization is only supplied as internal metrics-lib feature and used by
CollecTor to sanitize the logs to be published.
> Is a line containing a POST request a valid line, or one that uses FTP
as protocol or that returns HTTP 404 as status code. It's okay that
CollecTor skips these lines as part of the sanitizing process. But that
doesn't make them invalid.
From the point of a consumer that expects sanitized log lines the POST and
FTP lines are invalid.
>
> I'm also a bit uncleear if the separate validate and sanitize steps have
a negative impact on performance. In theory, it should be sufficient to
touch each line once. But I could be convinced that we're trading
performance for better design, if this is the case.
Touching the lines once is only a concern during sanitization. The
current code only operates once on each line. Simply parsing santized
logs only needs the validation part anyway.
>
> However, I'd really want us to be clear what it means for a line to be
valid!
Yes!
> ...
> One thing we should do is document whether `private byte[] logBytes`
might be compressed or not. We have been discussing that many times now,
and I'm deeply confused already what we're doing there.
Already addressed in the respective javadoc comments. All decompressed.
> ===== 629ef152be1fd2f5a00d203b614fc01e946c518d Tweak pattern for logline
validation.
> One question about the pattern: Does the `?:` in
`"^((?:\\d{1,3}\\.){3}\\d{1,3}) ..."` mean that we're now ignoring the
first 3 octets regardless of whether they're `0.0.0.` or not? That would
not follow the specification ...
I will re-check that with the new spec we have from #23243.
> ===== e24dda11613f340da3fbd6f1a93ba07d857f0b16 Remove getLogDateMillis
and move getLogDate to WebServerAccessLog.
>
> I wonder why there's no `getLogDateMillis()` anymore. But I can be
convinced that users should just extract millisecond from `LocalDate` if
they need them. Is that the plan? If so, green light.
Yep.
> ==== 6a6e93a2fd8b3f3b912ce9a258f4b32069c18ef8 (iwakeh/task-22983-4) Set
dev version.
> Let's revert this commit. ...
I reverted this, but instead would like to add the '2.1.1-dev' commit.
(Mainly for CollecTor #22428 implementation and testing.)
Next steps:
The latest changes in #23243 also makes a revision necessary. I'll use
the current branch here to add the changes.
So far the only open topic is your question about validation&sanitization.
The above answer might have resolved it a bit. Maybe, it'll also be more
clear in the upcoming commit(s) for the spec changes.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:48>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list