[metrics-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri Jul 21 12:39:27 UTC 2017
#22983: add a descriptor interface and implementation for web-logs
---------------------------------+------------------------------
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/metrics-lib | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
---------------------------------+------------------------------
Comment (by karsten):
Replying to [comment:5 iwakeh]:
> Replying to [comment:4 karsten]:
>
> Thanks for the valuable input!
>
> > Regarding the name, let's try to find something more descriptive. How
about `WebServerLog` or even `ApacheHttpServerAccessLog`? Otherwise
there's the risk of confusion with descriptor types added in the future,
like a log file written by BridgeDB containing client requests for bridge
addresses.
>
> I see an interface hierarchy here:
> LogDescriptor as parent for all logs (then we drop 'Descriptor' from the
names) and have the first extending interface WebServerAccessLog. Later
we can add others *Log interfaces like BridgeDbClientLog etc.
>
> So for now, I focus on the access-log integration and keep future
extensions in mind for the design.
Makes sense.
> > Regarding the suggested interface, I think there's a short term and a
long term part here.
> >
> > In the long term I think that it would be at least twice as useful if
we read the log contents and added methods to read these parsed contents.
It's true that this causes some development hassle. But that's why we do
it once in the library rather than rely on possibly more than one
application to get it right. And we can still include the raw descriptor
bytes by storing the compressed bytes and inflate them upon request.
>
> Yes, partially I have this in CollecTor anyway for sanitizing the logs.
I'll add generally useful functionality to the metrics-lib code.
> Should we have a new package for the implementations like
`org.torproject.descriptor.logs`? The log processing and content differs
from usual descriptors quite a bit.
As long as we keep all types that are relevant for applications in
`org.torproject.descriptor`, I don't mind adding new subpackages.
> > Some comments on the interface:
> > - Let's include a subtype `Request` or similar for each line
contained in the log file, and let's include a method `getRequests()` that
returns `Iterable<Request>`.
>
> There could be a parent interface LogLine that is extended by an
appropriate interface for each log type, like a Request interface for
access-logs.
> I think about it and definitly keep the design open for the addition,
but would put it on lesser priority right now.
Long term sounds fine.
> > - Due to the fact that we cannot include a `@type` annotation with a
version number, `Request` should ideally include getters for all fields
contained in Apache's Combined Log Format.
> > - Ideally, `getLogDate()` would return the date in milliseconds since
the epoch to be conformant to the rest of metrics-lib, in which case it
would probably be called `getLogMillis()`.
>
> Fine, but we only have the date no time here. Thus, msec signals a
precision we don't offer.
> I don't feel strongly about that.
Me neither, I just think that it's easier to handle timestamps from
different data sources if they all use the same format.
> > - I'm unclear what `getCompressionType()` returns. I think I'd expect
a `String` that is either `"gz"` or `"gz"`, but not a `byte[]`. Was that
intended?
>
> Correct, this should read `String getCompressionType()`, just a typo.
Actually, it might turn into an enum.
>
> > - If we read and parse logs, we'll have to change
`getUnrecognizedLines()` to return any unrecognized lines.
>
> Yes, maybe with an upper limit in case a log got mangled?
Good idea. First 100?
> > In the short term I can see how we might want to put the `Request`
part on hold and only return metadata and uncompressed raw descriptor
contents in this new descriptor type.
>
> Fine, as replied above.
>
> Do you have a rough estimate of the future log file sizes metrics-lib
will have to deal with?
No idea. I think some of the Apache logs are pretty large in uncompressed
form. But other descriptors have grown a lot over time, too, like votes.
And when we recently pondered appending all votes collected in a singly
CollecTor sync run, our original expectation of the size turned out to be
pretty useless. So, 20 times the size?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list