[tor-bugs] #22428 [Metrics/CollecTor]: Add webstats module
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Oct 23 16:46:19 UTC 2017
#22428: Add webstats module
-------------------------------+---------------------------------
Reporter: iwakeh | Owner: iwakeh
Type: enhancement | Status: needs_revision
Priority: High | Milestone: CollecTor 1.5.0
Component: Metrics/CollecTor | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------+---------------------------------
Comment (by iwakeh):
Replying to [comment:36 karsten]:
> Alright, I finished an initial review of commit 086e904 in your
task-22428-4 branch. I have several trivial or minor findings, but I'd
like to postpone them until we have resolved one that I consider major:
Good, it's better to address the small stuff at the very end :-)
>
> I'm unclear whether the sibling approach is robust enough to cover all
cases and edge cases. Maybe even worse, I'm unclear whether we'd notice if
we'd be running into an uncovered edge case or if we'd silently not
process and therefore lose data.
This is not the siblings approach question, but the general question: When
is the log for a certain day done?
I'll address this in more detail on #23243, because this is a
specification issue and this ticket here should only be concerned with
implementation, imo.
>
> For example, what happens if we sanitize logs from a server that
receives ''very'' few requests, maybe only a few requests per week?
Consider these original log files (where I scrubbed the virtual host
name):
> - `scrubbed.torproject.org-access.log-20171001.gz` contains requests
from 2017-09-30 and 2017-10-01.
> - `scrubbed.torproject.org-access.log-20171002.gz` contains requests
from 2017-10-01 only.
> - `scrubbed.torproject.org-access.log-20171004.gz` contains requests
from 2017-10-03 only.
> - `scrubbed.torproject.org-access.log-20171006.gz` contains requests
from 2017-10-05 and 2017-10-06.
>
> Would the existing code produce logs for 2017-10-01, -03, -05, and -06
with exactly the sanitized log lines from these original log files? (I
didn't run it, I only read the code and am unclear about this.)
The result does not depend on the contents of an input log. The above
files would lead to a single sanitized log for 2017-10-01. The
implementation relies on having the sibling, which could be provided by a
simple `touch scrubbed.torproject.org-access.log-20171003` command. The
application needs an outside cue. I'm stating this here for completeness.
As there is more to this (including the below questions), let's move the
discussion to the spec ticket.
>
> Here's another, related question: what happens if a web server rotates
logs more often than once per day? At least that's something that we write
in the specification. I'm not sure how this would work with file names, so
maybe we in fact require that logs are rotated exactly once per day, and
we just didn't write that in the specification yet. However, it seems
rather restrictive to prescribe exact log rotation intervals in order to
sanitize logs subsequently. Maybe we should be less restrictive here.
>
> Is there a way to make this approach more robust? And is there a way to
ensure that we'll learn about any broken assumptions as early as possible?
Will move all these questions and possible answers to on #23243.
>
> Ah, and do you mind doing another round of JavaDoc editing and variable
renaming towards finding a middle ground between 2-characters-is-almost-
verbose and 80-characters-can-fit-in-a-line-so-let-us-not-use-more-
than-79? As a fixup/squash commit without rebasing, please. :) Thank you!
I'll take another look, no problem ;-)
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:37>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list