[metrics-bugs] #25196 [Metrics/Statistics]: Cut off recent dates from several CSV files (was: Cut off recent dates from hidserv.csv)
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Mar 7 08:38:16 UTC 2018
#25196: Cut off recent dates from several CSV files
--------------------------------+------------------------------
Reporter: karsten | Owner: karsten
Type: defect | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Statistics | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: iwakeh | Sponsor:
--------------------------------+------------------------------
Changes (by karsten):
* status: needs_revision => needs_review
Comment:
I set up a local metrics-web instance and modified it to run once per hour
and not cut off any dates at all. I'm
[https://trac.torproject.org/projects/tor/attachment/ticket/25196/cut-off-
recent-dates-2018-03-06.pdf attaching a PDF file] showing how statistics
for given dates (colors) change (y axis) over the UTC day of March 6 (x
axis). If a colored line changes much over the day, we cannot reasonable
include it yet and need to cut off that date. There's a trade-off of
holding back a statistic that is still changing too much vs. delaying a
statistic more than necessary and not being able to act on the data.
Here's what I think we should do for all current statistics files:
- `servers.csv`: We currently cut off 2 days (today = 2018-03-06 and the
day before = 2018-03-05), but it would be sufficient to cut off just 1 day
(today). The reason is that this file is based on consensuses and
referenced server descriptors, all of which are typically available at the
end of a day.
- `ipv6servers.csv`: Same as `servers.csv`, except that we don't cut off
anything yet, though I think we should, following the same rationale as
above.
- `advbwdist.csv`: Same as `servers.csv`, except that we already cut off
just 1 day, so there's no need to change anything here.
- `bandwidth.csv`: This file is based on statistics reported in extra-
info descriptors, and those might take more time to come in. We're also
not doing any estimates on the numbers we go so far, but we're simply
adding up what we have. So, if 5% of statistics are still missing, those
missing statistics will still change the end result by 5%. I suggest to
wait 3 days. We currently cut off 4, but I think 3 should be sufficient.
The better (long-term) solution would be to compensate missing data by
extrapolating what we have, but we're not there yet.
- `connbidirect2.csv`: Same as for `bandwidth.csv`, except that we're
providing averages where missing descriptors don't affect the result as
much. Cutting of 2 days will be fine (today and yesterday).
- `clients.csv` and `userstats-combined.csv`: Same as for
`connbidirect2.csv`, except that we're being smarter about estimating
numbers from given reports. Cutting of 2 days will be enough (today and
yesterday).
- `hidserv.csv`: Same as `clients.csv` et al., except we're being quite
smart about extrapolating reported statistics, so that we might even cut
off just 1 day. But let's do 2 days as before to be on the safe side.
- `torperf-1.1.csv`: OnionPerf only provides completed days, so it
depends on when we get those files and whether we get all of them at once.
I'm less certain here, but I think we're doing okay by cutting off 2 days.
- `webstats.csv`: I don't have good data, because webstats.tp.o was down
for a couple days now. This might also change after switching to
CollecTor's webstats module. I'd say we don't touch this now and revisit
it after switching to CollecTor.
Please review [https://gitweb.torproject.org/karsten/metrics-
web.git/commit/?h=task-25196&id=450d9f1edd880a7d6d46014af6bcc0e211630af7
commit 450d9f1 in my updated task-25196 branch]. If possible, I'd like to
make changes tomorrow (Thursday).
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25196#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list