[metrics-bugs] #31204 [Metrics/CollecTor]: Extend file objects in index.json to include descriptor types, publication times, and file digests
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Aug 5 20:04:14 UTC 2019
#31204: Extend file objects in index.json to include descriptor types, publication
times, and file digests
-------------------------------+--------------------------
Reporter: karsten | Owner: karsten
Type: enhancement | Status: accepted
Priority: Medium | Milestone:
Component: Metrics/CollecTor | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------+--------------------------
Changes (by karsten):
* owner: metrics-team => karsten
* status: new => accepted
Comment:
I started working on this today. I do have some code here that supports
running in the background using a thread pool, but I'll have to spend at
least another day or two on this before it's ready for review.
A few observations from writing this code and testing it locally:
1. Reading tarballs to find out descriptor types and publication times is
really time consuming. A test run with 643M of data took roughly 10
minutes on my laptop. For comparison, our archive is 95G in size, so about
150 times the size. We might want to index the archive on an external
machine that is not the CollecTor host. And we need to be clear that the
server will be busy for 10-20 minutes after creating new tarballs every 2
to 3 days. Neither of which being a major concern, just stating it.
2. Interestingly, computing SHA-256 digests of tarballs only took about 5
seconds of these 10 minutes, so that's really, really cheap compared to
reading tarballs and extracting descriptor types and publication times.
3. I wonder how it will work out in practice that these new fields will
be blank for 10-20 minutes for newly created tarballs. In many cases,
newly created tarballs replace existing tarballs from a few days ago for
which these fields were available. One effect would be that the latest
published timestamp for a given descriptor type will flap between, say,
middle of a month to end of the previous month, only because the tarball
for the current month is replaced. Maybe we need to do something more
elaborate where we put newly created tarballs into a staging area where we
parse them and then move them into place.
I'll think more about these issues (mainly the third one) and work more on
the code as time permits. Grabbing the ticket, because it doesn't really
make sense for somebody else to re-do what I did so far.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31204#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list