[tor-bugs] #20228 [Metrics/CollecTor]: Append all votes with same valid-after time to a single file in `recent/`
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Oct 5 12:19:25 UTC 2016
#20228: Append all votes with same valid-after time to a single file in `recent/`
-------------------------------+---------------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: High | Milestone:
Component: Metrics/CollecTor | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------+---------------------
Changes (by karsten):
* priority: Medium => High
Comment:
I'd like us to move forward here, ideally with descriptors grouped by
download time and both of us being fully convinced that it's the best way
forward. :)
So, let me give you some background on where the `recent/` folder comes
from.
A few years back, there was just the `archive/` folder with tarballs that
were updated every few days. All services like Tor Metrics, ExoneraTor,
and Onionoo were running on the same host as CollecTor and using
CollecTor's directory structure for importing new descriptors. This was
very convenient for running these services, but of course very fragile and
very impossible for others to run similar services. That's when I turned
CollecTor into its own service.
The new CollecTor service had a local directory called `rsync/`, the
predecessor of `recent/`, which had just the newest files that other
services would download via `rsync` rather than http. The idea was to
provide the latest 72 hours of descriptors, so that services can miss
updates for up to 3 days (a weekend) without having to fall back to
importing tarballs from the `archive/` directory. This fixed the problem
of running all services on one machine, but it didn't allow others to run
services. We quickly learned that rsyncing thousands or even hundreds of
thousands of files did not scale, so we appended many small descriptors
into one file per CollecTor update run.
At some point we made that `rsync/` directory available via http as
`recent/` and taught Onionoo et al. to download descriptors from there
instead of relying on a local `rsync` command to magically fetch them.
This is when other services could first enter the game. It's also when
users started browsing the `recent/` directory to have an easy way to
download descriptors---but that was mostly coincidence and a nice side
effect.
Now we're considering changing the directory structure to make it even
more efficient for services to keep up to date. Merging votes into single
files reduces the `index.json*` size while keeping the service exactly as
useful for other services. Something that we'll make a bit more difficult
is accessibility for humans, because they cannot locate a vote as easily
anymore.
Also consider a feature request that people ask for every so often:
provide a search for raw descriptors. This is something that folks like
directory authority operators or others who debug the network would find
really useful. And these folks might be sad that votes are appended to
single files and stored by download time rather than valid-after time.
But it's again coincidence that votes are easily locatable by valid-after
time. On the other hand, if a user searches for something different, like
a relay fingerprint or IP address, they'll likely have to download the
latest few votes and search locally.
So, we might even go one step further and store ''all'' descriptors in the
`recent/` folder by download time. That would include consensuses of
which there are usually only per CollecTor update run. The upside would
be that it'd become more obvious that all files contain the download time,
not the published or valid-after time.
All in all, I'd like to consider the `recent/` folder as an update channel
for services rather than something that humans browse. I'm not going to
stop them from doing that, but I'm very hesitant to make the original use
case of that directory less useful by supporting this new use case. And
we would do that by forcing services to download multiple files containing
many descriptors they already know.
Somebody should go and write a descriptor database that takes CollecTor's
`recent/` folder as input and provides a search interface that returns raw
descriptors.
I hope this makes sense. Please let me know if it doesn't! And thanks
for reading this wall of text. ;)
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20228#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list