[metrics-bugs] #33502 [Metrics/CollecTor]: Do not let appended descriptor files grow too large
Tor Bug Tracker & Wiki
blackhole at torproject.org
Mon Mar 2 15:12:56 UTC 2020
#33502: Do not let appended descriptor files grow too large
-----------------------------------+----------------------
Reporter: karsten | Owner: karsten
Type: enhancement | Status: assigned
Priority: Medium | Milestone:
Component: Metrics/CollecTor | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-----------------------------------+----------------------
I revisited #20395 last week. The issue is that metrics-lib cannot handle
large descriptor files, because it first reads the entire file into memory
before splitting it into single descriptors and parsing them. While it
would be possible to parse large descriptor files after making some major
code changes (using `FileChannel` and doing lazy parsing), I don't think
that we have to do that. After all, we're writing these large descriptor
files ourselves in CollecTor, and it's up to us to stop doing that.
Going back in time, the original reason for concatenating multiple
descriptors into a single file was that rsyncing many tiny files from one
host to another host was just slow. So we appended server descriptors and
extra-info descriptors into a single file. This works well with server
descriptors or extra-info descriptors published within 1 hour or even 10
hours. It does not work that well anymore with all server descriptors or
extra-info descriptors synced from another CollecTor instance when
starting a new instance (#20335). It works even less well when importing
one or more monthly tarballs containing server descriptors or extra-info
descriptors (#27716).
My suggestion is that we define a configurable limit for appended
descriptor files of, say, 20 MiB. And when storing a descriptor, we check
whether appending a descriptor to an existing descriptor file would exceed
this limit and start a new descriptor file in that case.
There are some technical details to work out, but I think they can be
solved. I also don't expect this to produce a lot of code, not even
complex code changes. The benefit would be that we could resolve #20395
and #27716 by implementing this.
Thoughts on the general idea?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33502>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list