[tor-bugs] #20395 [Metrics/metrics-lib]: metrics-lib should be able to handle large descriptor files
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed May 10 20:28:59 UTC 2017
#20395: metrics-lib should be able to handle large descriptor files
---------------------------------+-----------------------------------
Reporter: iwakeh | Owner: karsten
Type: defect | Status: new
Priority: Medium | Milestone: metrics-lib 2.0.0
Component: Metrics/metrics-lib | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
---------------------------------+-----------------------------------
Comment (by karsten):
Great ideas above! And I think we should implement them, because they
clearly improve memory consumption.
Going into more details, your second assumption makes sense to me. I
didn't think of that before, but I agree that we can make that assumption.
However, your first assumption is unfortunately wrong. I just
concatenated all votes from May 1, 2017 to a single file with a size of
0.8G. I passed that to metrics-lib to read and parse it, which consumed
4.1G of memory in total for parsed descriptors and contained raw
descriptor bytes. I then modified `DescriptorImpl` to avoid storing raw
descriptor bytes in memory, which led to memory consumption of 3.3G. The
difference is precisely the 0.8G of the original file. But the 3.3G still
remain, and that number will grow with the number of descriptors we put in
a file. Like, 72 hours of votes from an initial CollecTor sync with all
votes concatenated to a single file would consume 9.9G, plus a few G more
while parsing. No, the suggestion above is certainly an improvement, but
it still does not scale.
But I could see us making these suggested improvements anyway, and they'll
help us going forward. Some thoughts:
- We could modify `Descriptor#getRawDescriptorBytes()` to use its file
reference and start and end position to retrieve the bytes from disk and
return them to the caller. That is, rather than bothering the user to do
that. This would even make this change backward-compatible.
- We should avoid calling that new `Descriptor#getRawDescriptorBytes()`
ourselves at all costs while parsing and instead pass the bytes around
directly. I'm mentioning this explicitly, because I found uses of that
method where we could have passed around these bytes as parameters
instead.
- We need to be careful to write the reading-files-in-chunks logic in a
way that detects descriptor starts and ends across chunk boundaries.
Think of tiny descriptors like microdescriptors.
- And we should avoid scanning chunks repeatedly when a descriptor covers
many, many such chunks. Think of huge descriptors like votes.
Once we're there, let's talk more about avoiding to keep potentially huge
lists of parsed descriptors in memory.
Do you want to start hacking on your suggestions above?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20395#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list