[tor-bugs] #20395 [Metrics/Library]: Add capability to handle large descriptor files
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri Feb 9 11:33:18 UTC 2018
#20395: Add capability to handle large descriptor files
-----------------------------+------------------------------
Reporter: iwakeh | Owner: karsten
Type: defect | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics/Library | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+------------------------------
Changes (by karsten):
* status: accepted => needs_review
Comment:
I started making some improvements here. Here's my train of thought:
1. Rather than reading the whole file to memory at the beginning, we
could read it in chunks and start parsing as soon as we have seen a full
descriptor. This sounds like a useful improvement, but it's actually very
limited, at least on its own. Reading the 70M descriptor file I used for
testing is actually done really fast. It's the parsing that takes long. As
long as we need the full descriptor file contents in memory, we don't have
to think about reading files in chunks. (See also 3. below.)
2. Rather than parsing all descriptors contained in a given file into a
list and then taking all parsed descriptors and throwing them into the
`BlockingIterator<Descriptor>`, we could just skip the list in the middle.
The effect is that the time to first descriptor is reduced by a huge
amount of time, whereas the time to last descriptor stays the same. I
prepared a patch for this. The commit message contains more details.
3. Rather than storing descriptor file contents in a `byte[]`, we could
go through the file, read descriptor by descriptor, and store a `File`
reference together with offset and length into the file. The effect would
be that we're avoiding to keep the raw descriptor file contents in memory
at all. We'd still keep parsed contents in memory. A possible downside is
that the file must not be deleted or moved away while the application
processes descriptors, which should be safe to require. Still, this is a
larger change than 2. And it requires 1. That's why I postponed this.
Please review [https://gitweb.torproject.org/user/karsten/metrics-
lib.git/commit/?h=task-20395&id=ef9406c148a477720cdca67c6a2891ecd850f912
commit ef9406c in my task-20395 branch].
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20395#comment:15>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list