[metrics-bugs] #21751 [Metrics/metrics-lib]: Use multiple threads to parse descriptors
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Mar 15 15:50:51 UTC 2017
#21751: Use multiple threads to parse descriptors
-------------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/metrics-lib | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-------------------------------------+--------------------------
The following idea came up when I looked a bit into #17831 to speed up
metrics-lib.
When we read and parse descriptors from disk, we're using a single thread
to read and parse descriptors. It's a daemon thread and not the
application's main thread, so if the application's thread is busy
processing parsed descriptors we're at least using two threads. But we
could parallelize even more by using separate threads for reading and
parsing and even using multiple threads for reading and/or for parsing.
I'll leave the I/O part to #17831 and focus on the multi-threaded parsing
part here.
I wrote a little patch that measures time spent on reading tarball
contents in `DescriptorReaderImpl#readTarballs()` and then extended that
by moving descriptor parsing code to a separate class that implements
`Runnable` and that gets executed by an `ExecutorService`. I initialized
that executor with `Executors.newFixedThreadPool(n)` for `n = [2, 4, 8,
16, 32, 64]`. I also tried `n = 1`, but ran out of memory due to a major
issue in my simple patch: it reads ''all'' tarball contents to memory when
creating `Task` instances even if they cannot be executed anytime soon.
What we should do is block the reader thread when it realizes that the
executor is already full. I'm attaching my patch, but only to avoid
starting from zero the next time. It needs more work.
|| '''separate parser threads''' || '''read `.tar` file (s)''' || '''parse
`.tar` file (s)''' || '''read `.tar.xz` file (s)''' || '''parse `.tar.xz`
file (s)''' ||
|| none (current code) || 35 || 159 || 9 || 162 ||
|| 2 || 36 || 42 || 8 || 126 ||
|| 4 || 41 || 13 || 7 || 96 ||
|| 8 || 42 || 11 || 6 || 35 ||
|| 16 || 41 || 11 || 10 || 28 ||
|| 32 || 45 || 13 || 7 || 34 ||
|| 64 || 41 || 13 || 6 || 38 ||
These results show that 4 threads speed up the parse time for `.tar` files
by a '''factor 12''' after which there's no visible improvement, and 8
threads speed up the parse time for `.tar.xz` files by a '''factor 4.6'''.
Just from these numbers I'd suggest using 8 threads by default and making
this number configurable for the application. But: needs more work.
My recommendation would be to look more into making parsing multi-threaded
and save #17831 for later. It seems like parsing is the lower-hanging
fruit.
Note that reading the same tarball in extracted form using the current
code took 271 seconds. In that case the lower-hanging fruit might be I/O
improvements, not multi-threaded parsing. But my hope is that not many
applications extract tarballs containing over 800,000 files and read them
using `DescriptorReader`, especially not if they could as well read the
tarball directly.
Suggestions welcome! Otherwise I might pick this up again and move it
forward whenever there's time.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21751>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list