[metrics-bugs] #22140 [Metrics/metrics-lib]: Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed May 3 14:06:52 UTC 2017
#22140: Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]
-------------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Metrics/metrics-lib | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-------------------------------------+--------------------------
When we're reading descriptors from disk we're storing raw descriptor
contents as `byte[]` and returning them in
`Descriptor#getRawDescriptorBytes()`. Also, we're storing partial raw
descriptor contents in `DirSourceEntry#getDirSourceEntryBytes()` and
`NetworkStatusEntry#getStatusEntryBytes()`.
Storing `byte[]` can be useful when writing raw contents back to disk,
because we can be sure that contents are exactly the same as when we read
them from disk. Namely, we don't have to worry about character encoding.
However, support for handling (large) `byte[]` content is limited. Today
I looked into ways to handle large descriptor files (#20395), and I found
that most libraries work best with character streams, not with byte
streams. And I only briefly considered implementing Knuth-Morris-Pratt
myself...
So, I looked at the four main code bases using metrics-lib (CollecTor,
ExoneraTor, metrics-web, Onionoo) to see which of them use raw descriptor
bytes and how. After all, if we're not using them ourselves, we can as
well get rid of them. Here's what I found:
1. Onionoo's `DescriptorQueue` uses raw bytes to keep statistics on
processed bytes, which seems like something that would still work
reasonably well with character lengths.
2. '''CollecTor's `DescriptorPersistence` indeed uses raw descriptor
bytes to write descriptors obtained from another CollecTor instance to
disk. We'd have to change that.'''
3. CollecTor's `VotePersistence` uses raw descriptor bytes to calculate
the digest of votes, which is something we should implement in metrics-lib
directly (#20333).
4. ExoneraTor's `ExoneraTorDatabaseImporter` imports raw status entry
bytes into the database, but we know that those are just ASCII, so this
would work as well with UTF-8 strings.
5. metrics-web's `RelayDescriptorDatabaseImporter` also imports raw
status entry bytes into the database, which works with strings for the
same reason as above.
I might have overlooked something.
But if not, CollecTor's `DescriptorPersistence` is the only place where we
really need `byte[]` rather than `String`. If we can change that, we can
switch from `Descriptor#getRawDescriptorBytes()` to
`Descriptor#getRawDescriptor()` and deprecate the former (and do the same
with the other two partial contents).
And then we can resume #20395 with a much more complete toolbox.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22140>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list