[metrics-bugs] #22140 [Metrics/metrics-lib]: Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]

Wed May 3 14:06:52 UTC 2017

#22140: Store raw descriptor contents as UTF-8 encoded Strings rather than byte[]
-------------------------------------+--------------------------
     Reporter:  karsten              |      Owner:  metrics-team
         Type:  defect               |     Status:  new
     Priority:  Medium               |  Milestone:
    Component:  Metrics/metrics-lib  |    Version:
     Severity:  Normal               |   Keywords:
Actual Points:                       |  Parent ID:
       Points:                       |   Reviewer:
      Sponsor:                       |
-------------------------------------+--------------------------
 When we're reading descriptors from disk we're storing raw descriptor
 contents as `byte[]` and returning them in
 `Descriptor#getRawDescriptorBytes()`.  Also, we're storing partial raw
 descriptor contents in `DirSourceEntry#getDirSourceEntryBytes()` and
 `NetworkStatusEntry#getStatusEntryBytes()`.

 Storing `byte[]` can be useful when writing raw contents back to disk,
 because we can be sure that contents are exactly the same as when we read
 them from disk.  Namely, we don't have to worry about character encoding.

 However, support for handling (large) `byte[]` content is limited.  Today
 I looked into ways to handle large descriptor files (#20395), and I found
 that most libraries work best with character streams, not with byte
 streams.  And I only briefly considered implementing Knuth-Morris-Pratt
 myself...

 So, I looked at the four main code bases using metrics-lib (CollecTor,
 ExoneraTor, metrics-web, Onionoo) to see which of them use raw descriptor
 bytes and how.  After all, if we're not using them ourselves, we can as
 well get rid of them.  Here's what I found:
  1. Onionoo's `DescriptorQueue` uses raw bytes to keep statistics on
 processed bytes, which seems like something that would still work
 reasonably well with character lengths.
  2. '''CollecTor's `DescriptorPersistence` indeed uses raw descriptor
 bytes to write descriptors obtained from another CollecTor instance to
 disk.  We'd have to change that.'''
  3. CollecTor's `VotePersistence` uses raw descriptor bytes to calculate
 the digest of votes, which is something we should implement in metrics-lib
 directly (#20333).
  4. ExoneraTor's `ExoneraTorDatabaseImporter` imports raw status entry
 bytes into the database, but we know that those are just ASCII, so this
 would work as well with UTF-8 strings.
  5. metrics-web's `RelayDescriptorDatabaseImporter` also imports raw
 status entry bytes into the database, which works with strings for the
 same reason as above.

 I might have overlooked something.

 But if not, CollecTor's `DescriptorPersistence` is the only place where we
 really need `byte[]` rather than `String`.  If we can change that, we can
 switch from `Descriptor#getRawDescriptorBytes()` to
 `Descriptor#getRawDescriptor()` and deprecate the former (and do the same
 with the other two partial contents).

 And then we can resume #20395 with a much more complete toolbox.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22140>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online