[metrics-bugs] #21932 [Metrics/metrics-lib]: Stop relying on the platform's default charset

Thu Apr 13 09:13:16 UTC 2017

#21932: Stop relying on the platform's default charset
-------------------------------------+--------------------------
     Reporter:  karsten              |      Owner:  metrics-team
         Type:  defect               |     Status:  new
     Priority:  Medium               |  Milestone:
    Component:  Metrics/metrics-lib  |    Version:
     Severity:  Normal               |   Keywords:
Actual Points:                       |  Parent ID:
       Points:                       |   Reviewer:
      Sponsor:                       |
-------------------------------------+--------------------------
 While looking into the encoding issue of different Onionoo instances
 producing different contact string encodings (#15813), I tracked down this
 issue to metrics-lib's `ServerDescriptorImpl.java` class and its usage of
 `new String(byte[])`.

 The issue is that the constructor above uses "the platform's default
 charset".  Turns out that the main Onionoo instance uses `US-ASCII` as
 default charset (`Charset.defaultCharset()`) and the mirror uses `UTF-8`.
 (Interestingly, the mirror only uses `UTF-8` for commands executed by cron
 and also uses `US-ASCII` for commands directly executed by my user, so the
 default would change depending on whether Onionoo's updater was started
 automatically after a reboot or started manually by the user; which made
 debugging just a bit more challenging!)

 Long story short, we should not rely on the platform's default charset
 when converting bytes to strings or vice versa, but we should explicitly
 specify the charset we want!  We just need to pick one.

 Somewhat related I ran an analysis of character encodings in relay server
 descriptors two weeks ago.  Here's what I found:

 {{{
 $ wget
 https://collector.torproject.org/archive/relay-descriptors/server-
 descriptors/server-descriptors-2017-02.tar.xz
 $ tar xf server-descriptors-2017-02.tar.xz
 $ find server-descriptors-2017-02 -type f -exec file --mime {} \; > mimes
 $ cut -d" " -f3 mimes | sort | uniq -c
   68 charset=iso-8859-1
 466900 charset=us-ascii
 1145 charset=utf-8
 }}}

 I'd say let's just pretend that server descriptors are UTF-8 encoded.  In
 this case, the following patch will resolve the issue for server
 descriptors:

 {{{
 diff --git
 a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
 b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
 index 309cad4..2381378 100644
 ---
 a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
 +++
 b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
 @@ -8,6 +8,7 @@ import org.torproject.descriptor.DescriptorParseException;
  import org.torproject.descriptor.ServerDescriptor;

  import java.io.UnsupportedEncodingException;
 +import java.nio.charset.StandardCharsets;
  import java.security.MessageDigest;
  import java.security.NoSuchAlgorithmException;
  import java.util.ArrayList;
 @@ -56,8 +57,8 @@ public abstract class ServerDescriptorImpl extends
 DescriptorImpl
    }

    private void parseDescriptorBytes() throws DescriptorParseException {
 -    Scanner scanner = new Scanner(new String(this.rawDescriptorBytes))
 -        .useDelimiter("\n");
 +    Scanner scanner = new Scanner(new String(this.rawDescriptorBytes,
 +        StandardCharsets.UTF_8)).useDelimiter("\n");
      String nextCrypto = "";
      List<String> cryptoLines = null;
      while (scanner.hasNext()) {
 }}}

 If this sounds like a reasonable plan, we should look into other places in
 the code where we use methods relying on the platform's default charset
 and explicitly specify a charset there, too.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21932>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online