[metrics-bugs] #21932 [Metrics/metrics-lib]: Stop relying on the platform's default charset
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Apr 13 09:13:16 UTC 2017
#21932: Stop relying on the platform's default charset
-------------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Metrics/metrics-lib | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-------------------------------------+--------------------------
While looking into the encoding issue of different Onionoo instances
producing different contact string encodings (#15813), I tracked down this
issue to metrics-lib's `ServerDescriptorImpl.java` class and its usage of
`new String(byte[])`.
The issue is that the constructor above uses "the platform's default
charset". Turns out that the main Onionoo instance uses `US-ASCII` as
default charset (`Charset.defaultCharset()`) and the mirror uses `UTF-8`.
(Interestingly, the mirror only uses `UTF-8` for commands executed by cron
and also uses `US-ASCII` for commands directly executed by my user, so the
default would change depending on whether Onionoo's updater was started
automatically after a reboot or started manually by the user; which made
debugging just a bit more challenging!)
Long story short, we should not rely on the platform's default charset
when converting bytes to strings or vice versa, but we should explicitly
specify the charset we want! We just need to pick one.
Somewhat related I ran an analysis of character encodings in relay server
descriptors two weeks ago. Here's what I found:
{{{
$ wget
https://collector.torproject.org/archive/relay-descriptors/server-
descriptors/server-descriptors-2017-02.tar.xz
$ tar xf server-descriptors-2017-02.tar.xz
$ find server-descriptors-2017-02 -type f -exec file --mime {} \; > mimes
$ cut -d" " -f3 mimes | sort | uniq -c
68 charset=iso-8859-1
466900 charset=us-ascii
1145 charset=utf-8
}}}
I'd say let's just pretend that server descriptors are UTF-8 encoded. In
this case, the following patch will resolve the issue for server
descriptors:
{{{
diff --git
a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
index 309cad4..2381378 100644
---
a/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
+++
b/src/main/java/org/torproject/descriptor/impl/ServerDescriptorImpl.java
@@ -8,6 +8,7 @@ import org.torproject.descriptor.DescriptorParseException;
import org.torproject.descriptor.ServerDescriptor;
import java.io.UnsupportedEncodingException;
+import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.ArrayList;
@@ -56,8 +57,8 @@ public abstract class ServerDescriptorImpl extends
DescriptorImpl
}
private void parseDescriptorBytes() throws DescriptorParseException {
- Scanner scanner = new Scanner(new String(this.rawDescriptorBytes))
- .useDelimiter("\n");
+ Scanner scanner = new Scanner(new String(this.rawDescriptorBytes,
+ StandardCharsets.UTF_8)).useDelimiter("\n");
String nextCrypto = "";
List<String> cryptoLines = null;
while (scanner.hasNext()) {
}}}
If this sounds like a reasonable plan, we should look into other places in
the code where we use methods relying on the platform's default charset
and explicitly specify a charset there, too.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21932>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list