[metrics-bugs] #21933 [Metrics/Onionoo]: Fix deserialization of UTF-8 characters in details statuses and documents
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Apr 13 09:45:04 UTC 2017
#21933: Fix deserialization of UTF-8 characters in details statuses and documents
---------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Metrics/Onionoo | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
---------------------------------+--------------------------
While looking into the encoding issue of different Onionoo instances
producing different contact string encodings (#15813), I tracked down a
somewhat related issue with UTF-8 characters all being converted to `?`.
The issue is related to how we're avoiding to store UTF-8 characters in
details statuses and details documents and instead escaping those
characters. We're doing this correctly for the serialization part but
incorrectly for the deserialization part.
We have two choices here. We could either give up on the escaping part
and just store UTF-8 characters directly. Or we could fix the
deserialization part. I have a fix for the latter and ran out of time for
the former, but maybe the former would be the better fix. I'm including
my fix here anyway:
{{{
diff --git a/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
b/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
index 39d6271..b6b1c4c 100644
--- a/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
+++ b/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
@@ -496,22 +496,25 @@ public class DocumentStore {
if (!parse) {
return this.retrieveUnparsedDocumentFile(documentType,
documentString);
- } else if (documentType.equals(DetailsDocument.class)
- || documentType.equals(BandwidthDocument.class)
+ } else if (documentType.equals(BandwidthDocument.class)
|| documentType.equals(WeightsDocument.class)
|| documentType.equals(ClientsDocument.class)
|| documentType.equals(UptimeDocument.class)) {
return this.retrieveParsedDocumentFile(documentType,
documentString);
+ } else if (documentType.equals(DetailsStatus.class)
+ || documentType.equals(DetailsDocument.class)) {
+ if (documentType.equals(DetailsStatus.class)) {
+ documentString = "{" + documentString + "}";
+ }
+ documentString = StringUtils.replace(documentString, "\\u",
"\\\\u");
+ return this.retrieveParsedDocumentFile(documentType,
documentString);
} else if (documentType.equals(BandwidthStatus.class)
|| documentType.equals(WeightsStatus.class)
|| documentType.equals(ClientsStatus.class)
|| documentType.equals(UptimeStatus.class)
|| documentType.equals(UpdateStatus.class)) {
return this.retrieveParsedStatusFile(documentType, documentString);
- } else if (documentType.equals(DetailsStatus.class)) {
- return this.retrieveParsedDocumentFile(documentType, "{"
- + documentString + "}");
} else {
log.error("Parsing is not supported for type "
+ documentType.getName() + ".");
}}}
The main difference here is the `StringUtils.replace()` part. Without
this line (so, current master) we would pass a string containing `\uxxxx`
to `Gson.fromJson()` which would unescape it and turn it into the
corresponding UTF-8 character. So far so good. But when we would later
write this status or document back to disk, `DocumentStore#writeToFile`
will write these bytes to disk as `US-ASCII`, and that will replace all
UTF-8 characters with `?`.
The patch fixes this by replacing `\uxxxx` in the file content with
`\\uxxxx` which `Gson.fromJson()` will not consider an escaped UTF-8
character. We do have code in place that reverses this double-escaping,
see `DetailsStatus#unescapeJson`. So, the patch fixes the problem by
keeping things escaped until they are used.
Again, I think the cleaner fix would be to give up on escaping UTF-8
characters and just switch to UTF-8. The part that might make this a
little harder is that we'll have to make sure that this works correctly
for existing files. And if it does not, we'll need to special-case those
somehow. But maybe the patch above helps to come up with this cleaner
fix.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21933>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list