[tor-dev] Proposal 285: Directory documents should be standardized as UTF-8
Nick Mathewson
nickm at alum.mit.edu
Tue Jan 9 17:34:06 UTC 2018
Hi, Teor, and sorry for the long delay! You had a lot of good
questions on this proposal, and I didn't know how to answer them all.
So in hopes of making progress here, I'm taking wild guesses and
asking for help in making the wild guesses better :)
On Mon, Nov 13, 2017 at 5:28 PM, teor <teor2345 at gmail.com> wrote:
> On 14 Nov 2017, at 05:51, Nick Mathewson <nickm at torproject.org> wrote:
>
> Filename: 285-utf-8.txt
> Title: Directory documents should be standardized as UTF-8
> Author: Nick Mathewson
> Created: 13 November 2017
> Status: Open
>
> 1. Summary and motivation
>
> People frequently want to include non-ASCII text in their router
> descriptors. The Contact line is a favorite place to do this, but in
> principle the platform line would also be pretty logical.
>
> Unfortunately, there's no specified way to encode non-ASCII in our
> directory documents.
>
> Fortunately, almost everybody who does it, uses UTF-8 anyway.
>
>
> How many current descriptors will be rejected as non-UTF-8?
I think that when last I checked, the number was something like 3.
> As we move towards Rust support in Tor, we gain another motivation
> for standarding on UTF-8, since Rust's native strings strongly prefer
> UTF-8.
>
> So, in this proposal, we describe a migration path to having all
> directory documents be fully UTF-8.
>
> 2. Proposal
>
> First, we should have Tor relays reject ContactInfo lines (and any
> other lines copied directly into router descriptors) that are not
> UTF-8.
>
>
> How do we define UTF-8?
I tried to do so as follows:
We define the allowable set of UTF-8 as:
* Encoding the codepoints U+01 through U+10FFFF,
* but excluding the codepoints U+D800 through U+DFFF,
* each encoded with the shortest possible encoding.
* without any BOM
Are there other restrictions we should make? If so, how should we phrase them?
[...]
> How do we carry forward existing ASCII restrictions into UTF-8?
I don't understand this question.
> We will need to update the directory spec to acknowledge that
> contact and platform lines may be parsed as UTF-8 or
> ASCII-including-arbitrary-bytes-except-NUL, and that they are
> terminated by single-byte newlines regardless.
Ack.
> How do we deal with format confusion attacks?
>
> UTF-8 has a few alternative whitespace characters. These could
> be used in an attack that confuses either humans viewing the file,
> or automated software:
>
> If a human uses a UTF-8 compatible viewer or editor, it likely shows
> Unicode newlines and ASCII newlines in an identical way. Similarly,
> it may show Unicode spaces and ASCII spaces in the same way.
> This may confuse the human reader.
Right. I don't see an obvious attack here, but we should keep it in mind.
Do you have a different suggestion of what to do here?
> Similarly, if automated software parses using a Unicode whitespace
> or newline character class, it will mis-parse directory documents.
> (Our Rust protover code looks for ASCII spaces, so it appears to
> be fine.)
>
> Note that we already have this issue with line feeds and carriage
> returns, which I thought we had solved by banning carriage returns
> in directory documents. But it appears we allow "any printing ASCII
> character". (We will have to edit this to include Unicode.)
Also let's consider all the nonprinting ASCII: it's already a
potential display problem if you're using a bad editor, or whatever.
> https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218
>
> At the same time, we should have authorities reject any router
> descriptors or extrainfo documents that are not valid UTF-8.
> Simultaneously, we can have all Tor instances reject all
> non-directory-descriptor directory documents that are not UTF-8,
> since none should exist today.
>
>
> If we apply the existing restrictions in dir-spec, which require
> non-directory-descriptor directory documents to be ASCII, they will
> also be UTF-8.
>
> Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?
> Do we expect to migrate these to non-ASCII UTF-8 at some point?
I think having non-ASCII in extrainfos is a reasonable possibility.
I'm not so sure about the others: there could be reasons in the
future.
My rationale for declaring everything to be UTF-8 was that it seemed
more reasonable to have a single set of rules for parsing everything
than to have different rules for different documents.
> Also, does "non-directory-descriptor directory documents" mean we
> can reject non-UTF-8 microdescriptors? I think we should.
I think so.
> Does the NS consensus contain any lines that are copied verbatim from
> descriptors?
I don't think so.
[...]
> should be rejected entirely: "reject-encrypted-non-utf-8". If that
> parameter is set to 1, then hidden service clients will not only
> warn, but reject the descriptors.
>
> Once the vast majority of clients are running versions that support
> the "reject-encrypted-non-utf-8" parameter, that parameter can be set
> to 1.
>
>
> We also can't reject bridge descriptors at the authority level.
> (Bridge clients download bridge descriptors directly from bridges.)
> Do we need bridge clients to also use this consensus parameter?
I added an extra section for this, basically saying "bridge clients
should do that too":
2.2. Bridge descriptors
Since clients download bridge descriptors directly from the bridges, they
also need a two-phase plan as for hidden service descriptors above. Here
we take the same approach as in section 2.1 above, except using the
parameter "reject-bridge-descriptor-non-utf-8".
More information about the tor-dev
mailing list