[tor-dev] Proposal 285: Directory documents should be standardized as UTF-8

Nick Mathewson nickm at alum.mit.edu
Tue Jan 9 17:34:06 UTC 2018


Hi, Teor, and sorry for the long delay!  You had a lot of good
questions on this proposal, and I didn't know how to answer them all.
So in hopes of making progress here, I'm taking wild guesses and
asking for help in making the wild guesses better :)


On Mon, Nov 13, 2017 at 5:28 PM, teor <teor2345 at gmail.com> wrote:
> On 14 Nov 2017, at 05:51, Nick Mathewson <nickm at torproject.org> wrote:
>
> Filename: 285-utf-8.txt
> Title: Directory documents should be standardized as UTF-8
> Author: Nick Mathewson
> Created: 13 November 2017
> Status: Open
>
> 1. Summary and motivation
>
>    People frequently want to include non-ASCII text in their router
>    descriptors.  The Contact line is a favorite place to do this, but in
>    principle the platform line would also be pretty logical.
>
>    Unfortunately, there's no specified way to encode non-ASCII in our
>    directory documents.
>
>    Fortunately, almost everybody who does it, uses UTF-8 anyway.
>
>
> How many current descriptors will be rejected as non-UTF-8?

I think that when last I checked, the number was something like 3.

>    As we move towards Rust support in Tor, we gain another motivation
>    for standarding on UTF-8, since Rust's native strings strongly prefer
>    UTF-8.
>
>    So, in this proposal, we describe a migration path to having all
>    directory documents be fully UTF-8.
>
> 2. Proposal
>
>    First, we should have Tor relays reject ContactInfo lines (and any
>    other lines copied directly into router descriptors) that are not
>    UTF-8.
>
>
> How do we define UTF-8?

I tried to do so as follows:

   We define the allowable set of UTF-8 as:
        * Encoding the codepoints U+01 through U+10FFFF,
        * but excluding the codepoints U+D800 through U+DFFF,
        * each encoded with the shortest possible encoding.
        * without any BOM

Are there other restrictions we should make?  If so, how should we phrase them?

[...]
> How do we carry forward existing ASCII restrictions into UTF-8?

I don't understand this question.

> We will need to update the directory spec to acknowledge that
> contact and platform lines may be parsed as UTF-8 or
> ASCII-including-arbitrary-bytes-except-NUL, and that they are
> terminated by single-byte newlines regardless.

Ack.

> How do we deal with format confusion attacks?
>
> UTF-8 has a few alternative whitespace characters. These could
> be used in an attack that confuses either humans viewing the file,
> or automated software:
>
> If a human uses a UTF-8 compatible viewer or editor, it likely shows
> Unicode newlines and ASCII newlines in an identical way. Similarly,
> it may show Unicode spaces and ASCII spaces in the same way.
> This may confuse the human reader.

Right.  I don't see an obvious attack here, but we should keep it in mind.

Do you have a different suggestion of what to do here?

> Similarly, if automated software parses using a Unicode whitespace
> or newline character class, it will mis-parse directory documents.
> (Our Rust protover code looks for ASCII spaces, so it appears to
> be fine.)
>
> Note that we already have this issue with line feeds and carriage
> returns, which I thought we had solved by banning carriage returns
> in directory documents. But it appears we allow "any printing ASCII
> character". (We will have to edit this to include Unicode.)

Also let's consider all the nonprinting ASCII: it's already a
potential display problem if you're using a bad editor, or whatever.

> https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218
>
>    At the same time, we should have authorities reject any router
>    descriptors or extrainfo documents that are not valid UTF-8.
>    Simultaneously, we can have all Tor instances reject all
>    non-directory-descriptor directory documents that are not UTF-8,
>    since none should exist today.
>
>
> If we apply the existing restrictions in dir-spec, which require
> non-directory-descriptor directory documents to be ASCII, they will
> also be UTF-8.
>
> Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?
> Do we expect to migrate these to non-ASCII UTF-8 at some point?

I think having non-ASCII in extrainfos is a reasonable possibility.
I'm not so sure about the others: there could be reasons in the
future.

My rationale for declaring everything to be UTF-8 was that it seemed
more reasonable to have a single set of rules for parsing everything
than to have different rules for different documents.

> Also, does "non-directory-descriptor directory documents" mean we
> can reject non-UTF-8 microdescriptors? I think we should.

I think so.

> Does the NS consensus contain any lines that are copied verbatim from
> descriptors?

I don't think so.

[...]
>    should be rejected entirely: "reject-encrypted-non-utf-8".  If that
>    parameter is set to 1, then hidden service clients will not only
>    warn, but reject the descriptors.
>
>    Once the vast majority of clients are running versions that support
>    the "reject-encrypted-non-utf-8" parameter, that parameter can be set
>    to 1.
>
>
> We also can't reject bridge descriptors at the authority level.
> (Bridge clients download bridge descriptors directly from bridges.)
> Do we need bridge clients to also use this consensus parameter?

I added an extra section for this, basically saying "bridge clients
should do that too":

2.2. Bridge descriptors

   Since clients download bridge descriptors directly from the bridges, they
   also need a two-phase plan as for hidden service descriptors above.  Here
   we take the same approach as in section 2.1 above, except using the
   parameter "reject-bridge-descriptor-non-utf-8".


More information about the tor-dev mailing list