[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

Tue Feb 13 21:56:11 UTC 2018

> On 13 Feb 2018, at 21:55, Iain Learmonth <irl at torproject.org> wrote:
> 
> Hi,
> 
>> On 12/02/18 23:55, isis agora lovecruft wrote:
>> 1. What passes for "canonicalised" "utf-8" in C will be different to
>>    what passes for "canonicalised" "utf-8" in Rust.  In C, the
>>    following will not be allowed (whereas they are allowed in Rust):
>>        - NUL (0x00)
>>        - Byte Order Mark (0xFEFF)
> 
> Much of the metrics software is written in Java. Java strings allow for
> NUL to appear, but assume that there is no BOM. If a BOM appears, then
> this would be interpreted as data and, I assume, parsing would probably
> fail. Should the whole document be rejected if it contains a NUL or BOM,
> or should these values be stripped and then carry on parsing as if it
> never happened?

Directory authorities and bridge clients already reject descriptors that
contain NUL. (This is an artefact of the C implementation: the descriptor
is seen as truncated, so it won't parse.)

We should specify rejection for BOM as well.

>> 2. Directory document keywords MUST be printable ASCII.
> 
> This can be validated. Should a single document keyword containing
> printable non-ASCII be enough to reject the document, or should a parser
> try to recover?

If parsers want to be consistent with the Tor implementation, they should
reject.

> I'd really like to see a section in the proposal about how parsers
> should react when they find something unexpected, otherwise all the
> parsers may end up doing different things.

+1

>> 3. This change may break some descriptor/consensus/document parsers.
>>    If you are the maintainer of a parser, you may want to start
>>    thinking about this now.
> 
> For the metrics tools there are some guidelines on this we can follow:
> https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
> language would be Python (for stem), but Python developers have probably
> got a good understanding of unicode/str/bytes by now. (In Python 3: when
> using UTF-8, BOM will not be stripped and will be interpreted as data,
> and you can have a NUL in a str).

Python for txtorcon
Rust for Tor's experimental protover implementation

And perhaps others:
https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries
https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20180214/958be478/attachment.html>