[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
teor
teor2345 at gmail.com
Tue Feb 13 21:56:11 UTC 2018
> On 13 Feb 2018, at 21:55, Iain Learmonth <irl at torproject.org> wrote:
>
> Hi,
>
>> On 12/02/18 23:55, isis agora lovecruft wrote:
>> 1. What passes for "canonicalised" "utf-8" in C will be different to
>> what passes for "canonicalised" "utf-8" in Rust. In C, the
>> following will not be allowed (whereas they are allowed in Rust):
>> - NUL (0x00)
>> - Byte Order Mark (0xFEFF)
>
> Much of the metrics software is written in Java. Java strings allow for
> NUL to appear, but assume that there is no BOM. If a BOM appears, then
> this would be interpreted as data and, I assume, parsing would probably
> fail. Should the whole document be rejected if it contains a NUL or BOM,
> or should these values be stripped and then carry on parsing as if it
> never happened?
Directory authorities and bridge clients already reject descriptors that
contain NUL. (This is an artefact of the C implementation: the descriptor
is seen as truncated, so it won't parse.)
We should specify rejection for BOM as well.
>> 2. Directory document keywords MUST be printable ASCII.
>
> This can be validated. Should a single document keyword containing
> printable non-ASCII be enough to reject the document, or should a parser
> try to recover?
If parsers want to be consistent with the Tor implementation, they should
reject.
> I'd really like to see a section in the proposal about how parsers
> should react when they find something unexpected, otherwise all the
> parsers may end up doing different things.
+1
>> 3. This change may break some descriptor/consensus/document parsers.
>> If you are the maintainer of a parser, you may want to start
>> thinking about this now.
>
> For the metrics tools there are some guidelines on this we can follow:
> https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
> language would be Python (for stem), but Python developers have probably
> got a good understanding of unicode/str/bytes by now. (In Python 3: when
> using UTF-8, BOM will not be stripped and will be interpreted as data,
> and you can have a NUL in a str).
Python for txtorcon
Rust for Tor's experimental protover implementation
And perhaps others:
https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries
https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20180214/958be478/attachment.html>
More information about the tor-dev
mailing list