[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
Iain Learmonth
irl at torproject.org
Tue Feb 13 10:55:30 UTC 2018
Hi,
On 12/02/18 23:55, isis agora lovecruft wrote:
> 1. What passes for "canonicalised" "utf-8" in C will be different to
> what passes for "canonicalised" "utf-8" in Rust. In C, the
> following will not be allowed (whereas they are allowed in Rust):
> - NUL (0x00)
> - Byte Order Mark (0xFEFF)
Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?
> 2. Directory document keywords MUST be printable ASCII.
This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?
I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.
> 3. This change may break some descriptor/consensus/document parsers.
> If you are the maintainer of a parser, you may want to start
> thinking about this now.
For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).
Thanks,
Iain.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 862 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20180213/73cfe579/attachment.sig>
More information about the tor-dev
mailing list