[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"
teor
teor2345 at gmail.com
Tue Feb 13 00:03:54 UTC 2018
> On 13 Feb 2018, at 10:55, isis agora lovecruft <isis at torproject.org> wrote:
>
> A couple outcomes of this:
>
> 1. What passes for "canonicalised" "utf-8" in C will be different to
> what passes for "canonicalised" "utf-8" in Rust. In C, the
> following will not be allowed (whereas they are allowed in Rust):
> - NUL (0x00)
> - Byte Order Mark (0xFEFF)
I want to clarify this point:
The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the
bytes 0xEF 0xBB 0xBF.
Tor's C and Rust implementations of UTF-8 must be identical.
When we write the C implementation, we must reject NUL for
compatibility with C string functions.
When we write the Rust implementation, we must reject NUL for
compatibility with the C implementation. (Rust already implements
UTF-8 strings that accept NUL, so this will require custom code).
When we write the C and Rust implementations, we must reject BOM
because it's unnecessary. Rejecting BOM is recommended by the
relevant standard. (Rust already implements UTF-8 strings that accept
BOM, so this will require custom code).
T
More information about the tor-dev
mailing list