[tor-dev] Drafting a proposal for mnemonic onion URLs

Wed Dec 21 05:51:24 UTC 2011

A couple issues:

1. There's no such thing as "bits per word" without a dictionary, or
better yet, a frequency-sorted dictionary. :-P (See Alex's comments in
https://plus.google.com/u/0/103112149634414554669/posts/FMDg2Vht8sb
for a good idea on this.)

For reference, the OED is ~250k words (~18 bits at most, without
considering frequency).

2. People cannot remember a random ordered set of even four rare
words. Consider for instance that memory competitions involve doing
*exactly that* and that untrained people generally can't retain even a
small number. What people do in competitions, and what I've proposed
in the draft, is remember *scenes* — e.g. something you can state in a
sentence.

Also, you can't use words that people don't know. "Squibbish" might as
well be "kaftorn". Merely being a word is not sufficient; it has to be
something with semantic value. "English-sounding" is useless except
for the very low bar of "can be spoken over a telephone".

This of course constrains the dictionary considerably, such that
assembling even 12 bits (4k) is not completely trivial. But remember,
the point of my proposal is *not* "can you output some random crap
that's word-ish" à la the horrible RFC2289. It's "can you give people
something they'll actually remember".

3. People are not going to remember whether it was, say, 48 or 49 or
84. Numbers are effectively useless for recall. People don't even
think in decimal, they think in very rough logarithmic approximations.
:-P

4. How would you 'keep generating words'? My proposal was that every
hash have a single, canonical phrasal mapping. You would not be able
to change it unless you change your hidden service descriptor. (You
could do that if you're willing to "move" your site, or are making a
new one.)

If you don't have canonical mapping, siteops can't tell people what
their phrasal descriptor is, which means you lose a majority of its
utility for security.

For that matter, how would you ship it without a dictionary? Are you
expecting that most users will already have a common dictionary file?

5. Not having it be in English means you have to support
internationalized domain names (nearly all other languages have
non-ASCII characters). AIUI that's just not technologically up to
speed yet.

I should emphasize that this is *not* a technical problem, it's a
problem of cognitive linguistics. Writing the code is relatively
trivial; it's just a simple parser / generator layer.

- Sai

On Wed, Dec 21, 2011 at 00:16, Nick Mathewson <nickm at alum.mit.edu> wrote:
> On Tue, Dec 20, 2011 at 11:20 PM, Sai <tor at saizai.com> wrote:
>> Howdy all.
>>
>> I would like to make a mnemonic translation method for onion URLs. Not
>> as in a full petname system with registration and so forth, just a
>> "four little words" style way of rendering and entering existing
>> 80-bit hashes in a way that is memorable to not just detect
>> partial-overlap fuzzing attacks but actually memorize and produce the
>> URL directly.
>>
>
> So, the last time I was talking about this (with a grad student whose
> name I have just asked for permission to cite), I came up with the
> following scheme.  Let BW be the number of bits per natural-language
> word; let BH be the number of bits per hidden service identity key
> hash.  Let H() be some hash function.  Represent the onion key
> identity hash as a sequence of BH/BW natural-language words, letting
> each word W represent the first BW bits of H(W).
>
> So if H() is SHA256, and BW is 20, and BH is 80, the hostname
>  bookbind.electrodynamism.unsurnamed.squibbish
> would represent the 80-bit sequence
>  Ht("bookbind") Ht("electrodynamism") H("unsurnamed") H("squibbish")
> where Ht(x) is the first 20 bits of SHA256(x).
>
> (In practice, 80 bits is feeling way too short these days; but that's
> another discussion.)
>
> The main advantages and disadvantages of this idea are:
>
>  + You don't need to ship a shared wordlist with the client software.
>  (That's a good thing, since big dictionaries get annoying fast.)
>  + You don't need to restrict yourself to any particular language; it
> is not English-centric.
>  - If bw is large enough, there will be sequences of bits that are
> not represented as the hash of any particular English word.  (e.g., if
> bw is 32, you are in trouble: no language has 4 billion words!)  You
> can work around this by having your search program generate new
> english-sounding sequences, by using a bigger wordlist or by
> generating keys until you find one whose fingerprint *is*
> representable with words you have.
>  - If you make BW big, you are going to get some pretty obscure words.
>  - Encodings are not unique.
>  + You can get a lot of bits per word.  If you're willing to do stuff
> like "every word in english" x "the numbers 00-99", you can get even
> more.
>  + It scales to more than 80 bits pretty easily.
>  - It doesn't make pretty sentences by default, though you can keep
> generating words till you find something you like the sound of.
>  - We'd need to watch out for words that look equivalent to a human,
> but aren't.
>  / You actually need to write a reasonably good search program.  This
> isn't too hard, and the database it generates isn't too big.
>
> There are other issues and other possible solutions too.
>
> yrs,
> --
> Nick
> _______________________________________________
> tor-dev mailing list
> tor-dev at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev