Proposal 158: Clients download consensus + microdescriptors

Wed Jan 21 05:46:04 UTC 2009

On Sun, Jan 18, 2009 at 02:22:52PM -0500, Roger Dingledine wrote:
> Filename: 158-microdescriptors.txt
> Title: Clients download consensus + microdescriptors
> Version: $Revision: 18172 $
> Last-Modified: $Date: 2009-01-18 13:57:20 -0500 (Sun, 18 Jan 2009) $
> Author: Roger Dingledine
> Created: 17-Jan-2009
> Status: Open

Hi, Roger!  This needs even more discussion, but I think it's going on
the right direction.

> 
> 1. Overview
> 
>   This proposal replaces section 3.2 of proposal 141, which was
>   called "Fetching descriptors on demand". Rather than modifying the
>   circuit-building protocol to fetch a server descriptor inline at each
>   circuit extend, we instead put all of the information that clients need
>   either into the consensus itself, or into a new set of data about each
>   relay called a microdescriptor. The microdescriptor is a direct
>   transform from the relay descriptor, so relays don't even need to know
>   this is happening.
> 
>   Descriptor elements that are small and frequently changing should go
>   in the consensus itself, and descriptor elements that are small and
>   relatively static should go in the microdescriptor. If we ever end up
>   with descriptor elements that aren't small yet clients need to know
>   them, we'll need to resume considering some design like the one in
>   proposal 141.

This is a good breakdown, and clarifies our motivation decently well.

Does this mean that the ports should (assuming it's possible) get
moved into the microdescriptor?  I think exit policies are relatively
stable.

 [...] 
> 3. Design
> 
>   There are three pieces to the proposal. First, authorities will list in
>   their votes (and thus in the consensus) what relay descriptor elements
>   are included in the microdescriptor, and also list the expected hash
>   of microdescriptor for each relay. Second, directory mirrors will serve
>   microdescriptors. Third, clients will ask for them and cache them.
> 
> 3.1. Consensus changes
> 
>   V3 votes should include a new line:
>     microdescriptor-elements bar baz foo
>   listing each descriptor element (sorted alphabetically) that authority
>   included when it calculated its expected microdescriptor hashes.
> 
>   We also need to include the hash of each expected microdescriptor in
>   the routerstatus section. I suggest a new "m" line for each stanza,
>   with the base64 of the hash of the elements that the authority voted
>   for above.
> 
>   The consensus microdescriptor-elements and "m" lines are then computed
>   as described in Section 3.1.2 below.
> 
>   I believe that means we need a new consensus-method "6" that knows
>   how to compute the microdescriptor-elements and add "m" lines.

Right.  We'll allocate the actual number when we implement; 6 seems
likeliest.

 [...]
> 3.1.2. Computing consensus for microdescriptor-elements and "m" lines
 [...]
>   It would be nice to have a more foolproof way to agree on what
>   microdescriptor hash each authority should vote for, so we can avoid
>   missing "m" lines. Just switching to a new consensus-method each time
>   we change the set of microdescriptor-elements won't help though, since
>   each authority will still have to decide what hash to vote for before
>   knowing what consensus-method will be used.
> 
>   Here's one way we could do it. Each vote / consensus includes
>   the microdescriptor-elements that were used to compute the hashes,
>   and also a preferred-microdescriptor-elements set. If an authority
>   has a consensus from the previous period, then it should use the
>   consensus preferred-microdescriptor-elements when computing its votes
>   for microdescriptor-elements and the appropriate hashes in the upcoming
>   period. (If it has no previous consensus, then it just writes its
>   own preferences in both lines.)

Here's a way that recovers a little more gracefully from
desynchronization.  The vote could include two sets at most: the one
you would like to use, and the one that was used in the most recent
consensus you have.  You include m-lines for both.  If either set
wins, your m-lines influence the consensus.

(If your favorite set is the one that the last consensus lists, you
wouldn't include duplicate m-lines.)

> 3.2. Directory mirrors serve microdescriptors
> 
>   Directory mirrors should then read the microdescriptor-elements line
>   from the consensus, and learn how to answer requests. (Directory mirrors
>   continue to serve normal relay descriptors too, a) to serve old clients
>   and b) to be able to construct microdescriptors on the fly.)
> 

So let's talk a little bit about why we're doing it this way.

The reason for doing this particular design (call it "caches build
microdescriptors") is mainly b above, so that if more info needs to
get added to microdescriptors, the caches can just serve it, and they
don't need to be upgraded.

The alternative would be to have the authorities generate and serve
the microdescriptors themselves.  (Call this "authorities build
microdescriptor".)  This would give us greater freedom in what we put
in them and how we format them, but would require more stuff to get
downloaded and cached from the authorities by the mirrors.

Even though I initially advocated it, I am not sure that the approach
this proposal takes actually helps forward compatibility.  After all,
the only reason to add a new field to microdescriptors is because
clients are going to start using it.  So, clients are upgrading
anyway.  Authorities would in both cases need to be reconfigured, at
least, to vote for the new microdescriptor constructions.

So how are these approaches different in their outcomes?

Advantages for "Caches build microdescriptors"

   + Caches download less from the authorities. 
   + It is trivial to determine that the authorities have computed the
     microdescriptor for a descriptor correctly if you have both.
     (With "Authorities build microdescriptors", you need to know the
     algorithm that the authorities used, so it's easy, but not
     quite so trivial.)
   +??? We can change what goes into microdescriptors just by
     upgrading the authority configuration.  This is not so great as
     it might first seem.  Revisions to microdescriptor contents
     shouldn't happen lightly, after all, since making them bigger
     will defeat their purpose, and taking things out will make old
     clients stop working.  Since changing their contents would
     require a proposal and a client behavior shift anyway, is having
     the authorities upgrade to a new consensus-version such a big deal?

Advantages for "Authorities build microdescriptors":
   + We have more flexibility about what the microdescriptors can
     contain.  For instance, they can't include the equivalent of the
     "p" lines in the current consensus format, even though those need
     to be calculated from exit policies, and are not simple copies.
     This is especially important if our goal is to shift stable info
     into the microdescriptors in order to keep consensuses small
     while making clients download descriptors less.

That's 3 advantages for "Caches build", and only 1 for "Authorities
build", but I think that the advantage of "authorities build" is much
bigger.  It lets us consider things like the exit-ports line, binary
packing of onion keys [not actually a win, but the next thing could
be], and so on.  What do you think?

(I think it was originally I who argued for a list of items to include
in the first place.  I may have been wrong and reaching for premature
generality.)

>   The microdescriptors with hashes <D1>,<D2>,<D3> should be available at:
>     http://<hostname>/tor/micro/d/<D1>+<D2>+<D3>.z

This implies that unless the mirror knows the microdescriptors for
every router in the last two or three consensuses, the client is out of
luck.  Thus, the mirror must have kept track of the fields listed for
microdescriptors in all the live consensuses.  So be it.

>   All the microdescriptors from the current consensus should also be
>   available at:
>     http://<hostname>/tor/micro/all.z
>   so a client that's bootstrapping doesn't need to send a 70KB URL just
>   to name every microdescriptor it's looking for.

Good idea.

>   The format of a microdescriptor is the header line
>   "microdescriptor-header"
>   followed by each element (keyword and body), alphabetically. There's
>   no need to mention what hash it's for, since it's self-identifying:
>   you can hash the elements to learn this.

We should mention that the header line is semantically important.  If
you see:
  microdescriptor-header foo bar
  foo X
then you know that the base descriptor has no bar element, whereas if
you see:
  microdescriptor-header foo
  foo X
then you know nothing about the bar element.

What are clients supposed to do, btw, if they find that the
microdescriptors that the authority lists do not contain some field
they regard as essential?  I assume the answer is, "This must never
happen.  Once a client version uses a field in microdescriptors, that
field must be present in microdescriptors until all client versions
requiring it are obsolete."  Yes?  Otherwise clients that want that
field need to fall back to descriptors.

>   (Do we need a footer line to show that it's over, or is the next
>   microdescriptor line or EOF enough of a hint? A footer line wouldn't
>   hurt much. Also, no fair voting for the microdescriptor-element
>   "microdescriptor-header".)

I don't see that a footer line is necessary.

>   The hash of the microdescriptor is simply the hash of the concatenated
>   elements -- not counting the header line or hypothetical footer line.
>   Unless you prefer that?

Just the elements is fine.

>   Is there a reasonable way to version these things? We could say that
>   the microdescriptor-header line can contain arguments which clients
>   must ignore if they don't understand them. Any better ways?

If we go with the authorities-build-microdescriptors idea, let's have
them numbered like the consensus version.

>   Directory mirrors should check to make sure that the microdescriptors
>   they're about to serve match the right hashes (either the hashes from
>   the fetch URL or the hashes from the consensus, respectively).
> 
>   We will probably want to consider some sort of smart data structure to
>   be able to quickly convert microdescriptor hashes into the appropriate
>   microdescriptor. Clients will want this anyway when they load their
>   microdescriptor cache and want to match it up with the consensus to
>   see what's missing.
>
> 3.3. Clients fetch them and cache them
> 
>   When a client gets a new consensus, it looks to see if there are any
>   microdescriptors it needs to learn. If it needs to learn more than
>   some threshold of the microdescriptors (half?), it requests 'all',
>   else it requests only the missing ones.

The client should estimate the typical compressed microdescriptor size
(CM).  Requesting another microdescriptor costs 41 bytes in the HTTP
request.   If the client wants N microdescriptors, and 41*N > CM, it
should request all.

>   Clients maintain a cache of microdescriptors along with metadata like
>   when it was last referenced by a consensus. They keep a microdescriptor
>   until it hasn't been mentioned in any consensus for a week. Future
>   clients might cache them for longer or shorter times.
> 
> 3.3.1. Information leaks from clients
> 
>   If a client asks you for a set of microdescs, then you know she didn't
>   have them cached before. How much does that leak? What about when
>   we're all using our entry guards as directory guards, and we've seen
>   that user make a bunch of circuits already?
> 
>   Fetching "all" when you need at least half is a good first order fix,
>   but might not be all there is to it.
> 
>   Another future option would be to fetch some of the microdescriptors
>   anonymously (via a Tor circuit).

Are these leaks worse than leaks from descriptor downloading?  If so,
how?

> 4. Transition and deployment
> 
>   Phase one, the directory authorities should start voting on
>   microdescriptors and microdescriptor elements, and putting them in the
>   consensus. This should happen during the 0.2.1.x series, and should
>   be relatively easy to do.

As we discussed on IRC, I believe this should wait till 0.2.2.x.
Getting the authorities onto newer versions is comparatively easy, and
0.2.1.x is in feature freeze now.  If it's important to prototype it
earlier, I can try to do that in a non-merged branch.

>   Phase two, directory mirrors should learn how to serve them, and learn
>   how to read the consensus to find out what they should be serving. This
>   phase could be done either in 0.2.1.x or early in 0.2.2.x, depending
>   on how messy it turns out to be and how quickly we get around to it.
> 
>   Phase three, clients should start fetching and caching them instead
>   of normal descriptors. This should happen post 0.2.1.x.

yrs,
-- 
Nick Mathewson