[tor-dev] Remote descriptor fetching

Wed Jun 5 14:29:50 UTC 2013

On 6/1/13 9:18 PM, Damian Johnson wrote:
> Thanks Karsten, thanks Kostas! It's a little disturbing that moria1 is
> providing truncated responses but guess we'll dig into that more
> later.

Indeed, this would be pretty bad.  I'm not convinced that moria1
provides truncated responses though.  It could also be that it
compresses results for every new request and that compressed responses
randomly differ in size, but are still valid compressions of the same
input.  Kostas, do you want to look more into this and open a ticket if
this really turns out to be a bug?

> Great points about needing a more flexible downloader. Here's another
> attempt, this time with a DescriptorDownloader class that's a bit
> similar to our present DescriptorReader...
> 
> https://trac.torproject.org/projects/tor/wiki/doc/stem#RemoteDescriptorFetching
> 
> It still feels a little clunky to me, and I'm not yet sure how best to
> handle the use case you mentioned concerning votes. Thoughts?
> 
> Feel free to edit the page (that's what wikis are there for!). -Damian

So, this isn't the super smart downloader that I had in mind, but maybe
there should still be some logic left in the application using this API.
 I can imagine how both DocTor and metrics-db-R could use this API with
some modifications.  A few comments/suggestions:

- There could be two methods get/set_compression(compression) that
define whether to use compression.  Assuming we get it working.

- If possible, the downloader should support parallel downloads, with at
most one parallel download per directory.  But it's possible to ask
multiple directories at the same time.  There could be two methods
get/set_max_parallel_downloads(max) with a default of 1.

- I'd want to set a global timeout for all things requested from the
directories, so a get/set_global_timeout(seconds) would be nice.  The
downloader could throw an exception when the global download timeout
elapses.  I need such a timeout for hourly running cronjobs to prevent
them from overlapping when things are really, really slow.

- Just to be sure, get/set_retries(tries) is meant for each endpoint, right?

- I don't like get_directory_mirrors() as much, because it does two
things: make a network request and parse it.  I'd prefer a method
use_v2dirs_as_endpoints(consensus) that takes a consensus document and
uses the contained v2dirs as endpoints for future downloads.  The
documentation could suggest to use this approach to move some load off
the directory authorities and to directory mirrors.

- Related note: I always look if the Dir port is non-zero to decide
whether a relay is a directory.  Not sure if there's a difference to
looking at the V2Dir flag.

- All methods starting at get_consensus() should be renamed to fetch_*
or query_* to make it clear that these are no getters but perform actual
network requests.

- All methods starting at get_consensus() could have an additional
parameter for the number of copies (from different directories) to
download.  The default would be 1.  But in some cases people might be
interested in having 2 or 3 copies of a descriptor to compare if there
are any differences, or to compare download times (more on this below).
 Also, a special value of -1 could mean to download every requested
descriptor from every available directory.  That's what I'd do in DocTor
to download the consensus from all directory authorities.

- As for download times, is there a way to include download meta data in
the result of get_consensus() and friends?  I'd be interested in the
directory that a descriptor was downloaded from and in the download time
in millis.  This is similar to how I'm interested in file meta data in
the descriptor reader, like file name or last modified time of the file
containing a descriptor.

- Can you add a fetch|query_votes(fingerprints) method to request vote
documents?

All in all, looks like a fine API.  When can I use it? :D

Thanks!

Best,
Karsten