Anonymity-preserving collection of usage data of a hidden service authoritative directory

Roger Dingledine arma at mit.edu
Wed May 2 23:36:23 UTC 2007


On Wed, May 02, 2007 at 09:51:08PM +0200, Karsten Loesing wrote:
> > Here are early results from moria1:

And here are the more steady-state results, after we've been up a
few days:

publish-total-history 2007-05-02 22:25:27 (900 s) 451,314,296,356,400,328,279,285,331,307,246,375,459,253,346,218,460,280,344,260,353,306,351,309,441,228,391,460,397,276,293,352,398,331,375,412,462,328,362,259,435,308,436,278,354,408,331,375,382,346,352,417,351,503,290,411,367,495,354,371,405,450,444,331,377,396,379,285,417,601,373,321,366,339,369,355,403,392,318,315,446,318,419,365,293,468,343,382,428,434,367,349,367,301,379,334
publish-novel-history 2007-05-02 22:25:27 (900 s) 1,0,1,1,0,1,0,2,3,1,2,5,0,0,0,0,0,1,0,1,0,0,0,0,4,2,0,3,1,0,2,0,1,0,3,1,0,0,0,0,0,0,2,1,1,0,0,3,1,0,2,4,2,0,0,0,0,0,1,4,2,0,0,0,1,1,0,1,2,0,0,0,3,2,0,1,1,1,0,0,1,0,3,1,0,1,0,0,0,0,0,0,0,0,0,0
publish-top-5-percent-history 2007-05-02 22:25:27 (900 s) 83,59,60,64,74,56,52,60,68,56,47,58,83,42,55,45,91,64,65,56,65,74,71,68,87,58,75,92,76,56,56,83,99,75,80,77,99,71,81,53,97,74,87,64,91,73,87,79,71,75,80,101,76,98,66,109,86,96,92,92,89,98,96,92,77,101,76,72,91,116,81,65,63,77,68,79,88,86,51,70,84,57,95,67,64,90,72,79,85,79,83,70,73,64,81,64
publish-top-10-percent-history 2007-05-02 22:25:27 (900 s) 135,92,88,106,116,86,80,90,105,85,69,102,139,67,91,61,143,92,102,84,100,105,107,98,134,78,114,140,115,85,84,119,135,109,116,123,146,103,118,83,142,107,131,94,124,118,120,119,114,108,120,145,115,159,96,149,118,159,129,131,128,143,146,126,113,142,114,99,140,192,118,98,102,111,107,118,131,124,83,102,132,90,148,106,91,132,109,120,130,125,121,111,109,94,119,95
publish-top-20-percent-history 2007-05-02 22:25:27 (900 s) 210,136,120,167,175,139,116,129,147,119,96,170,214,101,143,89,226,121,148,119,144,146,153,144,200,102,178,218,174,122,124,169,175,161,170,188,217,155,168,121,210,148,200,135,159,185,153,180,173,156,168,211,162,246,133,202,161,244,177,182,186,209,214,170,164,199,173,138,198,303,174,145,156,149,162,175,187,183,127,146,207,137,212,168,124,212,159,178,202,198,172,167,167,136,174,153
fetch-total-history 2007-05-02 22:25:27 (900 s) 29,46,42,46,51,49,40,32,33,23,42,40,31,24,31,21,23,21,34,24,22,33,30,26,53,41,43,35,24,42,44,68,34,33,28,36,16,20,30,42,63,30,26,23,24,12,18,26,25,14,41,17,18,20,15,32,26,22,23,39,15,22,26,34,23,24,14,11,21,26,17,13,12,16,18,13,9,14,14,13,13,33,34,16,24,19,25,31,28,23,36,30,32,52,54,69
fetch-successful-history 2007-05-02 22:25:27 (900 s) 15,22,15,20,21,20,23,14,14,16,19,19,17,8,19,14,16,11,16,13,10,24,11,16,23,22,17,20,14,23,25,18,16,15,15,22,9,10,21,22,30,13,14,9,14,8,9,10,12,6,19,6,15,10,5,20,16,15,14,24,10,12,14,18,12,16,11,8,13,20,8,8,7,11,10,11,7,11,7,9,10,24,23,5,16,11,17,23,17,17,27,16,21,23,37,59
fetch-top-5-percent-history 2007-05-02 22:25:27 (900 s) 9,16,14,19,25,23,13,11,8,4,10,7,6,9,9,4,6,5,13,5,4,14,11,6,11,7,8,10,4,8,9,20,6,9,4,8,3,3,5,10,17,5,4,7,3,2,5,4,6,3,14,2,4,4,5,12,7,4,3,12,3,4,3,11,3,4,3,3,4,4,7,2,3,6,6,3,2,3,7,3,3,7,9,8,6,4,5,9,6,6,10,8,8,16,22,40
fetch-top-10-percent-history 2007-05-02 22:25:27 (900 s) 13,20,22,22,28,27,16,16,15,6,14,10,8,14,12,7,9,8,16,9,8,19,15,9,18,12,12,12,7,11,12,27,8,12,7,11,6,5,9,12,22,8,7,9,5,4,7,7,8,3,18,4,7,7,5,15,10,6,6,14,5,6,6,13,5,7,6,3,6,7,7,4,3,6,6,3,2,5,7,3,3,13,16,8,9,7,10,13,8,9,15,11,12,20,27,46
fetch-top-20-percent-history 2007-05-02 22:25:27 (900 s) 15,26,26,27,32,32,21,20,21,8,20,16,12,16,16,10,12,10,18,13,10,21,19,13,24,19,18,16,9,17,18,39,12,16,10,17,8,8,14,18,32,13,11,11,9,5,9,11,10,6,23,6,10,10,8,17,13,8,10,18,7,10,10,17,9,11,7,5,8,10,10,6,5,9,11,5,4,7,10,6,6,18,21,11,12,9,13,19,12,12,21,15,15,26,32,51
desc-total-history 2007-05-02 22:25:27 (900 s) 883,883,884,885,885,886,886,888,891,892,894,899,899,899,899,899,899,900,900,901,901,901,901,901,905,907,907,910,911,911,913,913,914,914,917,918,918,918,918,918,918,918,920,921,922,922,922,925,926,926,928,932,934,934,934,934,934,934,935,939,941,941,941,941,942,943,943,944,946,946,946,946,949,951,951,952,953,954,954,954,955,955,958,959,959,960,960,960,960,960,960,960,960,960,960,960

> > The novel descriptors, and failed fetches, are high at the beginning
> > because it had just started up so it didn't have any yet. Hard to
> > guess what steady-state will be.
> 
> Sure, the first 10 rows or so might result from restarting the directory
> node. But from there on it looks like it has stabilized. Hidden services
> publish their descriptors once an hour, don't they?

Yep -- every RendPostPeriod, which is 1 hour by default.

They also republish whenever they consider their descriptor to be
"dirty", which happens when they establish a new introduction point
(rend_service_intro_established()) or give up on and drop an introduction
point (rend_services_introduce()). This 'dirty' part is what I meant
when I was pondering if a few hidden services have unstable connections,
and thus change their intro points a lot.

> (Well, that was easy
> to see even without looking into the spec by the decreasing number of
> novel publications after the first four intervals = 60 minutes.) So it's
> very unlikely that there will be many novel publications after the shown
> intervals.

Yep. They will be people creating a new hidden service, or people
turning on their Tor after it's been off for a while. As we see above,
there are at most a handful in each 15 minute period.

> The only thing that does not stabilize (yet) is the total number of
> descriptors. This should come from the fact that lease times for
> descriptors are very much higher than republication times (24 hours vs.
> 1 hour, right?).

#define REND_CACHE_MAX_AGE (2*24*60*60)
#define REND_CACHE_MAX_SKEW (24*60*60)

  cutoff = time(NULL) - REND_CACHE_MAX_AGE - REND_CACHE_MAX_SKEW;

So that actually appears to be 3 days.

But hey, at least we remove old ones sometime, rather than just collecting
them forever. :)

(Remember that this same logic is used by *clients* to discard old service
descriptors, and we have many fewer guarantees that their clocks are at
all correct. That's what the MAX_SKEW business is about.)

> Doesn't that mean that the increase in total
> descriptors from the fifth interval on only comes from descriptors that
> have not been refreshed and represent probably offline hidden services?
> That would mean that 145 (=803-658) or 18% of the descriptors in the
> last interval are useless (or even more if the total number of
> descriptors increases further, what is likely the case). Wouldn't it
> make more sense to synchronize publication intervals and lease times?
> Was that what you meant with "artifacts"? Why would a client expect that
> a hidden service with a 23-hour old descriptor is online if it knows
> that it should have republished every hour?

Well, if the client's clock is wrong by 23 hours, ...

But you're right, the servers storing the descriptors should be assumed
to have better clocks, and they could just dump old ones to save clients
the trouble.

Of course, the real reason hidden services republish every hour is
because the directory authorities don't store anything to disk and don't
share service descriptors among each other -- so every time we restart
a directory authority it forgets about all hidden services. This means
they need to republish frequently just in case an authority restarts.
If we made some way for service descriptors to survive a restart (e.g.
by storing them to disk, replicating them, or both), then it seems to
me we would reduce the need to republish dramatically.

> In a decentralized design I
> suggest to cut down the lease time to one hour (or maybe 1.5 hours).
> This saves resources for replicating descriptors in case of
> leaving/joining routers.

This is an interesting tradeoff. I'm not sure if it's better to demand
frequent "I'm still here" messages from the hidden services, so you can
quickly drop the ones that don't send one, or to be more flexible and
let them go long periods with the same intro points and never need to
send an update.

I guess if we want to get extra complex then somebody could try connecting
to the hidden service and only dump the descriptor if it's unreachable
-- but that probably doesn't play well with our authentication or
authorization tricks, nor with the valet node and related designs.

> > But the first thing to note is that
> > the total number of fetches are really not that high.
> 
> At least the number of fetches needs to be multiplied by five, because
> requests should be (more or less) equally distributed among directories.

Actually, three. Only "v1" directory authorities handle hidden service
stuff, and that's just moria1, moria2, and tor26 right now.

>  Though these numbers still are not as high as I expected, it is very
> interesting to have some absolute numbers.

Yep. This number seems to represent the total count of people interacting
with a given hidden service, but remember that it doesn't represent
the total number of rendezvous attempts -- since clients cache the
descriptors.

Though note in connection_ap_handshake_rewrite_and_attach() that clients
try to refetch a newer descriptor if the one they have cached is more
than 15 minutes old. Are you following all the details so far? :)

> > The second thing
> > to note is to start wondering why a few services publish so often --
> > is it because their intro circuits break often, maybe because they have
> > a poor network connection, so they feel the need to republish?
> 
> To be honest, I don't know yet if these numbers are really high or not.
> What is high and what is low? Does low mean that all services publish
> equally often, and high means that all services but one publish only
> once and the remaining service publishes all the other times?

Yep, that's (the extreme version of) the scenario I had in mind.

> I think I
> need to read a good statistics book to learn how to evaluate such data.
> When writing the spec, the percent-histories were just a goodie, and I
> wanted to implement something more complex than a counter in C to see if
> I have problems with the implementation stuff. ;) But you are right, if
> that number is (too) high, we should try to find out why.
> 
> > And the
> > third is to remember that hidden services suck right now, so we shouldn't
> > take the current usage pattern to be the requirements for future hidden
> > services. :)
> 
> Then my question is: Why do hidden services suck right now? Do you mean
> performance? Yes, that could be improved. In an earlier evaluation I
> found that connection establishment after having downloaded the
> descriptor takes 5.39 +- 12.4 seconds, i.e. with an acceptable mean, but
> a huge variance.

Right, it's this part. There is something that is making the rendezvous
itself be very slow. I'm not sure what it is. There's no need for it
to be as slow as it is. And I think it really reduces the set of people
who think hidden services are neat.

> Afterwards, message round-trip times were 2.32 +- 1.66
> seconds, i.e. acceptable after all.
> 
> Or are there other reasons why they suck? Unclear security properties?
> Too complicated setup? The need for Tor on the client side? What do you
> think?

Well, there are unclear security properties, but I don't think that
bothers most users or most people offering the hidden services. The
setup is really easy on the client side, which is the important part.
I am mainly thinking of the highly variable performance.

> Anyway, even if the current usage pattern does not really justify to
> distribute storage of rendezvous service descriptors, future
> applications of hidden services might do so. Or the other way round, new
> applications that would not be reasonable in a centralized storage can
> be made possible in a decentralized one. That keeps me optimistic. :)

Right.

Also, scaling questions aside, there are other reasons to distribute
hidden service descriptors and improve their availability.

So what more data might we want to collect about current usage patterns?
Or is this enough to move on to the next steps which are to think about an
ascii format for descriptors (rather than the awful binary format I was
dumb enough to use back when we started), think about the implications
of letting strangers see and serve all the descriptors, and think about
a protocol for receiving, serving, and replicating descriptors?

Fun fun,
--Roger



More information about the tor-dev mailing list