[or-cvs] r9993: Describe a simpler implementation for proposal 108, and note (in tor/trunk: . doc/spec/proposals)

Wed Apr 25 19:16:28 UTC 2007

On Fri, Apr 20, 2007 at 01:17:15PM -0400, nickm at seul.org wrote:
>    tor/trunk/doc/spec/proposals/108-mtbf-based-stability.txt
[snip]
> +Alternative:
> +
> +   "A router's Stability shall be defined as the sum of $alpha ^ d$ for every
> +   $d$ such that the router was not observed to be unavailable $d$ days ago."
> +
> +   This allows a simpler implementation: every day, we multiply yesterday's
> +   Stability by alpha, and if the router was running for all of today, we add
> +   1.

I don't think you mean quite that. For a server that just appeared,
there are an infinite number of previous days where it was not observed
to be unavailable. Do you mean 'was observed to be available'? And by
available, do we mean available for the entire day?

What are some ways we can choose \alpha?

> +Limitations:
> +
> +   Authorities can have false positives and false negatives when trying to
> +   tell whether a router is up or down.  So long as these aren't terribly
> +   wrong, and so long as they aren't significantly biased, we should be able
> +   to use them to estimate stability pretty well.

I haven't seen any discussion about how the router's declared uptime fits
into this. If a router goes down and then comes up again in between
measurements, the proposed approach will treat it as being up the
whole time -- yet connections through it will be broken. One approach
to handling this would be to notice if the uptime decreases from one
descriptor to the next. This would indicate a self-declared downtime
for the router, and we can just figure that into the calculations.

I'm not sure how we should compute the length of the downtime though:
in some cases it will be just a split second as for a reboot or upgrade,
but in others maybe the computer, network, or Tor process went down
and then came back a long time later. I guess since our computations
are just rough approximations anyway, we can just assume a zero-length
downtime unless our active testing also noticed it.

Speaking of the active testing, here's what we do right now:

Every 10 seconds, we call dirserv_test_reachability(), and it tries making
connections to a different 1/128 of the router list. So a given router
gets tried every 1280 seconds, or a bit over 21 minutes. We declare a
router to be unreachable if it has not been successfully found reachable
within the past 45 minutes. So at least two testing periods not to go
by before a running router is considered to be no longer running.

So our measurements won't be perfect, but I think this approach is a
much better one than just blindly believing the uptime entry in the
router descriptor.

What is our plan for storing (and publishing?) the observed uptime
periods for each router?

--Roger