[metrics-bugs] #32265 [Metrics/Exit Scanner]: MS: Format an exit list from a previous exit list and exitmap output

Wed Nov 20 13:42:44 UTC 2019

#32265: MS: Format an exit list from a previous exit list and exitmap output
----------------------------------+------------------------------
 Reporter:  irl                   |          Owner:  irl
     Type:  task                  |         Status:  needs_review
 Priority:  Medium                |      Milestone:
Component:  Metrics/Exit Scanner  |        Version:
 Severity:  Normal                |     Resolution:
 Keywords:                        |  Actual Points:
Parent ID:  #29654                |         Points:
 Reviewer:  karsten               |        Sponsor:
----------------------------------+------------------------------

Comment (by karsten):

 Replying to [comment:6 irl]:
 > Replying to [comment:5 karsten]:
 > > Glad to see that the rewrite is progressing so quickly!
 > >
 > > Couple remarks/questions:
 > >  - Why 48 hours and not 24 hours? Doesn't the current exit scanner
 keep scan results for 24 hours? I might be wrong, though. Let's use
 whatever the current scanner does.
 >
 > https://2019.www.torproject.org/tordnsel/exitlist-spec.txt
 >
 > It discards relays that were not seen in the last 48 hours in a
 consensus.

 Okay, let's use 48 hours then.

 > >  - Rather than downloading exit lists from CollecTor, wouldn't it be
 sufficient to just read the latest exit list previously written by this
 scanner? And if there's none, just assume that no previous scans have
 happened. In theory, this should be all we need to learn.
 >
 > Probably, but this was a handy way to get test data and I wanted to try
 out the new Stem functionality. It would be nice to have a method to
 bootstrap a new scanner but this could just mean manually downloading the
 latest exit list and putting it in the right place.

 Actually, I think it's harmful to download exit lists from CollecTor and
 merging them with the scanner's own measurements. We should instead merge
 new scan results with previous local results. It's also yet another
 dependency to download something from CollecTor that is not really needed.
 I'd say kill this code.

 > >  - It seems that `LastStatus` is only taken from exit lists downloaded
 from CollecTor but never set by new measurements. We should make a plan
 what to do with this field. Take it out? Populate it with consensus valid-
 after times?
 >
 > Right, this is the tricky bit. Do you know if anything consumes the
 LastStatus or Published timestamps? Ideally we could just drop these but
 for now I'm synthesizing them from the timestamp of the last measurement
 which could be close enough for the consumers.

 Well, the spec says what these fields are being used for: `Published` is
 used to skip relays that haven't published a new descriptor since the one
 in the current consensus, and `LastStatus` is used to know when to throw
 out relays from the list. This is all under the assumption that the
 scanner reads its previous exit list from disk before making measurements.

 My suggestion would be to use the consensus valid-after time as
 `LastStatus` time. It's pretty much the same as the `published` time in a
 version 2 status, and it would work for this purpose.

 > >  - Does exitmap with the plugin use previous scans as input to decide
 which relays to scan? I believe that it uses some logic to avoid scanning
 relays too frequently. This has two effects: it doesn't generate more load
 on the network and on single relays than necessary, and it ensures that
 new relays are scanned sooner. As a result, the new scanner could be run
 once or twice per hour, rather than every 2 or 3 hours (at 45 minutes
 runtime).
 >
 > No. It scans the entire network every time. It does this asynchronously,
 and doesn't try to prioritize anything. Just whichever circuits are built
 first will be tested first. I was even thinking it could run continuously.
 If exit relays cannot cope with two HTTP requests an hour, perhaps they
 shouldn't be exit relays.

 Ideally, we would change as few variables at the same time as possible, in
 order to compare the new results with the old ones. Changing the
 scheduling from "only scan relays with changed descriptors" to "scan all
 relays once per hour" seems like a major design change that we could make
 at a later time.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32265#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online