[tor-bugs] #16151 [metrics-lib]: Add new descriptor source to fetch descriptors from CollecTor via https

Sat May 23 17:15:28 UTC 2015

#16151: Add new descriptor source to fetch descriptors from CollecTor via https
-----------------------------+--------------------------
     Reporter:  karsten      |      Owner:  karsten
         Type:  enhancement  |     Status:  needs_review
     Priority:  normal       |  Milestone:
    Component:  metrics-lib  |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+--------------------------

Comment (by karsten):

 Replying to [comment:4 iwakeh]:

 Thanks for the extensive review!

 Before I respond, I must confess that I'm currently thinking about making
 a major change to `DescriptorCollector`: the way it works right now is
 that it fetches descriptors, parses them, hands them over to you, and you
 do with them whatever you want.  But most applications would want to cache
 descriptors locally before processing them, and in order to do that, we'll
 need a `DescriptorWriter`.

 I started writing such a class and ran into three problems:
  1. We need to distinguish whether a `ServerDescriptor` was published by a
 relay or a bridge.  I think we can do that by looking at `router-
 signature` vs. `router-digest`, and I even wrote code for that, but that
 needs testing.  So, not a big deal.
  2. In order to write a `Descriptor` to disk, we sometimes need additional
 information.  For example, a `Microdescriptor` doesn't contain a
 publication time, so we'd have to provide that seperately if we want to
 write descriptors to separate directories per month.  Also,
 `BridgeNetworkStatus` doesn't contain the fingerprint of the bridge
 authority.  Solving these issues is possible, but not pretty.
  3. Here's the party killer: if we want to use `DescriptorCollector` to
 mirror CollecTor's descriptor archive tarballs, we'd instead receive all
 descriptors contained in tarballs and would then have to write them to new
 tarballs.  In that use case we don't want to parse descriptors, which
 would even slow us down a lot.

 So, I wonder if we should change `DescriptorCollector` to simply
 synchronize files from the CollecTor website to a local directory without
 parsing descriptors at all.  Then we can use `DescriptorReader` as usual
 to process them.  The only use case that this wouldn't cover is when we
 want to process CollecTor's descriptors on-the-fly without having them
 touch the disk, but that doesn't seem important enough to force the other
 use cases into this model.

 Thoughts?

 Let me also reply inline.

 > Is there a task already for switching to java 7 and adding some logging
 framework?
 > These two comprise my "ceterum censeo ..." :-)
 > In general, could this change be also used to softly introduce some
 logging into metrics-lib?

 Both are good ideas, but let's do them when all services are migrated away
 from the dying yatei host.

 > Now the code comments/questions:
 > (I hope I didn't miss any new test cases ...)
 > It might be helpful to add some tests for DescriptorCollectorImpl that
 explicit test for certain behavior.
 > These would be a very good source of documentation: e.g. take the
 comment
 > inside "putTogetherHistory". A test could explicitly verify the
 functionality
 > described in the comments.

 Agreed.  For now, I'd risk doing the testing by setting up the services
 locally with the new metrics-lib and seeing if they still work okay.  But
 yes, we should add tests.

 > DescriptorCollector interface:
 > Why not turn the comments into javadoc? They are there already and very
 > readable. Just adding the tags for return values and parameters is not a
 big deal.

 Sure, why not.  I'll fix that.  We might later want to go through the
 other classes and clean up docs there, too.  But there's no harm in
 starting to do this with newly added code.

 > Concerning DescriptorCollectorImpl:
 > Let's try to avoid the TODOs here from the beginning ;-)
 > For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS,
 READ_TIMEOUT_MILLIS)
 > I would suggest using a system property (defaulting to the current
 values) as for the class
 > names in DescriptorSourceFactory using
 > {{{
 > onionoo.collector.url
 > onionoo.collector.timeout.read
 > onionoo.collector.timeout.write
 > }}}
 > as property names. Then it will be configurable immediately.
 > The retrieval of the 'int' properties needs just one extra method that
 returns
 > either the converted property value or logs an error and returns the
 default.
 > The URL could also be verified, if there is time to implement it now.

 The current way of configuring `DescriptorCollector` is by calling methods
 on it after it's created and before it starts collecting.  I'd rather want
 to make things configurable that way.  It's actually easy to do that, so
 we can change those TODOs into methods.

 I'd also want to have a more comprehensive plan for using properties in
 metrics-lib.  It's quite possible that using them is a better idea than
 the current way of configuring things.  But I'd want to have a high-level
 overview of that first.

 > The other TODO is the Apache's directory list parsing: the regex parsing
 > approach is straigthforward and ok, but I'd make it testable and
 definitly
 > log any problems. So a method 'addDirsFromListing(SortedMap<String,
 Long> list)'
 > with the code from lines 177-200 and the additional logging might be
 useful.

 Will look into that.  Making the code more testable sounds great.

 > Well, and the test for parsing directory string needs to be added.

 Agreed.

 > Tests for the property handling above would be helpful, too.

 (See above.)

 > What about adding the {{{@Override}}} annotation where appropriate?

 Sure, if it's missing that's an oversight.  I'll add that.

 So, I'm curious to hear what you think about my suggestion above.  Sorry
 for letting you review the current code and then suggesting to change it,
 but I didn't envision to run into all those problems.  Thanks again!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/16151#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online