[tor-bugs] #16151 [metrics-lib]: Add new descriptor source to fetch descriptors from CollecTor via https
Tor Bug Tracker & Wiki
blackhole at torproject.org
Sat May 23 17:15:28 UTC 2015
#16151: Add new descriptor source to fetch descriptors from CollecTor via https
-----------------------------+--------------------------
Reporter: karsten | Owner: karsten
Type: enhancement | Status: needs_review
Priority: normal | Milestone:
Component: metrics-lib | Version:
Resolution: | Keywords:
Actual Points: | Parent ID:
Points: |
-----------------------------+--------------------------
Comment (by karsten):
Replying to [comment:4 iwakeh]:
Thanks for the extensive review!
Before I respond, I must confess that I'm currently thinking about making
a major change to `DescriptorCollector`: the way it works right now is
that it fetches descriptors, parses them, hands them over to you, and you
do with them whatever you want. But most applications would want to cache
descriptors locally before processing them, and in order to do that, we'll
need a `DescriptorWriter`.
I started writing such a class and ran into three problems:
1. We need to distinguish whether a `ServerDescriptor` was published by a
relay or a bridge. I think we can do that by looking at `router-
signature` vs. `router-digest`, and I even wrote code for that, but that
needs testing. So, not a big deal.
2. In order to write a `Descriptor` to disk, we sometimes need additional
information. For example, a `Microdescriptor` doesn't contain a
publication time, so we'd have to provide that seperately if we want to
write descriptors to separate directories per month. Also,
`BridgeNetworkStatus` doesn't contain the fingerprint of the bridge
authority. Solving these issues is possible, but not pretty.
3. Here's the party killer: if we want to use `DescriptorCollector` to
mirror CollecTor's descriptor archive tarballs, we'd instead receive all
descriptors contained in tarballs and would then have to write them to new
tarballs. In that use case we don't want to parse descriptors, which
would even slow us down a lot.
So, I wonder if we should change `DescriptorCollector` to simply
synchronize files from the CollecTor website to a local directory without
parsing descriptors at all. Then we can use `DescriptorReader` as usual
to process them. The only use case that this wouldn't cover is when we
want to process CollecTor's descriptors on-the-fly without having them
touch the disk, but that doesn't seem important enough to force the other
use cases into this model.
Thoughts?
Let me also reply inline.
> Is there a task already for switching to java 7 and adding some logging
framework?
> These two comprise my "ceterum censeo ..." :-)
> In general, could this change be also used to softly introduce some
logging into metrics-lib?
Both are good ideas, but let's do them when all services are migrated away
from the dying yatei host.
> Now the code comments/questions:
> (I hope I didn't miss any new test cases ...)
> It might be helpful to add some tests for DescriptorCollectorImpl that
explicit test for certain behavior.
> These would be a very good source of documentation: e.g. take the
comment
> inside "putTogetherHistory". A test could explicitly verify the
functionality
> described in the comments.
Agreed. For now, I'd risk doing the testing by setting up the services
locally with the new metrics-lib and seeing if they still work okay. But
yes, we should add tests.
> DescriptorCollector interface:
> Why not turn the comments into javadoc? They are there already and very
> readable. Just adding the tags for return values and parameters is not a
big deal.
Sure, why not. I'll fix that. We might later want to go through the
other classes and clean up docs there, too. But there's no harm in
starting to do this with newly added code.
> Concerning DescriptorCollectorImpl:
> Let's try to avoid the TODOs here from the beginning ;-)
> For the various constants (COLLECTOR_BASE_URL, CONNECT_TIMEOUT_MILLIS,
READ_TIMEOUT_MILLIS)
> I would suggest using a system property (defaulting to the current
values) as for the class
> names in DescriptorSourceFactory using
> {{{
> onionoo.collector.url
> onionoo.collector.timeout.read
> onionoo.collector.timeout.write
> }}}
> as property names. Then it will be configurable immediately.
> The retrieval of the 'int' properties needs just one extra method that
returns
> either the converted property value or logs an error and returns the
default.
> The URL could also be verified, if there is time to implement it now.
The current way of configuring `DescriptorCollector` is by calling methods
on it after it's created and before it starts collecting. I'd rather want
to make things configurable that way. It's actually easy to do that, so
we can change those TODOs into methods.
I'd also want to have a more comprehensive plan for using properties in
metrics-lib. It's quite possible that using them is a better idea than
the current way of configuring things. But I'd want to have a high-level
overview of that first.
> The other TODO is the Apache's directory list parsing: the regex parsing
> approach is straigthforward and ok, but I'd make it testable and
definitly
> log any problems. So a method 'addDirsFromListing(SortedMap<String,
Long> list)'
> with the code from lines 177-200 and the additional logging might be
useful.
Will look into that. Making the code more testable sounds great.
> Well, and the test for parsing directory string needs to be added.
Agreed.
> Tests for the property handling above would be helpful, too.
(See above.)
> What about adding the {{{@Override}}} annotation where appropriate?
Sure, if it's missing that's an oversight. I'll add that.
So, I'm curious to hear what you think about my suggestion above. Sorry
for letting you review the current code and then suggesting to change it,
but I didn't envision to run into all those problems. Thanks again!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/16151#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list