[tor-bugs] #6569 [Stem]: Stem network status document parsing
Tor Bug Tracker & Wiki
torproject-admin at torproject.org
Wed Aug 15 17:34:03 UTC 2012
#6569: Stem network status document parsing
--------------------+-------------------------------------------------------
Reporter: neena | Owner: neena
Type: task | Status: needs_review
Priority: normal | Milestone:
Component: Stem | Version:
Keywords: | Parent:
Points: | Actualpoints:
--------------------+-------------------------------------------------------
Changes (by neena):
* status: needs_revision => needs_review
Comment:
Replying to [comment:5 atagar]:
> > + desc =
stem.descriptor.networkstatus.NetworkStatusDocument(descriptor_file.read())
>
> Again this is frontloading the read rather than passing the file object
itself to the NetworkStatusDocument. This means that the memory usage is,
at a minimum, as large as the whole consensus file. If we process
individual consensus entries then that'll be reduced to the size of an
individual consensus entry.
>
> I realize that this is a large change but it's likely necessary to avoid
swapping on some systems.
>
Eek. fixed.
As it is right now, NetworkStatusDocument parses descriptor data.
stem.descriptor.networkstatus.parse_file does handle the router
descriptors individually. So, if ever there is a need to parse
NetworkStatusDocuments from a string, we can use the class directly.
> I probably would have done the same though, on reflection, this can
actually be reduced to...
>
> {{{
> if line.startswith("opt "):
> line = line[4:]
>
> keyword = line.split(" ", 1)[0].strip()
> }}}
>
> ... this is because split doesn't need the whitespace to have a first
element...
Fixed.
> > +try:
> > + from cStringIO import StringIO
> > +except:
> > + from StringIO import StringIO
>
> Out of curiosity when is cStringIO unavailable? Iirc we're using it
elsewhere too so if it, say, isn't available on python 2.5 then that would
be good to know.
From what I understand cStringIO is unavailable on 'some platforms'. Not
sure what those are.
> Duplicate assignment. Personally I think the first is more readable
though up to you. If you like the second better then please put the flag
list into a temporary variable first to avoid having a single enormous
line.
Fixed.
> Also, minor nitpick: when doing a multi-line block please place the last
parentheses at the same indentation as the block. Ie...
and fixed.
Replying to [comment:6 atagar]:
> > stem/descriptor/__init__.py
> > # Metrics descriptor handling. These contain a single descriptor per
file.
>
> This comment of mine is no longer accurate with this addition. Maybe
this would now be a little nicer as chained iterators...
It is accurate now. I modified stem.descriptor.networkstatus.parse_file to
return NetworkStatusDocument objects instead of iterating over the router
descriptors.
> Lets use _peek_line() here. We could probably replace this function
with...
done
> Missing a parameter and out of order. The opt handling in this method is
more complicated than it needs to be - like _peek_keyword() lets just
strip it off of the line variable if it's there.
ugh, fixed.
> > + elif line.startswith("opt"):
> > + # if this is something new we don't recognize
> > + # ignore it and go to the next line
> > + descriptor_file.readline()
> > + return _read_keyword_line(keyword, descriptor_file, optional)
>
> This doesn't look right. I think you meant "elif optional:", having this
start with 'opt' really doesn't mean anything.
If it starts with "opt ", it should mean it's something from the future
that the parser doesn't understand yet. We should be ignoring that and
using the next line.
> Also, when checking for an 'opt' prefix please do a startswith() check
for 'opt ' so you don't match against, say, our new "optometrist" keyword.
:P
Eek, yes, I thought I fixed that.
> These pydocs look really identical to me. How to these functions differ?
Can we get rid of _read_keyword_line_str()?
I added _read_keyword_line_str to handle lists of strings, because I
didn't want to use StringIO initially. We could get rid of it and convert
everything to use StringIO. But, I'd rather leave it as it is, and
eventually convert everything to use lists of strings internally.
> > +def _read_until_keywords(keywords, descriptor_file, inclusive =
False, ignore_first = False):
> > ...
> > + :param bool ignore_first: doesn't check if the first line read has
one of the given keywords
>
> This new parameter seems oddly specific. The problem that you're trying
to solve is that a router status entry starts with a 'r ' line and
continues until... well, that part isn't defined in the spec. At present
you're hardcoding our check to be for...
>
> > +_router_desc_end_kws = ["r", "bandwidth-weights", "directory-footer",
"directory-signature"]
>
> This strikes me as being both a bit hacky and vulnerable to breakage if
tor adds a new footer value. I'd suggest asking for clarification from
Nick - this looks like a spec bug to me.
It isn't exactly hacky. From the order in the dir-spec the first line
after the router descriptors will be
1. directory-footer if consensus method >= 9
2. bandwidth-weights if it's a consensus (at most for consensus, it says)
3. directory-signature otherwise
If tor adds a new footer value, it would start with the "opt" keyword.
This would be ignored because of " elif line.startswith("opt"):".
I also committed the microdescriptor flavoured consensus parsing code.
Should I create a seperate ticket for that? or will this do?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6569#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list