[tor-bugs] #6569 [Stem]: Stem network status document parsing

Tue Aug 14 16:37:19 UTC 2012

#6569: Stem network status document parsing
--------------------+-------------------------------------------------------
 Reporter:  neena   |          Owner:  neena         
     Type:  task    |         Status:  needs_revision
 Priority:  normal  |      Milestone:                
Component:  Stem    |        Version:                
 Keywords:          |         Parent:                
   Points:          |   Actualpoints:                
--------------------+-------------------------------------------------------
Changes (by atagar):

  * status:  needs_review => needs_revision

Comment:

 Bit more feedback from looking it over this morning. I'm gonna swap this
 back into needs_revision until these are addressed.

 > stem/descriptor/__init__.py
 > # Metrics descriptor handling. These contain a single descriptor per
 file.

 This comment of mine is no longer accurate with this addition. Maybe this
 would now be a little nicer as chained iterators...

 {{{
 def parse_file(path, descriptor_file):
   ...

   filename, file_parser = os.path.basename(path), None

   if filename == "cached-descriptors":
     file_parser = stem.descriptor.server_descriptor.parse_file
   elif filename == "cached-extrainfo":
     file_parser = stem.descriptor.extrainfo_descriptor.parse_file
   else:
     # check if this is a metrics descriptor
     metrics_header_match = re.match("^@type (\S+) (\d+).(\d+)$",
 first_line)

     if metrics_header_match:
       desc_type, major_version, minor_version =
 metrics_header_match.groups()
       file_parser = lambda f: _parse_metrics_file(desc_type,
 int(major_version), int(minor_version), f)

   if file_parser:
     for desc in file_parser(descriptor_file):
       desc._set_path(path)
       yield desc

     return
   else:
     raise TypeError("Unable to determine the descriptor's type. filename:
 '%s', first line: '%s'" % (filename, first_line))

 def _parse_metrics_file(descriptor_type, major_version, minor_version,
 descriptor_file):
   # Parses descriptor files from metrics, yielding individual descriptors.
 This
   # throws a TypeError if the descriptor_type or version isn't recognized.

   if descriptor_type == "server-descriptor" and major_version == 1:
     yield
 stem.descriptor.server_descriptor.RelayDescriptor(descriptor_file.read())
   elif descriptor_type == "bridge-server-descriptor" and major_version ==
 1:
     yield
 stem.descriptor.server_descriptor.BridgeDescriptor(descriptor_file.read())
   ... etc...
   elif desc_type in ("network-status-consensus-3", "network-status-
 vote-3") and major_version == 1:
     # TODO: this upfront read still makes me twitch
     desc =
 stem.descriptor.networkstatus.NetworkStatusDocument(descriptor_file.read())
     for desc in desc.router_descriptors:
       yield desc
   else:
     raise TypeError("Unrecognized metrics descriptor, type: %s, major
 version: %i, minor version: %i" % (descriptor_type, major_version,
 minor_version))
 }}}

 > +def _peek_keyword(descriptor_file):
 > ...
 > +  last_position = descriptor_file.tell()
 > +  line = descriptor_file.readline()
 > +  if not line: return None

 Lets use _peek_line() here. We could probably replace this function
 with...

 {{{
 def _peek_keyword(descriptor_file):
   ...
   line = _peek_line(descriptor_file)

   if line.startswith("opt "):
     line = line[4:]

   if line:
     return line.split(' ', 1)[0]
   else:
     return None
 }}}

 > +def _read_keyword_line(keyword, descriptor_file, validate = True,
 optional = False):
 > ...
 > +  :param str keyword: keyword the line must begin with
 > +  :param bool optional: if the current line must begin with the given
 keyword
 > +  :param bool validate: validation is enabled

 Missing a parameter and out of order. The opt handling in this method is
 more complicated than it needs to be - like _peek_keyword() lets just
 strip it off of the line variable if it's there.

 > +  elif line.startswith("opt"):
 > +    # if this is something new we don't recognize
 > +    # ignore it and go to the next line
 > +    descriptor_file.readline()
 > +    return _read_keyword_line(keyword, descriptor_file, optional)

 This doesn't look right. I think you meant "elif optional:", having this
 start with 'opt' really doesn't mean anything.

 Also, when checking for an 'opt' prefix please do a startswith() check for
 'opt ' so you don't match against, say, our new "optometrist" keyword. :P

 > +def _read_keyword_line(keyword, descriptor_file, validate = True,
 optional = False):
 > +  """
 > +  Returns the rest of the line if the first keyword matches the given
 keyword. If
 > +  it doesn't, a ValueError is raised if optional and validate are True,
 if
 > +  not, None is returned.
 > +
 > +  Respects the opt keyword and returns the next keyword if the first is
 "opt".
 >
 > ... vs...
 >
 > +def _read_keyword_line_str(keyword, lines, validate = True, optional =
 False):
 > +  """
 > +  Returns the rest of the line if the first keyword matches the given
 keyword. If
 > +  it doesn't, a ValueError is raised if optional and validate are True,
 if
 > +  not, None is returned.
 > +
 > +  Respects the opt keyword and returns the next keyword if the first is
 "opt".

 These pydocs look really identical to me. How to these functions differ?
 Can we get rid of _read_keyword_line_str()?

 > +def _read_until_keywords(keywords, descriptor_file, inclusive = False,
 ignore_first = False):
 > ...
 > +  :param bool ignore_first: doesn't check if the first line read has
 one of the given keywords

 This new parameter seems oddly specific. The problem that you're trying to
 solve is that a router status entry starts with a 'r ' line and continues
 until... well, that part isn't defined in the spec. At present you're
 hardcoding our check to be for...

 > +_router_desc_end_kws = ["r", "bandwidth-weights", "directory-footer",
 "directory-signature"]

 This strikes me as being both a bit hacky and vulnerable to breakage if
 tor adds a new footer value. I'd suggest asking for clarification from
 Nick - this looks like a spec bug to me.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6569#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online