[tor-dev] An ANTLR 4 grammar for Tor bridge network statuses

Wed Nov 4 16:43:28 UTC 2015

On Wed, Nov 4, 2015 at 4:06 AM, Karsten Loesing <karsten at torproject.org> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hello developers,
>
> in the past few days I have been working on a grammar to parse Tor
> bridge network statuses and hopefully other Tor descriptors in the
> future.  It's working, for some definition of working, but some issues
> remain and I need some help.
>
> I just uploaded my sources, consisting only of the grammar with a fair
> amount of documentation:
>
> https://people.torproject.org/~karsten/volatile/BridgeNetworkStatus.g4

Nice work, Karsten!  I'm hoping we move towards some kind of
machine-readable grammar/schema for all our data formats, and that we
have our actual parsing/encoding code generated from it.

(When I did a survey of where all our crash/assertion bugs for the
last few years were, they seemed to have a higher-than-usual
concentration in our parsing code.)

One thing about this grammar in particular, though: It is over-strict.
It matches only the formats we use today, and not the formats we are
allowed to use in the future.  For one example, a flag on an 's' line
can be any non-space string - but this grammar will fail to parse
unrecognized flags.

On the other hand, while we specify the order of r, s, w, p, a, lines
in a generated consensus, clients are required to parse the s, w, p,
and a lines in any order, but not to allow two s lines in a single 'r'
entry.

I think that because of the free-ordering and multiplicity-restriction
rules for our data formats, a context-free grammar simply isn't going
to match our spec very well.

> Quoting from that file to facilitate discussion here:
>
> There are multiple goals of having a grammar for Tor descriptors
> available on CollecTor:
>
> 1. Translate descriptors to JSON for statistical analysis: Some tools
> and databases require Tor descriptors in a standard format like JSON.
>  This grammar and a parser generated from it can help making that
> translation as easy as possible, also to keep future maintenance as
> low as possible.
>
> 2. Provide a basis for descriptor-parsing libraries: As of late 2015,
> there are three libraries for parsing Tor descriptors: metrics-lib for
> Java, Stem for Python, and Zoossh for Go.  It would be beneficial to
> place as much knowledge about the descriptor format into a grammar
> shared by all those libraries and then generate parsers for different
> languages from that grammar.
>
> 3. Serve as documentation for the Tor directory protocol
> specification: Tor descriptors are already documented using a
> hand-written grammar, but that may contain slight inaccuracies because
> it's not verified.  This grammar could fix that by either detecting
> inaccuracies while trying to rewrite it to an executable grammar form
> or by replacing the grammar in the specification documentation with
> this executable grammar.
>
> Open issues and questions:
>
>  - Was it smart to explicitly include all those SP tokens in the
> rules, or should those be discarded right away by the lexer?  The main
> reason for keeping them was to stay as close to the specification as
> possible, but maybe that has downsides on the other goals.

IMO, once we have a grammar that is truly correct, that grammar should
_be_ the spec, and we should revise the main spec to reference the
grammar.

>  - If a bridge uses a nickname (or other token that's supposed to be a
> STRING) that is also a keyword like "r" or "published", things get
> confusing.  Try editing the input bridge network status and observe
> the result.  But those are perfectly valid nicknames, so what can we do?

Change the lexing rules so that keywords are only recognized as such
at position 0 on the line, outside of a base64 block?

best wishes,
-- 
Nick