[tor-dev] [GSoC '13] Status report - Searchable metrics archive

Mon Jul 29 06:04:18 UTC 2013

Hi everyone,

Some clean-ish working code is finally available online [1] (the old PoC
code has been moved to [2]); I'll be adding more soon, but this part does
what it's supposed to do, i.e.:

  - archival data import (download, mapping to ORM via Stem, efficiently
avoiding re-import of existing data via Stem's persistence path, etc.);
what's left for this part is a nice and simple rsync+cron setup to be able
to continuously download and import new data (via Metrics archive's
'recent')
  - data models and Stem <-> ORM <-> database mapping for descriptors,
consensuses and network statuses contained in consensuses
  - models can be easily queried by sqlalchemy's ORM; Karsten suggested
that an additional 'query layer' / internal API is not needed until there's
actual need for it (i.e., my plan was to provide an additional query API
abstracted from ORM (which is itself built on top of database/SQL/python
classes), and to build a backend on top of it, as a neat client of that API
as it were; I had some simple and ugly PoC's that are now pushed out of
priority queue until needed (if ever))
 - one example of how this querying (directly atop the ORM) works is
provided: a simple (very partial) Onionoo protocol implementation for
/summary and /details, including ?search, ?limit and ?offset. Querying
takes place over all NetworkStatuses. This is new in the sense that it uses
the ORM directly. If there is a need to formulate SQL queries more
directly, we'll do that as well.

During the Tor developer meetings in Munich, we tried talking over the
existing & proposed parts of the system with Karsten. I will be focusing on
making sure the Onionoo-like backend (which is being extended) is stable
and efficient. I'm still looking into database optimization (with Karsten's
advice); an efficient backend for the majority of all archival data
available would be a great deliverable in itself, and hopefully we can
achieve at least that. I might do well to try and document the database
iterations and development, as a lot of thinking now resides in a kind of
'black box' of DB spec, which does not produce code.

The large Postgres datasets are residing on a server I'm managing; I'm
working on exposing the Onionoo-like API for public queries; doing some
simple higher-level benchmarking (simulating multiple clients requesting
different data at once, etc.) now. I might need to move the datasets to yet
another server (again), but maybe not; it's easy to blame things on limited
CPU/memory resources. :)

Kostas.

[1]: https://github.com/wfn/torsearch
[2]: https://github.com/wfn/torsearch-poc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130729/bd00d434/attachment.html>