[tor-dev] GSoC 2013 / Tor project ideas - Searchable Tor descriptor archive - (pre-)proposal

Tue Apr 30 06:44:40 UTC 2013

On 4/30/13 2:17 AM, Kostas Jakeliunas wrote:
> Hello Karsten and everyone else :)

Hello Kostas,

> (TL;DR: would like to work on the searchable Tor descriptor archive project
> idea; considering drafting up a GSoC application)

Glad to see you're interested in the searchable descriptor archive
project, and thanks for writing this application draft.  Please see my
comments below.

> I'm a student & backend+frontend programmer from Lithuania who'd be very
> much interested in contributing to the Tor project via Google Summer of
> Code (well, ideally at least; the plan would be to volunteer some time to
> Tor in any case, but it's yet to happen, and GSoC is simply too awesome an
> opportunity not to try) -
> 
> The 'searchable Tor descriptor/metrics archive' project idea [1] would, I
> think, best fit in with my previous experience and general interests in
> terms of contributing to the Tor project. The searchable archive project
> idea in itself has a rather clear list of goals / generic constraints, and
> since I haven't contributed any code to the Tor project before, working
> with an existing general project idea (building a more concrete design
> proposal on top of it) probably makes most sense.
> 
> This particular project, I think, would match my previous Python backend
> programming experience: building backends to work with large datasets /
> databases -- crafting efficient ORMs and responsive APIs to interact with
> them. [2]
> 
> Applying the knowledge/skills learned to something which is ideologically
> close at heart and the purpose of which is very obvious to me sounds
> thrilling! (This year, as far as Python frameworks are concerned, I've been
> mostly exposed and have been working with Flask - have some (limited)
> experience with Django before that. As far as a proof-of-concept for the
> searchable archive is concerned, I'm considering trying some things out
> with Flask, since it allows me to do some quick prototyping.)

Flask sounds like a fine choice, especially if you think it's a good fit
and if you already have experience with it.  No need to use Django only
because it's mentioned in the project idea.

> I'd like to try and work out an implementation/design draft for what I
> could / would like to do (this is a preliminary email - I know I'm a bit
> late!) Ideally it (and a simple proof of concept search form ->
> browseable/clickable results / relay descriptor navigation page) would
> serve as the base for my GSoC application, but I have to be realistic about
> me being rather late to apply and not having participated in neither Tor
> nor GSoC before. I'd like to work out an application draft if possible,
> though. (Were I to get accepted, I would be able not to do any part-time
> work this summer, or would only need to take passive care of a couple of
> already running backends.)

Having some kind of proof of concept would be great!  It's fine if this
prototype focuses only on the data processing or querying part and
doesn't have a fancy user interface yet.  Ideally, you should try to
identify a few places in your overall application that are most likely
to fail, and write some code to identify problems there as early as
possible.

For example, if your database importer takes two hours to import an hour
worth of new descriptors, that would be a problem.  In this case, try to
write a simple importer and see how you could speed up the import
operation.  For example, for importing into PostgreSQL using Java, I
tried different variants like turning off auto-commit in JDBC or writing
a COPY command to a file and importing it using psql -f.

Or, if you're worried about performance of certain search queries, think
about ways to make those queries more efficient.  To give you a rough
idea of the problems there, take a look at the ExoneraTor database
schema, in particular the oraddress24 and oraddress48 columns:
https://gitweb.torproject.org/metrics-web.git/blob/HEAD:/db/exonerator.sql#l49

Note that these two examples may not be that relevant in your case.
They shall only serve as examples for potential performance bottlenecks
and approaches to work around them.

For your GSoC application it'd be sufficient to describe what problems
you identified, what code you wrote to further look into them, and what
results you got.  Also make your code available somewhere to give us an
idea of your code style.  But it's not necessary to create a shiny test
web app somewhere.

> I've read into the Tor Metrics portal pages (esp. Data Formats), and am
> trying to get acquainted with the existing archiving solution (reading into
> the 'metrics-web' java source (under
> metrics-web/src/org/torproject/ernie/web) to see how the descriptor etc.
> archives are currently parsed / imported into Postgres and so on), to first
> and foremost be able to evaluate the scope of what I'd like to write.

Sounds good.  Also take a look at stem: https://stem.torproject.org/.
It even has a tutorial for descriptor parsing:
https://stem.torproject.org/tutorials/mirror_mirror_on_the_wall.html

> I will presently work on a more specific list of constraints for the
> searchable archive project idea. I can then try producing a GSoC
> application draft.
> 
> Just to get an idea of what kind of system I'd be building / working on -
> at the very least, we'd be looking into:
> 
>    - (re)building the archival / metrics data update system - the proposed
>    method in [1] was a simple rsync over ssh / etc. to keep the data in sync
>    with the descriptor data collection point. If possible, it would help if
>    the rsync could work with uncompressed archives - rsync is intelligent
>    enough not to need to send *that* much excess data - and diffing is more
>    efficient with uncompressed data.
>    A simple rsync script (can be run as a cron job) would work here.

The last three days of descriptors are available in uncompressed form.
See the rsync paragraph on https://metrics.torproject.org/data.html.

>    - a python script (probably to be run through cron) to import the
>    archives into DB. Can stat files to only need to import new/modified ones,
>    e.g. The good thing about such an approach is that the script could work as
>    a semi-standalone (would still need the DB / ORM design), therefore could
>    be used in conjunction with other, different tools - and it would be built
>    as an atomic target during the implementation process - I heard you guys
>    like modular project design proposals ;) who doesn't like them!
>    We already have metrics-utils/exonerator/exonerator.py (which works as a
>    semantically-aware descriptor archive grep tool) - some archive parsing
>    logic can be reused maybe - the more pertinent thing here would be to

Stem has the ability to skip files it parsed before.  No need to
implement that functionality yourself.

>    - build the ORM for storing all the archival data in DB. Postgres is
>    preferred and could work, especially since probably the a large part of the
>    current ORM logic could be used here (I've taken a glance at the current
>    architecture, it makes good sense to me, but I haven't looked further,
>    neither have I done any benchmarking with the existing ORM (except for some
>    web-based relay search test queries which don't really count.))
> 
>    - it is very important to build an ORM which would scale well data-wise,
>    and would suit our queries well.
> 
>    - query logic and types - the idea would be to allow to do incremental
>    query-building - on the SQL level, WHERE clauses can be incrementally
>    attached to the higher-level ORM query object. For example, part of
>    fingerprint / relay name -> add a date interval -> additionally, add a
>    day-specific "on date" clause -> etc. ORM efficiency, benchmarking and
>    restrictions on query complexity - this may turn out to very much be a
>    nontrivial matter.
> 
>    - on the user interface level, user query input parsing - the idea would
>    be to allow for flexible user input; date interval / info not mandatory;
>    possibly, the user may provide additional filtering by specifying more
>    descriptor metadata. Would need to figure out the most accessible type of
>    input - a simple input field with info about the flexible syntax with
>    examples and a flexible parser, and/or optional additional fields - a
>    responsive UI (can expand the input form into additional fields/options and
>    try to avoid contradictory / non-sensical input) might prove effective here.
> 
>    - results page - would need to work out what kinds of relay metadata to
>    show, can click on parts of the data on the results page to further narrow
>    results - unless click on relay name / fingerprint / etc., in which case
>    would be taken to that specific relay's page (on a more general sense, this
>    would also be a kind of 'result narrowing.')
> 
>    - => generally create a browsable relay archive, formatting /
>    aggregating specific descriptor fields to provide more info about each
>    relay. <-- may make sense to start working on the particular constraint
>    list starting from this point.
> 
> 
> Hopefully Karsten can help me with the application, assuming my idea for
> the project is to make sense. :)

Yes, this is really a fine start!  Happy to discuss more details here or
on the GSoC platform once you submitted your application.

Best,
Karsten

> I will follow up with more details. Besides my email address <
> kostas at jakeliunas.com>, I can be reached on #tor-dev as 'wfn', or
> XMPP/Jabber via <phistopheles at jabber.org>.
> 
> Cheers to you all
> Kostas.
> 
> 
> [1] https://www.torproject.org/getinvolved/volunteer.html.en#metricsSearch
> 
> [2] My largest Python backend related project was this winter, building a
> redis (product likes/dislikes) + mysql (everything else) product
> recommendation solution to work with a large dataset of (product) metainfo
> (such as user votes on products), and creating APIs (on top of Flask) (API
> for the frontend CMS and for a mobile app) which include custom-per-user
> product recommendation feeds, etc. Large data (nothing close to the Tor
> descriptor/metrics archives, though!) + Interactive application logic
> architectures are of interest to me.
>