[tor-dev] GSoC 2013 / Tor project ideas - Searchable Tor descriptor archive - (pre-)proposal

Tue Apr 30 00:17:04 UTC 2013

Hello Karsten and everyone else :)

(TL;DR: would like to work on the searchable Tor descriptor archive project
idea; considering drafting up a GSoC application)

I'm a student & backend+frontend programmer from Lithuania who'd be very
much interested in contributing to the Tor project via Google Summer of
Code (well, ideally at least; the plan would be to volunteer some time to
Tor in any case, but it's yet to happen, and GSoC is simply too awesome an
opportunity not to try) -

The 'searchable Tor descriptor/metrics archive' project idea [1] would, I
think, best fit in with my previous experience and general interests in
terms of contributing to the Tor project. The searchable archive project
idea in itself has a rather clear list of goals / generic constraints, and
since I haven't contributed any code to the Tor project before, working
with an existing general project idea (building a more concrete design
proposal on top of it) probably makes most sense.

This particular project, I think, would match my previous Python backend
programming experience: building backends to work with large datasets /
databases -- crafting efficient ORMs and responsive APIs to interact with
them. [2]

Applying the knowledge/skills learned to something which is ideologically
close at heart and the purpose of which is very obvious to me sounds
thrilling! (This year, as far as Python frameworks are concerned, I've been
mostly exposed and have been working with Flask - have some (limited)
experience with Django before that. As far as a proof-of-concept for the
searchable archive is concerned, I'm considering trying some things out
with Flask, since it allows me to do some quick prototyping.)

I'd like to try and work out an implementation/design draft for what I
could / would like to do (this is a preliminary email - I know I'm a bit
late!) Ideally it (and a simple proof of concept search form ->
browseable/clickable results / relay descriptor navigation page) would
serve as the base for my GSoC application, but I have to be realistic about
me being rather late to apply and not having participated in neither Tor
nor GSoC before. I'd like to work out an application draft if possible,
though. (Were I to get accepted, I would be able not to do any part-time
work this summer, or would only need to take passive care of a couple of
already running backends.)

I've read into the Tor Metrics portal pages (esp. Data Formats), and am
trying to get acquainted with the existing archiving solution (reading into
the 'metrics-web' java source (under
metrics-web/src/org/torproject/ernie/web) to see how the descriptor etc.
archives are currently parsed / imported into Postgres and so on), to first
and foremost be able to evaluate the scope of what I'd like to write.

I will presently work on a more specific list of constraints for the
searchable archive project idea. I can then try producing a GSoC
application draft.

Just to get an idea of what kind of system I'd be building / working on -
at the very least, we'd be looking into:

   - (re)building the archival / metrics data update system - the proposed
   method in [1] was a simple rsync over ssh / etc. to keep the data in sync
   with the descriptor data collection point. If possible, it would help if
   the rsync could work with uncompressed archives - rsync is intelligent
   enough not to need to send *that* much excess data - and diffing is more
   efficient with uncompressed data.
   A simple rsync script (can be run as a cron job) would work here.

   - a python script (probably to be run through cron) to import the
   archives into DB. Can stat files to only need to import new/modified ones,
   e.g. The good thing about such an approach is that the script could work as
   a semi-standalone (would still need the DB / ORM design), therefore could
   be used in conjunction with other, different tools - and it would be built
   as an atomic target during the implementation process - I heard you guys
   like modular project design proposals ;) who doesn't like them!
   We already have metrics-utils/exonerator/exonerator.py (which works as a
   semantically-aware descriptor archive grep tool) - some archive parsing
   logic can be reused maybe - the more pertinent thing here would be to

   - build the ORM for storing all the archival data in DB. Postgres is
   preferred and could work, especially since probably the a large part of the
   current ORM logic could be used here (I've taken a glance at the current
   architecture, it makes good sense to me, but I haven't looked further,
   neither have I done any benchmarking with the existing ORM (except for some
   web-based relay search test queries which don't really count.))

   - it is very important to build an ORM which would scale well data-wise,
   and would suit our queries well.

   - query logic and types - the idea would be to allow to do incremental
   query-building - on the SQL level, WHERE clauses can be incrementally
   attached to the higher-level ORM query object. For example, part of
   fingerprint / relay name -> add a date interval -> additionally, add a
   day-specific "on date" clause -> etc. ORM efficiency, benchmarking and
   restrictions on query complexity - this may turn out to very much be a
   nontrivial matter.

   - on the user interface level, user query input parsing - the idea would
   be to allow for flexible user input; date interval / info not mandatory;
   possibly, the user may provide additional filtering by specifying more
   descriptor metadata. Would need to figure out the most accessible type of
   input - a simple input field with info about the flexible syntax with
   examples and a flexible parser, and/or optional additional fields - a
   responsive UI (can expand the input form into additional fields/options and
   try to avoid contradictory / non-sensical input) might prove effective here.

   - results page - would need to work out what kinds of relay metadata to
   show, can click on parts of the data on the results page to further narrow
   results - unless click on relay name / fingerprint / etc., in which case
   would be taken to that specific relay's page (on a more general sense, this
   would also be a kind of 'result narrowing.')

   - => generally create a browsable relay archive, formatting /
   aggregating specific descriptor fields to provide more info about each
   relay. <-- may make sense to start working on the particular constraint
   list starting from this point.

Hopefully Karsten can help me with the application, assuming my idea for
the project is to make sense. :)

I will follow up with more details. Besides my email address <
kostas at jakeliunas.com>, I can be reached on #tor-dev as 'wfn', or
XMPP/Jabber via <phistopheles at jabber.org>.

Cheers to you all
Kostas.

[1] https://www.torproject.org/getinvolved/volunteer.html.en#metricsSearch

[2] My largest Python backend related project was this winter, building a
redis (product likes/dislikes) + mysql (everything else) product
recommendation solution to work with a large dataset of (product) metainfo
(such as user votes on products), and creating APIs (on top of Flask) (API
for the frontend CMS and for a mobile app) which include custom-per-user
product recommendation feeds, etc. Large data (nothing close to the Tor
descriptor/metrics archives, though!) + Interactive application logic
architectures are of interest to me.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130430/ea92aeb0/attachment-0001.html>