[tor-dev] Python ExoneraTor

Tue Jun 24 16:53:58 UTC 2014

Hi Karsten,

On Tue, Jun 17, 2014 at 10:13 AM, Karsten Loesing
<karsten at torproject.org> wrote:
> Hi Kostas,
>
> On 11/06/14 04:48, Kostas Jakeliunas wrote:
>> Hi all!
>>
>> On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing <karsten at torproject.org> wrote:
>>> On 09/06/14 01:26, Damian Johnson wrote:
>>>> Oh, and another quick thought - you once mentioned that a descriptor
>>>> search service would make ExoneraTor obsolete, and in looking it over
>>>> I agree. The search functionality ExoneraTor provides is trivial. The
>>>> only reason it requires such a huge database is because it's storing a
>>>> copy of every descriptor ever made.
>>>>
>>>> I suspect the actual right solution isn't to rewrite ExoneraTor at
>>>> all, but rather develop a new service that can be queried for this
>>>> descriptor data. That would make for a *much* more worthwhile project.
>>>>
>>>> ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
>>>
>>> I agree, that was the idea behind Kostas' GSoC project last year.  And I
>>> still think it's a good idea.  It's just not trivial to get right.
>>
>> Indeed, not trivial at all!
>>
>> I'll use this space to mention the running metrics archive backend
>> modulo ExoneraTor stuff / what could be sorta-relevant here.
>>
>> fwiw, the onionoo-like backend is still running at an obscure address:port:
>> http://ts.mkj.lt:5555/
>
> Would you want to put the summary you wrote here to that link?

Moved the whole setup to work on port 80 (via uWSGI, with nginx as the
reverse proxy) ("ts.mkj.lt:5555/some/request" now transparently
perma-redirects to "ts.mkj.lt/some/request"), and put a simple very
short summary on the index:

http://ts.mkj.lt/
(have you heard of this new edgy font, "Times New Roman"?)
Let me know if something is too confusing or reads funny, etc. I can
elaborate more in the beginning or after the examples, too.

> And would you want me to add a sentence or two about your service
> together with a link to the CollecTor page?
>
> https://collector.torproject.org/#references

Ok!

> What would I write?

something like? --

The Searchable Metrics Archive backend allows users to search and
explore relay metrics data (consensuses and descriptors), present and
past. It covers the years 2008-now and provides an Onionoo-like API.

does that make sense?

>> TL;DR "what can I do with that" is: look at:
>>
>> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
>>
>> In particular, regarding ExoneraTor-like queries (incl. arbitrary
>> subnet / part-of-ip lookups):
>>
>> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup
>>
>> Not sure if it's worth discussing all the weaknesses of this archive
>> backend in this thread, but the short relevant version is that the
>> ExoneraTor-like functionality does mostly work, but I would need to
>> look into it so see how reliable the results are ("is this relay ip
>> address field really the one we should be using?", etc.)
>>
>> But what's nice is that it is possible to do arbitrary queries on all
>> consensuses since ~2008, with no date specified (if you don't want
>> to.) (Which is to say, "it's possible", not necessarily "this is the
>> right way to do the solution for the problems in this thread")
>>
>> So e.g. this is the ip address where moria runs, and we want to see
>> what relays have ever run on it:
>>
>> http://ts.mkj.lt:5555/details?search=128.31.0.34
>>
>> Take the fingerprint of the one that is currently running (moria1),
>> and look up its last 500 statuses (in a kind of condensed/summary
>> form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31&condensed=true
>>
>> "from", "to" date ranges can be specified as e.g. 2009, 2009-02,
>> 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc.
>> specified here:
>> https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
>>
>> (Descriptors/digests aren't currently included (I think they used to),
>> but they can be, etc.)
>>
>> The point is probably mostly about "this is some evidence that it can be done."
>> ("But there are nuances, things are imperfect, time is needed, etc.")
>>
>> The question really is regarding the actual scope of this rewrite, I suppose.
>>
>> I'd probably agree with Karsten that just doing a port of the
>> ExoneraTor functionality as it currently is on
>> exonerator.torproject.org would be the safe bet. See how that goes,
>> venture into more exotic lands later on maybe, etc. (That doesn't mean
>> that I wouldn't be excited to put the current backend to good use,
>> and/or use the knowledge I gained to help you folks in some way!)
>>
>>>
>>> Regarding your comment about storing a copy of every descriptor ever
>>> made, I believe that users trust ExoneraTor's results more if they see
>>> the actual descriptors that lead to results.  Of course, I'm saying that
>>> without knowing what ExoneraTor users actually want.  But let's not drop
>>> descriptor copies from the database easily.
>>>
>>> And, heh, when you say that the search functionality ExoneraTor provides
>>> is trivial, a little part of me is dying.  It's the part that spent a
>>> few weeks on getting the search functionality fast enough for
>>> production.  That was not at all trivial.  The oraddress24, oraddress48,
>>> and exitaddress24 fields as well as the indexes are the result of me
>>> running lots and lots of sample queries and wondering about Postgres'
>>> EXPLAIN ANALYZE results.  Just saying that it's not going to be trivial
>>> to generalize the search functionality towards other fields than IP
>>> addresses and dates.
>>
>> Hear hear, I can only imagine! These things and exonerator stuff is
>> not easy to be done in a way that would provide **consistently**
>> good/great performance.
>>
>> I spent some days of the last summer also looking at EXPLAIN ANALYZE
>> results (it was a great feeling to start to understand what they mean
>> and how I can make them better), but eventually things start making
>> sense. (And when they do, I also get that same feeling that NoSQL
>> stuff doesn't magically solve things.)
>>
>>>
>>> If others want to follow, here's the SQL code I'm talking about:
>>>
>>> https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
>>>
>>> So, I'm happy to talk about writing a searchable descriptor archive.  It
>>> could _start_ with ExoneraTor's functionality (minus the target address
>>> and port thing discussed in that other email), and then we could
>>> consider adding more searches.
>>
>> fwiw, imho this sounds like a sane plan to me. (Of course it could
>> also be possible to work on the onionoo-like archive backend (or fork
>> it, or smash it into parts and steal some of them, etc., but I can see
>> why this might yield unclear deliverables, etc.) (So a short document
>> of "what is wanted" would help, yeah.)
>>
>>>
>>> Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so
>>> let me make one remark about optimizing Postgres defaults: I wrote quite
>>> a few database queries in the past, and some of them perform horribly
>>> (relay search) whereas others perform really well (ExoneraTor).  I
>>> believe that the majority of performance gains can be achieved by
>>> designing good tables, indexes, and queries.  Only as a last resort we
>>> should consider optimizing the Postgres defaults.
>>
>> Ha, at this point I probably have a sort of "premature optimizer"
>> label in your mind, Karsten. :) (And I kinda deserved it by at one
>> point focusing on very-low-level postgres caching mechanisms last
>> summer, etc etc.)
>>
>> I've actually come to really appreciate good schema and query
>> design[1] and the wonders that they do. That being said, I'd actually
>> be curious to know how large the indexes of relay-search and current
>> exonerator are.[2] I (still) bet increasing postgres' shared_buffers
>> and effective_cache_size (totally normal practice!) might help! (Oh,
>> is this one of those vim-vs-emacs things? If it is, sorry.)
>
> I just deleted most of the database contents behind the relay-search
> service a few days ago.  But I might even have agreed there that some
> PostgreSQL tweaking would have helped.  It was a bad database design,
> mostly because it was built for a different purpose (data aggregation
> for metrics website), so it's a bad example.
>
> But let me give you some numbers on current ExoneraTor (manually deleted
> part of the output which we don't care about here):
>
> exonerator=> \dt+
>      Name      |  Size
> ---------------+--------
>  consensus     | 16 GB
>  descriptor    | 31 GB
>  exitlistentry | 558 MB
>  statusentry   | 50 GB
> (4 rows)
>
> exonerator=> \di+
>                   Name                   |     Table     |  Size
> -----------------------------------------+---------------+---------
>  consensus_pkey                          | consensus     | 1280 kB
>  descriptor_pkey                         | descriptor    | 1930 MB
>  exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB
>  exitlistentry_exitaddress_scanneddate   | exitlistentry | 82 MB
>  exitlistentry_pkey                      | exitlistentry | 173 MB
>  statusentry_oraddress24_validafterdate  | statusentry   | 5470 MB
>  statusentry_oraddress48_validafterdate  | statusentry   | 4629 MB
>  statusentry_oraddress_validafterdate    | statusentry   | 5509 MB
>  statusentry_pkey                        | statusentry   | 10 GB
> (9 rows)

Looks nice! :) thanks! (just for fun, the largest index on my side is
one "statusentry_substr_validafter_idx", which is an index on two
columns (a (SUBSTR() of) relay nickname and the consensus valid after
(DESC)), and it's currently at 7004 MB.)
Anyway, "these sizes make sense" is all I can think of right now!

> Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to
> run.

okay, maybe I'll think of something some time, and if I do, I can
either open a ticket, or create a new email thread, unless this is
kind-of-ok for this thread.

(Regarding "what part of $something is in memory", I remember the
"disk read" (or was it "buffer read") words in EXPLAIN ANALYZE being
useful. Also, sometimes postgres really mis-assumes on how much it'll
have to read, and how much it ends up reading (it's all there in the
results iirc, but you probably know all that.) In which case a VACUUM
should help (of course), etc.)

> If we're going to optimize the ExoneraTor database, should we move this
> discussion to a ticket?

Derailment with technicalities is always a looming danger I guess, but
at this point I'm not even sure what you and Damian (and possibly
others) are planning to do with the current ExoneraTor. I assume
current ExoneraTor performance is good as it currently stands, so this
part of the thread/thoughtspace can be closed for the time being as
far as I can see. (And I could open a ticket if I think of something
interesting to do regarding diagnosing/optimizing the ExoneraTor
database.)

I suppose there's still no consensus whether a python-exonerator
should aim to replicate current ExoneraTor's functionality (and, say,
use the current database), or whether it should do more(tm). (Happy to
participate in some form of discussion at the dev meeting, if my input
can be useful!)

best wishes
Kostas

> All the best,
> Karsten
>
>
>> But the point is that (to invoke a cliche) there is no free lunch, and
>> (2) postgresql can really do wonders and scale well when used right.
>>
>>>
>>> You realize that a searchable descriptor archives focuses much more on
>>> database optimization than the ExoneraTor rewrite from Java to Python
>>> (which would leave the database untouched)?
>>>
>>
>> "leaving database untouched" probably implies (very) significantly
>> less work, so it would be a nice/clear starting point. (caveat, i may
>> be lacking context, etc.)
>>
>>
>> [1]: also, fun things like "sometimes indexes won't be used because a
>> sequential read will be faster, because if parts of indexes to be used
>> are in various parts across the disk (not all of them are in memory),
>> random seek + read a bit into memory + repeat is slower than 'just
>> read a lot of continuous data into memory'", etc etc.)
>>
>> [2]: if you're feeling adventuruous, you can run this on each of
>> postgres databases, to see how large the indexes (among all other
>> things) are, and which parts of them are loaded into memory
>> https://github.com/wfn/torsearch/blob/master/misc/buffercache.sql
>>
>> --
>>
>> Kostas.
>>
>> 0x0e5dce45 @ pgp.mit.edu
>>
>