[tor-dev] Python ExoneraTor

Wed Jun 25 13:09:35 UTC 2014

On 24/06/14 18:53, Kostas Jakeliunas wrote:
> Hi Karsten,

Hi Kostas,

> On Tue, Jun 17, 2014 at 10:13 AM, Karsten Loesing
> <karsten at torproject.org> wrote:
>> Hi Kostas,
>>
>> On 11/06/14 04:48, Kostas Jakeliunas wrote:
>>> Hi all!
>>>
>>> On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing <karsten at torproject.org> wrote:
>>>> On 09/06/14 01:26, Damian Johnson wrote:
>>>>> Oh, and another quick thought - you once mentioned that a descriptor
>>>>> search service would make ExoneraTor obsolete, and in looking it over
>>>>> I agree. The search functionality ExoneraTor provides is trivial. The
>>>>> only reason it requires such a huge database is because it's storing a
>>>>> copy of every descriptor ever made.
>>>>>
>>>>> I suspect the actual right solution isn't to rewrite ExoneraTor at
>>>>> all, but rather develop a new service that can be queried for this
>>>>> descriptor data. That would make for a *much* more worthwhile project.
>>>>>
>>>>> ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
>>>>
>>>> I agree, that was the idea behind Kostas' GSoC project last year.  And I
>>>> still think it's a good idea.  It's just not trivial to get right.
>>>
>>> Indeed, not trivial at all!
>>>
>>> I'll use this space to mention the running metrics archive backend
>>> modulo ExoneraTor stuff / what could be sorta-relevant here.
>>>
>>> fwiw, the onionoo-like backend is still running at an obscure address:port:
>>> http://ts.mkj.lt:5555/
>>
>> Would you want to put the summary you wrote here to that link?
> 
> Moved the whole setup to work on port 80 (via uWSGI, with nginx as the
> reverse proxy) ("ts.mkj.lt:5555/some/request" now transparently
> perma-redirects to "ts.mkj.lt/some/request"), and put a simple very
> short summary on the index:
> 
> http://ts.mkj.lt/
> (have you heard of this new edgy font, "Times New Roman"?)
> Let me know if something is too confusing or reads funny, etc. I can
> elaborate more in the beginning or after the examples, too.

Looks good to me!

>> And would you want me to add a sentence or two about your service
>> together with a link to the CollecTor page?
>>
>> https://collector.torproject.org/#references
> 
> Ok!
> 
>> What would I write?
> 
> something like? --
> 
> The Searchable Metrics Archive backend allows users to search and
> explore relay metrics data (consensuses and descriptors), present and
> past. It covers the years 2008-now and provides an Onionoo-like API.
> 
> does that make sense?

It does!  Tweaked a tiny bit and put online.

>>> TL;DR "what can I do with that" is: look at:
>>>
>>> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
>>>
>>> In particular, regarding ExoneraTor-like queries (incl. arbitrary
>>> subnet / part-of-ip lookups):
>>>
>>> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup
>>>
>>> Not sure if it's worth discussing all the weaknesses of this archive
>>> backend in this thread, but the short relevant version is that the
>>> ExoneraTor-like functionality does mostly work, but I would need to
>>> look into it so see how reliable the results are ("is this relay ip
>>> address field really the one we should be using?", etc.)
>>>
>>> But what's nice is that it is possible to do arbitrary queries on all
>>> consensuses since ~2008, with no date specified (if you don't want
>>> to.) (Which is to say, "it's possible", not necessarily "this is the
>>> right way to do the solution for the problems in this thread")
>>>
>>> So e.g. this is the ip address where moria runs, and we want to see
>>> what relays have ever run on it:
>>>
>>> http://ts.mkj.lt:5555/details?search=128.31.0.34
>>>
>>> Take the fingerprint of the one that is currently running (moria1),
>>> and look up its last 500 statuses (in a kind of condensed/summary
>>> form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31&condensed=true
>>>
>>> "from", "to" date ranges can be specified as e.g. 2009, 2009-02,
>>> 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc.
>>> specified here:
>>> https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
>>>
>>> (Descriptors/digests aren't currently included (I think they used to),
>>> but they can be, etc.)
>>>
>>> The point is probably mostly about "this is some evidence that it can be done."
>>> ("But there are nuances, things are imperfect, time is needed, etc.")
>>>
>>> The question really is regarding the actual scope of this rewrite, I suppose.
>>>
>>> I'd probably agree with Karsten that just doing a port of the
>>> ExoneraTor functionality as it currently is on
>>> exonerator.torproject.org would be the safe bet. See how that goes,
>>> venture into more exotic lands later on maybe, etc. (That doesn't mean
>>> that I wouldn't be excited to put the current backend to good use,
>>> and/or use the knowledge I gained to help you folks in some way!)
>>>
>>>>
>>>> Regarding your comment about storing a copy of every descriptor ever
>>>> made, I believe that users trust ExoneraTor's results more if they see
>>>> the actual descriptors that lead to results.  Of course, I'm saying that
>>>> without knowing what ExoneraTor users actually want.  But let's not drop
>>>> descriptor copies from the database easily.
>>>>
>>>> And, heh, when you say that the search functionality ExoneraTor provides
>>>> is trivial, a little part of me is dying.  It's the part that spent a
>>>> few weeks on getting the search functionality fast enough for
>>>> production.  That was not at all trivial.  The oraddress24, oraddress48,
>>>> and exitaddress24 fields as well as the indexes are the result of me
>>>> running lots and lots of sample queries and wondering about Postgres'
>>>> EXPLAIN ANALYZE results.  Just saying that it's not going to be trivial
>>>> to generalize the search functionality towards other fields than IP
>>>> addresses and dates.
>>>
>>> Hear hear, I can only imagine! These things and exonerator stuff is
>>> not easy to be done in a way that would provide **consistently**
>>> good/great performance.
>>>
>>> I spent some days of the last summer also looking at EXPLAIN ANALYZE
>>> results (it was a great feeling to start to understand what they mean
>>> and how I can make them better), but eventually things start making
>>> sense. (And when they do, I also get that same feeling that NoSQL
>>> stuff doesn't magically solve things.)
>>>
>>>>
>>>> If others want to follow, here's the SQL code I'm talking about:
>>>>
>>>> https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
>>>>
>>>> So, I'm happy to talk about writing a searchable descriptor archive.  It
>>>> could _start_ with ExoneraTor's functionality (minus the target address
>>>> and port thing discussed in that other email), and then we could
>>>> consider adding more searches.
>>>
>>> fwiw, imho this sounds like a sane plan to me. (Of course it could
>>> also be possible to work on the onionoo-like archive backend (or fork
>>> it, or smash it into parts and steal some of them, etc., but I can see
>>> why this might yield unclear deliverables, etc.) (So a short document
>>> of "what is wanted" would help, yeah.)
>>>
>>>>
>>>> Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so
>>>> let me make one remark about optimizing Postgres defaults: I wrote quite
>>>> a few database queries in the past, and some of them perform horribly
>>>> (relay search) whereas others perform really well (ExoneraTor).  I
>>>> believe that the majority of performance gains can be achieved by
>>>> designing good tables, indexes, and queries.  Only as a last resort we
>>>> should consider optimizing the Postgres defaults.
>>>
>>> Ha, at this point I probably have a sort of "premature optimizer"
>>> label in your mind, Karsten. :) (And I kinda deserved it by at one
>>> point focusing on very-low-level postgres caching mechanisms last
>>> summer, etc etc.)
>>>
>>> I've actually come to really appreciate good schema and query
>>> design[1] and the wonders that they do. That being said, I'd actually
>>> be curious to know how large the indexes of relay-search and current
>>> exonerator are.[2] I (still) bet increasing postgres' shared_buffers
>>> and effective_cache_size (totally normal practice!) might help! (Oh,
>>> is this one of those vim-vs-emacs things? If it is, sorry.)
>>
>> I just deleted most of the database contents behind the relay-search
>> service a few days ago.  But I might even have agreed there that some
>> PostgreSQL tweaking would have helped.  It was a bad database design,
>> mostly because it was built for a different purpose (data aggregation
>> for metrics website), so it's a bad example.
>>
>> But let me give you some numbers on current ExoneraTor (manually deleted
>> part of the output which we don't care about here):
>>
>> exonerator=> \dt+
>>      Name      |  Size
>> ---------------+--------
>>  consensus     | 16 GB
>>  descriptor    | 31 GB
>>  exitlistentry | 558 MB
>>  statusentry   | 50 GB
>> (4 rows)
>>
>> exonerator=> \di+
>>                   Name                   |     Table     |  Size
>> -----------------------------------------+---------------+---------
>>  consensus_pkey                          | consensus     | 1280 kB
>>  descriptor_pkey                         | descriptor    | 1930 MB
>>  exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB
>>  exitlistentry_exitaddress_scanneddate   | exitlistentry | 82 MB
>>  exitlistentry_pkey                      | exitlistentry | 173 MB
>>  statusentry_oraddress24_validafterdate  | statusentry   | 5470 MB
>>  statusentry_oraddress48_validafterdate  | statusentry   | 4629 MB
>>  statusentry_oraddress_validafterdate    | statusentry   | 5509 MB
>>  statusentry_pkey                        | statusentry   | 10 GB
>> (9 rows)
> 
> Looks nice! :) thanks! (just for fun, the largest index on my side is
> one "statusentry_substr_validafter_idx", which is an index on two
> columns (a (SUBSTR() of) relay nickname and the consensus valid after
> (DESC)), and it's currently at 7004 MB.)
> Anyway, "these sizes make sense" is all I can think of right now!

Good to hear.

>> Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to
>> run.
> 
> okay, maybe I'll think of something some time, and if I do, I can
> either open a ticket, or create a new email thread, unless this is
> kind-of-ok for this thread.
> 
> (Regarding "what part of $something is in memory", I remember the
> "disk read" (or was it "buffer read") words in EXPLAIN ANALYZE being
> useful. Also, sometimes postgres really mis-assumes on how much it'll
> have to read, and how much it ends up reading (it's all there in the
> results iirc, but you probably know all that.) In which case a VACUUM
> should help (of course), etc.)

Feel free to start a new thread or create a ticket for this.  To be
honest, I didn't run EXPLAIN ANALYZE on this database for quite a while.
 I just assume everything works fine.

>> If we're going to optimize the ExoneraTor database, should we move this
>> discussion to a ticket?
> 
> Derailment with technicalities is always a looming danger I guess, but
> at this point I'm not even sure what you and Damian (and possibly
> others) are planning to do with the current ExoneraTor. I assume
> current ExoneraTor performance is good as it currently stands, so this
> part of the thread/thoughtspace can be closed for the time being as
> far as I can see. (And I could open a ticket if I think of something
> interesting to do regarding diagnosing/optimizing the ExoneraTor
> database.)
> 
> I suppose there's still no consensus whether a python-exonerator
> should aim to replicate current ExoneraTor's functionality (and, say,
> use the current database), or whether it should do more(tm). (Happy to
> participate in some form of discussion at the dev meeting, if my input
> can be useful!)

Damian won't be in Paris, AFAIK.  But sure, happy to discuss more next week.

All the best,
Karsten