[tor-dev] Python ExoneraTor

Kostas Jakeliunas kostas at jakeliunas.com
Wed Jun 11 03:02:26 UTC 2014


On Tue, Jun 10, 2014 at 10:38 AM, Karsten Loesing
<karsten at torproject.org> wrote:
> On 10/06/14 05:41, Damian Johnson wrote:
>>>> let me make one remark about optimizing Postgres defaults: I wrote quite
>>>> a few database queries in the past, and some of them perform horribly
>>>> (relay search) whereas others perform really well (ExoneraTor).  I
>>>> believe that the majority of performance gains can be achieved by
>>>> designing good tables, indexes, and queries.  Only as a last resort we
>>>> should consider optimizing the Postgres defaults.
>>>>
>>>> You realize that a searchable descriptor archives focuses much more on
>>>> database optimization than the ExoneraTor rewrite from Java to Python
>>>> (which would leave the database untouched)?
>>>
>>> Are other datastore models such as splunk or MongoDB useful?
>>> [splunk has a free yet proprietary limited binary... those having
>>> historical woes and takebacks, mentioned just for example here.]
>>
>> Earlier I mentioned the idea of Dynamo. Unless I'm mistaken this lends
>> itself pretty naturally to addresses as a hash key, and descriptor
>> dates as the range key. Lookups would then be O(log(n)) where n is the
>> total number of descriptors an address has published (... that is to
>> say very, very quick).
>>
>> This would be a fun project to give Boto a try. *sigh*... there really
>> should be more hours in the day...
>
> Quoting my reply to Damian to a similar question earlier in the thread:
>
>> I'm wary about moving to another database, especially NoSQL ones and/or cloud-based ones.  They don't magically make things faster, and Postgres is something I understand quite well by now. [...] Not saying that DymanoDB can't be the better choice, but switching the database is not a priority for me.
>
> If somebody wants to give, say, MongoDB a try, I'd be interested in
> seeing the performance comparison to the current Postgres schema.  When
> you do, please consider all three search_* functions that the current
> schema offers, including searches for other IPv4 addresses in the same
> /24 and other IPv6 addresses in the same /48.

Personally, the only NoSQL thing I've come across (and have had some
really good experiences with in the past) was Redis, which is a kind
of key-value store-in-memory, with some nice simple data structures
(like sets, and operations on sets. So if you can reduce your problem
to (e.g.) sets and set operations, Redis might be a good fit.)

(I think that isis is actually experimenting with Redis right now, to
do prop 226-bridgedb-database-improvements.txt)

If the things that you store in Redis can't be made to fit into
memory, you'll probably have a bad time.

So to generalize, if some relational data which needs to be searchable
can be made to fit into memory ("we can guarantee it wouldn't exceed x
GB [for t time]"), offloading that part onto some key-value (or some
more elaborate) system *might* make sense.

Also, I mixed up the link in footnote [2]. It should have linked to
this diagnostic postgres query:

https://github.com/wfn/torsearch/blob/master/misc/list_indexes_in_memory.sql

--

regards
Kostas


More information about the tor-dev mailing list