[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Tue Aug 13 13:17:29 UTC 2013

On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing <karsten at torproject.org>wrote:
>
> I suggest putting pg_prewarm on the future work list.  I sense there's a
> lot of unused potential in stock PostgreSQL.  Tweaking the database at
> this point has the word "premature optimization" written on it in big
> letters for me.

> Also, to be very clear here, a tool that requires custom tweaks to
> PostgreSQL has minimal chances of running on torproject.org machines in
> the future.  The current plan is that we'll have a dedicated database
> machine operated by our sysadmins that not even the service operator
> will have shell access to.

Oh, understood then, OK, no extensions (at least) for now.

Apropos: as of my current (limited) understanding, it might be difficult to
support, for example, nickname sub-string searches without a (supported,
official) extension. One such extension is pg_trgm [1], which is in the
contrib/ directory in 9.1, and is just one make install away. But for now,
I'll assume this is not possible / we should avoid this.

So, why do you join descriptors and network statuses in the search
> process?  At the Munich dev meeting I suggested joining the tables
> already in the import process.  What do you think about that idea?

Yes, I had made a half-hearted attempt to normalize the two tables some
time ago, for a small amount of descriptors and status entries; I'll be
trying out this scheme in full (will need to re-import a major part of the
data (which I didn't do then) to be able to see if it scales well) after I
try something else. (Namely, using a third table of unique fingerprints
(the statusentry table currently holds ~170K unique fingerprints vs. ~67M
rows in total) and (non-unique) nicknames for truly quick fingerprint
lookup and nickname search; I did experiment with this as well, but I
worked with a small subset of overall data in that case, too; and I think I
can do a better job now.)

It had seemed to me that the bottleneck was in having to sort a too large
number of rows, but now I understand (if only just a bit) more about the
'explain analyze' output to see that the 'Nested Loop' procedure, which is
what does the join in the join query discussed, is expensive and is part of
the bottleneck so to speak. So I'll look into that after properly
benchmarking stuff with the third table. (By the way, for future reference,
we do have to test out different ideas on a substantial subset of overall
data, as the scale function is not, so to say, linear.) :)

> > https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
> >
> > We use the following indexes while executing that query:
> >
> >  * lower(nickname) on descriptor
> >
> >  * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
> statusentry
>
> Using only the first 12 characters sounds like a fine approach to speed
> up things.  But why 12?  Why not 10 or 14?  This is probably something
> you should annotate as parameter to find a good value for later in the
> process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
> for now, but it might not be the best number.)
>

Yes, this is as unscientific as it gets. As of now, we're using a raw SQL
query, but I'll be encapuslating them properly soon (so we can easily
attach different WHERE clauses, etc.), at which point I'll make it into a
parameter. I did do some tests, but nothing extensive; just made sure the
indexes can fit into memory whole, which was the main constraint. Will do
some tests.

> Also, would it keep indexes smaller if you took something else than
> base16 encoding for fingerprints?  What about base64?  Or is there a
> binary type in PostgreSQL that works fine for indexes?
>

Re: latter, no binary type for B-Trees (which is the default index type in
pgsql) as far as I can see. But it's a good idea / approach, so I'll look
into it, thanks! On the whole though, as long as all the indexes occupy
only a subset of pgsql's internal buffers, there shouldn't be a problem /
that's not the problem, afaik. But, if we're making a well-researched
ORM/database design, I should look into it.

> Do you have a list of searches you're planning to support?

These are the ones that should *really* be supported:

   - ?search=nickname
   - ?search=fingerprint
   - ?lookup=fingerprint
   - ?search=address [done some limited testing, currently not focusing on
   this]
   - ?running=<boolean>
   - ?flag=flag [every kind of clause which further narrows down the query
   is not bad; the current db model supports all the flags that Stem does, and
   each flag has its own column]
   - ?first_seen_days=range
   - ?last_seen_days=range

As per the plan, the db should be able to return a list of status entries /
validafter ranges (which can be used in {first,last}_seen_days) given some
fingerprint.

Thanks for your feedback and reply!

Kostas.

[1]: http://www.postgresql.org/docs/9.1/static/pgtrgm.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130813/43051eec/attachment.html>