[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Apr 28 12:26:10 UTC 2015
#15844: Develop database schema to support Onionoo's search parameter efficiently
-------------------------+---------------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Keywords: | Actual Points:
Parent ID: | Points:
-------------------------+---------------------
The current way for handling incoming client requests is to load all
search-relevant data about relays and bridges into memory and handle
requests from there. This has two major downsides: it's difficult to
extend and it requires us to limit searches to relays and bridges that
have been running in the past seven days. We would like to overcome both.
After some experimenting with database schemas it seems that supporting
the `search` parameter efficiently will be most difficult. Here's what it
does (from https://onionoo.torproject.org/protocol.html):
''Return only (1) relays with the parameter value matching (part of a)
nickname, (possibly $-prefixed) beginning of a hex-encoded fingerprint,
beginning of a base64-encoded fingerprint without trailing equal signs, or
beginning of an IP address, (2) bridges with (part of a) nickname or
(possibly $-prefixed) beginning of a hashed hex-encoded fingerprint, and
(3) relays and/or bridges matching a given qualified search term. Searches
by relay IP address include all known addresses used for onion routing and
for exiting to the Internet. Searches for beginnings of IP addresses are
performed on textual representations of canonical IP address forms, so
that searches using CIDR notation or non-canonical forms will return empty
results. Searches are case-insensitive, except for base64-encoded
fingerprints. If multiple search terms are given, separated by spaces, the
intersection of all relays and bridges matching all search terms will be
returned. Complete hex-encoded fingerprints should always be hashed using
SHA-1, regardless of searching for a relay or a bridge, in order to not
accidentally leak non-hashed bridge fingerprints in the URL. [...]''
Before providing my experimental code (which I'd have to clean up anyway),
I'd want to keep this discussion as open as possible and only present my
general ideas how this could be implemented:
- In the following I'll assume that we're going to use PostgreSQL as
database. I already have some experience with it from other Tor-related
projects, and our sysadmins like it more than other SQL databases. If
NoSQL turns out to be superior for this use case based on some actual
performance evaluations, I'm happy to consider that.
- We'll have to support three comparison modes for the `search` paramter:
"starts with", "starts with ignore case", and "contains as substring".
- PostgreSQL does not support substring searches (`LIKE '%foo%'`) out of
the box, at least not efficiently, but there's a package called `pg_trgm`
that can "determine the similarity of text based on trigram matching".
It's contained in Debian's `postgresql-contrib` package, so it should be
available to us.
- Right now, search terms are supported starting at a minimum length of 1
character. I could imagine raising that to 3 characters if it has major
benefits to search efficiency. Though if it doesn't, let's keep
supporting searches for 1 or 2 characters.
- I briefly experimented with a normalized database schema with a
`servers` table containing one row per relay or bridge, an `addresses`
table with one or more addresses per server, and a `fingerprints` table
with original and hashed fingerprint per server. The performance was not
very promising, because searches would have to happen in all three tables.
Happy to try again if somebody has hints what I could have done wrong.
- I also considered (but did not test) a schema with a single `servers`
table that encodes all fields that are relevant for the `search` parameter
in a single string with format `"lower-case-nickname#base64-fingerprint
|lower-case-hex-fingerprint|lower-case-hashed-hex-fingerprint|lower-case-
address1|lower-case-address2"`. For example, Tonga would have the
combined string
`"tonga#SgzNLdx5lQg9c/XWZxAMilgx8W0|4a0ccd2ddc7995083d73f5d667100c8a5831f16d|e654ae16b76cf002bd26adaf060f8a9c5d333cc9|82.94.251.203"`,
and searches for `Tonga` would use the following condition: `WHERE search
LIKE '%tonga%#' OR search LIKE '%#Tonga%' OR search LIKE '%|tonga%'`.
- There may be variants of these two schemas that have advantages that I
didn't think of yet. Suggestions very welcome.
If we can find a good database schema for the `search` parameter,
implementing the other parameters should be relatively easy.
Here's Tonga's search data for a very first sample:
{{{
{
"t": true,
"f": "4A0CCD2DDC7995083D73F5D667100C8A5831F16D",
"n": "Tonga",
"ad": [
"82.94.251.203"
],
"cc": "nl",
"as": "AS3265",
"fs": "2007-10-27 12:00:00",
"ls": "2015-04-18 13:00:00",
"rf": [
"Authority",
"Fast",
"HSDir",
"Running",
"Stable",
"V2Dir",
"Valid"
],
"cw": 20,
"r": true,
"c": "4096/fd3428b4 lucky green <shamrock at cypherpunks.to>"
}
}}}
I also uploaded more
[https://people.torproject.org/~karsten/volatile/summary.xz sample search
data] in case that helps the discussion.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list