[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed May 13 23:03:10 UTC 2015
#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Resolution: | Keywords:
Actual Points: | Parent ID:
Points: |
-----------------------------+-----------------
Comment (by leeroy):
There are a couple things to look into if you decide to test the
possibility of a search engine. Consider a search of nickname,
fingerprint, ip, and contact.
1. Onionoo file-based i/o represents a virtual document view. It's
usually delegated to the client to render it. A search engine makes the
document view more concrete on the Onionoo server. This can lead to
duplicate copies of data across multiple files. Which may be more
difficult, and resource-consuming, to update or modify. Compression may
also be harder to implement efficiently on a document-basis instead of on
a field-basis.
1. The documents must be created/updated in a separate atomic step from
searching/indexing/reading.
1. It's a challenge to define a normalization of the searchable content.
You need a normalization for nicknames, fingerprints, ip, and contact. In
the case of ip, a tokenization leads to the problem of recognizing which
octet is being searched. In the case of contact you need to be careful to
not eliminate stop-words, and also take into consideration the impact of
multiple languages (stemming).
1. To perform substring search, or even the equivalent prefix-search
within words requires using a dynamically created regex.
1. Full document indexing may induce undesired tradeoffs similar to the
week-long limit currently seen.
----
> If substring searches are just too hard to implement, let me suggest
another simplification: how about we sacrifice the `contact` parameter
together with substring searches in the `nickname` field?
>
It occurs to me !`contact` field doesn't even show up in summary
documents. So that field wouldn't apply to a request for summary data?
Eliminating substring searches on the server would definitely help with
those big requests. I don't think it would hurt terribly for small
requests though. Maybe if it was sufficiently limited in it's search range
it wouldn't be a concern? At least theoretically it can be as efficient as
Java--maybe even more if compiled.
----
> Example: a search for "u" would still return all relays called "Unnamed"
and "u", but not "default"
>
That's a great point. Why would a substring search for a single-character
be useful anway? If the search pattern is short it should be reasonable to
say only prefix-search is available, or there's some other limits imposed.
----
> If a request takes 100 ms on average, that's fine by me, but if some
requests take 60 s, then something's very wrong.
>
Is that just the server processing the query, or does that include the
possibility that a client has a slow connection? Absolutely, if a client
has to wait 60 s to get any data, something's definitely not working
properly. Instrumentation should be built in from the get go--to keep
track of events like this.
----
> I also worry about potential lack of scalability, either to thousands of
concurrent clients or to twice or three times as many entries in the
database.
>
Are there any such situations like this with the currently deployed code?
Besides the data in memory. Are there any particular circumstances that
induce worst-case performance? Does the current worst-case have any other
potential consequences? I'm just concerned that the scalability question
lacks any consideration of the rest of the system.
----
> I'd prefer a solution that has reasonable performance for most use
cases, rather than one that is highly optimized for one use case and that
might perform badly for use cases we didn't think.
>
Agreed. Designing around some well known use-cases leads to code which
relies to much on the optimizations. This can make additions more
difficult, due to code that needs rewriting, and a database which you're
reluctant to change. First the database should be clean, and efficient,
and the rest of the system should be checked for bottlenecks. Amdahl's law
isn't it?
----
> Of course it's no hard requirement that the database performs better
than the current in-memory search.
>
I think it's perfectly reasonable to expect the database transition should
result in a system that performs comparably to the currently deployed
code. That might be a good starting point. A data-set representative of
the current in-memory search only using a database would make for a good
comparison.
----
> How do we proceed?
>
Some thoughts about that:
* Maybe by coming up with a better normal form. One which starts with the
details-view of data. Summary data might be better represented as a
derivation. I noticed a couple problems with the one I posted above. It
includes a lot of information not seen in a summary document. The ASN and
Country fields might make more sense part of the IP table. It also doesn't
consider ip changes over time or other geolocation information.
* History objects and aggregate data need a database representation too.
PostgreNOSQL features and compression might come in handy here. These
fields are created, read by the client for rendering, but aren't really
searched as far as I can tell.
* Determine which data needs to be tracked over time. For example, flags,
and ip associated with the lifetime of a router determined by fingerprint.
Are there others? What about consensus weight, and position probabilities?
* Functions for importing new data, or even importing the initial data
set
* Functions to update last-modified response on receiving an update
notification from the database
I'm doing it again aren't I? The long response thing :D
----
> leeroy, did you want to write some database code and run some
performance measurements? If so,
[https://people.torproject.org/%7Ekarsten/volatile/summary.xz here's some
sample data] (the same that I mentioned above).
>
I do have that data. I guess I could get some more from CollecTor or
Onionoo. Which Java is deployed? 7? Besides that, do you have something in
particular in mind for code? A wishlist?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list