[tor-bugs] #15844 [Onionoo]: Develop database schema to support Onionoo's search parameter efficiently
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu May 14 15:10:27 UTC 2015
#15844: Develop database schema to support Onionoo's search parameter efficiently
-----------------------------+-----------------
Reporter: karsten | Owner:
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Onionoo | Version:
Resolution: | Keywords:
Actual Points: | Parent ID:
Points: |
-----------------------------+-----------------
Comment (by karsten):
Replying to [comment:11 leeroy]:
> There are a couple things to look into if you decide to test the
possibility of a search engine. Consider a search of nickname,
fingerprint, ip, and contact. [...]
It seems like switching to a search engine would be a pretty major change
even compared to switching to a database. I'd say if somebody wants to
look into this, that's cool, but the database switch seems like the much-
lower hanging fruit for now.
> > If substring searches are just too hard to implement, let me suggest
another simplification: how about we sacrifice the `contact` parameter
together with substring searches in the `nickname` field?
>
> It occurs to me the `contact` field doesn't even show up in summary
documents. So that field wouldn't apply to a request for summary data?
Eliminating substring searches on the server ''would'' help with those big
requests. I don't think it would hurt terribly for small requests though.
Maybe if it were limited in it's search range it wouldn't be a concern? At
least theoretically it can be as efficient as Java--maybe even ''more'' if
compiled.
You can use the `contact` parameter for any document type, even one that
doesn't contain the `contact` field. The way this works is that all
requests are answered using a node index that can handle all possible
parameters and that returns a list of fingerprints. In the next step, all
documents matching these fingerprints are written to the response. For
example, right now you can use the `contact` parameter when asking for
`weights` documents.
> > Example: a search for "u" would still return all relays called
"Unnamed" and "u", but not "default"
>
> That's a great point. Why would a substring search for a single-
character be useful anway? If the search pattern is short it should be
reasonable to say only prefix-search is available. That or there's some
other limits imposed.
I'm not too worried about this simplification, because nicknames have
become less important a while ago with the Named flag being discontinued.
They are convenient ways to refer to relays, but there is no guarantee
that you'll get the relay you're hoping to get.
> > If a request takes 100 ms on average, that's fine by me, but if some
requests take 60 s, then something's very wrong.
>
> Would that be just the server processing the query, or does that include
the possibility that a client has a slow connection? Absolutely, if a
client has to wait 60 s to get any data, something's definitely not
working properly. Instrumentation should be built in from the start--to
keep track of events like this.
Well, clients having slow connections should be treated separately. We
already have performance statistics in the deployed code:
{{{
Request statistics (2015-05-13 18:52:48, 3600 s):
Total processed requests: 285066
Most frequently requested resource: details (283763), summary (547),
bandwidth (372)
Most frequently requested parameter combinations: [lookup, fields]
(278974), [flag, type, running] (2531), [lookup] (2393)
Matching relays per request: .500<2, .900<2, .990<2048, .999<16384
Matching bridges per request: .500<1, .900<1, .990<1, .999<8192
Written characters per response: .500<256, .900<512, .990<2097152,
.999<4194304
Milliseconds to handle request: .500<8, .900<16, .990<32, .999<64
Milliseconds to build response: .500<4, .900<4, .990<256, .999<512
}}}
If we switch request handling from the current in-memory approach to a
database approach, we should watch out for the last but one line there,
"Milliseconds to handle request". That's the time from receiving a
request to returning a list of fingerprint of documents to be returned.
You read that line like this: 50% of requests were handled in less than 8
milliseconds, ..., and 99.9% in less than 64 milliseconds.
Slow clients would show up in the next line, "Milliseconds to build
response", which includes reading documents from disk as well as writing
them to the response. (We could split up that line, but then we'd have to
buffer complete responses in memory, which might be bad.)
Happy to add new statistics there, if they help us investigate potential
bottlenecks.
> > I also worry about potential lack of scalability, either to thousands
of concurrent clients or to twice or three times as many entries in the
database.
>
> Are there any situations like this with the currently deployed code?
Besides the data in memory. Are there any circumstances that induce worst-
case performance? Does the worst-case have any other potential
consequences? I'm just concerned that the scalability question lacks any
consideration of the rest of the system.
There are indeed other parts of the system that would be affected by an
increase in concurrent clients or entries in the database. As I described
above, once the database would return a list of fingerprints, those
documents would have to be looked up and written to the response. Right
now, all documents are read from disk, but we might want to put them into
the database as well. And yes, we would want to make sure that those
parts of the system scale as well.
> > I'd prefer a solution that has reasonable performance for most use
cases, rather than one that is highly optimized for one use case and that
might perform badly for use cases we didn't think.
>
> Agreed. Designing around some well known use-cases leads to code which
relies to much on the optimizations. This can make additions more
difficult, due to code that needs rewriting, and a database which you're
reluctant to change. First the database should be clean, and efficient,
and the rest of the system should be checked for bottlenecks. Amdahl's law
isn't it?
Agreed.
> > Of course it's no hard requirement that the database performs better
than the current in-memory search.
>
> I think it's perfectly reasonable to expect the database transition
should result in a system that performs comparably to the currently
deployed code. That might be a good starting point. A data-set
representative of the current in-memory search, only using a database,
would make for a good comparison.
That would be a fine comparison to start with, but even if the result is
that the database is slower by a factor 1.5 or so, its other advantages
might outweigh that.
> > How do we proceed?
>
> Some thoughts about that:
>
> * Maybe by coming up with a better normal form. One which starts with
the details-view of data. Summary data might be better represented as a
derivation. I noticed a couple problems with the one I posted above. It
includes a lot of information not seen in a summary document. The ASN and
Country fields might make more sense part of the IP table. It also doesn't
consider ip changes over time or other geolocation information.
Looking at details documents is a fine start. But let's be careful with
adding new parameters that would make it harder to make the database fast
enough. One problem at a time.
> * History objects and aggregate data need a database representation
too. PostgreNOSQL features and compression might come in handy here. These
fields are created, read by the client for rendering, but aren't really
searched as far as I can tell.
History documents are not searched, and that's perfectly fine.
> * Determine which data needs to be tracked over time. For example,
flags, and ip associated with the lifetime of a router determined by
fingerprint. Are there others? What about consensus weight, and position
probabilities?
Let's be careful with adding historical data. There are over 7.5 years of
data in CollecTor, and adding anything of that to Onionoo might not scale
very well. Onionoo started out by giving out details about relays and
bridges that are currently running. Adding non-running relays and bridges
that have been running in the last week was just a convenience. Removing
that 1-week limitation would be even more convenient, but we need to be
very careful how to do that. For example, it might be possible to search
by any IP address assigned to a relay in the past. But allowing users to
look up which relays had the Stable flag on January 1, 2009 might be a
little too much.
> * Functions for importing new data, or even importing the initial data
set
>
> * Functions to update last-modified response on receiving an update
notification from the database
Yes, this code needs to be written. I just thought it might be easier to
start with writing some fine new SQL code to do performance experiments
and not mess with the Onionoo code base.
> I'm doing it again aren't I? The long response thing :D
At least this one is easier to respond to. :)
> > leeroy, did you want to write some database code and run some
performance measurements? If so,
[https://people.torproject.org/%7Ekarsten/volatile/summary.xz here's some
sample data] (the same that I mentioned above).
>
> I do have that data. I guess I could get some more from CollecTor or
Onionoo. Which Java is deployed? 7? Besides that, do you have something in
particular in mind for code? A wishlist?
I'd say ignore CollecTor for the moment and only look at Onionoo. The
document I pasted above is the current input to the request-handling code.
You could also fetch the [https://onionoo.torproject.org/details latest
details documents] for additional input. But feel free to mention any
other data here first, and I might comment on possible scalability issues
from including those in the search.
The deployed Java version is indeed 7. I don't have anything particular
in mind for the code, but I'm happy to comment on patches. Release early,
release often.
Thanks for your contributions here!
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/15844#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list