[tor-commits] [bridgedb/master] Separate backend database improvements into their own proposal.
isis at torproject.org
isis at torproject.org
Tue Feb 4 00:28:48 UTC 2014
commit 92ee780e102f3574f57f18191641a3ca1e9c0af1
Author: Isis Lovecruft <isis at torproject.org>
Date: Wed Oct 23 18:32:02 2013 +0000
Separate backend database improvements into their own proposal.
---
.../XXX-bridgedb-database-improvements.txt | 260 ++++++++++++++++++++
1 file changed, 260 insertions(+)
diff --git a/doc/proposals/XXX-bridgedb-database-improvements.txt b/doc/proposals/XXX-bridgedb-database-improvements.txt
new file mode 100644
index 0000000..2d25bd2
--- /dev/null
+++ b/doc/proposals/XXX-bridgedb-database-improvements.txt
@@ -0,0 +1,260 @@
+# -*- coding: utf-8 ; mode: org -*-
+
+Filename: XXX-bridgedb-database-improvements.txt
+Title: "Scalability and Stability Improvements to BridgeDB: Switching to a
+ Distributed Database System and RDBMS"
+Author: Isis Agora Lovecruft
+Created: 12 Oct 2013
+Related Proposals: XXX-social-bridge-distribution.txt
+Status: Open
+
+* I. Overview
+
+ BridgeDB is Tor's Bridge Distribution system, which currently has two major
+ Bridge Distribution mechanisms: the HTTPS Distributor and an Email
+ Distributor. [0]
+
+ BridgeDB is written largely in Twisted Python, and uses Python2's builtin
+ sqlite3 as its database backend. Unfortunately, this backend system is
+ already showing strain through increased times for queries, and sqlite's
+ memory usage is not up-to-par with modern, more efficient, NoSQL databases.
+
+ In order to better facilitate the implementation of newer, more complex
+ Bridge Distribution mechanisms, several improvements should be made to the
+ underlying database system of BridgeDB. Additionally, I propose that a
+ clear distinction in terms, as well as a modularisation of the codebase, be
+ drawn between the mechanisms for Bridge Distribution versus the backend
+ Bridge Database (BridgeDB) storage system.
+
+ This proposal covers the design and implementation of a scalable NoSQL â
+ Document-Based and Key-Value Relational â database backend for storing data
+ on Tor Bridge relays, in an efficient manner that is ammenable to
+ interfacing with the Twisted Python asynchronous networking code of current
+ and future Bridge Distribution mechanisms.
+
+* II. Terminology
+
+ BridgeDistributor := A program which decides when and how to hand out
+ information on a Tor Bridge relay, and to whom.
+
+ BridgeDB := The backend system of databases and object-relational mapping
+ servers, which interfaces with the BridgeDistributor in order
+ to hand out bridges to clients, and to obtain and process new,
+ incoming ``@type bridge-server-descriptors``,
+ ``@type bridge-networkstatus`` documents, and
+ ``@type bridge-extrainfo`` descriptors. [3]
+
+ BridgeFinder := A client-side program for an Onion Proxy (OP) which handles
+ interfacing with a BridgeDistributor in order to obtain new
+ Bridge relays for a client. A BridgeFinder also interfaces
+ with a local Tor Controller (such as TorButton or ARM) to
+ handle automatic, transparent Bridge configuration (no more
+ copy+pasting into a torrc) without being given any
+ additional privileges over the Tor process, [1] and relies
+ on the Tor Controller to interface with the user for
+ control input and displaying up-to-date information
+ regarding available Bridges, Pluggable Transport methods,
+ and potentially Invite Tickets and Credits (a cryptographic
+ currency without fiat value which is generated
+ automatically by clients whose Bridges remain largely
+ uncensored, and is used to purchase new Bridges), should a
+ Social Bridge Distributor be implemented. [2]
+
+* III. Databases
+** III.A. Scalability Requirements
+
+ Databases SHOULD be implemented in a manner which is ammenable to using a
+ distributed storage system; this is necessary because many potential
+ datatypes required by future BridgeDistributors MUST be stored permanently.
+ For example, in the designs for the Social Bridge Distributor, the list of
+ hash digests of spent Credits, and the list of hash digests of redeemed
+ Invite Tickets MUST be stored forever to prevent either from being replayed
+ â or double-spent â by a malicious user who wishes to block bridges faster.
+ Designing the BridgeDB backend system such that additional nodes may be
+ added in the future will allow the system to freely scale in relation to
+ the storage requirements of future BridgeDistributors.
+
+ Additionally, requiring that the implementation allow for distributed
+ database backends promotes modularisation the components of BridgeDB, such
+ that BridgeDistributors can be separated from the backend storage system,
+ BridgeDB, as all queries will be issued through a simplified, common API,
+ regardless of the number of nodes system, or the design of future
+ BridgeDistributors.
+
+*** 1. Distributed Database System
+
+ A distributed database system SHOULD be used for BridgeDB, in order to
+ scale resources as the number of Tor bridge users grows. This database
+ system, hereafter referred to as DDBS.
+
+ The DDBS MUST be capable of working within Twisted's asynchronous
+ framework. If possible, a Object-Relational Mapper (ORM) SHOULD be used to
+ abstract the database backend's structure and query syntax from the Twisted
+ Python classes which interact with it, so that the type of database may be
+ swapped out for another with less code refactoring.
+
+ The DDBM SHALL be used for persistent storage of complex data structures
+ such as the bridges, which MAY include additional information from both the
+ `@type bridge-server-descriptor`s and the `@type bridge-extra-info`
+ descriptors. [3]
+
+**** 1.a. Choice of DDBS
+
+ CouchDB is chosen for its simple HTTP API, ease of use, speed, and official
+ support for Twisted Python applications. [4] Additionally, its
+ document-based data model is very similar to the current archetecture of
+ tor's Directory Server/Mirror system, in that an HTTP API is used to
+ retrieve data stored within virtual directories. Internally, it uses JSON
+ to store data and JavaScript as its query language, both of which are
+ likely friendlier to various other components of the Tor Metrics
+ infrastructure which sanitise and analyse portions of the Bridge
+ descriptors. At the very least, friendlier than hardcoding raw SQL queries
+ as Python strings.
+
+**** 1.b. Data Structures which should be stored in a DDBS:
+
+ - RedactedDB - The Database of Blocked Bridges
+
+ The RedactedDB will hold entries of bridges which have been discovered to
+ be unreachable from BridgeDB network vantage point, or have been reported
+ unreachable by clients.
+
+ - BridgeDB - The Database of Bridges
+
+ BridgeDB holds information on available Bridges, obtained via bridge
+ descriptors and networkstatus documents from the BridgeAuthority. Because
+ a Bridge may have multiple `ORPort`s and multiple
+ `ServerTransportListenAddress`es, attaching additional data to each of
+ these addresses which MAY include the following information on a blocking
+ event:
+ - Geolocational country code of the reported blocking event
+ - Timestamp for when the blocking event was first reported
+ - The method used for discovery of the block
+ - an the believed mechanism which is causing the block
+ would quickly become unwieldy, the RedactedDB and BridgeDB SHOULD be kept
+ separate.
+
+ - User Credentials
+
+ For the Social BridgeDistributor, these are rather complex,
+ increasingly-growing, concatenations (or tuples) of several datatypes,
+ including Non-Interactive Proofs-of-Knowledge (NIPK) of Commitments to
+ k-TAA Blind Signatures, and NIPK of Commitments to a User's current
+ number of Credits and timestamps of requests for Invite Tickets.
+
+*** 2. Key-Value Relational Database Mapping Server
+
+ For simpler data structures which must be persistently stored, such as the
+ list of hashes of previously seen Invite Tickets, or the list of
+ previously spent Tokens, a Relational Database Mapping Server (RDBMS)
+ SHALL be used for optimisation of queries.
+
+ Redis and Memcached are two examples of RDBMS which are well tested and
+ are known to work well with Twisted. The major difference between the two
+ is that Memcaches are stored only within volatile memory, while Redis
+ additionally supports commands for transferring objects into persistent,
+ on-disk storage.
+
+ There are several support modules for interfacing with both Memcached and
+ Redis from Twisted Python, see Twisted's MemCacheProtocol class [5] [6] or
+ txyam [7] for Memcached, and txredis [8] or txredisapi [9] for
+ Redis. Additionally, numerous big name projects both use Redis as part of
+ their backend systems, and also provide helpful documentation on their own
+ experience of the process of switching over to the new systems. [17] For
+ non-Twisted Python Redis APIs, there is redis-py, which provides a
+ connection pool that could likely be interfaced with from Twisted Python
+ without too much difficultly. [10] [11]
+
+**** 2.a. Data Structures which should be stored in a RDBMS
+
+ Simple, mostly-flat datatypes, and data which must be frequently indexed
+ should be stored in a RDBMS, such as large lists of hashes, or arbitrary
+ strings with assigned point-values (i.e. the "Uniform Mapping" for the
+ current HTTPS BridgeDistributor).
+
+ For the Social BridgeDistributor, hash digests of the following datatypes
+ SHOULD be stored in the RDBMS, in order to prevent double-spending and
+ replay attacks:
+
+ - Invite Tickets
+
+ These are anonymous, unlinkable, unforgeable, and verifiable tokens
+ which are occasionally handed out to well-behaved Users by the Social
+ BridgeDistributor to permit new Users to be invited into the system.
+ When they are redeemed, the Social BridgeDistributor MUST store a hash
+ digest of their contents to prevent replayed Invite Tickets.
+
+ - Spent Credits
+
+ These are Credits which have already been redeemed for new Bridges.
+ The Social BridgeDistributor MUST also store a hash digest of Spent
+ Credits to prevent double-spending.
+
+*** 3. Bloom Filters and Other Database Optimisations
+
+ In order to further decrease the need for lookups in the backend
+ databases, Bloom Filters can used to eliminate extraneous
+ queries. However, this optimization would only be beneficial for string
+ lookups, i.e. querying for a User's Credential, and SHOULD NOT be used for
+ queries within any of the hash lists, i.e. the list of hashes of
+ previously seen Invite Tickets. [14]
+
+**** 3.a. Bloom Filters within Redis
+
+ It might be possible to use Redis' GETBIT and SETBIT commands to store a
+ Bloom Filter within a Redis cache system; [15] doing so would offload the
+ severe memory requirements of loading the Bloom Filter into memory in
+ Python when inserting new entries, reducing the time complexity from some
+ polynomial time complexity that is proportional to the integral of the
+ number of bridge users over the rate of change of bridge users over time,
+ to a time complexity of order O(1).
+
+**** 3.b. Expiration of Stale Data
+
+ Some types of data SHOULD be safe to expire, such as User Credentials
+ which have not been updated within a certain timeframe. This idea should
+ be further explored to assess the safety and potential drawbacks to
+ removing old data.
+
+ If there is data which SHOULD expire, the PEXPIREAT command provided by
+ Redis for the key datatype would allow the RDBMS itself to handle cleanup
+ of stale data automatically. [16]
+
+**** 4. Other potential uses of the improved Bridge database system
+
+ Redis provides mechanisms for evaluations to be made on data by calling
+ the sha1 for a serverside Lua script. [15] While not required in the
+ slightest, it is a rather neat feature, as it would allow Tor's Metrics
+ infrastructure to offload some of the computational overhead of gathering
+ data on Bridge usage to BridgeDB (as well as diminish the security
+ implications of storing Bridge descriptors).
+
+ Also, if Twisted's IProducer and IConsumer interfaces do not provide
+ needed interface functionality, or it is desired that other components of
+ the Tor software ecosystem be capable of scheduling jobs for BridgeDB,
+ there are well-tested mechanisms for using Redis as a message
+ queue/scheduling system. [16]
+
+* References
+
+[0]: https://bridges.torproject.org
+ mailto:bridges at bridges.torproject.org
+[1]: See proposals 199-bridgefinder-integration.txt at
+ https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/199-bridgefinder-integration.txt
+[2]: See XXX-social-bridge-distribution.txt at
+ https://gitweb.torproject.org/user/isis/bridgedb.git/blob/refs/heads/feature/7520-social-dist-design:/doc/proposals/XXX-bridgedb-social-distribution.txt
+[3]: https://metrics.torproject.org/formats.html#descriptortypes
+[4]: https://github.com/couchbase/couchbase-python-client#twisted-api
+[5]: https://twistedmatrix.com/documents/current/api/twisted.protocols.memcache.MemCacheProtocol.html
+[6]: http://stackoverflow.com/a/5162203
+[7]: http://findingscience.com/twisted/python/memcache/2012/06/09/txyam:-yet-another-memcached-twisted-client.html
+[8]: https://pypi.python.org/pypi/txredis
+[9]: https://github.com/fiorix/txredisapi
+[10]: https://github.com/andymccurdy/redis-py/
+[11]: http://degizmo.com/2010/03/22/getting-started-redis-and-python/
+[12]: http://www.dr-josiah.com/2012/03/why-we-didnt-use-bloom-filter.html
+[13]: http://redis.io/topics/data-types §"Strings"
+[14]: http://redis.io/commands/pexpireat
+[15]: http://redis.io/commands/evalsha
+[16]: http://www.restmq.com/
+[17]: https://www.mediawiki.org/wiki/Redis
More information about the tor-commits
mailing list