[metrics-bugs] #32126 [Metrics/Ideas]: Add OONI's Vanilla Tor measurement data to Tor Metrics
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Nov 21 15:06:32 UTC 2019
#32126: Add OONI's Vanilla Tor measurement data to Tor Metrics
---------------------------+------------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/Ideas | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
---------------------------+------------------------------
Comment (by hellais):
@irl pointed me to this ticket on the OONI side ticket for things related
to this here: https://github.com/ooni/pipeline/issues/13
> The next step after answering the questions above is to figure out how
we'd get the data for these new graphs. Some thoughts:
> Maintaining our own copy of the OONI metadata database, like I did for
this analysis, isn't feasible. We only need a small fraction of ~40G of
this database which currently has a total size of 696G. Also, cloning this
database took way too long for us to do it once per day.
I would like to better understand this point and which aspect of it is not
feasible.
If you setup the MetaDB following the instructions here:
https://github.com/ooni/sysadmin/blob/master/docs/metadb-sharing.md, it
will be configured as a read-only replica which will **automatically
sync** as soon as we write new changes to the OONI MetaDb.
That is to say that there is no need to do a clone once per day, you just
set it up once and then it will automatically sync every time.
Did you eventually manage to set it up? What are you thoughts about the
schema of the vanilla_tor tables, are they adequate?
The MetaDB is the main way we are encouraging people to integrate OONI
data for batch analysis and we already have several users of it. I would
like to try to better understand what are the limitations and issues you
are facing so we can try to best address them.
> We might be able to maintain a copy of the .yaml files of vanilla_tor
measurements only. We would sync these once or twice per day and serve
them with CollecTor. We'd have to define our own database schema for
importing and aggregating them. This is not a small project and not a
small commitment.
I think this approach is highly sub-optimal as it will require you
duplicating the code we are already writing for parsing the OONI dataset.
> A while ago we were hoping to get a .csv file from OONI with just the
data we need. For example, the .csv file behind the three graphs above is
150M large, though it could easily be reduced to 75M, uncompressed. Maybe
we'd have to define precisely what data we want (the discussion above) and
then write the database query for it. This would be the smallest project
and commitment from our side; in other words, it would be most likely to
happen soon.
This is also a possibility if you give us an idea of what queries you need
to run exactly.
We already have some private API endpoints to support the usage of this
data in OONI Explorer:
https://github.com/ooni/api/blob/master/measurements/api/private.py#L638,
though these are not really means to be consumed externally and may be
subject to change.
The MetaDB sync is the option that would be preferable for us.
> A possible variant of the ideas above would be that we operate on a
read-only copy of the metadata database where we can define views, run
queries, and export results as .csv files.
This is also a possibility, though I would like to better first understand
what are the issues or limitations you are having in accessing the MetaDB.
Keep in mind we are currently facing a lot of challenges in scaling up the
MetaDB to support the increased usage of OONI Explorer and hence we are
trying, when possible, to move people away from using our instance and
setting up their own especially for batch analysis needs.
If we do go this route we may setup a separate read-only replica just to
be used by external consumers of the data.
Once we get a better idea of what are your needs (an example query would
be very useful) and if the vanilla_tor table is adequate we can maybe see
if we can also do some sort of csv export of the data.
For information on the schema of the vanilla_tor table see:
https://github.com/ooni/pipeline/blob/master/af/shovel/centrifugation.py#L1605
https://github.com/ooni/pipeline/blob/master/af/oometa/006-vanilla-
tor.install.sql#L7
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32126#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list