[metrics-bugs] #25383 [Metrics/Website]: Deprecate stats.html and stats/*.csv files
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue May 8 09:09:33 UTC 2018
#25383: Deprecate stats.html and stats/*.csv files
-----------------------------+------------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-----------------------------+------------------------------
Comment (by karsten):
Replying to [comment:12 irl]:
> Are we strongly against the idea of providing two CSV files?
I have been thinking a lot about this yesterday, and I think the answer
is: yes.
Providing two types of CSV files pretty much doubles our effort for adding
new aggregations or graphs as well as changing or removing parts. I'd
prefer the process for adding or improving graphs to get easier, not
harder.
Let's try to provide just one type of CSV files, assuming that we don't
break existing, valid use cases.
But let's find a way to stop providing our pre-aggregated statistics
files. They are not the best interface that we can provide. And they are
an interface that can become quite painful to maintain in the future.
> I'd like to see the current CSV that only contains the data used to
produce the plot, and then additionally the full CSV pre-filtering that
would contain all the data.
>
> This would work for the use case where you want to do your own
processing on the data and would also work for the use case where someone
wanted to produce plots using the same data that we have already filtered
and processed.
>
> For the full CSV file, a header would probably be useful. It may also be
useful to have an HTML page that contains a list of all the available CSV
files but the specifications for those files could be documented in the
headers of the CSVs. We wouldn't need to list the individual pre-filtered
CSV files on that page.
Understood, I think.
Here's another suggestion:
4. We provide 1 CSV file per graph that is parameterized by default and
that can also be requested without any parameters. The link on the graph
page would contain the same parameters as the graph, so that the CSV file
content would be pretty close to what's shown in the graph. Except that
the file might contain a few more columns. But the header would explain
those columns. And the header would also say that it's possible to drop
parameters to get more data for different parameter combinations of this
graph.
Let's make this more concrete by adding sample data:
The CSV link on the current [https://metrics.torproject.org/userstats-
relay-country.html Relay users] graph page would read (line break added
for visibility):
{{{
https://metrics.torproject.org/userstats-relay-country.csv?
start=2018-02-07&end=2018-05-08&country=all&events=off
}}}
That first and last lines would be:
{{{
#
# The Tor Project
#
# URL: https://metrics.torproject.org/userstats-relay-
country.csv?start=2018-02-07&end=2018-05-08&country=all&events=off
#
# Insert some specification...
#
date,country,users,downturns,upturns,lower,upper
2018-02-07,,4071868,,,,
2018-02-08,,3815277,,,,
2018-02-09,,4000274,,,,
[...]
2018-05-03,,2296101,,,,
2018-05-04,,2341577,,,,
2018-05-05,,2229328,,,,
}}}
Now, if someone's interested in date for all dates, a break-down by all
possible countries, and possible censorship events, they'd simply take out
all parameters and fetch the following file (link does not work yet):
{{{
https://metrics.torproject.org/userstats-relay-country.csv
}}}
The first and last lines would be:
{{{
#
# The Tor Project
#
# URL: https://metrics.torproject.org/userstats-relay-country.csv
#
# Insert some specification...
#
date,country,users,downturns,upturns,lower,upper
2011-03-06,a1,1443,,,,
2011-03-06,a2,424,,,,
2011-03-06,ae,8395,,,,
[...]
2018-05-06,zw,245,FALSE,FALSE,122,389
2018-05-06,,2220344,,,,
2018-05-06,??,25797,,,,
}}}
For comparison, the current CSV file, ''that we wouldn't provide
anymore'', starts and ends with the following lines:
{{{
date,node,country,transport,version,lower,upper,clients,frac
2011-03-06,relay,a1,,,,,1443,11
2011-03-06,relay,a2,,,,,424,11
2011-03-06,relay,ad,,,,,70,11
[...]
2018-05-06,bridge,,scramblesuit,,,,16,63
2018-05-06,bridge,,snowflake,,,,3,63
2018-05-06,bridge,??,,,,,1135,63
}}}
Note that the bridge user data would still be available on the various
bridge users graphs.
And we could discuss whether it makes sense to include the `frac` column
in the relay users CSV file or not. If we include it, it would be there in
the parameterized CSV file as well as the non-parameterized CSV file. I
guess this is a trade-off between usability ("less is more") and
usefulness ("more details can help").
Thoughts?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25383#comment:13>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list