[metrics-bugs] #29346 [Metrics/Website]: Document why our CSV files are in tidy/long format and how to process them
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Feb 5 20:38:49 UTC 2019
#29346: Document why our CSV files are in tidy/long format and how to process them
---------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
---------------------------------+--------------------------
This ticket is based on a discussion in Brussels.
The issue we talked about is that it can sometimes be difficult to import
our per-graph CSV files into applications like LibreOffice Calc or
services like CKAN and make charts out of them.
The reason is that we chose to use tidy/"long" data formats for our CSV
files. For example, the following lines are contained in the
relayflags.csv file:
{{{
date,flag,relays
2007-10-27,Exit,602
2007-10-27,Fast,1126
2007-10-27,Guard,244
2007-10-27,Running,1254
2007-10-27,Stable,586
2007-10-28,Exit,592
2007-10-28,Fast,1115
2007-10-28,Guard,293
2007-10-28,Running,1244
2007-10-28,Stable,578
[...]
}}}
However, charting applications expect the data in the messy/"wide" format:
{{{
date,Exit,Fast,Guard,Running,Stable
2007-10-27,602,1126,244,1254,586
2007-10-28,592,1115,293,1244,578
[...]
}}}
We briefly discussed in Brussels to change our formats accordingly, to
please LibreOffice Calc et al. However, after giving this some more
thoughts, I'm opposed to this idea.
There are reasons why we picked the tidy format in the first place: it's
more flexible, because we don't have to worry about having to add or
remove columns at any time. It's also somewhat easier to handle with
statistics tools/languages like R and the very powerful tidyverse
libraries. See also Hadley Wickham's Tidy Data paper which is a really
good read on this topic: https://www.jstatsoft.org/article/view/v059i10
What can we do? I don't want to make the data harder to process for
anyone, and sometimes LibreOffice Calc or CKAN can be great tools to get a
first impression on a data set. We can also not expect everyone to use R
or SPSS or MATLAB. But maybe we can solve this with better documentation
rather than changing the way we're doing things.
The magic word here seems to be: '''pivot table'''. This random blog post
that I just found seems to be a good start for people wanting to wrangle
our tidy data into whatever they need for making charts:
https://blog.datawrapper.de/pivottables/
And this random CKAN plugin that I did ''not'' try out could be a way to
teach CKAN how to use our tidy data formats for its preview
visualizations: https://github.com/routetopa/ckanext-pivottable
So, how about we document the reasons for choosing tidy data formats on
the Statistics page and linking to a few tutorials for processing our data
with common charting tools? Ideally, we would add links rather than write
a lot of text on our own, though.
Does this sound plausible?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29346>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list