[metrics-bugs] #33258 [Metrics]: Add CSV file export of graphed data
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri Mar 20 22:39:27 UTC 2020
#33258: Add CSV file export of graphed data
-----------------------------------------+------------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: needs_review
Priority: Medium | Milestone:
Component: Metrics | Version:
Severity: Normal | Resolution:
Keywords: metrics-team-roadmap-2020Q1 | Actual Points:
Parent ID: #33327 | Points: 1
Reviewer: | Sponsor: Sponsor59
-----------------------------------------+------------------------------
Changes (by karsten):
* status: new => needs_review
Comment:
I started working on #33256 and #33258 in parallel and now have a patch
for review that implements #33258. This patch contains a rewrite of tgen
plots to use the pandas and seaborn libraries.
On the plus side of using pandas, it's really easy to export graphed data
to a .csv file. And it's in general good practice to separate all data
tidying from visualization.
On the minus side, I didn't want to update all the old PyLab code to use
pandas. Instead I switched that code to using seaborn, which is much newer
and much higher-level. The code is much shorter and easier to read. But it
comes with a few changes to produced plots that we need to discuss.
I attached
[https://trac.torproject.org/projects/tor/attachment/ticket/33258/tgen.onionperf.viz.2020-03-20_10%3A06%3A21.pdf
old] and
[https://trac.torproject.org/projects/tor/attachment/ticket/33258/tgen.onionperf.viz.2020-03-20_23%3A27%3A42.pdf
new] tgen plots as .pdf files. Changes are:
- ECDFs "Time to download first byte" and "Time to download last of {
51200, 1048576, 5242880 } bytes" remain mostly unchanged. One very minor
change is that lines now extend to (-Inf, 0) and (Inf, 1). As before,
these plots are set to focus on values up to the 99th percentile.
- Time plots "Time to download { first, last } of { 51200, 1048576,
5242880 } bytes over time" are roughly the same as the "mean time to
download [...]" plots. Noticeable differences are that the x axis uses
datetime values rather than "ticks" and that the plot has changed from
line to scatter plot. The rationale behind switching from lines to dots is
that measurements are mostly independent from each other. This fact is
better expressed by using a single dot per measurement rather than shorter
or longer lines depending on how different subsequent measurements were
and how much time has passed between those measurements.
- Box plots "Time to download last of { 51200, 1048576, 5242880 } bytes"
replace the "median time to download [...]" plots by giving more detail
than just the median. They do not, however, show maxima or even any
outliers at all, because extreme outliers can make it difficult to read
the median value.
- Bar plots "Mean time to download last of { 51200, 1048576, 5242880 }
bytes" replace the "mean time to download [...]" plots. It's questionable
whether these plots are still required with the box plots being present.
- There are no equivalents for "max time to download [...]" plots,
because the maximum download time can also be obtained from time plots. If
having plots with download time maxima is for some reason important, they
could be re-added as bar plots.
- Count plots "Number of downloads of { 51200, 1048576, 5242880 } bytes
completed" replace their similarly named equivalents but are much more
readable.
- There are no equivalents for "number of { 51200, 1048576, 5242880 }
bytes completed, all clients over time". These time plots are basically
the same as the time plots showing download time, except that those have
useful y values which these don't have.
- Count plots "Number of downloads failed" and time plot "Download
runtime until error" replace the various error graphs which didn't seem to
be as useful.
Regarding tor plots I'm a bit unclear why we would need them at all. I
attached the
[https://trac.torproject.org/projects/tor/attachment/ticket/33258/tor.onionperf.viz.2020-03-20_10%3A06%3A36.pdf
old] tor plots as .pdf file for discussion here. I did not yet rewrite
this code, because maybe we can kill it right away. Some notes:
- The "60 second moving average [...]" graphs are currently broken. The x
axis is supposed to be the time in seconds, but it starts at unix time 0
or 1970-01-01. If you look veeeeery closely at the space right to the
legend you'll find the data points. However, I don't know how this
visualization can be useful for anything besides debugging a handful of
measurements.
- The "1 second throughput [...]" graphs would be more useful with a
higher data resolution than 1 KiB/s, which is the reason for those huge
steps. But even if the ECDFs would be smoother, is this really something
we care about?
I attached my Git-formatted
[https://trac.torproject.org/projects/tor/attachment/ticket/33258/0001
-Rewrite-tgen-plots-to-use-pandas-and-seaborn.patch patch] for review;
looks like I don't have an OnionPerf repository yet. But maybe we can have
a higher-level discussion of the items above first before diving deep into
the code review.
Just in case somebody wants to reproduce these plots, here are the
commands I used:
{{{
python setup.py build
sudo python setup.py install
onionperf visualize \
-d 2019-01-31.onionperf.analysis.json.xz 2019-01-31-ab \
-d 2019-01-30.onionperf.analysis.json.xz 2019-01-30-nl
}}}
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33258#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list