[tor-bugs] #2687 [Torperf]: Update filter.R to parse Torperf's new .mergedata format
Tor Bug Tracker & Wiki
torproject-admin at torproject.org
Sun Apr 24 10:32:11 UTC 2011
#2687: Update filter.R to parse Torperf's new .mergedata format
-------------------------+--------------------------------------------------
Reporter: karsten | Owner: karsten
Type: enhancement | Status: needs_review
Priority: major | Milestone:
Component: Torperf | Version:
Keywords: | Parent:
Points: 4 | Actualpoints:
-------------------------+--------------------------------------------------
Comment(by karsten):
Thanks for attaching your code. It's very interesting to read someone
else's R code. I learn a lot by doing so. :)
However, I'm afraid neither of our attempts are sufficient yet. Here are
a few comments:
- I'd rather want to avoid adding another dependency with the "stringr"
package, unless we have to. I replaced the str_* functions with standard
R functions, e.g., unlist(strsplit()), which seemed to work.
- Writing to CSV doesn't work yet. I'd like to know if we can export the
parsed data easily.
- The major issue is that parsing takes much too long. I parsed 1 week of
50 KB downloads containing 4247 rows, 2013 of which are measurements.
Your script took 2:54 minutes for this task or 1:47 minutes when using the
standard R functions instead of the stringr stuff. My script takes 0:25
minutes for this task which is still far too much. For comparison,
reading the output CSV file takes only 314 milliseconds. People will want
to parse months or even years of data coming from a dozen Torperf runs or
more. This shouldn't take hours, but minutes. So, we should aim for at
most a few seconds for the week of data. Plus, the script should scale
linearly for more data, which I'm not sure is the case for our attempts.
If we cannot find an efficient way to parse these files, let's take one
step back. What data formats are there that allow us to add or remove
columns easily and that can be parsed efficiently in R? CSV is fast, but
inflexible. Space-separated key-value-pairs are flexible, but apparently
slow. What else is there? XML?
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2687#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list