[metrics-bugs] #29772 [Metrics/Website]: Plot nearly worst-case bandwidth when downloading from [public|onion] server

Thu Apr 25 14:01:23 UTC 2019

#29772: Plot nearly worst-case bandwidth when downloading from [public|onion]
server
-----------------------------+------------------------------
 Reporter:  karsten          |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_review
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  scalability      |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------
Changes (by karsten):

 * status:  needs_revision => needs_review

Comment:

 I should start this comment by saying that I'm not a statistician. In case
 of doubt what I'm saying below, please go re-read this first sentence! :)

 I agree with you that the bandwidth plot works better than the latency
 plot. We're excluding very few bandwidth numbers as outliers as compared
 to the number of latency numbers that we're throwing out.

 However, I don't think that a 4-day moving average would fix this. As you
 can see in the boxplots I posted here last week, medians and quartiles are
 relatively stable over the days, and those values are what we're using to
 figure out if another value is excluded as outlier. After all, we have
 around 144 latency values per day and public/onion service. So, even if we
 considered 4 days (or even more) at a time, our threshold for excluding
 values as outliers would not change much. Of course, implementing such a
 moving average wouldn't be trivial to do, with all the missing data that
 we have to handle.

 I think the issue is that the way we're excluding outliers is based on the
 assumption that our data is normally distributed. This works okay for
 bandwidth, which is obviously not 100% correct, because there's no
 negative bandwidth, but which is apparently close enough. It doesn't work
 very well for latencies, because there's some heavy-tailed distribution at
 work that we don't know, and not all the values we're excluding are really
 outliers.

 Another reason could be that we're looking at the smallest bandwidth
 values, which are at the ''head'' of the distribution, and at the largest
 latency values, which are the heavy ''tail''.

 However, my suggestion is to ignore all this and make the plots as you
 suggested earlier and as I plotted them last week. Two reasons:

  1. Boxplots are understood by many people, and if we say that we're
 plotting the five values from boxplots, people will understand what we're
 doing.

  2. We need a baseline, even if it's not 100% correct in a
 mathematical/statistical sense. If our way to exclude outliers is flawed,
 it will be flawed for past measurements as well as for future
 measurements, in the exact same way.

 Regarding your rocket analogy: it's certainly not just distance between
 relays that we're seeing here. We're also seeing overfull queues keeping
 received cells waiting for crypto and forwarding to the next relay. But
 this is fine, we want to know how long it takes to send something over the
 circuit and get back a response.

 So, my suggestion would be to move forward with what we have. What do you
 think?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29772#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online