[metrics-bugs] #29787 [Metrics/Onionperf]: Enumerate possible failure cases and include failure information in .tpf output

Fri Apr 5 15:31:18 UTC 2019

#29787: Enumerate possible failure cases and include failure information in .tpf
output
-------------------------------+------------------------------
 Reporter:  karsten            |          Owner:  metrics-team
     Type:  enhancement        |         Status:  new
 Priority:  Medium             |      Milestone:
Component:  Metrics/Onionperf  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+------------------------------

Comment (by karsten):

 Hi acute!

 Your idea to extend that code that matches tgen logs and tor control port
 event logs sounds interesting. Is that going to replace OnionPerf's
 analysis.py? If yes, why don't you extend or replace that code rather than
 start a new code base?

 However, I wonder if we could start simpler here by simply looking at the
 tgen logs alone:

  1. For an initial classification of failure cases it might be sufficient
 to learn ''when'' a request fails and ''how''. Like, in which request
 phase does a request fail and how much time has elapsed up to that point?
 Maybe the tgen logs also tell us how a request failed, that is, whether
 the tor process sent an error or tgen ran into a timeout or stallout (even
 though we're setting stallout high enough that this is currently not the
 case) or checksum error or whatever. It would be good to know what
 fraction of requests succeeded and what fractions failed at the various
 request stages. This is all based on tgen information, which is the
 application point of view that treats tor as a black box.

  2. The next step after that, for me, would be to match tgen logs with tor
 control port event logs. I wonder why we'd be using the source port for
 this. Is that to handle potentially overlapping requests? Do we handle
 cases where a source port is re-used over the day, by including time? And
 what do we do if no corresponding source port is found in the other log
 file, or is that scenario unrealistic/impossible? In short, this sounds
 complicated and potentially error-prone. Maybe we could simplify this by
 doing the matching solely based on timing information? And do you think we
 could also match tor logs (not control port events) by using the same
 timing information? Assuming that there's anything interesting in these
 logs.

 Sadly, the weekend is almost here and I likely won't be able to spend much
 time on this analysis over the weekend. But if I find time, I'll start by
 reading tgen logs and writing little helper tools to classify failure
 cases solely based on tgen logs. I'll share measurement identifiers of
 some sort for failure cases as I find them.

 Thanks!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29787#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online