[metrics-team] Monthly churn values per relay flag
Philipp Winter
phw at nymity.ch
Mon Feb 15 02:22:50 UTC 2016
On Thu, Feb 04, 2016 at 05:07:01PM +0100, Karsten Loesing wrote:
> - Regarding the mentioned outliers and suspected consensus hiccups, I
> wonder if those are caused by comparing non-adjacent consensuses. A
> random example could be the decline between 2007-11-10 to 2007-11-13.
I already account for that and don't calculate churn values when
consensuses are missing.
> Or is that an artifact of connecting two valid comparisons of two
> adjacent consensuses each with numerous missing values in between? In
> that case, would it make sense to add NAs to the graph data before
> plotting it, so that the line would end on 2007-11-10 and another line
> would start on 2007-11-13?
What do you mean by "numerous missing values in between?"
> - I'm yet unsure how churn rates are defined exactly, and I know I
> brought this up in previous discussions, Philipp, I just don't
> remember the latest definition. If I compare two adjacent consensuses
> C0 and C1, how are the numbers in NewRunning and GoneRunning
> calculated? If I were to define them, I think I'd say that NewRunning
> is the number of relays in C1 that were not listed in C0, divided by
> the total number of relays in C1, and GoneRunning is the number of
> relays in C0 that are not listed anymore in C1, divided by the total
> number of relays in C0. Note the different denominators. But I think
> if we use the same denominator, we can't guarantee that both values
> are in [0, 1]. Is this also the definition you used?
Yes, that's how I'm doing it now (after changing the definition, thanks
to your suggestions.)
> - Would it make sense to not only include one Date in your .csv file
> (I assume this is C1 in my definition above?), but the valid-after
> times of C0 and C1 that you're comparing, like DatePrevious and
> DateCurrent?
Yes, it's always C1. I could include it if you think it's helpful, but
it's redudant because I only compare two adjacent consensuses, so C0's
valid-after is always C1's valid-after minus one hour.
> - Your .csv file uses a "wide" table format with lots of columns for
> the different flags. My experience is that this table format has
> disadvantages, because you need to know which columns exist and update
> any code using the .csv file when you add new columns. I find the
> "long" table format to be more flexible. In that format you'd add
> columns for flags that just contain a boolean or null/NA/empty string
> and one line per combination of flags. Here's an example of how the
> first lines could look like in the "long" table format:
>
> Date,Authority,BadExit,Exit,Fast,Guard,HSDir,Named,Running,Stable,Unnamed,V2Dir,Valid,New,Gone
> 2007-10-27T13:00:00Z,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.00000,0.00000
> 2007-10-27T13:00:00Z,NA,NA,T,NA,NA,NA,NA,NA,NA,NA,NA,NA,0.09551,0.02669
That's new to me, thanks for the tip. I opened a feature report for
this:
<https://github.com/NullHypothesis/sybilhunter/issues/6>
> - I agree with Nusenu that absolute relay numbers might be
> interesting, too. Using the "long" table format they are relatively
> easy to add as new column NewAbs and GoneAbs. Whether you'd graph
> them or not is another question.
Yes, that's already implemented:
<https://nymity.ch/sybilhunting/churn-values/churn-all.csv>
> - Another potentially interesting metric would be fraction of
> consensus weight joining or leaving the network, like NewCW and
> GoneCW. Not sure how useful that would be for joining relays, because
> new relays typically have very small consensus weights, but it might
> be interesting to see when a large part of the network by consensus
> weight leaves.
That's a good idea. I opened a feature report for it:
<https://github.com/NullHypothesis/sybilhunter/issues/4>
> - Did you consider comparing consensuses with more than 1 hours in
> between them, for example 1 day, 1 week, or 1 month? That would
> remove daily/weekly/monthly patterns and might make it easier to
> observe changes. It would also reduce the data resolution in the
> graph, allowing you to plot more than just a month. I could imagine
> that a graph from 2007 to 2016 would be much more useful with a data
> resolution of 1 week or 1 month. Note how a data format with
> DatePrevious and DateCurrent would allow you to add that data to your
> .csv file. Maybe add another column Interval that you set to "1
> month" to make plotting easier.
Also a good idea. I added another feature report:
<https://github.com/NullHypothesis/sybilhunter/issues/5>
Cheers,
Philipp
More information about the metrics-team
mailing list