[metrics-team] How to handle double entries in ConverTor

Karsten Loesing karsten at torproject.org
Fri Jun 17 10:48:30 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Thomas,

you could cite:

https://collector.torproject.org/#data-formats

All the best,
Karsten


On 17/06/16 11:39, tl wrote:
> Hi,
> 
> is there such a README for CollecTor data anywhere? I would just
> cite or refer to it.
> 
> Cheers, Thomas
> 
> 
>> On 17.06.2016, at 08:58, Karsten Loesing <karsten at torproject.org>
>> wrote:
>> 
>> Signed PGP part Hi Thomas,
>> 
>> it sounds like a plausible design decision that your converter
>> does not deduplicate entries.  However, the consequence of that
>> cannot be that you require the input data to be free of
>> duplicates.
>> 
>> The consequence is that any consumer of your converted data must
>> take into account that there could be duplicates.  And that's a
>> reasonable requirement.  Just state it in the README that there
>> might be duplicate entries in the output, and you're done.
>> Though even if you didn't state that, consumers of your data
>> shouldn't make such an assumption anyway.
>> 
>> Stated differently, "be conservative in what you do, be liberal
>> in what you accept from others." (Robustness principle, 
>> https://en.wikipedia.org/wiki/Robustness_principle)
>> 
>> Note that I'm still planning to repackage those older tarballs
>> to minimize confusions like yours when you found out that
>> tarballs contain different files than you expected.  You
>> shouldn't have to wait for that though.
>> 
>> All the best, Karsten
>> 
>> 
>> On 16/06/16 23:42, tl wrote:
>>> Hi,
>>> 
>>> during the metrics team chat this afternoon we discussed
>>> briefly how ConverTor would or should handle double entries of
>>> descriptors. I thought about it a little more and think the
>>> following aspects are important:
>>> 
>>> * ConverTor only converts CollecTor tarballs into other
>>> formats: JSON, Parquet and Avro. It changes the contents of
>>> these archives as little as possible. If they contain double
>>> entries, the converted archives will do so too.
>>> 
>>> * Therefor the CollecTor tarballs better be correct ;-)
>>> 
>>> * Doing otherwise wouldn’t be easy: ConverTor would have to
>>> keep all descriptors in memory before he can write them out in
>>> one flush. Or ConverTor would have to know how to change 
>>> JSON/Parquet/Avro files. And it would have to keep an index of
>>> the entries. This feels a bit like writing a database. I quite
>>> possibly don’t fully understand what I’m talking about here but
>>> still I highly doubt that this would be the right way to go.
>>> 
>>> * Analytics will work on the converted JSON/Parquet/Avro files 
>>> first. Importing the descriptors from JSON/Parquet/Avro into a 
>>> database is something that can be done to facilitate certain
>>> kinds of analytics but it’s not absolutely necessary. A lot of 
>>> aggregation and other tasks can be accomplished from the
>>> converted files alone and as far as I know this is how it’s
>>> usually done: load data from files into memory and compute new
>>> data from there (and write it to files again).
>>> 
>>> * When importing descriptors in a database (HBase is planned)
>>> it should be easy to detect double entries and handle them 
>>> appropriatly.
>>> 
>>> * During aggregation from archive files it might be possible
>>> as well to detect double entries.
>>> 
>>> * I wonder though what the practical relevance of this question
>>> is. It wouldn’t expect there to be a lot of double entries in
>>> CollecTor archives. Am I wrong?
>>> 
>>> * So far I haven’t spend much time on thinking about how a
>>> system must be build that is updated with new descriptors in an
>>> hourly or daily fashion. For a start I would just regenerate
>>> monthly tarballs and then re-convert them since that doesn’t
>>> take very long. Of course a database should be beneficial here
>>> but I’ll have to play with this whole machinery a little more
>>> before I can say something that I’m confident about.
>>> 
>>> 
>>> Hope this helps to clear things up. Of course patches are
>>> always welcome :-) My focus though will be on getting the
>>> conversion bug free (but not adding new features) and starting
>>> to do some meaningful aggregation and analytics.
>>> 
>>> Cheers oma _______________________________________________ 
>>> metrics-team mailing list metrics-team at lists.torproject.org 
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>
>>
>>
>>> 
_______________________________________________
>> metrics-team mailing list metrics-team at lists.torproject.org 
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>
>> 
> 
> 
> 
> 
> 
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html
> > < Diskurs und Wutbürger -
> http://www.faz.net/aktuell/politik/inland/politik-braucht-eine-sprache-der-maessigung-14281846.html
>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJXY9V+AAoJEC3ESO/4X7XBMb0H/iu0/rDCzIfsXhGdizd1mBAQ
w0uughlRSht07XRI4qOvpQAkif+BBjljnlay/Mqng21OyM/G2ozL3rQHdh6Ab32y
lYL7dPvsJ/BLgAOLN6iCGBNhwCmIPVw55TeZSyHnvk2xfop+iC7NiagO5emhQOdM
qDLMTLzkKor/xNZiVkNTDkhP+djqSbM2o50Frg0w/A6C5ry9RjIWKHvQK/4PCobF
j57u5+V/3OfB69G7WiiQ4FQP9L5tZGZ5hjayt/DWLUAeLHqUVae5P4AASmn1ZSAT
r/5anMCwjSXSFr+sEaS+6ef9Vg5mHWjjqBbs3ysWftmzuM2hEjek4fdN05zdDZg=
=FPWd
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list