[metrics-team] How to handle double entries in ConverTor
Karsten Loesing
karsten at torproject.org
Fri Jun 17 06:58:37 UTC 2016
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Thomas,
it sounds like a plausible design decision that your converter does
not deduplicate entries. However, the consequence of that cannot be
that you require the input data to be free of duplicates.
The consequence is that any consumer of your converted data must take
into account that there could be duplicates. And that's a reasonable
requirement. Just state it in the README that there might be
duplicate entries in the output, and you're done. Though even if you
didn't state that, consumers of your data shouldn't make such an
assumption anyway.
Stated differently, "be conservative in what you do, be liberal in
what you accept from others." (Robustness principle,
https://en.wikipedia.org/wiki/Robustness_principle)
Note that I'm still planning to repackage those older tarballs to
minimize confusions like yours when you found out that tarballs
contain different files than you expected. You shouldn't have to wait
for that though.
All the best,
Karsten
On 16/06/16 23:42, tl wrote:
> Hi,
>
> during the metrics team chat this afternoon we discussed briefly
> how ConverTor would or should handle double entries of descriptors.
> I thought about it a little more and think the following aspects
> are important:
>
> * ConverTor only converts CollecTor tarballs into other formats:
> JSON, Parquet and Avro. It changes the contents of these archives
> as little as possible. If they contain double entries, the
> converted archives will do so too.
>
> * Therefor the CollecTor tarballs better be correct ;-)
>
> * Doing otherwise wouldn’t be easy: ConverTor would have to keep
> all descriptors in memory before he can write them out in one
> flush. Or ConverTor would have to know how to change
> JSON/Parquet/Avro files. And it would have to keep an index of the
> entries. This feels a bit like writing a database. I quite possibly
> don’t fully understand what I’m talking about here but still I
> highly doubt that this would be the right way to go.
>
> * Analytics will work on the converted JSON/Parquet/Avro files
> first. Importing the descriptors from JSON/Parquet/Avro into a
> database is something that can be done to facilitate certain kinds
> of analytics but it’s not absolutely necessary. A lot of
> aggregation and other tasks can be accomplished from the converted
> files alone and as far as I know this is how it’s usually done:
> load data from files into memory and compute new data from there
> (and write it to files again).
>
> * When importing descriptors in a database (HBase is planned) it
> should be easy to detect double entries and handle them
> appropriatly.
>
> * During aggregation from archive files it might be possible as
> well to detect double entries.
>
> * I wonder though what the practical relevance of this question is.
> It wouldn’t expect there to be a lot of double entries in CollecTor
> archives. Am I wrong?
>
> * So far I haven’t spend much time on thinking about how a system
> must be build that is updated with new descriptors in an hourly or
> daily fashion. For a start I would just regenerate monthly tarballs
> and then re-convert them since that doesn’t take very long. Of
> course a database should be beneficial here but I’ll have to play
> with this whole machinery a little more before I can say something
> that I’m confident about.
>
>
> Hope this helps to clear things up. Of course patches are always
> welcome :-) My focus though will be on getting the conversion bug
> free (but not adding new features) and starting to do some
> meaningful aggregation and analytics.
>
> Cheers oma _______________________________________________
> metrics-team mailing list metrics-team at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJXY5+dAAoJEC3ESO/4X7XBxUkH/0IVjNAFddxkga3JnohfFAJH
rP8ZHb1HtCXa4Lf9aizrLJ84PVgbmD8Y0aiutn2Fr7FpBF97F+z2Z7DIv2dcP7vn
xqIPsw0/28AH8FNbrMvKxcTrVMlEKKZBvEXuDraaDoUKENGk+Sn3Md1/1WM3tHxL
7SN/Ronju7rnVFqaX9vYcfrxb73T1dxJ4y2nz75R/m+61H9xHpdkrro2aG7bqNiy
wdEsKfPVYBoOuI5gRQyvDjeGx5vHJKat611V/y0pS0hX9P3ai7kNLFg0BONragGw
RlN2LRKT0/uIgmNYChk0XPXPqbtcg5C4xa6GGNw4xobaLlPasqf+Af6lvXzYpio=
=NRIR
-----END PGP SIGNATURE-----
More information about the metrics-team
mailing list