[metrics-team] duplicates in collector tarballs?
Karsten Loesing
karsten at torproject.org
Wed Jun 15 13:07:47 UTC 2016
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi Thomas,
my first guess is that you're looking at a different timestamp than
CollecTor for deciding which tarball a descriptor belongs in.
Unfortunately, "relays 2007-08 and 2007-09" is rather vague, because
relays published all kinds of descriptors in those two months, and I
can't really look at all those tarballs right now.
Can you list a tarball and a file contained in that tarball which you
think doesn't belong there?
All the best,
Karsten
On 14/06/16 11:33, tl wrote:
>
>> On 14.06.2016, at 11:27, tl <tl at rat.io> wrote:
>>
>>>
>>> On 14.06.2016, at 10:05, Karsten Loesing
>>> <karsten at torproject.org> wrote:
>>>
>>> Signed PGP part Hi Thomas,
>>>
>>> can you give one or more examples?
>>
>> Unfortunately I didn’t keep note of them. When I couldn’t convert
>> all descriptors of one type in one run (because I ran into memory
>> limits) I converted descriptors per year. Maybe in 20% of these
>> cases I got results like this:
>>
>> -rwxrwxrwx 1 t t 1978191 Jun 14 03:23
>> RelayVote_2015-12.parquet.snappy -rwxrwxrwx 1 t t 2473316989 Jun
>> 14 03:23 RelayVote_2016-01.parquet.snappy -rwxrwxrwx 1 t t
>> 2384211448 Jun 14 03:23 RelayVote_2016-02.parquet.snappy
>> -rwxrwxrwx 1 t t 2265386311 Jun 14 03:23
>> RelayVote_2016-03.parquet.snappy -rwxrwxrwx 1 t t 2339112076 Jun
>> 14 03:23 RelayVote_2016-04.parquet.snappy -rwxrwxrwx 1 t t
>> 2062026086 Jun 14 03:23 RelayVote_2016-05.parquet.snappy
>>
>> where I had only converted tarballs of 2016.
>>
>>
>> I had similar issues when I converted tarballs from another year
>> but I don’t remember for sure which type and which year. I think
>> (!) it relays for 2012-08 and 2012-09 so it’s not only an issue
>> with years ends. It seems like my JSON converter handles this
>> issue differently than my Parquet converter. The JSON converter
>> didn’t run into memory issues and seems to be happy to append to
>> data already written to disk. The Parquet converter otoh often
>> (but not always :-/) keeps everything in memory and only in the
>> very last step writes everything to disk in one flush. Then
>> sometimes the results for one or two months remain completely
>> empty and my current guess would be that in those cases there was
>> an overlap of descriptors in tarballs from different months and
>> the converter couldn’t decide which one to write out. The two
>> months mentioned above where such a case and when I then
>> converted sepoerately I got results also for the month 2012-07
>> and 2012-10. But again: I’m neither sure about the year nor the
>> type of descriptor. I would have to rerun conversions and search
>> for them. Should I?
>
> Ha, found them in the bash-history: relays 2007-08 and 2007-09
>
> c’t
>
>
>> Ciao Thomas
>>
>>
>>> All the best, Karsten
>>>
>>>
>>> On 13/06/16 22:19, tl wrote:
>>>> Hi,
>>>>
>>>> when testing some descriptor converter I stumbled across the
>>>> fact that descriptor tarballs for a given month sometimes
>>>> contain a few descriptors from the month before or after.
>>>> That introduces a problem that I might be able to overcome by
>>>> poking at the code but before I try that I’d like to know: -
>>>> if a descriptor tarball for say 2012-10 also contains
>>>> descriptors from 2012-09 does that mean that the 2012-09
>>>> descriptors contained in the 2012-10 tarball are not
>>>> contained in the 2012-09 tarball? Or are they duplicates? -
>>>> and if they are no duplicates: would it be hard to repackage
>>>> the tarballs? Tedious for sure, but hard? Or not good for
>>>> other reasons?
>>>>
>>>> Cheers Thomas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________ metrics-team
>>>> mailing list metrics-team at lists.torproject.org
>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>
>>>
>>>
>>>>
_______________________________________________
>>> metrics-team mailing list metrics-team at lists.torproject.org
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>
>>
>>
>>
>>
>>
>>
>>>
< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
>>
>> _______________________________________________ metrics-team
>> mailing list metrics-team at lists.torproject.org
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>
>>
>
>
>
>
>
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html
> >
>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJXYVMjAAoJEC3ESO/4X7XBDhQH/iAXIuhf8ghdcvLP8Uk/Nuv/
Xt53IRqvI+wV9zqonmPReXGDmOKDZA3v0n0+1d58+XXakUU+WFZq2yG3x0BebzVx
WcdpySxoT7jepKxz/Q0eo7nNyNnlrQv80lr2mh7URmkq83CZdlW+4/ZbXx5A6DBY
cLojNTUs30LHWDpv3+nk1qyT6DISStNw8bwK/FP/fDFiTmQDMqo+8wTEZF4a7k8v
1K10yOa9O0DMbYdI0Czb6DBiI3MqfCZP/6oPyi3gJR6IiDCWPijb5TDjuLTH9/5K
+BIblV7Ayx1bSsuCrpryJ/vt+pmIbWf4IDhQ+Z9m7JiUVMV59JwjxKrRvAcDiOc=
=DIRN
-----END PGP SIGNATURE-----
More information about the metrics-team
mailing list