[metrics-team] duplicates in collector tarballs?
Karsten Loesing
karsten at torproject.org
Thu Jun 16 11:51:07 UTC 2016
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15/06/16 23:43, tl wrote:
>>
>> On 15.06.2016, at 15:07, Karsten Loesing <karsten at torproject.org>
>> wrote:
>>
>> Signed PGP part Hi Thomas,
>>
>> my first guess is that you're looking at a different timestamp
>> than CollecTor for deciding which tarball a descriptor belongs
>> in.
>
> I’m using getPublishedMillis() in most cases, except Consensus -
> getValidAfterMillis() Torperf - getStartMillis() Tordnsel -
> getDownloadedMillis() What is CollecTor using?
Off the top of my head, those three are correct. Plus, votes would be
sorted by getValidAfterMillis(), too.
>> Unfortunately, "relays 2007-08 and 2007-09" is rather vague,
>> because relays published all kinds of descriptors in those two
>> months, and I can't really look at all those tarballs right now.
>
> Sorry, my bad. That’s the short name I used internally for
> server-descriptors-2007-09.tar.xz
> server-descriptors-2007-08.tar.xz
>
>
>> Can you list a tarball and a file contained in that tarball which
>> you think doesn't belong there?
>
> Converting server-descriptors-2007-09.tar.xz I get 3 results:
> Relay_2007-08.json, Relay_2007-09.json and Relay_2007-10.json. I’m
> attaching the latter:
>
>
>
>
>
> Both descriptors are from October 1. early in the morning.
Yep, you're right, that looks bad. I wrote a small Java application
to parse through all tarballs and tell me which of them contain
descriptors that don't belong there. I'm attaching the sources, FYI.
I'll create a ticket as soon as I have a better sense of what's going
wrong. But server-descriptors-2007-09.tar.xz looks indeed problematic.
> And I’m also thinking if I shouldn't just use the date of the
> tarball that contains the descriptors. I hadn’t expected any
> problems here so I went for the (easily reachable) dates in the
> descriptors but it seems safest to just reproduce CollecTor
> tarballs as faithful as possible no matter how the descriptors were
> allocated. Especially since the situation get’s even more complex
> with Consensus, Torperf and Tordnsel. I just don’t know how exactly
> I could get hold of the name of the tarball that the descriptor is
> extracted from. Seems like metrics-lib.DescriptorReader doesn’t
> provide the name of the tarball it’s reading. Can you do something
> about that.
You should be able to learn that via DescriptorFile. See the Javadocs
there.
> Ciao Thomas
All the best,
Karsten
>
>
>
>
>
>
>
>
>
>> All the best, Karsten
>>
>>
>> On 14/06/16 11:33, tl wrote:
>>>
>>>> On 14.06.2016, at 11:27, tl <tl at rat.io> wrote:
>>>>
>>>>>
>>>>> On 14.06.2016, at 10:05, Karsten Loesing
>>>>> <karsten at torproject.org> wrote:
>>>>>
>>>>> Signed PGP part Hi Thomas,
>>>>>
>>>>> can you give one or more examples?
>>>>
>>>> Unfortunately I didn’t keep note of them. When I couldn’t
>>>> convert all descriptors of one type in one run (because I ran
>>>> into memory limits) I converted descriptors per year. Maybe
>>>> in 20% of these cases I got results like this:
>>>>
>>>> -rwxrwxrwx 1 t t 1978191 Jun 14 03:23
>>>> RelayVote_2015-12.parquet.snappy -rwxrwxrwx 1 t t 2473316989
>>>> Jun 14 03:23 RelayVote_2016-01.parquet.snappy -rwxrwxrwx 1 t
>>>> t 2384211448 Jun 14 03:23 RelayVote_2016-02.parquet.snappy
>>>> -rwxrwxrwx 1 t t 2265386311 Jun 14 03:23
>>>> RelayVote_2016-03.parquet.snappy -rwxrwxrwx 1 t t 2339112076
>>>> Jun 14 03:23 RelayVote_2016-04.parquet.snappy -rwxrwxrwx 1 t
>>>> t 2062026086 Jun 14 03:23 RelayVote_2016-05.parquet.snappy
>>>>
>>>> where I had only converted tarballs of 2016.
>>>>
>>>>
>>>> I had similar issues when I converted tarballs from another
>>>> year but I don’t remember for sure which type and which year.
>>>> I think (!) it relays for 2012-08 and 2012-09 so it’s not
>>>> only an issue with years ends. It seems like my JSON
>>>> converter handles this issue differently than my Parquet
>>>> converter. The JSON converter didn’t run into memory issues
>>>> and seems to be happy to append to data already written to
>>>> disk. The Parquet converter otoh often (but not always :-/)
>>>> keeps everything in memory and only in the very last step
>>>> writes everything to disk in one flush. Then sometimes the
>>>> results for one or two months remain completely empty and my
>>>> current guess would be that in those cases there was an
>>>> overlap of descriptors in tarballs from different months and
>>>> the converter couldn’t decide which one to write out. The
>>>> two months mentioned above where such a case and when I then
>>>> converted sepoerately I got results also for the month
>>>> 2012-07 and 2012-10. But again: I’m neither sure about the
>>>> year nor the type of descriptor. I would have to rerun
>>>> conversions and search for them. Should I?
>>>
>>> Ha, found them in the bash-history: relays 2007-08 and 2007-09
>>>
>>> c’t
>>>
>>>
>>>> Ciao Thomas
>>>>
>>>>
>>>>> All the best, Karsten
>>>>>
>>>>>
>>>>> On 13/06/16 22:19, tl wrote:
>>>>>> Hi,
>>>>>>
>>>>>> when testing some descriptor converter I stumbled across
>>>>>> the fact that descriptor tarballs for a given month
>>>>>> sometimes contain a few descriptors from the month before
>>>>>> or after. That introduces a problem that I might be able
>>>>>> to overcome by poking at the code but before I try that
>>>>>> I’d like to know: - if a descriptor tarball for say
>>>>>> 2012-10 also contains descriptors from 2012-09 does that
>>>>>> mean that the 2012-09 descriptors contained in the
>>>>>> 2012-10 tarball are not contained in the 2012-09 tarball?
>>>>>> Or are they duplicates? - and if they are no duplicates:
>>>>>> would it be hard to repackage the tarballs? Tedious for
>>>>>> sure, but hard? Or not good for other reasons?
>>>>>>
>>>>>> Cheers Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> metrics-team mailing list
>>>>>> metrics-team at lists.torproject.org
>>>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>>>
>>>>>
>>>>>
>>>>>>
>>
>>>>>>
_______________________________________________
>>>>> metrics-team mailing list
>>>>> metrics-team at lists.torproject.org
>>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>
>>>>>
< Der Siegeszug der Populisten - http://www.stern.de/6880250.html >
>>>>
>>>> _______________________________________________ metrics-team
>>>> mailing list metrics-team at lists.torproject.org
>>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
< Der Siegeszug der Populisten - http://www.stern.de/6880250.html
>>>>
>>>
>>
>
>
>
>
>
>
> < Der Siegeszug der Populisten - http://www.stern.de/6880250.html
> > < Diskurs und Wutbürger -
> http://www.faz.net/aktuell/politik/inland/politik-braucht-eine-sprache-der-maessigung-14281846.html
>
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
iQEcBAEBAgAGBQJXYpKqAAoJEC3ESO/4X7XBAFcH/2fTkmtl4GVimbl1QQT6vRsj
ziD0EHyQ68R3iuEpAtNpsV0G0ItEsn+RyPc+OdKCERFNVD+ulRQP8FJzjzH/IlR9
820eYZhBzs8rb7samdYhZvV6s9J1LT8/YqpHBWrV7DUzREt9iBJOqFLcYh0xNXcY
CFKOPyU9oJ2Iq2pn/+E3CKXsSAnuRM91QoVTKyQ2UtI0Lq4iTfPUScnXicUDB2Fw
zBIQwdvLNtzaCuZjrH+a+zostolZ3Wlw9D7emyZth3pS4eq6NcEwxY6LZ8e/Yxv4
xQYq+oTJ5wDKQy1Xh7xcslVQVikMtL09lvmenCczHmaHO3VVSqOT/N5/+54anXM=
=cvXb
-----END PGP SIGNATURE-----
-------------- next part --------------
package wrongmonth;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Iterator;
import java.util.Locale;
import java.util.TimeZone;
import org.torproject.descriptor.BridgeExtraInfoDescriptor;
import org.torproject.descriptor.BridgeNetworkStatus;
import org.torproject.descriptor.BridgePoolAssignment;
import org.torproject.descriptor.BridgeServerDescriptor;
import org.torproject.descriptor.Descriptor;
import org.torproject.descriptor.DescriptorFile;
import org.torproject.descriptor.DescriptorReader;
import org.torproject.descriptor.DescriptorSourceFactory;
import org.torproject.descriptor.ExitList;
import org.torproject.descriptor.RelayDirectory;
import org.torproject.descriptor.RelayExtraInfoDescriptor;
import org.torproject.descriptor.RelayNetworkStatus;
import org.torproject.descriptor.RelayNetworkStatusConsensus;
import org.torproject.descriptor.RelayNetworkStatusVote;
import org.torproject.descriptor.RelayServerDescriptor;
import org.torproject.descriptor.TorperfResult;
public class Main {
public static void main(String[] args) throws IOException {
File tarballsDirectory = new File(
"/Users/karsten/backup/collector-backup");
File logFile = new File("wrongmonth.log");
new Main(tarballsDirectory, logFile).parseTarballsDirectory();
}
private BufferedWriter bw;
private File tarballsDirectory;
DateFormat yearMonthFormat;
public Main(File tarballsDirectory, File logFile) throws IOException {
this.bw = new BufferedWriter(new FileWriter(logFile));
this.tarballsDirectory = tarballsDirectory;
this.yearMonthFormat = new SimpleDateFormat("yyyy-MM", Locale.US);
this.yearMonthFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
}
private void parseTarballsDirectory() throws IOException {
for (File tarballFile : this.tarballsDirectory.listFiles()) {
String tarballFilename = tarballFile.getName();
if (!tarballFilename.contains("-20")) {
System.out.printf("Cannot extract month from tarball "
+ "'%s'.\n", tarballFilename);
} else {
String tarballMonth = tarballFilename.substring(
tarballFilename.indexOf("-20") + 1);
tarballMonth = tarballMonth.substring(0, "yyyy-MM".length());
System.out.printf("Processing tarball '%s'.\n",
tarballFilename);
this.parseDescriptors(tarballFile, tarballMonth);
}
}
}
private void parseDescriptors(File tarballFile, String tarballMonth)
throws IOException {
DescriptorReader descriptorReader =
DescriptorSourceFactory.createDescriptorReader();
descriptorReader.addTarball(tarballFile);
descriptorReader.setMaxDescriptorFilesInQueue(10);
Iterator<DescriptorFile> descriptorFiles =
descriptorReader.readDescriptors();
while (descriptorFiles.hasNext()) {
DescriptorFile descriptorFile = descriptorFiles.next();
for (Descriptor descriptor : descriptorFile.getDescriptors()) {
long publishedMillis = this.extractPublishedMonth(descriptor);
String publishedMonth = this.yearMonthFormat.format(
publishedMillis);
if (!tarballMonth.equals(publishedMonth)) {
System.out.printf("Tarball '%s' contains file "
+ "'%s' with a descriptor published at '%s'.\n",
tarballFile.getName(), descriptorFile.getFileName(),
publishedMonth);
}
}
}
}
private long extractPublishedMonth(Descriptor descriptor) {
long publishedMillis = -1L;
if (descriptor instanceof BridgeNetworkStatus) {
publishedMillis = ((BridgeNetworkStatus) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof BridgeServerDescriptor) {
publishedMillis = ((BridgeServerDescriptor) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof BridgeExtraInfoDescriptor) {
publishedMillis = ((BridgeExtraInfoDescriptor) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof BridgePoolAssignment) {
publishedMillis = ((BridgePoolAssignment) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof RelayNetworkStatusConsensus) {
publishedMillis = ((RelayNetworkStatusConsensus) descriptor)
.getValidAfterMillis();
} else if (descriptor instanceof ExitList) {
publishedMillis = ((ExitList) descriptor)
.getDownloadedMillis();
} else if (descriptor instanceof RelayExtraInfoDescriptor) {
publishedMillis = ((RelayExtraInfoDescriptor) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof RelayServerDescriptor) {
publishedMillis = ((RelayServerDescriptor) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof RelayNetworkStatus) {
publishedMillis = ((RelayNetworkStatus) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof RelayDirectory) {
publishedMillis = ((BridgePoolAssignment) descriptor)
.getPublishedMillis();
} else if (descriptor instanceof TorperfResult) {
publishedMillis = ((TorperfResult) descriptor)
.getStartMillis();
} else if (descriptor instanceof RelayNetworkStatusVote) {
publishedMillis = ((RelayNetworkStatusVote) descriptor)
.getValidAfterMillis();
}
return publishedMillis;
}
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Main.java.sig
Type: application/octet-stream
Size: 287 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160616/1c0c8f55/attachment.obj>
More information about the metrics-team
mailing list