[tor-commits] [stem/master] Error when reading CollecTor file twice
atagar at torproject.org
atagar at torproject.org
Sat Oct 24 22:05:17 UTC 2020
commit 78ad7080d6172365ddfbd50742ea42fa814c0485
Author: Damian Johnson <atagar at torproject.org>
Date: Sat Oct 24 14:27:23 2020 -0700
Error when reading CollecTor file twice
When a CollecTor file is already within our cache we validate its hash against
CollecTor's index. However, files we wrote to disk are decompressed by default
so the hash naturally mismatches.
https://github.com/torproject/stem/issues/76
Reproduced this issue with the following script...
import stem.descriptor.collector
collector = stem.descriptor.collector.get_instance()
desc_file = collector.files(descriptor_type = 'server-descriptor')[0] # pick any arbitrary file
print('Number of descriptors (first read): %s' % len(list(desc_file.read(directory = '/tmp/collector_cache'))))
print('Number of descriptors (second read): %s' % len(list(desc_file.read(directory = '/tmp/collector_cache'))))
Before...
% python demo.py
Number of descriptors (first read): 3112
Traceback (most recent call last):
File "scrap.py", line 8, in <module>
print('Number of descriptors (second read): %s' % len(list(desc_file.read(directory = '/tmp/collector_cache'))))
File "/home/atagar/Desktop/stem/stem/descriptor/collector.py", line 273, in read
path = self.download(directory, True, timeout, retries)
File "/home/atagar/Desktop/stem/stem/descriptor/collector.py", line 335, in download
raise OSError("%s already exists but mismatches CollecTor's checksum (expected: %s, actual: %s)" % (path, expected_hash, actual_hash))
OSError: /tmp/collector_cache/server-descriptors-2005-12.tar already exists but mismatches CollecTor's checksum (expected: bf700d8b6143e310219b2ce2810abd82f94bc295c7f08e9f1a88989562e33b2f, actual: 32a5ea8fd761e5967fbb8d399742f0da7cbb1c79c1539f2e58cad2e668462652)
After...
% python demo.py
Number of descriptors (first read): 3112
Number of descriptors (second read): 3112
We can either solve this by dropping the hash check or caching compressed
archives. Initially I leaned toward the former to expedite cache reads, but on
reflection the later is conceptually simpler. Essentially, is this a network
cache or a read cache? A network cache is safer in that if CollecTor replaces a
file (but keeps the same filename) this will catch the change.
---
stem/descriptor/collector.py | 13 ++-----------
1 file changed, 2 insertions(+), 11 deletions(-)
diff --git a/stem/descriptor/collector.py b/stem/descriptor/collector.py
index 892ccc30..a84d9724 100644
--- a/stem/descriptor/collector.py
+++ b/stem/descriptor/collector.py
@@ -270,7 +270,7 @@ class File(object):
return
- path = self.download(directory, True, timeout, retries)
+ path = self.download(directory, timeout, retries)
# Archives can contain multiple descriptor types, so parsing everything and
# filtering to what we're after.
@@ -290,13 +290,12 @@ class File(object):
yield desc
- def download(self, directory: str, decompress: bool = True, timeout: Optional[int] = None, retries: Optional[int] = 3, overwrite: bool = False) -> str:
+ def download(self, directory: str, timeout: Optional[int] = None, retries: Optional[int] = 3, overwrite: bool = False) -> str:
"""
Downloads this file to the given location. If a file already exists this is
a no-op.
:param directory: destination to download into
- :param decompress: decompress written file
:param timeout: timeout when connection becomes idle, no timeout
applied if **None**
:param retries: maximum attempts to impose
@@ -311,12 +310,7 @@ class File(object):
"""
filename = self.path.split('/')[-1]
-
- if self.compression != Compression.PLAINTEXT and decompress:
- filename = filename.rsplit('.', 1)[0]
-
directory = os.path.expanduser(directory)
-
path = os.path.join(directory, filename)
if not os.path.exists(directory):
@@ -336,9 +330,6 @@ class File(object):
response = stem.util.connection.download(COLLECTOR_URL + self.path, timeout, retries)
- if decompress:
- response = self.compression.decompress(response)
-
with open(path, 'wb') as output_file:
output_file.write(response)
More information about the tor-commits
mailing list