[tor-bugs] #6471 [Metrics Utilities]: Design file format and Python/Java library for multiple GeoIP or AS databases
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Nov 7 15:53:49 UTC 2012
#6471: Design file format and Python/Java library for multiple GeoIP or AS
databases
-------------------------------+--------------------------------------------
Reporter: karsten | Owner:
Type: enhancement | Status: needs_review
Priority: normal | Milestone:
Component: Metrics Utilities | Version:
Keywords: | Parent:
Points: | Actualpoints:
-------------------------------+--------------------------------------------
Comment(by atagar):
> octets = address_string.split('.')
> return long(''.join(["%02X" % long(octet) for octet in octets]), 16)
Nice. :)
> Are you certain that the current code leads to reading the file into
memory twice?
Pretty sure. What makes you think that it doesn't?
Calling 'file.readlines()' reads the whole file into a list of newline
separated strings. At this point you've read the whole file once into
memory, and then you iterate over the entires and append data for each
entry.
This is a bit similar to the difference between python's range() and
xrange() function. Calling range() gives you a list (which is iterable)
while xrange() gives you an iterator. Hence...
{{{
for i in range(1000000000):
print i
}}}
... means making a list of a billion ints in memory then printing each
while...
{{{
for i in xrange(1000000000):
print i
}}}
... has constant memory usage because xrange() provides the sequence on
demand.
Personally I think that it's stupid that the python compiler isn't smart
enough to say "the developer's using range() or readlines() in a loop,
hence provide an iterator rather than a list", and maybe it does in newer
python versions. I wouldn't count on it though.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6471#comment:20>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list