[tor-commits] [tor/master] Add script to fix "A1" entries in geoip file.
nickm at torproject.org
nickm at torproject.org
Fri Dec 7 16:40:03 UTC 2012
commit 2bf195d0ce38dbc0ad25f10288f22ed352230296
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date: Tue Nov 27 21:22:58 2012 -0500
Add script to fix "A1" entries in geoip file.
Fixes #6266.
---
changes/task-6266 | 7 ++
src/config/README.geoip | 90 +++++++++++++++++++++
src/config/deanonymind.py | 194 +++++++++++++++++++++++++++++++++++++++++++++
src/config/geoip-manual | 114 ++++++++++++++++++++++++++
4 files changed, 405 insertions(+), 0 deletions(-)
diff --git a/changes/task-6266 b/changes/task-6266
new file mode 100644
index 0000000..e7f0509
--- /dev/null
+++ b/changes/task-6266
@@ -0,0 +1,7 @@
+ o Minor features:
+ - Use a script to replace "A1" ("Anonymous Proxy") entries in our
+ geoip file with real country codes. This script fixes about 90% of
+ "A1" entries automatically and uses manual country code assignments
+ to fix the remaining 10%. See src/config/README.geoip for details.
+ Fixes #6266.
+
diff --git a/src/config/README.geoip b/src/config/README.geoip
new file mode 100644
index 0000000..8520501
--- /dev/null
+++ b/src/config/README.geoip
@@ -0,0 +1,90 @@
+README.geoip -- information on the IP-to-country-code file shipped with tor
+===========================================================================
+
+The IP-to-country-code file in src/config/geoip is based on MaxMind's
+GeoLite Country database with the following modifications:
+
+ - Those "A1" ("Anonymous Proxy") entries lying inbetween two entries with
+ the same country code are automatically changed to that country code.
+ These changes can be overriden by specifying a different country code
+ in src/config/geoip-manual.
+
+ - Other "A1" entries are replaced with country codes specified in
+ src/config/geoip-manual, or are left as is if there is no corresponding
+ entry in that file. Even non-"A1" entries can be modified by adding a
+ replacement entry to src/config/geoip-manual. Handle with care.
+
+
+1. Updating the geoip file from a MaxMind database file
+-------------------------------------------------------
+
+Download the most recent MaxMind GeoLite Country database:
+http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip
+
+Run `python deanonymind.py` in the local directory. Review the output to
+learn about applied automatic/manual changes and watch out for any
+warnings.
+
+Possibly edit geoip-manual to make more/fewer/different manual changes and
+re-run `python deanonymind.py`.
+
+When done, prepend the new geoip file with a comment like this:
+
+ # Last updated based on $DATE Maxmind GeoLite Country
+ # See README.geoip for details on the conversion.
+
+
+2. Verifying automatic and manual changes using diff
+----------------------------------------------------
+
+To unzip the original MaxMind file and look at the automatic changes, run:
+
+ unzip GeoIPCountryCSV.zip
+ diff -U1 GeoIPCountryWhois.csv AutomaticGeoIPCountryWhois.csv
+
+To look at subsequent manual changes, run:
+
+ diff -U1 AutomaticGeoIPCountryWhois.csv ManualGeoIPCountryWhois.csv
+
+To manually generate the geoip file and compare it to the automatically
+created one, run:
+
+ cut -d, -f3-5 < ManualGeoIPCountryWhois.csv | sed 's/"//g' > mygeoip
+ diff -U1 geoip mygeoip
+
+
+3. Verifying automatic and manual changes using blockfinder
+-----------------------------------------------------------
+
+Blockfinder is a powerful tool to handle multiple IP-to-country data
+sources. Blockfinder has a function to specify a country code and compare
+conflicting country code assignments in different data sources.
+
+We can use blockfinder to compare A1 entries in the original MaxMind file
+with the same or overlapping blocks in the file generated above and in the
+RIR delegation files:
+
+ git clone https://github.com/ioerror/blockfinder
+ cd blockfinder/
+ python blockfinder -i
+ python blockfinder -r ../GeoIPCountryWhois.csv
+ python blockfinder -r ../ManualGeoIPCountryWhois.csv
+ python blockfinder -p A1 > A1-comparison.txt
+
+The output marks conflicts between assignments using either '*' in case of
+two different opinions or '#' for three or more different opinions about
+the country code for a given block.
+
+The '*' conflicts are most likely harmless, because there will always be
+at least two opinions with the original MaxMind file saying A1 and the
+other two sources saying something more meaningful.
+
+However, watch out for '#' conflicts. In these cases, the original
+MaxMind file ("A1"), the updated MaxMind file (hopefully the correct
+country code), and the RIR delegation files (some other country code) all
+disagree.
+
+There are perfectly valid cases where the updated MaxMind file and the RIR
+delegation files don't agree. But each of those cases must be verified
+manually.
+
diff --git a/src/config/deanonymind.py b/src/config/deanonymind.py
new file mode 100755
index 0000000..c86dadc
--- /dev/null
+++ b/src/config/deanonymind.py
@@ -0,0 +1,194 @@
+#!/usr/bin/env python
+import optparse
+import os
+import sys
+import zipfile
+
+"""
+Take a MaxMind GeoLite Country database as input and replace A1 entries
+with the country code and name of the preceding entry iff the preceding
+(subsequent) entry ends (starts) directly before (after) the A1 entry and
+both preceding and subsequent entries contain the same country code.
+
+Then apply manual changes, either replacing A1 entries that could not be
+replaced automatically or overriding previously made automatic changes.
+"""
+
+def main():
+ options = parse_options()
+ assignments = read_file(options.in_maxmind)
+ assignments = apply_automatic_changes(assignments)
+ write_file(options.out_automatic, assignments)
+ manual_assignments = read_file(options.in_manual, must_exist=False)
+ assignments = apply_manual_changes(assignments, manual_assignments)
+ write_file(options.out_manual, assignments)
+ write_file(options.out_geoip, assignments, long_format=False)
+
+def parse_options():
+ parser = optparse.OptionParser()
+ parser.add_option('-i', action='store', dest='in_maxmind',
+ default='GeoIPCountryCSV.zip', metavar='FILE',
+ help='use the specified MaxMind GeoLite Country .zip or .csv '
+ 'file as input [default: %default]')
+ parser.add_option('-g', action='store', dest='in_manual',
+ default='geoip-manual', metavar='FILE',
+ help='use the specified .csv file for manual changes or to '
+ 'override automatic changes [default: %default]')
+ parser.add_option('-a', action='store', dest='out_automatic',
+ default="AutomaticGeoIPCountryWhois.csv", metavar='FILE',
+ help='write full input file plus automatic changes to the '
+ 'specified .csv file [default: %default]')
+ parser.add_option('-m', action='store', dest='out_manual',
+ default='ManualGeoIPCountryWhois.csv', metavar='FILE',
+ help='write full input file plus automatic and manual '
+ 'changes to the specified .csv file [default: %default]')
+ parser.add_option('-o', action='store', dest='out_geoip',
+ default='geoip', metavar='FILE',
+ help='write full input file plus automatic and manual '
+ 'changes to the specified .csv file that can be shipped '
+ 'with tor [default: %default]')
+ (options, args) = parser.parse_args()
+ return options
+
+def read_file(path, must_exist=True):
+ if not os.path.exists(path):
+ if must_exist:
+ print 'File %s does not exist. Exiting.' % (path, )
+ sys.exit(1)
+ else:
+ return
+ if path.endswith('.zip'):
+ zip_file = zipfile.ZipFile(path)
+ csv_content = zip_file.read('GeoIPCountryWhois.csv')
+ zip_file.close()
+ else:
+ csv_file = open(path)
+ csv_content = csv_file.read()
+ csv_file.close()
+ assignments = []
+ for line in csv_content.split('\n'):
+ stripped_line = line.strip()
+ if len(stripped_line) > 0 and not stripped_line.startswith('#'):
+ assignments.append(stripped_line)
+ return assignments
+
+def apply_automatic_changes(assignments):
+ print '\nApplying automatic changes...'
+ result_lines = []
+ prev_line = None
+ a1_lines = []
+ for line in assignments:
+ if '"A1"' in line:
+ a1_lines.append(line)
+ else:
+ if len(a1_lines) > 0:
+ new_a1_lines = process_a1_lines(prev_line, a1_lines, line)
+ for new_a1_line in new_a1_lines:
+ result_lines.append(new_a1_line)
+ a1_lines = []
+ result_lines.append(line)
+ prev_line = line
+ if len(a1_lines) > 0:
+ new_a1_lines = process_a1_lines(prev_line, a1_lines, None)
+ for new_a1_line in new_a1_lines:
+ result_lines.append(new_a1_line)
+ return result_lines
+
+def process_a1_lines(prev_line, a1_lines, next_line):
+ if not prev_line or not next_line:
+ return a1_lines # Can't merge first or last line in file.
+ if len(a1_lines) > 1:
+ return a1_lines # Can't merge more than 1 line at once.
+ a1_line = a1_lines[0].strip()
+ prev_entry = parse_line(prev_line)
+ a1_entry = parse_line(a1_line)
+ next_entry = parse_line(next_line)
+ touches_prev_entry = int(prev_entry['end_num']) + 1 == \
+ int(a1_entry['start_num'])
+ touches_next_entry = int(a1_entry['end_num']) + 1 == \
+ int(next_entry['start_num'])
+ same_country_code = prev_entry['country_code'] == \
+ next_entry['country_code']
+ if touches_prev_entry and touches_next_entry and same_country_code:
+ new_line = format_line_with_other_country(a1_entry, prev_entry)
+ print '-%s\n+%s' % (a1_line, new_line, )
+ return [new_line]
+ else:
+ return a1_lines
+
+def parse_line(line):
+ if not line:
+ return None
+ keys = ['start_str', 'end_str', 'start_num', 'end_num',
+ 'country_code', 'country_name']
+ stripped_line = line.replace('"', '').strip()
+ parts = stripped_line.split(',')
+ entry = dict((k, v) for k, v in zip(keys, parts))
+ return entry
+
+def format_line_with_other_country(original_entry, other_entry):
+ return '"%s","%s","%s","%s","%s","%s"' % (original_entry['start_str'],
+ original_entry['end_str'], original_entry['start_num'],
+ original_entry['end_num'], other_entry['country_code'],
+ other_entry['country_name'], )
+
+def apply_manual_changes(assignments, manual_assignments):
+ if not manual_assignments:
+ return assignments
+ print '\nApplying manual changes...'
+ manual_dict = {}
+ for line in manual_assignments:
+ start_num = parse_line(line)['start_num']
+ if start_num in manual_dict:
+ print ('Warning: duplicate start number in manual '
+ 'assignments:\n %s\n %s\nDiscarding first entry.' %
+ (manual_dict[start_num], line, ))
+ manual_dict[start_num] = line
+ result = []
+ for line in assignments:
+ entry = parse_line(line)
+ start_num = entry['start_num']
+ if start_num in manual_dict:
+ manual_line = manual_dict[start_num]
+ manual_entry = parse_line(manual_line)
+ if entry['start_str'] == manual_entry['start_str'] and \
+ entry['end_str'] == manual_entry['end_str'] and \
+ entry['end_num'] == manual_entry['end_num']:
+ if len(manual_entry['country_code']) != 2:
+ print '-%s' % (line, ) # only remove, don't replace
+ else:
+ new_line = format_line_with_other_country(entry,
+ manual_entry)
+ print '-%s\n+%s' % (line, new_line, )
+ result.append(new_line)
+ del manual_dict[start_num]
+ else:
+ print ('Warning: only partial match between '
+ 'original/automatically replaced assignment and '
+ 'manual assignment:\n %s\n %s\nNot applying '
+ 'manual change.' % (line, manual_line, ))
+ result.append(line)
+ else:
+ result.append(line)
+ if len(manual_dict) > 0:
+ print ('Warning: could not apply all manual assignments: %s' %
+ ('\n '.join(manual_dict.values())), )
+ return result
+
+def write_file(path, assignments, long_format=True):
+ if long_format:
+ output_lines = assignments
+ else:
+ output_lines = []
+ for long_line in assignments:
+ entry = parse_line(long_line)
+ short_line = "%s,%s,%s" % (entry['start_num'],
+ entry['end_num'], entry['country_code'], )
+ output_lines.append(short_line)
+ out_file = open(path, 'w')
+ out_file.write('\n'.join(output_lines))
+ out_file.close()
+
+if __name__ == '__main__':
+ main()
+
diff --git a/src/config/geoip-manual b/src/config/geoip-manual
new file mode 100644
index 0000000..3811c75
--- /dev/null
+++ b/src/config/geoip-manual
@@ -0,0 +1,114 @@
+# This file contains manual overrides of A1 entries (and possibly others)
+# in MaxMind's GeoLite Country database. Use deanonymind.py in the same
+# directory to process this file when producing a new geoip file. See
+# README.geoip in the same directory for details.
+
+# Remove MaxMind entry 0.116.0.0-0.119.255.255 which MaxMind says is AT,
+# but which is part of reserved range 0.0.0.0/8. -KL 2012-06-13
+"0.116.0.0","0.119.255.255","7602176","7864319","",""
+
+# NL, because previous MaxMind entry 31.171.128.0-31.171.133.255 is NL,
+# and RIR delegation files say 31.171.128.0-31.171.135.255 is NL.
+# -KL 2012-11-27
+"31.171.134.0","31.171.135.255","531334656","531335167","NL","Netherlands"
+
+# EU, because next MaxMind entry 37.139.64.1-37.139.64.9 is EU, because
+# RIR delegation files say 37.139.64.0-37.139.71.255 is EU, and because it
+# just makes more sense for the next entry to start at .0 and not .1.
+# -KL 2012-11-27
+"37.139.64.0","37.139.64.0","629882880","629882880","EU","Europe"
+
+# CH, because previous MaxMind entry 46.19.141.0-46.19.142.255 is CH, and
+# RIR delegation files say 46.19.136.0-46.19.143.255 is CH.
+# -KL 2012-11-27
+"46.19.143.0","46.19.143.255","773033728","773033983","CH","Switzerland"
+
+# GB, because next MaxMind entry 46.166.129.0-46.166.134.255 is GB, and
+# RIR delegation files say 46.166.128.0-46.166.191.255 is GB.
+# -KL 2012-11-27
+"46.166.128.0","46.166.128.255","782663680","782663935","GB","United Kingdom"
+
+# US, though could as well be CA. Previous MaxMind entry
+# 64.237.32.52-64.237.34.127 is US, next MaxMind entry
+# 64.237.34.144-64.237.34.151 is CA, and RIR delegation files say the
+# entire block 64.237.32.0-64.237.63.255 is US. -KL 2012-11-27
+"64.237.34.128","64.237.34.143","1089282688","1089282703","US","United States"
+
+# US, though could as well be UY. Previous MaxMind entry
+# 67.15.170.0-67.15.182.255 is US, next MaxMind entry
+# 67.15.183.128-67.15.183.159 is UY, and RIR delegation files say the
+# entire block 67.15.0.0-67.15.255.255 is US. -KL 2012-11-27
+"67.15.183.0","67.15.183.127","1125103360","1125103487","US","United States"
+
+# US, because next MaxMind entry 67.43.145.0-67.43.155.255 is US, and RIR
+# delegation files say 67.43.144.0-67.43.159.255 is US.
+# -KL 2012-11-27
+"67.43.144.0","67.43.144.255","1126928384","1126928639","US","United States"
+
+# US, because previous MaxMind entry 70.159.21.51-70.232.244.255 is US,
+# because next MaxMind entry 70.232.245.58-70.232.245.59 is A2 ("Satellite
+# Provider") which is a country information about as useless as A1, and
+# because RIR delegation files say 70.224.0.0-70.239.255.255 is US.
+# -KL 2012-11-27
+"70.232.245.0","70.232.245.57","1189672192","1189672249","US","United States"
+
+# US, because next MaxMind entry 70.232.246.0-70.240.141.255 is US,
+# because previous MaxMind entry 70.232.245.58-70.232.245.59 is A2
+# ("Satellite Provider") which is a country information about as useless
+# as A1, and because RIR delegation files say 70.224.0.0-70.239.255.255 is
+# US. -KL 2012-11-27
+"70.232.245.60","70.232.245.255","1189672252","1189672447","US","United States"
+
+# GB, despite neither previous (GE) nor next (LV) MaxMind entry being GB,
+# but because RIR delegation files agree with both previous and next
+# MaxMind entry and say GB for 91.228.0.0-91.228.3.255. -KL 2012-11-27
+"91.228.0.0","91.228.3.255","1541668864","1541669887","GB","United Kingdom"
+
+# GB, because next MaxMind entry 91.232.125.0-91.232.125.255 is GB, and
+# RIR delegation files say 91.232.124.0-91.232.125.255 is GB.
+# -KL 2012-11-27
+"91.232.124.0","91.232.124.255","1541962752","1541963007","GB","United Kingdom"
+
+# GB, despite neither previous (RU) nor next (PL) MaxMind entry being GB,
+# but because RIR delegation files agree with both previous and next
+# MaxMind entry and say GB for 91.238.214.0-91.238.215.255.
+# -KL 2012-11-27
+"91.238.214.0","91.238.215.255","1542379008","1542379519","GB","United Kingdom"
+
+# US, because next MaxMind entry 173.0.16.0-173.0.65.255 is US, and RIR
+# delegation files say 173.0.0.0-173.0.15.255 is US. -KL 2012-11-27
+"173.0.0.0","173.0.15.255","2902458368","2902462463","US","United States"
+
+# US, because next MaxMind entry 176.67.84.0-176.67.84.79 is US, and RIR
+# delegation files say 176.67.80.0-176.67.87.255 is US. -KL 2012-11-27
+"176.67.80.0","176.67.83.255","2957201408","2957202431","US","United States"
+
+# US, because previous MaxMind entry 176.67.84.192-176.67.85.255 is US,
+# and RIR delegation files say 176.67.80.0-176.67.87.255 is US.
+# -KL 2012-11-27
+"176.67.86.0","176.67.87.255","2957202944","2957203455","US","United States"
+
+# EU, despite neither previous (RU) nor next (UA) MaxMind entry being EU,
+# but because RIR delegation files agree with both previous and next
+# MaxMind entry and say EU for 193.200.150.0-193.200.150.255.
+# -KL 2012-11-27
+"193.200.150.0","193.200.150.255","3251148288","3251148543","EU","Europe"
+
+# US, because previous MaxMind entry 199.96.68.0-199.96.87.127 is US, and
+# RIR delegation files say 199.96.80.0-199.96.87.255 is US.
+# -KL 2012-11-27
+"199.96.87.128","199.96.87.255","3344979840","3344979967","US","United States"
+
+# US, because previous MaxMind entry 209.58.176.144-209.59.31.255 is US,
+# and RIR delegation files say 209.59.32.0-209.59.63.255 is US.
+# -KL 2012-11-27
+"209.59.32.0","209.59.63.255","3510312960","3510321151","US","United States"
+
+# FR, because previous MaxMind entry 217.15.166.0-217.15.166.255 is FR,
+# and RIR delegation files contain a block 217.15.160.0-217.15.175.255
+# which, however, is EU, not FR. But merging with next MaxMind entry
+# 217.15.176.0-217.15.191.255 which is KZ and which fully matches what
+# the RIR delegation files say seems unlikely to be correct.
+# -KL 2012-11-27
+"217.15.167.0","217.15.175.255","3641681664","3641683967","FR","France"
+
More information about the tor-commits
mailing list