[ooni-dev] ooni-pipeline using local time zone when parsing timestamps
David Fifield
david at bamsoftware.com
Thu Mar 17 06:04:34 UTC 2016
Parts of the ooni-pipeline are using datetime.fromtimestamp instead of
datetime.utcfromtimestamp. It means the timestamp will be parsed
differently according to the time zone in which the code is run:
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf5b5c0148904e888c/pipeline/batch/daily_workflow.py#L521
entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time',
0)).strftime("%Y-%m-%d %H:%M:%S")
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf5b5c0148904e888c/pipeline/batch/daily_workflow.py#L583
start_time = datetime.fromtimestamp(header.get('start_time', 0))
Now, it appears that there's no actual harm done by this bug, because it
looks like the timestamps were read and written by someone in UTC+1, so
the conversion evens out in the final JSON. But it could go wrong if run
by someone in a different time zone.
I noticed this because when I process the new JSON files, I get
timestamps that are an hour different than I got parsing the YAML files.
To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"),
which gives the correct answer because of the reason in the previous
paragraph. My old YAML times were off by an hour, because I used
datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if
it were UTC.
Here's some related information I found while researching the above. It
looks like individual test results are *also* using local timestamps, or
have their clocks set wrong, or something. Because when I compare the
"test_start_time" field to the "Date" header (which is always supposed
to be UTC) in a response to an http_requests test, I get some widely
divergent values (I saw up to around 9 hours).
For example consider
https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS29182-http_requests-no_report_id-0.1.0-probe.json
It has a difference of about 6 hours:
{
"test_start_time":"2015-01-01 07:00:29",
"test_keys": {
"requests": [
{
"response": {
"headers": {
"Date": "Thu, 01 Jan 2015 01:00:35 GMT"
...
}
}
}
]
},
probe_cc: "BE"
}
probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't
know where the 6-hour discrepancy is coming from. Note also that the
timestamp in the file name is not the same: 060029.
Here's a summary of the http_requests reports from 2015-01-01. Most of
them match, but some are off by 6 or 1 hour:
CC: CH test_start_time: 2015-01-01 03:00:28 Date: Thu, 01 Jan 2015 02:59:14 GMT +0
CC: BE test_start_time: 2015-01-01 07:00:29 Date: Thu, 01 Jan 2015 01:00:35 GMT +6
CC: FR test_start_time: 2015-01-01 09:41:08 Date: Thu, 01 Jan 2015 09:41:14 GMT +0
CC: FR test_start_time: 2015-01-01 10:47:20 Date: Thu, 01 Jan 2015 10:47:23 GMT +0
CC: NL test_start_time: 2015-01-01 12:42:16 Date: Thu, 01 Jan 2015 12:42:21 GMT +0
CC: NL test_start_time: 2015-01-01 15:41:32 Date: Thu, 01 Jan 2015 15:41:44 GMT +0
CC: CZ test_start_time: 2015-01-01 15:46:27 Date: Thu, 01 Jan 2015 15:46:34 GMT +0
CC: NL test_start_time: 2015-01-01 17:19:39 Date: Thu, 01 Jan 2015 17:19:43 GMT +0
CC: HK test_start_time: 2015-01-01 18:03:57 Date: Thu, 01 Jan 2015 17:04:10 GMT +1
CC: NL test_start_time: 2015-01-01 21:47:32 Date: Thu, 01 Jan 2015 21:47:37 GMT +0
I used this script:
import json
import sys
for line in sys.stdin:
doc = json.loads(line)
try:
probe_cc = doc["probe_cc"]
test_start_time = doc["test_start_time"]
for request in doc["test_keys"]["requests"]:
response_time = request["response"]["headers"]["Date"]
print "CC: %s test_start_time: %s Date: %s" % (probe_cc, test_start_time, response_time)
except KeyError:
pass
More information about the ooni-dev
mailing list