[tor-bugs] #30369 [Metrics/Library]: Fix regular expression in descriptor parser to correctly recognize bandwidth files
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu May 2 18:53:27 UTC 2019
#30369: Fix regular expression in descriptor parser to correctly recognize
bandwidth files
---------------------------------+----------------------
Reporter: karsten | Owner: karsten
Type: defect | Status: assigned
Priority: Medium | Milestone:
Component: Metrics/Library | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
---------------------------------+----------------------
We're using a regular expression on the first 100 characters of a
descriptor to recognize bandwidth files. More specifically, if a
descriptor starts with ten digits followed by a newline, we parse it as a
bandwidth file. (This is ugly, but the legacy bandwidth file format
doesn't give us much of a choice.)
This regular expression is broken. The regular expression we want is one
that matches the first 100 characters of a descriptor, which ours didn't
do.
Suggested fix:
{{{
diff --git
a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
index 119fe09..08ac909 100644
---
a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
+++
b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
@@ -132,7 +132,7 @@ public class DescriptorParserImpl implements
DescriptorParser {
sourceFile);
} else if (fileName.contains(LogDescriptorImpl.MARKER)) {
return LogDescriptorImpl.parse(rawDescriptorBytes, sourceFile,
fileName);
- } else if (firstLines.matches("^[0-9]{10}\\n")) {
+ } else if (firstLines.matches("(?s)[0-9]{10}\\n.*")) {
/* Identifying bandwidth files by a 10-digit timestamp in the first
line
* breaks with files generated before 2002 or after 2286 and when
the next
* descriptor identifier starts with just a timestamp in the first
line
}}}
Explanation:
- We don't need to start the pattern with `^`, because the regular
expression needs to match the whole string anyway.
- The `(?s)` part enables the dotall mode: ''"In dotall mode, the
expression . matches any character, including a line terminator. By
default this expression does not match line terminators. Dotall mode can
also be enabled via the embedded flag expression (?s). (The s is a
mnemonic for "single-line" mode, which is what this is called in Perl.)"''
- We need to end the pattern with `.*` to match any characters following
the first newline, which also includes newlines due to the previously
enabled dotall mode.
I'll create a branch for this in a minute.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/30369>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list