[tor-commits] [webstats/master] Document the new sanitizing code, but don't implement it yet.
runa at torproject.org
runa at torproject.org
Fri Dec 30 16:26:09 UTC 2011
commit 52731f5544954594a7b6e6805dd1136539e17856
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date: Fri Dec 30 12:08:58 2011 +0100
Document the new sanitizing code, but don't implement it yet.
---
src/org/torproject/webstats/Main.java | 109 +++++++++++++++++++++++++++++++--
1 files changed, 104 insertions(+), 5 deletions(-)
diff --git a/src/org/torproject/webstats/Main.java b/src/org/torproject/webstats/Main.java
index bcf5cf1..628e4a3 100644
--- a/src/org/torproject/webstats/Main.java
+++ b/src/org/torproject/webstats/Main.java
@@ -7,11 +7,110 @@ import java.util.regex.*;
import org.apache.commons.compress.compressors.gzip.*;
-/* Sanitize gz-compressed Apache web logs by removing all potentially
- * sensitive parts and makes sure we never sanitize a web log file
- * twice. Only consider lines containing HTTP GET requests. Append
- * sanitized lines to out/YYYY/MM/DD/<filename>access.log.
- * TODO Document how exactly the sanitizing process works. */
+/*
+ * Sanitize Apache web logs by removing all potentially sensitive parts.
+ *
+ * TODO Document what exactly is sanitized and how that is done.
+ *
+ * TODO Implement the following description.
+ *
+ * The main operation is to parse Apache web log files from the in/
+ * directory and write sanitized web log files to the out/ directory.
+ * Files in the in/ directory are assumed to never change and will be
+ * deleted after processing by this program. Files in the out/ directory
+ * are guaranteed to never change and may be deleted by a subsequently
+ * running program.
+ *
+ * This program uses a couple of state files to make sure that files in
+ * in/ are not parsed more than once and that files in out/ do not need to
+ * be changed:
+ * - state/lock prevents concurrent executions of this program.
+ * - state/in-history contains file names of previously read and deleted
+ * files in the in/ directory.
+ * - state/in-history.new is the file written in the current execution
+ * that will replace state/in-history during the execution.
+ * - state/execution/ contains new or updated output files parsed in the
+ * current execution.
+ * - state/out-history contains file names of previously written and
+ * possibly deleted files in the out/ directory.
+ * - state/out-history.new is the file written in the current execution
+ * that will replace state/out-history at the end of the execution.
+ * - state/full/ contains complete output files that may or may not be
+ * newer than files in the out/ directory.
+ * - state/diff/ contains new parts for files in the out/ directory which
+ * have been deleted.
+ *
+ * The steps taken by this program are as follows:
+ * 1. Check that state/lock does not exists, or exit immediately. Add a
+ * new state/lock file.
+ * 2. Read the contents from state/in-history and state/out-history and
+ * the directory listings of out/, state/diff/, and state/update/ to
+ * memory.
+ * 3. For each file in in/:
+ * a. Append the file name to state/in-history.new.
+ * b. Check that the file name is not contained in state/in-history.
+ * If it is, print out a warning and skip the file.
+ * c. Parse the file in chunks of 250,000 lines to reduce writes.
+ * d. When writing sanitized chunks to output files, for each output
+ * file, check in the following order if there is already such a
+ * file in
+ * i. state/execution/,
+ * ii. state/full/,
+ * iii. out/, or
+ * iv. state/diff/.
+ * If there's such a file, merge the newly sanitized lines with
+ * that file and write the sorted result state/execution/.
+ * 4. Rename state/in-history to state/in-history.old and rename
+ * state/in-history.new to state/in-history. Delete
+ * state/in-history.old.
+ * 5. Delete files in in/ that have been parsed in this execution.
+ * 6. For each file in state/execution/:
+ * a. Check if there's a corresponding line in state/out-history. If
+ * so, check whether there is a file in state/full/ or out/. If
+ * so, move the file to state/full/. Otherwise move the file to
+ * state/diff/, overwriting the file there if one exists.
+ * b. If a. does not apply and the sanitized log is less than four (4)
+ * days old, move the file to state/full/.
+ * c. If b. does not apply, append a line to out-history.new and move
+ * the file to out/.
+ * 7. Rename state/out-history to state/out-history.old and rename
+ * state/out-history.new to state/out-history. Delete
+ * state/out-history.old.
+ * 8. Delete state/lock and exit.
+ *
+ * If the program is interrupted and leaves a lock file in state/lock, it
+ * requires an operator to fix the state/ directory and make it work
+ * again. IMPORTANT: DO NOT CHANGE ANYTHING IN THE state/ DIRECTORY
+ * UNLESS YOU'RE CERTAIN WHAT YOU'RE DOING! The following situations can
+ * happen. It may make sense to try a solution in a non-productive
+ * setting first:
+ * A. The file state/in-history.new does not exist and there are no files
+ * in state/execution/. The process died before step 3. Delete
+ * state/lock and re-run the program.
+ * B. The file state/in-history.new exists and there are files in
+ * state/execution/. The process died during steps 3 or 4. Delete
+ * all files in state/execution/. If state/in-history does not exist,
+ * but state/in-history.old does exist, rename the latter to the
+ * former. Delete state/lock and re-run the program.
+ * C. The file state/in-history.new does not exist, but there are files
+ * in state/execution/. The process died after step 4. Run the steps
+ * 5 to 8 manually. Then re-run the program.
+ *
+ * Whenever logs are parsed that are 4 days old or older, there may
+ * already be output files in out/ that cannot be modified anymore. The
+ * operator may decide to manually overwrite files in out/ with the files
+ * in state/full/ or state/diff/. IMPORTANT: ONLY OVERWRITE FILES IN out/
+ * IF YOU'RE CERTAIN HOW TO FIX THE PROGRAM THAT PARSES ITS FILES. There
+ * are two possible situations:
+ * A. There is a file in state/full/. This file is newer than the file
+ * with the same name in out/ and contains everything from that file,
+ * too. It's okay to overwrite the file in out/ with the file in
+ * state/full/ and delete the file in state/full/.
+ * B. There is a file in state/diff/. The file in out/ didn't exist
+ * anymore when parsing more log lines for it. The file that was in
+ * out/ should be located and merged with the file in state/diff/.
+ * Afterwards, the file in state/diff/ should be deleted.
+ */
public class Main {
private static File historyFile = new File("hist");
private static File inputDirectory = new File("in");
More information about the tor-commits
mailing list