[tor-commits] [collector/master] Task-20234: the first description of CollecTor's client facing file structure.
karsten at torproject.org
karsten at torproject.org
Fri Sep 30 13:55:13 UTC 2016
commit 96a4006fd107426f2c6bc9d879d8ce18b2c2c109
Author: iwakeh <iwakeh at torproject.org>
Date: Thu Sep 29 20:15:45 2016 +0200
Task-20234: the first description of CollecTor's client facing file structure.
Version 0.9
---
src/main/resources/docs/PROTOCOL | 384 +++++++++++++++++++++++++++++++++++++++
1 file changed, 384 insertions(+)
diff --git a/src/main/resources/docs/PROTOCOL b/src/main/resources/docs/PROTOCOL
new file mode 100644
index 0000000..0949f3e
--- /dev/null
+++ b/src/main/resources/docs/PROTOCOL
@@ -0,0 +1,384 @@
+
+ Protocol of CollecTor's File Structure
+ DRAFT
+
+1.0 Introduction
+
+ This protocol describes the file structure of the data offered by
+ CollecTor mirrors. The description of the data contained in the
+ structure is not part of this protocol and can be found on the
+ main page of a CollecTor instance, e.g. https://collector.torproject.org
+
+ CollecTor's data resides in three top elements:
+
+ * archive folder
+ * index files
+ * recent folder
+
+ These can be located anywhere on the providing systems memory, permanent
+ or in-memory. As long as they can be accesses from the root folder of
+ the web-interface, i.e.
+
+ https://collector.someinstance.org/archive
+ https://collector.someinstance.org/index.json*
+ https://collector.someinstance.org/recent
+
+ The fourth file system structure that is part of the protocol is the
+ output directory (in the following referred to as 'out'), which is not
+ directly visible to CollecTor's clients. This directory is the place
+ for internal storage and its file system structure is later contained
+ in the provided tar-balls with archived data.
+
+ The following sections describe the substructure of the storage locations.
+
+2.0 The 'archive' Folder
+
+ The 'archive' folder contains the Tor network history as compressed
+ tarballs.
+
+2.1 Immediate Subfolders of 'archive'
+
+ The top structure of 'archive' is comprised of
+
+ * bridge-descriptors
+ * bridge-pool-assignments
+ * exit-lists
+ * relay-descriptors
+ * torperf
+
+ The substructure of these folders differs depending on their content.
+
+2.2 'bridge-pool-assignments', 'exit-lists', and 'torperf' below 'archive'
+
+ 'bridge-pool-assignments', 'exit-lists', and 'torperf' directly contain the
+ compressed tar-balls named according to the following rules:
+
+ marker DASH year DASH month DOT TAR DOT compression-type
+
+ DASH is the "-" symbol.
+ DOT a dot ".".
+ TAR is the string "tar".
+
+ 'marker' is the either 'bridge-pool-assignments', 'torperf', or 'exit-lists'.
+ 'year' and 'month' are derived from the published dates (bridge pool
+ assignments), measurement dates (torperf), or download dates (exit lists) of
+ the contained data.
+ The 'compression-type' is one element of "xz", "gz", "bz2", or "zip". The
+ current default compression type is set to "xz".
+
+2.3 'bridge-descriptors' below 'archive'
+
+ 'bridge-descriptors' contains the following subdirectories:
+
+ * extra-infos
+ * server-descriptors
+ * statuses
+
+ These directories contain compressed tar-balls of the respective descriptors.
+ The compressed tar-balls are named in the following way:
+
+ BRIDGE DASH marker DASH year DASH month DOT TAR DOT compression-type
+
+ BRIDGE is the string "bridge".
+
+ 'marker' is one of 'extra-infos', 'server-descriptors', or 'statuses'.
+ 'year' and 'month' are derived from the published dates of the contained
+ data.
+
+2.4 'relay-descriptors' below 'archive'
+
+ 'relay-descriptors' contains the following substructure:
+
+ * consensuses
+ * extra-infos
+ * microdescs
+ * server-descriptors
+ * statuses
+ * tor
+ * votes
+ * certs.tar.xz
+
+ 'consensuses', 'extra-infos', 'microdescs', 'server-descriptors',
+ 'statuses', 'tor', and 'votes' contain compressed tar-balls of the
+ respective descriptors.
+ The compressed tar-balls are named in the following way:
+
+ marker DASH year DASH month DOT TAR DOT compression-type
+
+ 'marker' is one of 'consensuses', 'extra-infos', 'microdescs',
+ 'server-descriptors', 'statuses', 'tor', or 'votes'
+ 'year' and 'month' are derived from different dates found in the data:
+
+ * for consensuses, microdescs, and votes from the valid-after dates,
+ * for extra-infos, server-descriptors, statuses, and tor from the
+ published dates.
+
+3.0 Index Files
+
+ The index.json file and its compressed versions of various types are
+ contained under the web-root of the main CollecTor instance.
+ Future versions of this protocol might move these files to a separate
+ 'index' folder.
+
+4.0 The 'recent' folder
+
+ 'recent' provides data from the last 72 hours in the following
+ subfolders:
+
+ * bridge-descriptors
+ * exit-lists
+ * relay-descriptors
+ * torperf
+
+4.1 'exit-lists' and 'torperf' below 'recent'
+
+4.1.1
+ 'exit-lists' and 'torperf' directly contain the data files.
+ The exit-list files are named
+
+ year DASH month DASH day DASH hour DASH minute DASH second
+
+ Where these are derived from the download time (exit lists) or measurement
+ time (torperf).
+
+4.1.2
+ 'torperf' contains files named
+
+ source DASH kilobytes DASH year DASH month DASH day DOT extension
+
+ Where 'source' is defined by the CollecTor instance and can
+ be mapped to a URL. 'year', 'month', and 'day' are taken from
+ the download date. 'extension' is 'tpf'.
+
+4.2 'bridge-descriptors' below 'recent'
+
+ 'bridge-descriptors' contains the following subdirectories:
+
+ * extra-infos
+ * server-descriptors
+ * statuses
+
+4.2.1
+ 'extra-infos' and 'server-descriptors' contain data files named
+ in the following way:
+
+ year DASH month DASH day DASH hour DASH minute DASH second
+ DASH marker
+
+ Where 'marker' is one of the strings "extra-infos" or "server-descriptors".
+ All time related values are derived from the download time.
+
+4.2.2
+ 'statuses' contains data files named
+
+ year month day DASH hour minute second DASH fingerprint
+
+ All time related values are derived from the published time and
+ 'fingerprint' is the fingerprint of the providing bridge
+ authority.
+
+4.3 'relay-descriptors' below 'recent'
+
+ 'relay-descriptors' contains the following substructure:
+
+ * consensuses
+ * extra-infos
+ * microdescs
+ * server-descriptors
+ * votes
+
+4.3.1
+ 'consensuses' contains consensus documents named according:
+
+ year DASH month DASH day DASH hour DASH minute DASH second
+ DASH CONSENSUS
+
+ Where CONSENSUS is the string "consensus" and all time related
+ values are derived from the valid-after dates.
+
+4.3.2
+ 'extra-infos' and 'server-descriptors' contain data files as
+ the corresponding folder described in section 4.2.1.
+
+4.3.3
+ 'votes' contains files named
+
+ year DASH month DASH day DASH hour DASH minute DASH second
+ DASH VOTE DASH fingerprint DASH digest
+
+ Where VOTE is the string "vote" and all time related
+ values are derived from the valid-after dates. 'fingerprint'
+ is the fingerprint of the authority and 'digest' is the SHA1
+ digest of the authority's medium term signing key.
+
+4.3.4
+ 'microdescs' contains the subfolders
+
+ * consensus-microdesc
+ * micro
+
+ 'consensus-microdesc' contains files named
+
+ year DASH month DASH day DASH hour DASH minute DASH second
+ DASH CONSENSUSMICRO
+
+ Where CONSENSUSMICRO is the string "consensus-microdesc" and
+ all time related values are derived from the valid-after dates.
+
+4.3.5
+ 'micro' serves files named
+
+ year DASH month DASH day DASH hour DASH minute DASH second
+ DASH MICRO
+
+ Where MICRO is the string "micro" and all time related values
+ are derived from the valid-after dates of the referencing microdesc
+ consensus.
+
+5.0 The Tar-ball's Directory Structure and the Internal Structure
+
+ The 'out' directory is the internal storage that is used for
+ preparing tar-balls. So, the following structures occur also
+ partially in the tars that are currently prepared per month.
+
+ The top level of subdirectories is
+
+ * bridge-descriptors
+ * exit-lists
+ * relay-descriptors
+ * torperf
+
+ (There has been a fifth subdirectory 'bridge-pool-assignments' which has been
+ removed when CollecTor stopped collecting those descriptors. However, it's
+ structure can still be found in the tarballs.)
+
+5.1 'exit-lists' and 'torperf' Tars
+
+5.1.1
+ 'exit-lists' and 'torperf' tars are taken from the following
+ subdirectory structure
+
+ year SEP month SEP day
+
+ Where SEP is the path separator (usually *nix type "/") and
+ year, month, and day are derived from the download dates (exit lists) or
+ measurement dates (torperf).
+ Files insides the day directory level are named according to the
+ rules in 4.1.1 and 4.1.2.
+
+ Thus, monthly tars are named according to section 2.2 and contain
+ the following structure
+
+ marker DASH year DASH month SEP day
+
+ Where 'marker' stands for the strings "torperf" or "exit-list".
+
+5.1.2
+ The 'torperf' directory also contains summary files named
+
+ source DASH amount DOT extension
+
+ Where 'source' is the data source defined in the CollecTor instance,
+ amount is a number with appended "kb" or "mb", and 'extension' is
+ either "data" or "extradata".
+
+5.2 'bridge-descriptors' below 'out'
+
+ 'bridge-descriptors' contains the following subdirectories:
+
+ * extra-infos
+ * server-descriptors
+ * statuses
+
+5.2.1
+ 'extra-infos' and 'server-descriptors' have the following
+ subdirectory structure
+
+ year SEP month SEP first SEP second
+
+ Where year is derived from the published date.
+ 'first' and 'second' are the first and second symbol from the
+ router-digest, which also serves as the filename for the files
+ in the 'second' level directories.
+
+ Tars are named according to section 2.3 and have the following
+ substructure using the definitions from 2.3:
+
+ BRIDGE DASH marker DASH year DASH month SEP first SEP second
+
+5.2.2
+ 'statuses' have a different substructure
+
+ year SEP month SEP day
+
+ Where 'year', 'month', and 'day' are derived from the published dates.
+ Files insides the 'day' directory level are named according to the
+ rules in 4.2.2.
+
+ The tars are named as in 2.3 with the substructure
+
+ BRIDGE DASH STATUSES DASH year DASH month SEP day
+
+ Where STATUSES is the string "statuses".
+
+5.3 'relay-descriptors' below 'out'
+
+ 'relay-descriptors' contains the following substructure:
+
+ * certs
+ * consensus
+ * extra-info
+ * microdesc
+ * server-descriptor
+ * vote
+
+5.3.1
+ 'certs' contains certificate files named
+
+ fingerprint DASH year DASH month DASH day DASH hour DASH minute DASH second
+
+ All time related values are derived from the key-published time and
+ 'fingerprint' is the fingerprint of the authority.
+
+5.3.2
+ 'consensus' and 'vote' contain the subdirectory structure
+
+ year SEP month SEP day
+
+ Where the time related values are taken from the valid-after dates.
+
+ 'extra-info' and 'server-descriptor' follow the naming and subdirectory
+ structure as described in 5.2.1. 'consensus' and 'vote' tars use the
+ substructure of 'statuses' as described in 5.2.2.
+
+5.3.3
+ 'microdesc' has a more complex subdirectory structure
+
+ year SEP month
+
+ Where the time related values are taken from the valid-after dates.
+ Inside the 'month' folders are two directories
+
+ * consensus-microdesc
+ * micro
+
+5.3.4
+ 'consensus-microdesc' contains 'day' subdirectories derived from the
+ valid-after dates and files named according to 4.3.4.
+
+ 'micro' follows the subdirectory structure of
+
+ first SEP second
+
+ Where year is derived from the published date.
+ 'first' and 'second' are the first and second symbol from the SHA-256
+ digest, which also serves as the filename for the files
+ in the 'second' level directories.
+
+ The monthly tars are named according to 2.4 and contain
+
+ MICRODESCS DASH year DASH month
+
+ and below the subdirectories of 'micro' and 'consensus-microdescs'.
+
+
More information about the tor-commits
mailing list