[metrics-bugs] #31204 [Metrics/CollecTor]: Extend file objects in index.json to include descriptor types, publication times, and file digests
Tor Bug Tracker & Wiki
blackhole at torproject.org
Fri Jul 19 07:06:05 UTC 2019
#31204: Extend file objects in index.json to include descriptor types, publication
times, and file digests
-----------------------------------+--------------------------
Reporter: karsten | Owner: metrics-team
Type: enhancement | Status: new
Priority: Medium | Milestone:
Component: Metrics/CollecTor | Version:
Severity: Normal | Keywords:
Actual Points: | Parent ID:
Points: | Reviewer:
Sponsor: |
-----------------------------------+--------------------------
atagar suggested to extend file objects in CollecTor's `index.json` to
include descriptor types, publication times, and file digests.
As of now, file objects in the `index.json` file have the following
fields:
- `"path"`: Relative path of the file.
- `"size"`: Size of the file in bytes.
- `"last_modified"`: Timestamp when the file was last modified using
pattern `"YYYY-MM-DD HH:MM"` in the UTC timezone.
The new fields could be defined as follows, though this is very much
subject to discussion on this ticket:
- `"types"`: List of descriptor types as found in `@type` annotations of
contained descriptors (optional).
- `"first_published"`: Earliest published timestamp (or similar) of
contained descriptors (optional).
- `"last_published"`: Latest published timestamp (or similar) of
contained descriptors (optional).
- `"sha256"`: SHA-256 digest of the file, encoded as base64 (optional).
All these new fields seem reasonable things to add, and I don't see why we
wouldn't want to add them. The index will get bigger, but that sounds
acceptable. The coding effort is non-zero, which is something we'll have
to admit. But all in all, I don't see a blocker for doing this.
Implementation note: All these new fields have in common that they're not
just file attributes that we can easily obtain from Java's `File` class.
We'll have to open and read files in order to obtain these fields, and
that's very time-consuming. I could see how we do this in a background
thread (or thread pool) started by CollecTor's `CreateIndexJson.java` with
a state file of some sort to avoid reprocessing files that haven't
changed. And while this thread (pool) hasn't completed processing a file,
the index would simply omit these new fields (not files!), which is why
fields are defined as optional above.
What else did I miss? atagar, please fill in any thoughts that I left out.
Once we agree on the spec here, this could be a fine little project for a
volunteer.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31204>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the metrics-bugs
mailing list