[tor-bugs] #29697 [Internal Services]: archive.tpo is soon running out of space
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue May 21 14:08:42 UTC 2019
#29697: archive.tpo is soon running out of space
-------------------------------+------------------------
Reporter: boklm | Owner: (none)
Type: defect | Status: new
Priority: Medium | Milestone:
Component: Internal Services | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
-------------------------------+------------------------
Comment (by anarcat):
TL;DR: possible paths:
1. Internet Archive (IA)
2. Software Heritage
3. commercial storage (e.g. Amazon Glacier)
4. host our own
5. spend more time deciding on archival policies
6. mix of the above
One way to manage stuff like this is to break it up in smaller pieces and
distribute it around. a typical way I manage those archives is with git-
annex, which allows for reliable tracking of N copies (say "3 redundant
copies") and supports *many* different "remotes", including Amazon
Glacier, Internet Archive (IA) and so on. It's what I used in the Brazil
archival project and it mostly worked. It's hard to use, unfortunately,
which may be a big blocker for adoption.
If git-annex is too complicated, we can talk to IA directly. I would
recommend, however, against using their web-based upload interface which,
even they acknowledge, is terrible and barely useable. I packaged the
[https://tracker.debian.org/pkg/python-internetarchive internetarchive]
python client in Debian to work around that problem and it works much
better.
Moving files to IA only shifts the problem, in my opinion: then we have
only a single copy, elsewhere and while we don't need to manage that space
anymore, we also don't manage backups and will never know if they drop
stuff on us (and they do, sometimes, either deliberately or by mistake). I
would propose that if stuff moves out of our "backed-up" infrastructure,
it should be stored in at least two administratively distinct locations.
Another such location we could use, apart from commercial providers like
Amazon, is the [http://softwareheritage.org/ Software Heritage] project
([https://en.wikipedia.org/wiki/Software_Heritage WP]) which is *designed*
to store copies of source code and software artifacts of all breeds. It
might already have something for Tor even.
Otherwise, assuming we can solve this problem ourselves, I think this
question boils down to "How big of an archive do we actually need and how
fast does it grow?" With the limited Grafana history I had available a
week ago, I have calculated we dump roughly ~10GB per week of new stuff on
there, but naturally the sample size is too small to take that number
seriously. To give you another metric, in the last two weeks now (one week
later), we have gone from 254GB to 207GB, eating a whopping 47GB in 15
days, which clocks the rate at ~3GB a day or ~24GB a week. When I looked
at it a week ago, we had 220GB left, which gives us a rate of 13GB/week,
so I would estimate the burn rate is between 10 to 20GB/week, which gives
us about 10 to 20 weeks to act on this problem.
Assuming 10GB/week, this means we need ~500GB of *new* storage every year.
In our current capacity, this trickles into roughly 2x1TB of storage per
year because of RAID and backups.
So if we want this problem to go away for ~10 years (assuming current
rate, which is probably inaccurate, at beast), we could throw hardware at
the problem and give Hetzner another ~200EUR/mth specifically for an
archival server. We might be able to save some costs by *not* backing up
the server and using IA/Software Heritage as a fallback, with git-annex as
well.
Fundamentally, this is a cost problem. Do you want us to spend time to
figure out a proper archival policy and cheap/free storage locations or
pay for an archival server?
In any case, I'd be happy to dig deeper into this to figure out the
various options beyond the above napkin calculations.
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29697#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list