Elsevier has a sustainable business because they force university libraries to pay big $$$. The big $$$ can be fed back into sales, lobbying politicians, and other activities that leave enough money that the executives behind it can get paid big $$$.
Academic databases might cost 1/1000 as much to run as commercial journals, but then there isn't any surplus they can use to get rents.
The Wayback Machine doesn't work right at all on one of my own COS preprints: https://web.archive.org/web/20200211103826/https://engrxiv.o...
Maybe I should contact ArchiveTeam...
The files can be downloaded directly by adding '/download' to the engrxiv URL for each preprint.
Hosting a few thousand PDF files that are probably only downloaded by a handful of people shouldn't cost thousands of dollars.
For an operation like that you are going need at least one technical FTE, you will be more comfortable with two. (Somebody can go on vacation)
You probably also need one or two FTE for "customer service" functions (e.g. I can't upload file X, I am having trouble downloading file Y)
If you are getting organizations to subscribe to this you also have to run a sales organization, not just to get new customers, but simply to keep getting checks from the customers you have. You need at least one FTE for that, but sales organizations usually develop a hierarchy to the extent that you might have one senior FTE and two junior FTEs.
Then you need somebody to scramble for grants, interface with non-customers, so you get an FTE for administration.
So that gets you to 8 FTE and a wage bill upwards of $500,000 a year. If you had everyone working at full capacity it might be efficient, but if your sales efforts don't get you to full capacity this is a boondoggle.
arXiv.org got started because Paul Ginsparg wasn't concerned about cost recovery at the beginning (no sales), did it as a labor of love, got some people to help him with it as a side project, so it cost at most 2 FTE to run.
Once it got to Cornell it developed a cost structure in line with what I described except the sales and grant-getting functions were neglected so the investment in people to make it sustainable in the beginning still wasn't enough.
Preprints cost (projected vs actuals): https://docs.google.com/spreadsheets/d/1V0vKrf50K667CqM3e4S2...
Org finances: https://cos.io/about/our-finances/
A few things stand out:
1. Preprints are pretty new. You're not just hosting PDFs on an s3 bucket in maintenance mode- you're also wrangling authors with very different workflows to use your platform. This means building tools for moderation or retraction, and long handholding to recruit 26 partner groups, some of them started as grassroots efforts without their own institutional history. Each group may have their own ideas about governance. (each research field may do things in different ways)
2. In that light, the projected personnel costs are.... not high. The spreadsheet claims that 22% of page traffic going to preprints, but the original 2019 forecast called for ~$7k budget on developers + QA, total. At market rates, that's... a small fraction of the annual cost of a single developer? (their team page lists 10-15 devs on staff)
3. Compared to the overall organizational finances, it suggests that if anything, some of the cost of running the service is being spread across their other offerings. The original vs modified forecasts for 2019 seem a bit, well, different- it's likely that the costs are still being worked out, and may be dependent on hard-to-predict growth.
It's also very notable that this hubbub seems to involve a relatively small amount of cash: the proposed funding model is a 60-40 split, with the service share divided among up to 26 groups. That says a lot about the role of building institutions to support preprints long term, and the need to help grassroots initiatives mature if we want to keep these services active.
"...this fee structure accounts for $87,976 in contributions by the preprint services toward maintenance costs (38% of total)"
Disclaimer: I don't speak for any of the groups involved in this process, and comments are based only on these public documents. There may well be other numbers or context.
I mean, I can guess from context that it’s a place to store scientific studies. But I don’t know its role or why it’s any different from throwing some files in a Dropbox folder or a static website.
1. persistence, which obviously is hurt by them running out of money. Dropbox links apparently die eventually, static websites bitrot, etc. Things put on arxiv can be trusted enough to put links to them in papers, if needed.
2. Discoverability. Pre-print archives send out digests (usually people subscribe to particular areas). It's a low-effort way to distribute them. Related, by having things semi-centralized, it's easier to discover papers (they get indexed by google scholar, for example...).
Maybe other benefits? Not sure.
https://duckduckgo.com/?q=preprint+server+site%3Anature.com&... for example lists many previous publications by Nature on the topic, including https://www.nature.com/articles/nmeth.3831 https://www.nature.com/articles/s41467-017-00950-5 and https://www.nature.com/articles/d41586-019-00199-6 . (The first was not in the flagship journal "Nature" but a sister journal.)
In short, a preprint is a version of a scientific paper before it's been formally accepted by a journal. See https://en.wikipedia.org/wiki/Preprint .
The first real electronic preprint repository is https://en.wikipedia.org/wiki/ArXiv . That gives more background information on the topic.
abought describes some of the differences between a set-of-files and a preprint repository at https://news.ycombinator.com/item?id=22319686 .
For hosting 26 preprint servers. Is that high, low or about right
If I had to host 26 highly-available servers available to a geographically distributed group in a commercial or academic space with paid maintenance staff and some semblance of support.. that's actually pretty darn cheap.
Let's say a paper is 1 MB, downloading it 1,000 times costs 10 cents at AWS prices. 10,000 lifetime views would be a lot for an arXiv paper, and that is $1. Plus storing 1,000 papers at that rate is another 10 cents a month. Storing a million papers is $100 a month.
The thing people miss about AWS's charging for bandwidth is that it's an efficient way to lower their costs handling DMCA requests. You might get 20,000 downloads of a 2 GB camcorder capture of a movie in the theaters -- the prospect of a $4000 bill to do do that encourages people to use other platforms instead. AWS doesn't have their legal department running big bills dealing with this and they pass on the savings and the generally safer environment to do whatever it is you are doing w/o harassment to their customers.
This could indicate that many preprints are posted as standalone artifacts (eg not every preprint would include rich datasets alongside the PDF). In effect, people could be skipping the workflow and just using it for the preprint.
For bandwidth usage, they're estimating 20-25% of site traffic. Hosting costs are not a trivial expense, but they're not the main driver of the budget, and the bigger uncertainty seems to be on labor costs.