
Popular preprint servers face closure because of money troubles - covertlibrarian
https://www.nature.com/articles/d41586-020-00363-3
======
PaulHoule
Even arXiv has struggled at times.

Elsevier has a sustainable business because they force university libraries to
pay big $$$. The big $$$ can be fed back into sales, lobbying politicians, and
other activities that leave enough money that the executives behind it can get
paid big $$$.

Academic databases might cost 1/1000 as much to run as commercial journals,
but then there isn't any surplus they can use to get rents.

~~~
jessriedel
When was was the last time the arXiv was actually at risk of not having enough
money?

~~~
PaulHoule
The time they laid me off in the mid-2000's.

------
leemailll
This sadly is a common theme for academic databases. Many databases were
running on funding money. And after the funding gone, so is the database

~~~
btrettel
Academics should make their databases easily archivable then.

One thing I don't like about COS preprint sites is how much they rely on
JavaScript. I'm not sure the Wayback Machine can archive them correctly. It
might take a specialized bot.

The Wayback Machine doesn't work right at all on one of my own COS preprints:
[https://web.archive.org/web/20200211103826/https://engrxiv.o...](https://web.archive.org/web/20200211103826/https://engrxiv.org/35u7g/)

Maybe I should contact ArchiveTeam...

~~~
dhacks
I'm the director of engrXiv. It would be great to have a reliable archive of
the server contents separate from COS. It would be pretty easy to scrape all
of the files. There is a regularly updated CSV of all of the contents of
engrXiv here: [https://osf.io/ns9yr/](https://osf.io/ns9yr/)

The files can be downloaded directly by adding '/download' to the engrxiv URL
for each preprint.

------
Robotbeat
Sounds like there should be a permanent estate/endowment/foundation for these
sorts of things that can provide for their use indefinitely.

------
hannob
TBH the numbers seem exceptionally high to me. Is there something I'm missing?

Hosting a few thousand PDF files that are probably only downloaded by a
handful of people shouldn't cost thousands of dollars.

~~~
abought
I think the best way to understand the numbers would be to look at the
breakdown of costs. Two links that were shared from the twitter thread
discussing this might be useful:

Preprints cost (projected vs actuals):
[https://docs.google.com/spreadsheets/d/1V0vKrf50K667CqM3e4S2...](https://docs.google.com/spreadsheets/d/1V0vKrf50K667CqM3e4S2itlFTdSMAqKAz17gJKxjVjE/edit#gid=0)

Org finances: [https://cos.io/about/our-finances/](https://cos.io/about/our-
finances/)

A few things stand out: 1\. Preprints are pretty new. You're not just hosting
PDFs on an s3 bucket in maintenance mode- you're also wrangling authors with
very different workflows to use your platform. This means building tools for
moderation or retraction, and long handholding to recruit 26 partner groups,
some of them started as grassroots efforts without their own institutional
history. Each group may have their own ideas about governance. (each research
field may do things in different ways)

2\. In that light, the projected personnel costs are.... not high. The
spreadsheet claims that 22% of page traffic going to preprints, but the
original 2019 forecast called for ~$7k budget on developers + QA, total. At
market rates, that's... a small fraction of the annual cost of a single
developer? (their team page lists 10-15 devs on staff)

3\. Compared to the overall organizational finances, it suggests that if
anything, some of the cost of running the service is being spread across their
other offerings. The original vs modified forecasts for 2019 seem a bit, well,
different- it's likely that the costs are still being worked out, and may be
dependent on hard-to-predict growth.

It's also very notable that this hubbub seems to involve a relatively small
amount of cash: the proposed funding model is a 60-40 split, with the service
share divided among up to 26 groups. That says a lot about the role of
building institutions to support preprints long term, and the need to help
grassroots initiatives mature if we want to keep these services active.

[https://cos.io/our-products/osf-preprints/](https://cos.io/our-products/osf-
preprints/) "...this fee structure accounts for $87,976 in contributions by
the preprint services toward maintenance costs (38% of total)"

Disclaimer: I don't speak for any of the groups involved in this process, and
comments are based only on these public documents. There may well be other
numbers or context.

------
melbourne_mat
The financial problems seem to be happening at regional preprints like
Indonesia or Africa. The people who run the service for them are surely using
western labour. The obvious answer is for the regional preprints to switch to
using local labour - ie run the service themselves. That would be a
significant saving.

------
hyperion2010
I have had my doubts about COS's technical foundations and strategic abilities
and this and a number of the comments here do nothing to reassure me,
especially in light of the glowing stories about dedication to long term
sustainability that I have been told by some of their representatives.

------
Chemistt
Looks like that money is appetizing for a Center of Open Science that the
motto is to promote openness and transparency. Are such costs really
transparent? There is no openness, much less transparency here. What kind of
maintenance that would justify such excessive costs? "... the servers are
hosted online by the non-profit Center for Open Science (COS)" A non-profit
center and asks for $200,000 per year? How it would cost if it was for-profit,
then? With less than one hundred dollar per year ($100), you can have
dedicated hosting plans with unlimited space. Hosting 26 servers- or even up
100 servers with thousands text files per year- on one structured platform
(OSF) should NOT cost that high fees, should it?

------
btrettel
The one preprint site that I use which is hosted by the COS seems to be doing
okay, fortunately:
[https://blog.engrxiv.org/financials/](https://blog.engrxiv.org/financials/)

------
jessriedel
I still don't understand stand the point of the regional repositories. How
does this "increase exposure" for their research, and if it did why would that
be valuable?

~~~
improbable22
If you found one, you get to brag about it. Give regional ted-X talks.
Schmooze with people closer to the money than usual. I presume that's the main
point.

------
dangus
I could be wrong here but I was frustrated at how the article never really
gave a summary of what a preprint repository is.

I mean, I can guess from context that it’s a place to store scientific
studies. But I don’t know its role or why it’s any different from throwing
some files in a Dropbox folder or a static website.

~~~
dbpatterson
There are (at least) several goals:

1\. persistence, which obviously is hurt by them running out of money. Dropbox
links apparently die eventually, static websites bitrot, etc. Things put on
arxiv can be trusted enough to put links to them in papers, if needed.

2\. Discoverability. Pre-print archives send out digests (usually people
subscribe to particular areas). It's a low-effort way to distribute them.
Related, by having things semi-centralized, it's easier to discover papers
(they get indexed by google scholar, for example...).

Maybe other benefits? Not sure.

------
cbare
Ha, image how much Nature Publishing Group are crying in their coffee while
they write that headline! Open science has the same free-rider problems as
open source, but also, the same network effects with huge benefits.

------
grizzles
I've always wanted to take a swing at re-institutionalizing science as a
betting game. It'd be such a good business model for a free preprint server.
Best predictor wins! Cheaters punished!

~~~
sgillen
Could you expand? What exactly do you mean what would people be betting on?

~~~
grizzles
The outcome of an experiment and it's reproducibility over n independent
repetitions.

------
remram
EarthArXiv's statement might bring some additional context:
[https://eartharxiv.github.io/cos.html](https://eartharxiv.github.io/cos.html)

------
travisporter
“... sustain its hosting service in the long term, which will cost about
US$230,000 in 2020”

For hosting 26 preprint servers. Is that high, low or about right

~~~
driverdan
That's about $737/m/server. Seems high to me.

~~~
mygo
this cost-per-server implies that all the money goes to the servers. There may
be humans who need to get paid too. Unless they’re claiming that that’s just
the technology cost?

------
api
Seems like a job for a peer to peer system if bandwidth is a major expense.

~~~
PaulHoule
I can't imagine that it is.

Let's say a paper is 1 MB, downloading it 1,000 times costs 10 cents at AWS
prices. 10,000 lifetime views would be a lot for an arXiv paper, and that is
$1. Plus storing 1,000 papers at that rate is another 10 cents a month.
Storing a million papers is $100 a month.

The thing people miss about AWS's charging for bandwidth is that it's an
efficient way to lower their costs handling DMCA requests. You might get
20,000 downloads of a 2 GB camcorder capture of a movie in the theaters -- the
prospect of a $4000 bill to do do that encourages people to use other
platforms instead. AWS doesn't have their legal department running big bills
dealing with this and they pass on the savings and the generally safer
environment to do whatever it is you are doing w/o harassment to their
customers.

~~~
LyndsySimon
COS’s platform, the Open Science Framework, hosts large amounts of data in
addition to the preprint itself. It’s more of a “scientific workflow platform”
than a “preprint service”.

~~~
abought
Per the budget spreadsheet linked elsewhere, preprints seem to be 20+% of site
traffic, but only 0.4-1% of storage costs for the overall site.

This could indicate that many preprints are posted as standalone artifacts (eg
not every preprint would include rich datasets alongside the PDF). In effect,
people could be skipping the workflow and just using it for the preprint.

For bandwidth usage, they're estimating 20-25% of site traffic. Hosting costs
are not a trivial expense, but they're not the main driver of the budget, and
the bigger uncertainty seems to be on labor costs.

