Hacker News new | past | comments | ask | show | jobs | submit login
Popular preprint servers face closure because of money troubles (nature.com)
49 points by covertlibrarian on Feb 13, 2020 | hide | past | favorite | 35 comments

Even arXiv has struggled at times.

Elsevier has a sustainable business because they force university libraries to pay big $$$. The big $$$ can be fed back into sales, lobbying politicians, and other activities that leave enough money that the executives behind it can get paid big $$$.

Academic databases might cost 1/1000 as much to run as commercial journals, but then there isn't any surplus they can use to get rents.

It isn't clear from TFA, but COS might be going for that "surplus", themselves. When all their users were signing up, were they informed that what was free then would soon cost thousands of dollars?

When was was the last time the arXiv was actually at risk of not having enough money?

The time they laid me off in the mid-2000's.

This sadly is a common theme for academic databases. Many databases were running on funding money. And after the funding gone, so is the database

Academics should make their databases easily archivable then.

One thing I don't like about COS preprint sites is how much they rely on JavaScript. I'm not sure the Wayback Machine can archive them correctly. It might take a specialized bot.

The Wayback Machine doesn't work right at all on one of my own COS preprints: https://web.archive.org/web/20200211103826/https://engrxiv.o...

Maybe I should contact ArchiveTeam...

I'm the director of engrXiv. It would be great to have a reliable archive of the server contents separate from COS. It would be pretty easy to scrape all of the files. There is a regularly updated CSV of all of the contents of engrXiv here: https://osf.io/ns9yr/

The files can be downloaded directly by adding '/download' to the engrxiv URL for each preprint.

Sounds like there should be a permanent estate/endowment/foundation for these sorts of things that can provide for their use indefinitely.

TBH the numbers seem exceptionally high to me. Is there something I'm missing?

Hosting a few thousand PDF files that are probably only downloaded by a handful of people shouldn't cost thousands of dollars.

It's not that hard to host PDF files in the short term, but over the long term you have to do maintenance, and that adds up over time.

For an operation like that you are going need at least one technical FTE, you will be more comfortable with two. (Somebody can go on vacation)

You probably also need one or two FTE for "customer service" functions (e.g. I can't upload file X, I am having trouble downloading file Y)

If you are getting organizations to subscribe to this you also have to run a sales organization, not just to get new customers, but simply to keep getting checks from the customers you have. You need at least one FTE for that, but sales organizations usually develop a hierarchy to the extent that you might have one senior FTE and two junior FTEs.

Then you need somebody to scramble for grants, interface with non-customers, so you get an FTE for administration.

So that gets you to 8 FTE and a wage bill upwards of $500,000 a year. If you had everyone working at full capacity it might be efficient, but if your sales efforts don't get you to full capacity this is a boondoggle.

arXiv.org got started because Paul Ginsparg wasn't concerned about cost recovery at the beginning (no sales), did it as a labor of love, got some people to help him with it as a side project, so it cost at most 2 FTE to run.

Once it got to Cornell it developed a cost structure in line with what I described except the sales and grant-getting functions were neglected so the investment in people to make it sustainable in the beginning still wasn't enough.

I think the best way to understand the numbers would be to look at the breakdown of costs. Two links that were shared from the twitter thread discussing this might be useful:

Preprints cost (projected vs actuals): https://docs.google.com/spreadsheets/d/1V0vKrf50K667CqM3e4S2...

Org finances: https://cos.io/about/our-finances/

A few things stand out: 1. Preprints are pretty new. You're not just hosting PDFs on an s3 bucket in maintenance mode- you're also wrangling authors with very different workflows to use your platform. This means building tools for moderation or retraction, and long handholding to recruit 26 partner groups, some of them started as grassroots efforts without their own institutional history. Each group may have their own ideas about governance. (each research field may do things in different ways)

2. In that light, the projected personnel costs are.... not high. The spreadsheet claims that 22% of page traffic going to preprints, but the original 2019 forecast called for ~$7k budget on developers + QA, total. At market rates, that's... a small fraction of the annual cost of a single developer? (their team page lists 10-15 devs on staff)

3. Compared to the overall organizational finances, it suggests that if anything, some of the cost of running the service is being spread across their other offerings. The original vs modified forecasts for 2019 seem a bit, well, different- it's likely that the costs are still being worked out, and may be dependent on hard-to-predict growth.

It's also very notable that this hubbub seems to involve a relatively small amount of cash: the proposed funding model is a 60-40 split, with the service share divided among up to 26 groups. That says a lot about the role of building institutions to support preprints long term, and the need to help grassroots initiatives mature if we want to keep these services active.

https://cos.io/our-products/osf-preprints/ "...this fee structure accounts for $87,976 in contributions by the preprint services toward maintenance costs (38% of total)"

Disclaimer: I don't speak for any of the groups involved in this process, and comments are based only on these public documents. There may well be other numbers or context.

maybe move from AWS to good-old VPS/small business hosting?

The financial problems seem to be happening at regional preprints like Indonesia or Africa. The people who run the service for them are surely using western labour. The obvious answer is for the regional preprints to switch to using local labour - ie run the service themselves. That would be a significant saving.

I have had my doubts about COS's technical foundations and strategic abilities and this and a number of the comments here do nothing to reassure me, especially in light of the glowing stories about dedication to long term sustainability that I have been told by some of their representatives.

Looks like that money is appetizing for a Center of Open Science that the motto is to promote openness and transparency. Are such costs really transparent? There is no openness, much less transparency here. What kind of maintenance that would justify such excessive costs? "... the servers are hosted online by the non-profit Center for Open Science (COS)" A non-profit center and asks for $200,000 per year? How it would cost if it was for-profit, then? With less than one hundred dollar per year ($100), you can have dedicated hosting plans with unlimited space. Hosting 26 servers- or even up 100 servers with thousands text files per year- on one structured platform (OSF) should NOT cost that high fees, should it?

The one preprint site that I use which is hosted by the COS seems to be doing okay, fortunately: https://blog.engrxiv.org/financials/

I still don't understand stand the point of the regional repositories. How does this "increase exposure" for their research, and if it did why would that be valuable?

If you found one, you get to brag about it. Give regional ted-X talks. Schmooze with people closer to the money than usual. I presume that's the main point.

I could be wrong here but I was frustrated at how the article never really gave a summary of what a preprint repository is.

I mean, I can guess from context that it’s a place to store scientific studies. But I don’t know its role or why it’s any different from throwing some files in a Dropbox folder or a static website.

There are (at least) several goals:

1. persistence, which obviously is hurt by them running out of money. Dropbox links apparently die eventually, static websites bitrot, etc. Things put on arxiv can be trusted enough to put links to them in papers, if needed.

2. Discoverability. Pre-print archives send out digests (usually people subscribe to particular areas). It's a low-effort way to distribute them. Related, by having things semi-centralized, it's easier to discover papers (they get indexed by google scholar, for example...).

Maybe other benefits? Not sure.

Preprint servers are well-known to nearly all readers of the journal Nature, so there wasn't much need to explain.

https://duckduckgo.com/?q=preprint+server+site%3Anature.com&... for example lists many previous publications by Nature on the topic, including https://www.nature.com/articles/nmeth.3831 https://www.nature.com/articles/s41467-017-00950-5 and https://www.nature.com/articles/d41586-019-00199-6 . (The first was not in the flagship journal "Nature" but a sister journal.)

In short, a preprint is a version of a scientific paper before it's been formally accepted by a journal. See https://en.wikipedia.org/wiki/Preprint .

The first real electronic preprint repository is https://en.wikipedia.org/wiki/ArXiv . That gives more background information on the topic.

abought describes some of the differences between a set-of-files and a preprint repository at https://news.ycombinator.com/item?id=22319686 .

Ha, image how much Nature Publishing Group are crying in their coffee while they write that headline! Open science has the same free-rider problems as open source, but also, the same network effects with huge benefits.

I've always wanted to take a swing at re-institutionalizing science as a betting game. It'd be such a good business model for a free preprint server. Best predictor wins! Cheaters punished!

Could you expand? What exactly do you mean what would people be betting on?

The outcome of an experiment and it's reproducibility over n independent repetitions.

EarthArXiv's statement might bring some additional context: https://eartharxiv.github.io/cos.html

“... sustain its hosting service in the long term, which will cost about US$230,000 in 2020”

For hosting 26 preprint servers. Is that high, low or about right

If I was hosting 26 reasonably-available servers in my home office, that's quite high.

If I had to host 26 highly-available servers available to a geographically distributed group in a commercial or academic space with paid maintenance staff and some semblance of support.. that's actually pretty darn cheap.

That's about $737/m/server. Seems high to me.

this cost-per-server implies that all the money goes to the servers. There may be humans who need to get paid too. Unless they’re claiming that that’s just the technology cost?

Ridiculously high, but to be expected.

Seems like a job for a peer to peer system if bandwidth is a major expense.

I can't imagine that it is.

Let's say a paper is 1 MB, downloading it 1,000 times costs 10 cents at AWS prices. 10,000 lifetime views would be a lot for an arXiv paper, and that is $1. Plus storing 1,000 papers at that rate is another 10 cents a month. Storing a million papers is $100 a month.

The thing people miss about AWS's charging for bandwidth is that it's an efficient way to lower their costs handling DMCA requests. You might get 20,000 downloads of a 2 GB camcorder capture of a movie in the theaters -- the prospect of a $4000 bill to do do that encourages people to use other platforms instead. AWS doesn't have their legal department running big bills dealing with this and they pass on the savings and the generally safer environment to do whatever it is you are doing w/o harassment to their customers.

COS’s platform, the Open Science Framework, hosts large amounts of data in addition to the preprint itself. It’s more of a “scientific workflow platform” than a “preprint service”.

Per the budget spreadsheet linked elsewhere, preprints seem to be 20+% of site traffic, but only 0.4-1% of storage costs for the overall site.

This could indicate that many preprints are posted as standalone artifacts (eg not every preprint would include rich datasets alongside the PDF). In effect, people could be skipping the workflow and just using it for the preprint.

For bandwidth usage, they're estimating 20-25% of site traffic. Hosting costs are not a trivial expense, but they're not the main driver of the budget, and the bigger uncertainty seems to be on labor costs.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact