Hacker News new | past | comments | ask | show | jobs | submit login
ACM costs vs. arxiv.org costs (twitter.com)
240 points by luu 4 months ago | hide | past | web | favorite | 104 comments

I wrote this tweet. Wasn't sure anyone would read it - glad it's got noticed, because I do think it's an important issue!

I'd love to know where that money at IEEE and ACM is going. The annual reports don't make it at all clear, unfortunately. Obviously, there are for-profit publishers where the money is simply going to huge profit margins. But that's not the case for professional societies.

One thing I noticed when I had to sign up to ACM for a conference a few years ago was that I got harangued by sales-people from ACM for months afterwards, trying to get me on a call to have me buy more expensive memberships. It wasn't an automated system - it was an actual person, trying to get me onto an actual phone call with them. It occurred to me at the time that that must be very expensive, yet it's still a profitable thing for them to do - so there's clearly a lot of money changing hands...

I don't think this is a good sign. Perhaps the professional societies can openly publish a full breakdown of what they're spending money on?

Thank you for tweeting this.

I'm making my way through the IEEE annual report to which you linked[a], and am dumbfounded too. Not only does the IEEE spend $193.4M/year operating "periodicals and media," they also spend $103.5M/year on "membership and public imperatives" (!?) and $38.5M/year on "standards" (!?). Note that these figures do not include the amounts spent on conferences, which are at least tangible.

Moreover, it looks like the organization has $523.8M in investments at fair value (p. 43) -- an endowment greater than that of the vast majority of US colleges and universities. Am I imagining things, or does the IEEE look a lot like "a small hedge fund attached to a professional organization?"

To put that last figure in perspective, if income and gains from the IEEE's investment portfolio are, say, at least 5%/year, they could fund all of arXiv.org's annual expenses 20 times over in perpetuity without dipping into capital.

Something smells rotten indeed.

[a] https://www.ieee.org/content/dam/ieee-org/ieee/web/org/corpo...

Another asset they have (and monetize) is a vast library of old papers. For instance, if you want to read Kildall's 1973 classic "A unified approach to global program optimization" then ACM will gladly let you do so... after you pay them $15.00 for the 8 page paper. https://dl.acm.org/doi/10.1145/512927.512945

It's hard to see how this is compatible with their claim to be "Advancing Computing as a Science & Profession".

Another reason why something like Sci-Hub is a public good despite its negative press.

I just clicked it and downloaded the paper?

(I agree with you just possibly a poor example)

If you're on a university or corporate network it's possible there is an org-level subscription that gives you "free" access to everything.

Clicked at home and got the paywall. Pretty sure you're on a network with access.

Spooky! That makes sense.

Wouldn't have known because I've been (ummm...) "Acquiring" (libgen) them elsewhere

> and $38.5M/year on "standards" (!?)

Does "standards" mean things like these?

WiFi: https://en.m.wikipedia.org/wiki/IEEE_802.11

Floating point: https://en.m.wikipedia.org/wiki/IEEE_754

Those types of standards seem pretty important.

Now that you mention it, probably, yes.

Still, $38.5M/year?

Those are very expensive, although that only pushes the ripoff back one level, because why do they cost money at all?

because you have to run lots of meetings and pay technical editors to go over them, produce diagrams, etc?

i'm not arguing that any specific amount is appropriate, but "free" seems a little low.

I was secretary for IEEE 802.20. It is a volunteer position as is the chair position, but some evil company usually has the "independent" chair on its payroll (I'm looking at you, QUALCOMM). There are no paid editors draft standards are written by committee or companies as are change proposals (I have standardized many) and the chair usually merges the changes, for free. The only people paid at conferences are the registration people.

Wow, I didn't realize the standards stuff is also done by volunteers.

> The only people paid at conferences are the registration people.

...if by paid you mean "free labor from local organizers and their students, plus some $1K-$3K travel awards for a PhD student in exchange for 20 hours of front desk registration, with maybe one association employee overseeing everything".

TBF, the travel awards are actually a pretty reasonable rate (avg $2K for 1/2 week of labor) if you close your eyes and ignore that:

1. CS PhD students are under-paid by a factor of 10x

2. CS PhD students are probably the only tech workers who are expected to raise funds on their own for work-related travel multiple times per year

3. these organizations double-dip. They are essentially paying a bit over market rate for labor required to run their conference registration desks while claiming that this relatively small amount of $$$ gives them some moral justification for price-gouging everything else (see: any ACM/IEEE statement about open access policy, which inevitably mentioned sponsoring student travel...)

4. ...and then still lean on conference organizers to find corporate sponsors to supplement those travel awards.

5. and those same students do a ton of other free labor in the form of writing and reviewing papers.

And so on.

Basically, any time I think "oh yeah, ACM/IEEE do that thing I didn't think about. Well, those people must be paid / must be receiving honest-to-goodness grants", it turns out: nope! All the work is done by volunteers, and any "grants" come with labor requirements.

TBF, none of that is totally unreasonable until you realize how much money these orgs are raking in. Where the hell does it all go!?

In the humanities, this is how all the journals work: professors and grad students do all the writing and editing for free for the reputation it gives them, then some asshole company charges $30 for a 5 page PDF from 1989.

Isn't the ACM run by the community for the community? Can we meet Pancake at a conference and ask her directly for clarification if the reports aren't detailed enough?

FYI to save people some Googling...

I thought "Pancake" was an auto-correct mistake, because I'd never heard that surname before. But it really is the ACM president's surname.

Around here in the emergency medicine field we have a Norma Pancake and a Dr Waffle.

This makes way more giddy than would be appropriate if I were to meet them in person.

There's a lot to critique in publishing and associated costs, but this tweet is unfortunately factually wrong.

From the linked article, ACM's publication costs are $10.9M, not $33.7M.

One of the ACM's major publication initiatives over the last 3-5 years has been an overhaul of their publication templates and publication workflow, to ensure greater consistency in publication formatting, improve accessibility, and archive publications in more future-proof formats. There are also the ongoing costs of creating and indexing metadata (ACM tracks more metadata than arXiv, including resolved citations), preservation (ACM buys failsafe perpetual access services from Portico, arXiv has mirrors at other university libraries).

Should it cost $10.9M? I am not sure. Does it cost a lot more than what arXiv does? Yes.

For a costing exercise: the service ACM buys from Portico is archival and republication. If ACM goes insolvent, Portico flips on their archive and the content remains available. How would you price this service, knowing that when it is actually needed, it's because your customer can no longer pay bills, and you now need to take up their hosting (and all related costs) for approximately forever with no further revenue? I think a network of university libraries would be a more cost-effective way to provide this service, but it's the kind of thing that people working on publication and archival professionally think about, and that factors into the cost of professional archival-level publication.

(I cannot speak to IEEE.)

> their publication templates and publication workflow, to ensure greater consistency in publication formatting, improve accessibility, and archive publications in more future-proof formats

Publication workflow, formatting and accessibility? For every paper I’ve done I just send the ACM a final PDF produced myself from a LaTeX template that hasn’t changed in years. What’s the workflow for taking an already final PDF from authors and uploading it to a file server?

That workflow has changed in the last few years.

- Brand new templates (introduced about 5 years ago, the LaTeX template has had multiple updates per year since then)

- Workflow that makes use of the source (or possibly codes the source embeds in the PDF, but you have to provide LaTeX source to ACM these days)

- Papers now render in both PDF and HTML (and the HTML looks quite good), this started showing up within the last 1-2 years

- Papers are archived in an XML-based format (something called JITS, I do not know details) to facilitate rendering to PDF, HTML, ePub, and other formats not yet devised

That doesn't seem too impressive. It's essentially a workflow that a few universities could band together and replicate via an open source project relatively easily IMHO.

As an example, Pandoc can already handle 90% of this type of workflow by itself (converting Latex to various XML formats). An open source project shared among a few universities or developed by single body like the ACM and used among dozen's of publications and fields. Even two or three full time people working on this would cost much less than $1M per year.

That sounds pretty counterproductive. So now authors, in addition to keeping up on their research, need to keep up on the updates to the ACM's LaTeX stylesheet? And there's every chance that the version that is formatted well with the ACM stylesheet when you initially submit will have formatting bugs six months later because the template got updated? And now you have a whole new toolchain to debug when the HTML version of your paper misaligns your tables? And maybe the HTML version that looks fine today will get mangled in 2028 after you retire and they update the CSS, as has happened with most of the New York Times articles?

It sounds like the ACM has a really different set of priorities than libraries and researchers do, one that values increasing headcount over guaranteeing permanence.

I'm not sure how it works at ACM, but often, it's people retyping the contents of your article into a JATS-XML template and adding additional metadata (authors, date of publication, perhaps who funded it, etc.), which is then used to generate several outputs (e.g. PDF, HTML, but also citation lists, etc.).

>The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012.


>LaTeXML is a free, public domain software, which converts LaTeX documents to XML, HTML, EPUB, JATS and TEI.


The wonderful thing about standards is that there are so many of them. And each one has variations.

> people retyping the contents of your article

Wow. Well I can imagine that’s expensive.

Thank you for the correction.

IEEE's $193m is where we should focus our attention, when it comes to this expense line.

I agree. I have no idea what IEEE is doing that costs that much. And while I don't take as hard a line against them as I do against Elsevier, I have never published with them and don't currently have any plans to change that.

I'm not sure how many articles are published a year in ACM [1], but the answer seems to be a few 10,000s. That's a per-article publishing cost of a few hundred dollars, which is not unrealistic to me.

[1] The ACM Digital Library claims 2.8 million published over 84 years, or about 33,000/year if divided equally over the years (which is laughably false). Some number of that quantity may include citations for keynotes or posters, which aren't really research papers, but I don't have a good handle on that rate.

Annual report 2019 gives some details - 34,000 full text articles were published in the DL. This will exclude non-archival content like keynotes, posters, etc if conference organisers provide correct metadata.

One great thing about a $99/year ACM membership is that it includes full access to the O’Reilly service formerly known as Safari Books, normally $499/year. I have no idea what the volume pricing on Safari subscriptions is, so can only say there’s a possibility that ACM article sales subsidize Safari books; I do know that, as an individual who has to learn new things and look old things up constantly, the inclusion of O’Reilly/Safari makes an individual ACM subscription a fantastic deal.

Additionally, having seen someone organize an academic conference once, I do know that IEEE provides a conference with things like bank accounts, insurance, and purchasing departments that can meet the creditworthiness requirements of major hotels. It also ends up covering the shortfall if the conference winds up losing money.

I’m all for improving efficiencies where possible, and there are definitely some problems with these organizations, don’t get me wrong; but I did want to emphasize that both organizations are definitely providing real value to parts of the computing community.

Disclaimer: I haven’t read the link as twitter is (intentionally) inaccessible from my machine.

I'm curious why twitter is inaccessible from your machine.

From "intentionally" I would guess a social network DNS blocklist that he self-configured, although it could also be a workplace policy.

Yeah, self-configured DNS blocklist. Some of the reasons I’m not a fan of twitter apply to hacker news as well; for example, both can be distracting and contribute to procrastination.

Compared to hacker news, my main issues with twitter are a much lower signal-to-noise ratio, lack of prioritization over time windows—if I take a day/week/month away from hacker news, it’s easy to catch up on the top things I missed at https://hn.algolia.com —lack of depth in content, and pervasive tracking across other sites including through t.co.

Isn't this a typical case of Parkinson's law?

See e.g. https://en.wikipedia.org/wiki/Parkinson%27s_law (didn't find a free version of the book unfortunately).

There is even a fitting example in the book where Parkinson describes how the administration of the British Navy became bigger and bigger although there were fewer and fewer ships to manage. In the present case, one would have to assume analogously that with the introduction of the Internet and the advancing automation, considerable costs were eliminated.

So the question is: have ACM and the other mentioned organizations already reached the tertiary and last stage of INJELITITIS?

> have ACM and the other mentioned organizations already reached the tertiary and last stage of INJELITITIS?

Frankly, ACM membership and publication is a good deal compared to many other societies. For $200 annual dues, you get unlimited access to all journals.

Contrast to the American Chemical Society. For $175, you get:

1) Access to 50 articles for 48 hours 2) The right to purchase access to additional articles at $12 a pop for 48 hour access.

You want to look at the article again after 48 hours? Pay up again.

[0] https://www.acs.org/content/acs/en/membership/member-benefit...

Well, that was not the argument. I agree, there are much greedier organizations around. However, this does not answer the question why the operation of these organisations is so expensive compared to their services. Parkinson has a good explanation though ;-)

This post makes no sense. ArXiv is a site to which papers are posted. ACM and IEEE are technical societies with a range of publications professionally managed, peer reviewed, and edited. They serve different needs and have--surprise surprise--different costs.

> This post makes no sense. ArXiv is a site to which papers are posted. ACM and IEEE are technical societies with a range of publications professionally managed, peer reviewed, and edited

I have been a reviewer of many ACM, Elsevier etc. conferences. The reviewers, editors don't get any money for their service. Regardless, "professional management" is not a sufficient argument for 33x / 190x the price difference . IEEE annual spend $92M in people costs and I doubt a single $ of it goes to any of these peer reviews or editors.

It's not just about professional management. I suspect IEEE and ACM have much more complicated infrastructure to handle submission, peer review, production, etc., which arXiv doesn't have. I'm not justifying the costs -- I wouldn't be able to do that unless I see the breakdown of costs, e.g., how much it goes to the society, how much it goes to post-production, etc. I would also assume that ACM and IEEE journals also do a bit of copy editing that goes beyond mere spell checks before publishing the articles. Although copy editing and post-production is generally outsourced, it is still expensive. All this adds to the cost.

We don't need to speculate how much it costs to run a world class journal. The cost above arxiv is 15$ per Submission:


IEEE charges 1700$+

Correction, "Our total costs probably average about $30 per accepted article." https://discreteanalysisjournal.com/post/40 And that seems to be as bare bones as one can do, as they don't proofread and Scholastica is only used for peer review costing 10$ per submission.

I was quoting the number for submitted articles from Gowers blog, your number is accepted. Also, proofreading at publishers basically doesn't happen any more. And even if it would, it certainly doesn't account for the difference in price, not even remotely close.

You can call it bare bones, but as a scientist, it is very hard to see what value IEEE adds above this "bare bones" approach.

> Also, proofreading at publishers basically doesn't happen any more.

I'm sorry, but which publishers are you talking about here? I've published with several publishers in my field (physics) including APS, AIP, AAS, EPS, Springer, etc., and almost all of them do moderate to extensive copy editing, even for journals that aren't exactly high-impact.

> it certainly doesn't account for the difference in price, not even remotely close.

This is of course true, and I agree. But starting an argument against conventional journals by comparing their costs of operation with that of a preprint repository is disingenuous. And that is where I take issue.

But it's really not disingenuous. They simply don't provide value for money. They are, in economic terms, rentiers. Arxiv isn't all of the cost, but the arxiv costs are order of magnitude accurate, that's what Gowers showed. In physics JHEP is a good example of a high class arxiv overlay journal that is extremely successful without APCs.

The IOP and DPG with NJP also charge half of what IEEE Access do, and they are among the journals that actually provide some copy editing. As far as I recall my publications with the APS did not have any substantial proof reading done.

I actually think proof reading is a really valuable service, and I would be happy to pay for it optionally or in a transparent fashion. But even with proof reading we don't get to 1000s of dollars.

My understanding from talking to editors is that for professional societies, journal income subsidises other activities. Which is fine, but I would like to see that transparently declared. "APC 150$, Contribution to other IEEE activities 1500$". Structurally it's also questionable that library budgets should finance professional societies, but that's really the least of my concerns.

> In physics JHEP is a good example of a high class arxiv overlay journal that is extremely successful without APCs.

I don't think JHEP is an overlay journal anymore. AFAIK, JHEP is now published by SISSA/Springer with funds from CERN/SCOAP. SCOAP also pays for most articles in Phys. Lett. B and some articles in Phys. Rev. D. But SCOAP probably doesn't pay for open access as much as individual authors would have to.

> As far as I recall my publications with the APS did not have any substantial proof reading done.

APS did do a moderate amount of copyediting when I published with them in 2015. They (and most publishers) also check papers for plagiarism, and it's my understanding that the third-party services that they use for this charge a hefty fee [1]. arXiv only compares submissions with existing preprints on arXiv and not other journals.

> My understanding from talking to editors is that for professional societies, journal income subsidises other activities. Which is fine, but I would like to see that transparently declared.

I totally agree with the sentiment that most publishers charge way more than they should for open-access options. Premium open-access-only journals like Phys. Rev. X and Nat. Comm. are also problematic since it discourages authors who have smaller budgets and grants from submitting their papers to these journals.

[1] https://www.ithenticate.com/products

That's an unfair comparison. Discrete Analysis is an arXiv-overlay journal, IEEE hosts its own articles. Plus, Discrete Analysis doesn't typeset or proofread the articles it publishes. Again, I'm not justifying the exorbitant prices that most journals charge to publish using an open access option, all I'm saying is that it's unfair to compare arXiv with an actual academic journal. A preprint repository is in no way equivalent to an academic journal.

Arxiv hosting costs were posted elsewhere and are minimal. Less than the 30$ per published article that DA requires on top. From memory: <10$ per year per published article.

So we have roughly 1660$ unaccounted for for IEEE. PRX costs 4000$.

> I suspect IEEE and ACM have much more complicated infrastructure to handle submission, peer review, production, etc.,

To the tune of a hundred million dollars? I can’t even remotely imagine how. If someone asked you to design a system to do all of those above things, would you actually come to the conclusion it’d cost a cool hundred mil yearly to operate?

I'd actually like to see a breakdown of IEEE costs. AFAIK, IEEE offers print-editions of several of its journals. I don't know who reads hard copies of journals these days, but I'm sure printing on actual physical paper would add to operating costs.

They don't pay for peer review. I'm not sure you mean by "professionally managed" or "edited" exactly - or why that would cost over $100m.

(Disclaimer: I wrote the tweet. Although I didn't expect it to appear on HN...)

Thanks Jeremy for highlighting this issue.

A lot of scientific publishers have hijacked "Open Access" to charge high fees for the same publication as before and pocket more money. For example, "Springer Blood Cancer Journal" charges $ 4,580 as OA fees. I can't imagine how one can rationalize that cost.

It's quite clear how you can rationalize it:

* to advance your career you need to publish in certain specific journals

* 5 grand is a reasonable amount of money to advance your career


* it is reasonable to pay 5 grands to publish on such a journal

PLOS ONE’s fee is $1,595. A factor of 3 doesn’t seem inexplainable to me. That’s “Springer Nature Blood Cancer Journal”, so factors of 1½ in “better editing”, “older, less efficient systems” and “fewer publications/year, so lower efficiency of scale” could already do it.

Also, is Nature working on digitizing old content? That can be costly (I remember reading somewhere that it could involve a) finding a library that has a copy and b) flying there to photograph it) and that’s a cost PLOS won’t ever have.

> PLOS ONE’s fee is $1,595. A factor of 3 doesn’t seem inexplainable to me.

You need to first explain why PLOS ONE is an appropriate baseline. A world class Open Access journal costs roughly $10 per submission [0]. Most arguments I have seen using PLOS ONE as a baseline, talk about the "non profit" part of PLOS. It has to be stressed here at that "non profit" doesn't mean PLOS works on a "non profit" business model. It just means that they generate profit AND the profit isn't distributed to its members, directors or officers [1].

[0] https://gowers.wordpress.com/2016/03/01/discrete-analysis-la...

[1] https://www.law.cornell.edu/wex/non-profit_organizations

They probably mean administration of a peer review. Someone has to find a suitable reviewer, which can be a bit time-consuming task [1], write to reviewers, just do all the coordination.

[1] There are now tools that automate and speed up finding peer reviewers, though, like https://www.prophy.science/referee-finder

> Someone has to find a suitable reviewer, which can be a bit time-consuming task [1], write to reviewers, just do all the coordination.

All done by volunteers!

> Someone has to find a suitable reviewer, which can be a bit time-consuming task [1], write to reviewers, just do all the coordination.

This is done by the Program Chairs of the conference who are professors at universities. ACM doesn't do or pay anything for it.

> surprise surprise--different costs

The surprise is NOT that the costs are different. The surprise is the order of magnitude of the difference. The surprise is also the absolute value of the cost of IEEE given it's apparent activity.

I have no idea if the massive spending on IEEE is reasonable given I have no insight into their org. I do agree the size of the costs as associated with the activities is suspicious.

It's fine to think that the IEEE cost is too high for the value provided, but there's no reason to measure it with respect to a baseline defined by the arXiv. The arXiv is literally just an automated system pushing bits around (plus some unpaid volunteers who do light moderating).

Likewise, I might think the person employed at the post office gets paid too much relative to the value they provide, but it would be silly to compare their salary cost to the maintenance cost of a blue steel Post Office box on the corner. ("Their health insurance alone costs 1000x the annual price of blue paint!")

> publications professionally managed, peer reviewed, and edited

That's done by the community - ACM don't fund that. They just run the conferences (which are paid and ticketed so presumably fund themselves) and host the paper files.

Not to defend the price differential, but someone has to solicit reviewers and manage the review process. This is going to roughly track the number of submissions and isn't free.

> someone has to solicit reviewers and manage the review process

Lol that’s done by volunteer community members as well. It literally is free to the ACM (but the people are paid to do it by their institutions as part of their jobs.)

Your lol is misinformed. ACM and IEEE produce and print actual dead-tree journals. This means selecting a fixed number of articles which are consistently formatted by an editorial process. Additionally, [edited to qualify: some but not all] IEEE Associate Editors are paid positions. Arxiv is not doing this.

We can argue/debate about what is the value-add (and precise workflow difference) of the IEEE editorial process vs. arxiv, but there is a difference.

Like the comment you responded to, I'm not defending the precise differential. I'm a past IEEE member (and author, and reviewer) and found their membership fees excessive.

> Your lol is misinformed.

In every ACM review committee I’ve participated in, it’s just volunteer members of the community emailing other other and using open source software to coordinate reviews.

I think all of the ACM conferences I contribute to have stopped actually printing physical proceedings. The consistent formatting is by a LaTeX document class, with an automated system that checks for bad fonts, text going into the margin, etc. The PDF that ends up in the Digital Library is the one the author submits (possibly with a watermark or stamp applied). I don't know if there is a small amount of ACM-provided human labour employed somewhere, but mostly they just provide a small amount of automation.

$193 million would get you 1000 employees at $193,000 a head.

Even taking into account overheads like desk space and pensions, that's a very large number of very well paid employees for basic secretarial work like soliciting reviewers

IEEE at our college would send a significant amount of money for the student org. In addition the student fee likely did not even cover the basic costs of access provided by third parties, etc.

They also have conferences, etc. to students, were the student fees almost certainly don’t cover the costs.

Their expenses may be too high, but the comparison to arxiv is not helpful.

You can host a peer-reviewed selection of papers as an "Overlay" over an open-access archive such as arxiv.org, at trivial cost. Closed-access journals don't even fund their peer-reviewers generally, so it's not like the "overlay" journal would be offering a worse deal from that POV.

Indeed, it costs very little to run a good quality journal and there's a growing network of them owned and managed by the scholarly community itself: https://freejournals.org/

Usually when the budget is opaque there's a reason for it.

Money really seems to be wasted for the most part, and the digital library has always been an embarrassment. Years later, it still has no API, and last I checked the terms of service forbid you from accessing it with software! Completely braindead.

The only reason to join ACM besides accessing the crappy digital library is to save money on conference registration fees, which are outrageous because of what the venues usually charge and get away with.

Usenix seems (to me at least) to be better run and had open access policies years before ACM. But the IEEE is largely a cesspool as well, charging absurd fees for digital library access, and sponsoring a number of spamferences.

Actually I might stand corrected somewhat; if you want a Safari/O'Reilly subscription, ACM might be the way to do it?

However... I'd probably be in favor of unbundling this and other commercial services from ACM memberships.

If its hosting costs the ACM is worried about, they could offer torrents as a free download option to lighten the load.

I imagine the vast majority of the digital library is included here:


80,700,000 articles as of today. These torrents contain every single article that's ever been accessed through sci-hub. Database dumps at http://gen.lib.rus.ec/dbdumps/ provide an index.

For those who are trying to follow the money, often organizations like these have expensive old-school benefits like defined-benefit pension plans.

Following the money is an excellent plan!

I haven't heard anything about defined-benefit pension plans, or their use in professional societies. Could you say a bit more? Or provide a link to learn more?

What are the costs here... 25M downloads per month doesn't seem like it should cost $100k per month. Must be total archive size I imagine?

This isn't insanely high given its traffic and that it's attached to an educational (not for profit) institution. They need developers, infrastructure, support, people to process the papers, etc...

Is arxiv.org mirrored on archive.org? The latter has its Wayback Machine of course, but that might not necessarily follow .ps and .pdf links.

Wayback Machine crawls files as well as pages.

They do provide bulk access [1], so I'm sure there will be backups. The numbers on that page are quite out of date, so I'm not sure it's current.

[1] https://arxiv.org/help/bulk_data_s3

ArXiv has historically been extremely hostile to crawlers, which is one of the reasons for its low costs.

Bah. If they average thousands of downloads per file, there's plenty of room in there for some crawlers.

And the total data set is less than a terabyte; seed a torrent somewhere for $20.

The user-pays S3 bucket also exists as a good thing but S3 is much more expensive than data needs to be.

They started in 1991, when a terabyte was an unimaginably huge quantity of data and it was common for anonymous FTP servers like xxx.lanl.gov to request that you not connect until after business hours to avoid interfering with the main purpose of the machines. When I joined the internet in 1992, our 7.5-MHz VAX had a 56-kbps frame-relay link to New Mexico Technet (TECNET on our DECNET), which I think may also have provided LANL's rather beefier internet connection. They started providing WWW access in 1993, before Apache added preforking to NCSA HTTPD, and in fact I think before NCSA HTTPD itself. This means that initially every new HTTP request involved forking a new child process from the HTTP server, which took a few hundred milliseconds. This is the context in which the arXiv's hostile stance toward spidering was established.

I agree that it would be an extremely valuable course of action to seed a series of torrents, since a single torrent wouldn't work; it would have to be replaced every time a new paper was uploaded, fragmenting the swarm enough to render it useless. Also, they could surely use Fastly and permit spidering.

> They started in 1991, when a terabyte was an unimaginably huge quantity of data

The collection was a lot smaller then. The whole thing has fit on one hard drive for a long time. And as far as interpreting my post as criticism goes, apply it to the last ten years only.

> This is the context in which the arXiv's hostile stance toward spidering was established.

They should have reconsidered it at some point.

> series of torrents

Sure. A torrent for each 500MB chunk they already collate, or yearly, or both.

arXiv has bulk dumps that anyone can S3. They aren't updated every day, not even every month, but still they have it. I used it myself.

Something seems weird. 25000000 downloads per month, lets assume that the average pdf size is 1MB, that would be only 776 GB of data uploaded per day, I do not think that this can justify the $1.3m/year figure.

For the sake of the argument, let's assume the following:

- If you factor all the man-hours required to keep the site up and running (ignoring the setup costs), then you'd have about one engineer for a whole year.

- If we go for a simple S3 based download and accompaniying infrastructure, it would be safe to assume that it would cost AT LEAST about $700k per year (AWS is not cheap).

I'd say it's about right.

Can anyone speak to if ACM and/or IEEE memberships are worth it for a professional developer? I loved reading stuff like the History of Programming Languages papers back when I had student access in college.

Taking someone's money to fund research and then not giving them access to that research is theft. If that money is tax money, I'd call it treason.

The big bucks are in patents, not in The Proceedings of the Fifth Annual ACM Intergalactic Conference on Subsubsubsfied.

Doing something about https://en.wikipedia.org/wiki/Bayh%E2%80%93Dole_Act seems way more important than fixing publishing...

> Taking someone's money to fund research and then not giving them access to that research is theft.

That seems like hyperbole to me. I think it falls more into the category of "inequitable exchange" or "bad deal".

and except for simple formalities, Arxiv isn't reviewed, typeset or anything else, which a normal journal usually entails (and typesetting costs... hands up, who is doing all the integral d-s in mathrm...). And yet people cite Arxiv often blindly...

ACM and IEEE don’t review either, that’s all coordinated and done by volunteers, at least for every conference or journal I’ve reviewed for. Typesetting also seems irrelevant, I read a lot of Arxiv papers and have never found typesetting bad enough to matter.

There seem to be other things ACM/IEEE does like helping pay for conference stuff or somehow supporting student chapters. But the literal editorial role looks oversold.

\renewcommand{\d}{\mathrm d}

I prefer \newcommand{\intd}{\,\mathrm{d}} to separate it a little.

(arxiv.org is not archive.org if someone wants to fix the title. both are absurd creations that can't possibly exist, though.)

> both are absurd creations that can't possibly exist, though.

How so?

It was an endearing comment.

I intended that as a reference to an XKCD strip[0]. Folks often talk about public goods like scientific research, as if the "real stuff" couldn't possibly happen without private investment. ¯\_(ツ)_/¯

[0] https://xkcd.com/2102/


Fixed. Thanks!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact