Great to see a large "Signatories to the Declaration of Support" and that it's hosted by the Internet Archive. Keep fighting the good fights my friends! :)
This doesn't look especially useful. You can search for small sentence fragments and get the paper title I guess? But you can already do that on Google Scholar.
You can't read the papers or see their citations or cited-bys. What's it for? Poking Elsevier?
I saw that article, which portrayed that statement as speculation by a non-expert, before even a post-mortem examination. I'm not saying something controversial happened, but I haven't found information on what did happen.
Hey this is an interesting idea. How would it be done though?
For example I've noticed certain similarities across classical statistics, Machine Learning and digital signal processing. But they all seem to come at it from different angles, and they often use different terminology. Sometimes a field will rediscover something discovered elsewhere a long time ago, eg the trapezium rule.
That sounds super interesting! The results will depend upon the kind of information this database provides about each paper. We don't know yet the kind of processing each paper goes through and the quality of the "description" in the database. Assuming that it is good enough, this would be a really interesting project to work on.
One of my colleagues, Richard Zanibbi, leads the Document and Pattern Recognition Lab[0] here at the Rochester Institute of Technology. They're doing a lot of interesting work on formula recognition and search.
I just finish looking into their projects. They are very promising. I don't why they don't try to engage with OOS community. They can get their models used in a manner that can help many people and will open many ways to use their work.
>"And around the same time that he heard about the Rameshwari judgment, he had come into possession (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub [...] Malamud began to wonder whether he could legally use the Sci-Hub drives to benefit Indian students [...] Asked directly whether some of the text-mining depot’s articles come from Sci-Hub, he said he wouldn’t comment"
(It'd be nice if there was a coverage source other than this scientific publisher, whose biases are obvious).
There are torrents available. I'm unsure if I can link to the site directly without getting banned so you'll have to check the Gizmodo article for the link to the site containing the torrents. [0] It's missing some of the 2021 papers though, judging by the dates.
also, I want a version that lets me browse papers. Unlike people in academia or other research fields, I don't know the paper that I want to pirate yet.
from my lived experience, this request has been seen as absurd and invalidated by many because their lived experience is always having feeds of papers and abstracts they can start from. or how they see a browser extension that doesn’t do what I’m asking at all to be good enough (there’s some that automatically give you the scihub link to any paper you are looking at, which still puts me at square one, which paper?).
Unpaywall is a browser extension you might find useful. It provides links to open access papers, with the links appearing on pubmed pages, publisher pages, etc. A significant proportion of papers are now open access so this tool is very often useful.
That's pretty crazy to think about, even if you consider overhead and multimedia assets like images often included in PDFs. I remember the old "the library of congress fits on this CDROM" analogies (which wasn't entirely true) but this takes it to a whole new level.
At some point, it seems like in research, it will be far easier to skip the lit review and just do the work then later if you do the same work someone else did, compare results for consistency. We may yet get through the hurdle of the reproducibility crisis due to the deluge of information. The underlying issue though is, because you couldn't find the related effort that caused you to independently attempt to replicate the effort, you may also never find all the duplicate efforts to compare for consistency.
PDF is fairly inefficient compared to formats like DVI (and consider that so many papers are produced using TeX anyway, though figures may be in various image formats.)
But 77TB? You could host the whole thing in a shoebox with nine 8TB flash drives, a 10-socket powered USB hub, and a Raspberry Pi.
Scihub has been a life saver for me once I started working on some of the more obscure areas of AI such as signal/time series processing. Anything off the beaten path is locked up behind paywalls, and I'm sorry I ain't paying $40 just to see if someone's paper sucks or not (which 95% of them do, in this particular niche, especially the ones shielded from scrutiny by paywalls).
95% of all papers I have read have sucked. Maybe they just weren’t what I was looking for but a lot of them I couldn’t believe got published as anything novel
Of the remaining five percent, at least in software you can be sure at least 80% (4% of the total) doesn't actually work when tested. It is beyond frustrating to deal with research, to the point that these days unless the algorithm is very well described or there is source code available I have to assume the researchers are just lying.
> It is beyond frustrating to deal with research, to the point that these days (...) I have to assume the researchers are just lying.
This is not something new, or even from this century. The Royal Society, which was founded on 1660 and is a reference in the history of science, adopted the motto "take nobody's word for it".
Recent AI stuff from major labs on Arxiv is pretty good, but yeah, anything that's AI+some other field is usually pretty bad. It's usually written by someone in that other field who might be an expert in their own domain, but who knows very little about AI or even just numerical optimization in general. The fact that such "work" is accepted uncritically by publishers doesn't inspire a lot of confidence in the value they purportedly add. It's right on the surface: "awesome" results are easy to achieve in AI if you screw up your train/val split, or deliberately choose an extremely weak baseline.
How about learning a language model on ~~the SciHub texts~~ some large private repository of articles, that can auto-complete a paper from the title ...
How people goes about downloading/mirroring sci-hub? I assume you cannot start torrenting from your home in Texas without getting into problems. Do you get a seedbox? You are in a state like China or Russia?
Getting these > 70 TB to a residential address does not seems easy.
If you are serious: you can download it to a seedbox, connect to it and download it to your residential address over scp. The storage is a solved problem, not cheap but then you are trying to store a significant chunk of all papers :)
Or if you already have a VPN provider, check their terms to see what their torrent policies are. Last I checked mine permitted such traffic in a few specific datacenters.
That solves the routing problem, but not the pipe size problem.
70TB over 1Gbps: 6+ days
70TB over 500Mbps: 12+ days
70TB over 100Mbps: 2+ months
70TB over 50Mbps: 4+ months
70TB over 25Mbps: 8+ months
70TB over 10Mbps: 21+ months
70TB over 5Mbps: 3 years
70TB over 1Mbps: 17 years
--
70TB via mail: equivalent to 100Mbps (international) to 1Gbps+ (local) depending on distance, but only cost-effective if an equivalent connection is unavailable due to average weight (and price) of sufficient HDDs, eg, 3x18TB = ~$1200, combined weight approx 1kg.
Does Amazon allow torrenting? Or could you have your seedbox push the data to S3 from $WHEREVER? If so, you could possibly use Snowball[1] to get the data to your residence.
Given the cloud ecosystem (and associated markets), I would be noteworthily surprised if an average datacenter/colo's smart-hands facilities wouldn't be able to accept me mailing in some HDDs, having them attached (possibly by renting multiple additional servers and having them installed in there), then later disconnecting them and printing a supplied prepaid shipping label to return them.
(I'd provide an appropriately oversized tip to ensure a near-unreasonable amount of bubble wrap is included in the return.)
I might be reading this wrong (I always have problems calculating AWS costs), but this seems VERY expensive. Much more than paying a 10Gbps internet plan for a year where I live.
Possibly, but is cost the only factor? What if you don't want to wait a year to get the data? Or don't want to saturate the 'net connection you're using with just this download? And what about error recovery, since no 'net link is going to be 100% stable for an entire year...
Not saying Snowball is the best idea by any means. But I can see how it might be worth considering.
We (scite.ai) have extracted 918M citation statements (3 full sentences) from 27M full-text articles (more than half are from paywalled articles from indexing agreements).
And, recently, have made these citation statements easily searchable to find expert analyses and opinions on nearly any topic extracted from the literature: https://scite.ai/search/citations
We can't release the full dataset as our licensing agreements with publishers restrict it. We do have an API though that can be used: https://api.scite.ai/docs
Nice work on Scite,
I'm not sure if this is a different use case than I have, but my searches are listing duplicates of the same papers many times. Is there a reason to not collapse duplicates into a single result?
Is this on Citation Statement search? I think it is probably because the citation context contains two or more citations in it. We look at citations per sentence so it is duplicated there. I can see how that is confusing though.
You would typically require some sort of filtering in an N-gram index to prevent the index from growing larger than the corpus. Maybe something like a tf-idf-threshold, or some other heuristic.
Otherwise you get a lot of useless "of that", "in an", "as a"-type word salad that is of no use to anyone, while what's actually useful sounds more like N-grams out of Tamarian.
Also crossed my mind. We'd still would be left without illustrations, table organization, charts, etc, which in a considerable number of cases are of extreme relevance, though
I was looking for a big and novel dataset to work on as a personal project. Can devote a couple months after work to it.
What are interesting things that could be done with this now as a non-scientist, but that could be useful to share too?
I would be interested in seeing how different branding terms evolve in the literature; e.g., "machine learning" vs "artificial intelligence" vs "neural net" or "surrogate model" vs "digital twin" vs "response surface". There a number of terms of art that have substantial overlap, but which term ends up being used depends on the audience, which often includes grant providers. I suspect the popularity of these terms evolves according to what terms appeal the most to funding agencies.
An annoyingly broad answer, but: broadcast an opportunity in which you provide vetoed access to the server/instance holding the data (eg, with standard user accounts, private home directories and the data located read-only in a dir off of /). Would create a small storm of overhead ("please install ..." ad nauseum), but provide interesting opportunities for open-ended creativity.
Also, while unusual, you could potentially extract value by lightly imposing requests ;) on others to (where viable/a good fit) at least have a go at helping you with problem-solving/busywork-type steps in the research you're doing (however experimental/uncertain). Since everyone would be looking at the same dataset, this may bring interesting/unexpected ideas to the table (and maybe even shed light on potential opportunities for collaboration down the road). For individuals who are reasonably independent and self-directed but have no experience working with huge amounts of data, this would also provide a cool chance to play around in a real-world-scale environment without the failure risks attached to eg fulfilling^Wfiguring out business requirements etc.
(Now I'm reminded of https://tilde.club/, which operates a shared Linux box a (very) large bunch of people have collective access to. It's a stab in the dark (ie, the one reference I'm aware of), but maybe the admins there would have interesting insight about managing shared access Linux boxes.)
I was looking for a big and novel dataset to work on as a personal project
As an aside, one interesting data-set that is out there, legally and freely available, which is decent sized (not as big as this, of course) is the United States Fire Administration's NFIRS[1] data.
Roughly speaking, NFIRS is a record of all fire calls run in the USA, including details like date, time, location, responding department, incident type, dollar value of damages, number of deaths, number of injuries, etc.
I say "roughly speaking" because strictly speaking participation in NFIRS is voluntary and not every department in the USA participates. If memory serves correctly, for example, I think FDNY does not. But as far as I know, a significant majority of the fire departments in the US do send their data to NFIRS and so it becomes a pretty interesting snapshot of fire activity across the entire country.
Edit: from the NFIRS page:
The NFIRS database comprises about 75% of all reported fires that occur annually.
I wonder -- there's lots of machine learning designed for making inferences and there's some for language and transfer. How easy is it to train an oracle that you can ask questions? Or could you make an ontology? Seems like a wealth of scientific papers might be a neat thing to train on. Not to emit gibberish new papers, but to make something useful. Somehow.
So far as I understand it's a lot more than that. Their model has not only indexed but has also individually classified 900M+ citation statements by sentiment, the outputs of which produce a score to represent each paper's relative trustworthiness.
To your point though it's fun to think about generative applications too. I for one would appreciate a writing assistant trained on millions of scientific papers -- like OK I'll write that last-minute proposal for you but you better believe it's going to be chock-full of dispensable lexical arcana.
Reminder that these journal publishers have egregious and exploitive contracts with universities, and should not be commended for turning any blind eyes to such databases.
The state university I attended publishes their subscription expenditures, link to the document here [1].
Besides the outrageous prices the other thing that jumps off the pages is how many of them have managed to hide the amount of tax money they receive from a state institution with contract terms.
Scihub is what 100 tb ? That is like inr 80,000 + a server to hold drives say another 30,000 so in total around 110,000 or 120,000. That is around USD 1500. If I get a windfall of sorts, this is the thing I am going to spend on.
Donations, eh yeah but I feel like scihub needs everyone's support.
I happened to check Amazon a few minutes ago while writing another comment and found 18TB drives seem to be hovering around US$350. 6*18=108TB (and US$2100), however, that does not account for redundancy.
I recently learned the fun way that ZFS performance (incl read-only) absolutely tanks if the pool is full (the box was unusable until I deleted the data), so to use ZFS you'd want to add in a few TB of spare capacity. https://wintelguy.com/zfs-calc.pl says 10 drives ($3500) should provide 100TB of space.
Alternatively you could use dm-integrity (which is very new and does not yet have optimized I/O routines) and either layer RAID on top of that or manually duplicate the data yourself, which would let you just get exactly the number of disks. It's possible you could alternatively run Ceph with ECC, but I don't know if Ceph likes having some free space to work with.
This also does not take into account a server, but this does not need to be too expensive, probably US$750 or potentially even less.
I respond entirely out of curiosity; I also dream of backing up large datasets one day :D
my argument is just to provide redundancy. we do not need additional local redundancy because the data is out there. it would be a problem if my copy or your copy somehow becomes the only copy. then we would be concerned.
this says INR 11,000.00 which is $137.raw 100 would need around 18 so $2500.
i think there can be variations of disk sizes and amazon fluctuations but in cases like these i would have a business buy them because that way input tax of 18% would pass on to the business and conversely to me for lowering the final bill. so, $2118 would be for these drives. same for the server. a tax benefit of 18%.
Huh. I seriously need to graph the size vs price thing and figure out where the sweet spots are. I last did that sort of thing a few years ago, got completely out of the game. I just googled a bit and discovered 18TB drives are a thing and went and looked for those. lol
I guess you're right about not needing local redundancy... although this is very much at odds with the "disk is over 500GB, redundancy required!!11" beeper in my head :D
> And although free search engines such as Google Scholar have — with publishers’ agreement — indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn’t allow large-scale computerized analysis using more specialized searches, Malamud says.
“We have seen some initiatives run into trouble, however, when the necessary rights have not been secured to enable their sustainability,”
I love the double-speak that these corrupt publishers use - it really highlights how mafia-esque the whole situation is. "Pay your protection fee, or we won't be able to /protect/ you". Intellectual slavery and copyright is just another tool of opression used to keep the wealthy in their positions of power.
We aren't slaves, we are partners. Nor are we all startup leaders.
We have ideas, we don't necessarily have the time, money or even ability to capitalize on them directly as a business.
We also have no need to come up with ideas which have direct application right now. We can come up with "here's a good idea (maybe)"... stuff, and it really doesn't matter so much if it doesn't pan out.
We still get the bonus, the credit, and the feeling that our idea got recorded and acknowledged.
It's making lots of small bets to get ahead, rather than go for broke on one big one.
Anyone can create a billion dollar company or win a Nobel prize while we're at it. All technically true but largely meaningless for lack of nuance and perspective.
If you're an independent inventor you'll need to hire a patent attorney for your patent to have a meaningful chance, which will run several grand. The patent office is overwhelmed and as a result tends to reject the first draft of an application so that only the most tenacious filers stick around. So you'll spend several grand more on each of probably 3-5 iterations of the application. $10k is a realistic estimate for attorneys' fees if the process goes well.
Then you'll need to monetize it. If you want to sell a product you'll deal with all the other IP holders and their "patent thickets" that intersect with trivial aspects of your product. To sell your product you'll need to license their IP, eroding the value of your own.
Finally you'll have to defend your IP from infringement, which is a major expense---much larger than filing. If you threaten a company with infringement they will make it their strategy to grind you down with legal expenses. If it goes to litigation, you should expect to spend at least several hundred grand.
So overall I would not say just "anyone" has meaningful access to IP. As with many legal/bureaucratic processes, your mileage will vary with how deep your pockets are.
> Anyone can create IP. It's not something only the wealthy can do.
No. First you need privileged access to humanity’s techno-scientific inheritance. New discoveries and their negative research are increasingly being monopolized through trade secret protection (which, unlike patents, means that some discoveries can be forever monopolized).
There are many instances where a small artist sampling commercial music has been slapped with a life-ending lawsuit, while commercial artists straight up ripping off small artists make chart hits and millions of dollars.
> And are you telling us we need privilege to write a song?
I wrote: “you need privileged access to humanity’s techno-scientific inheritance.”
So no, that’s not what I'm telling you.
For cultural media/products it’s different. If the working class need for affordable housing was respected, and we didn’t have parasitic Wall Street housing-financializers like Blackstone sucking out our life energy, then songwriters wouldn’t need to extract endless royalties from cultural products like songs to pay for their housing rents. Instead they could live off live performances or use some sort of societal reputational framework to measure their joy-bringing cultural contribution (e.g. by using the number of plays or similar metrics).
Rent extraction and monopolization of our techno-scientific has a grossly underexplored compounding cancer-like effect on the social organism. Today's property relations (the state letting individuals 'own' bits of science) are literally pushing us towards climate genocide.
I care about growing radical universal access to our techno-scientific inheritance because we live in an age where scientific discoveries, which are all individually part of a larger web of interconnected collective feedback loops revealed when painstakingly reverse engineering various phenomena on our planet, are increasingly being commoditized, with little public awareness of the destructiveness and insidiousness of that shift.
> “…today, a tiny minority of people and corporate interests across the world are accumulating vast wealth and power from rental income, not only from housing and land but from a range of other assets, natural and created. ‘Rentiers’ of all kinds are in unparalleled ascendancy
> Rentiers derive income from ownership, possession or control of assets that are scarce or artificially made scarce. Most familiar is rental income from land, property, mineral exploitation or financial investments, but other sources have grown too. They include the income lenders gain from debt interest; income from ownership of ‘intellectual property’ (such as [trade secrets], patents, copyright, brands and trademarks); capital gains on investments; ‘above normal’ company profits (when a firm has a dominant market position that allows it to charge high prices or dictate terms); income from government subsidies; and income of financial and other intermediaries derived from third-party transactions.“
> Rather than a “free market,” the neoliberal global economy praised as “free trade” is actually “a global framework of institutions and regulations that enable elites to maximise their rental income.”
How does this fit in the context of monetizable, economically valuable IP and the protection of individual intellectual freedom and personal benefit (or lack thereof) for the individual contributor who actually invented it?
Also, technically, as an employee anything you create which even remotely fits into the realm of the employer and the company is transferred to them the moment you create it.
It fits in the context that you can sell a program that you've developed, but other people can't just go and sell it. I don't know why you are focused on inventions when the article's IP was about scholarly articles.
IP in general is easy. IP in specific is anywhere between devilishly hard and trivial, and by its very nature it's monopolistic. You can't "just create your own" if the territory you want to build on is already occupied. By treating all IP as though it's close to impossible, and that we must societally worship the creators and their progeny, we create monumental inefficiencies.
> we must societally worship the creators and their progeny,
It's really hard to have a useful conversation when nonsense like this creeps in. There's no society that behaves as a group. Talk mechanisms, not anthropomorphised outcomes.
Fine: the mechanism is a copyright length of life+70, as a heritable asset. Law is a moral code that's enforced so that as a society we must do, or not do, certain things, so yes, the intent is absolutely that society behaves as a group in certain specific ways.
Again, what's the point in your last statement? You explain your nebulous statement and then sign off as though I'd said your explanation was nonsense.
It sounds as though even a superficial thinking through of mechanisms has stopped you talking about how we worship IP creators, which is great. It's just a mechanism to allow investment in invention to reap rewards before others can compete. You might think that's a bad idea; what's your better idea?
> It's just a mechanism to allow investment in invention to reap rewards before others can compete.
Except it's a lie that the real innovators are those in the private sector, and that they need to recoup their investment through the intellectual property regime. Nearly all discoveries are made through public sector funding. [1] Your worshipping of supposed individual inventors is part of the great man theory adopted by Silicon Valley [2] and which is spouted by Global North and American bourgeois media/stories/movies.
Research and development is a team sport. The lone genius myth will die.
Indeed. This is my new strategy at the place I work, where I can't get acceptance of new interesting "a few years out there" ideas for our product line, but I can file for IP for such ideas. I'm actually encouraged to. It's a great outlet for the creative if you've got the organization to help out.
Not to mention I can't hire lawyers, researchers, illustrators, and all the other things my company does for me. It would cost me so much more, both in dollars and time.
Yep! While yeah I am selling IP for some small bonus, it's way, way easier and a guaranteed return on my investment vs. creating a startup or trying to get someone to license a patent. I've done a few patents now at my current job, and it's literally about 4 hours of investment from me for each one to get the summary written up, talk to the lawyers, and then review their patent draft.
Also, the patent lawyers I worked with had so much domain experience (ML) that it was kinda scary.
Well it depends a lot on the patent, I guess. Some patents are definitely not worth a startup, so probably it's not a bad idea to get some bonus and be done with it.
However it's cool for big companies to show they have thousands of patents.
Still can't construe it as ethical or right to prevent access to scientific papers. There's something about universities pooling cash for publishing, but whatever's happening is something else.
If you want to perform searches against the index: https://archive.org/search.php?query=creator%3A%22Public+Res...
Great to see a large "Signatories to the Declaration of Support" and that it's hosted by the Internet Archive. Keep fighting the good fights my friends! :)