If you want to perform searches against the index: https://archive.org/search.php?query=creator%3A%22Public+Res...
Great to see a large "Signatories to the Declaration of Support" and that it's hosted by the Internet Archive. Keep fighting the good fights my friends! :)
For example, I'm trying to find articles with "neuroscience" or "brain-computer interface" among the keywords.
You can't read the papers or see their citations or cited-bys. What's it for? Poking Elsevier?
>In_memoriam Shamnad Basheer ; Aaron Swartz
The same applied math turns up across domains - I would love to be able to see who else is using the same math I already know on their problems.
For example I've noticed certain similarities across classical statistics, Machine Learning and digital signal processing. But they all seem to come at it from different angles, and they often use different terminology. Sometimes a field will rediscover something discovered elsewhere a long time ago, eg the trapezium rule.
It's clearly Sci-Hub, as this 2019 article all-but-confirms:
>"And around the same time that he heard about the Rameshwari judgment, he had come into possession (he won’t say how) of eight hard drives containing millions of journal articles from Sci-Hub [...] Malamud began to wonder whether he could legally use the Sci-Hub drives to benefit Indian students [...] Asked directly whether some of the text-mining depot’s articles come from Sci-Hub, he said he wouldn’t comment"
(It'd be nice if there was a coverage source other than this scientific publisher, whose biases are obvious).
As has Science:
from my lived experience, this request has been seen as absurd and invalidated by many because their lived experience is always having feeds of papers and abstracts they can start from. or how they see a browser extension that doesn’t do what I’m asking at all to be good enough (there’s some that automatically give you the scihub link to any paper you are looking at, which still puts me at square one, which paper?).
its a pretty basic feature though.
there is a link there to 851 torrents , ~77TB of data (compressed zip files), and also an index (sql db dump)
That's pretty crazy to think about, even if you consider overhead and multimedia assets like images often included in PDFs. I remember the old "the library of congress fits on this CDROM" analogies (which wasn't entirely true) but this takes it to a whole new level.
At some point, it seems like in research, it will be far easier to skip the lit review and just do the work then later if you do the same work someone else did, compare results for consistency. We may yet get through the hurdle of the reproducibility crisis due to the deluge of information. The underlying issue though is, because you couldn't find the related effort that caused you to independently attempt to replicate the effort, you may also never find all the duplicate efforts to compare for consistency.
PDF is fairly inefficient compared to formats like DVI (and consider that so many papers are produced using TeX anyway, though figures may be in various image formats.)
But 77TB? You could host the whole thing in a shoebox with nine 8TB flash drives, a 10-socket powered USB hub, and a Raspberry Pi.
Someone really needs to build Shoe-Hub.
This is not something new, or even from this century. The Royal Society, which was founded on 1660 and is a reference in the history of science, adopted the motto "take nobody's word for it".
Getting these > 70 TB to a residential address does not seems easy.
Or if you already have a VPN provider, check their terms to see what their torrent policies are. Last I checked mine permitted such traffic in a few specific datacenters.
70TB over 1Gbps: 6+ days
70TB over 500Mbps: 12+ days
70TB over 100Mbps: 2+ months
70TB over 50Mbps: 4+ months
70TB over 25Mbps: 8+ months
70TB over 10Mbps: 21+ months
70TB over 5Mbps: 3 years
70TB over 1Mbps: 17 years
70TB via mail: equivalent to 100Mbps (international) to 1Gbps+ (local) depending on distance, but only cost-effective if an equivalent connection is unavailable due to average weight (and price) of sufficient HDDs, eg, 3x18TB = ~$1200, combined weight approx 1kg.
(I'd provide an appropriately oversized tip to ensure a near-unreasonable amount of bubble wrap is included in the return.)
Not saying Snowball is the best idea by any means. But I can see how it might be worth considering.
We've taken this an extra step and have classified the citation statements as providing supporting or contrasting evidence (https://direct.mit.edu/qss/article/doi/10.1162/qss_a_00146/1...)
And, recently, have made these citation statements easily searchable to find expert analyses and opinions on nearly any topic extracted from the literature: https://scite.ai/search/citations
(PS: comparable initiative of citation datasets that are downloadable: S2ORC https://github.com/allenai/s2orc and Internet Archive Refcat https://blog.archive.org/2021/10/19/internet-archive-release...)
Otherwise you get a lot of useless "of that", "in an", "as a"-type word salad that is of no use to anyone, while what's actually useful sounds more like N-grams out of Tamarian.
A bit like linguee but for scientists: (at the bottom of the page): https://www.linguee.fr/francais-anglais/search?source=auto&q...
Also, while unusual, you could potentially extract value by lightly imposing requests ;) on others to (where viable/a good fit) at least have a go at helping you with problem-solving/busywork-type steps in the research you're doing (however experimental/uncertain). Since everyone would be looking at the same dataset, this may bring interesting/unexpected ideas to the table (and maybe even shed light on potential opportunities for collaboration down the road). For individuals who are reasonably independent and self-directed but have no experience working with huge amounts of data, this would also provide a cool chance to play around in a real-world-scale environment without the failure risks attached to eg fulfilling^Wfiguring out business requirements etc.
(Now I'm reminded of https://tilde.club/, which operates a shared Linux box a (very) large bunch of people have collective access to. It's a stab in the dark (ie, the one reference I'm aware of), but maybe the admins there would have interesting insight about managing shared access Linux boxes.)
As an aside, one interesting data-set that is out there, legally and freely available, which is decent sized (not as big as this, of course) is the United States Fire Administration's NFIRS data.
Roughly speaking, NFIRS is a record of all fire calls run in the USA, including details like date, time, location, responding department, incident type, dollar value of damages, number of deaths, number of injuries, etc.
I say "roughly speaking" because strictly speaking participation in NFIRS is voluntary and not every department in the USA participates. If memory serves correctly, for example, I think FDNY does not. But as far as I know, a significant majority of the fire departments in the US do send their data to NFIRS and so it becomes a pretty interesting snapshot of fire activity across the entire country.
Edit: from the NFIRS page:
The NFIRS database comprises about 75% of all reported fires that occur annually.
We have extracted 918M citation statements from 27M full text articles.
To your point though it's fun to think about generative applications too. I for one would appreciate a writing assistant trained on millions of scientific papers -- like OK I'll write that last-minute proposal for you but you better believe it's going to be chock-full of dispensable lexical arcana.
The state university I attended publishes their subscription expenditures, link to the document here .
Besides the outrageous prices the other thing that jumps off the pages is how many of them have managed to hide the amount of tax money they receive from a state institution with contract terms.
Donations, eh yeah but I feel like scihub needs everyone's support.
I recently learned the fun way that ZFS performance (incl read-only) absolutely tanks if the pool is full (the box was unusable until I deleted the data), so to use ZFS you'd want to add in a few TB of spare capacity. https://wintelguy.com/zfs-calc.pl says 10 drives ($3500) should provide 100TB of space.
Alternatively you could use dm-integrity (which is very new and does not yet have optimized I/O routines) and either layer RAID on top of that or manually duplicate the data yourself, which would let you just get exactly the number of disks. It's possible you could alternatively run Ceph with ECC, but I don't know if Ceph likes having some free space to work with.
This also does not take into account a server, but this does not need to be too expensive, probably US$750 or potentially even less.
I respond entirely out of curiosity; I also dream of backing up large datasets one day :D
this says INR 11,000.00 which is $137.raw 100 would need around 18 so $2500.
i think there can be variations of disk sizes and amazon fluctuations but in cases like these i would have a business buy them because that way input tax of 18% would pass on to the business and conversely to me for lowering the final bill. so, $2118 would be for these drives. same for the server. a tax benefit of 18%.
as i said, i would love to have this
I guess you're right about not needing local redundancy... although this is very much at odds with the "disk is over 500GB, redundancy required!!11" beeper in my head :D
(And wow things do fluctuate quite a bit, I just poked around and found https://www.amazon.com/MDD-Ultrastar-HUS726060ALE614-Enterpr... (6TB) for US$99, but that's a DVR quality drive, only rated for short term storage.)
Also, TIL about the tax thing, that's really cool. TIL about the Indian GST.
> And although free search engines such as Google Scholar have — with publishers’ agreement — indexed the text of paywalled literature, they only allow users to search with certain types of text queries, and restrict automated searching. That doesn’t allow large-scale computerized analysis using more specialized searches, Malamud says.
I love the double-speak that these corrupt publishers use - it really highlights how mafia-esque the whole situation is. "Pay your protection fee, or we won't be able to /protect/ you". Intellectual slavery and copyright is just another tool of opression used to keep the wealthy in their positions of power.
We have ideas, we don't necessarily have the time, money or even ability to capitalize on them directly as a business.
We also have no need to come up with ideas which have direct application right now. We can come up with "here's a good idea (maybe)"... stuff, and it really doesn't matter so much if it doesn't pan out.
We still get the bonus, the credit, and the feeling that our idea got recorded and acknowledged.
It's making lots of small bets to get ahead, rather than go for broke on one big one.
If you're an independent inventor you'll need to hire a patent attorney for your patent to have a meaningful chance, which will run several grand. The patent office is overwhelmed and as a result tends to reject the first draft of an application so that only the most tenacious filers stick around. So you'll spend several grand more on each of probably 3-5 iterations of the application. $10k is a realistic estimate for attorneys' fees if the process goes well.
Then you'll need to monetize it. If you want to sell a product you'll deal with all the other IP holders and their "patent thickets" that intersect with trivial aspects of your product. To sell your product you'll need to license their IP, eroding the value of your own.
Finally you'll have to defend your IP from infringement, which is a major expense---much larger than filing. If you threaten a company with infringement they will make it their strategy to grind you down with legal expenses. If it goes to litigation, you should expect to spend at least several hundred grand.
So overall I would not say just "anyone" has meaningful access to IP. As with many legal/bureaucratic processes, your mileage will vary with how deep your pockets are.
Here is a book written by two economists on how the current IP regime concentrates power and stifles innovation: http://www.dklevine.com/general/intellectual/againstfinal.ht...
No. First you need privileged access to humanity’s techno-scientific inheritance. New discoveries and their negative research are increasingly being monopolized through trade secret protection (which, unlike patents, means that some discoveries can be forever monopolized).
Yes I agree this isn’t the best source so if you have a good source that explains it better, please share?
And are you telling us we need privilege to write a song?
I wrote: “you need privileged access to humanity’s techno-scientific inheritance.”
So no, that’s not what I'm telling you.
For cultural media/products it’s different. If the working class need for affordable housing was respected, and we didn’t have parasitic Wall Street housing-financializers like Blackstone sucking out our life energy, then songwriters wouldn’t need to extract endless royalties from cultural products like songs to pay for their housing rents. Instead they could live off live performances or use some sort of societal reputational framework to measure their joy-bringing cultural contribution (e.g. by using the number of plays or similar metrics).
Rent extraction and monopolization of our techno-scientific has a grossly underexplored compounding cancer-like effect on the social organism. Today's property relations (the state letting individuals 'own' bits of science) are literally pushing us towards climate genocide.
I care about growing radical universal access to our techno-scientific inheritance because we live in an age where scientific discoveries, which are all individually part of a larger web of interconnected collective feedback loops revealed when painstakingly reverse engineering various phenomena on our planet, are increasingly being commoditized, with little public awareness of the destructiveness and insidiousness of that shift.
> “…today, a tiny minority of people and corporate interests across the world are accumulating vast wealth and power from rental income, not only from housing and land but from a range of other assets, natural and created. ‘Rentiers’ of all kinds are in unparalleled ascendancy
> Rentiers derive income from ownership, possession or control of assets that are scarce or artificially made scarce. Most familiar is rental income from land, property, mineral exploitation or financial investments, but other sources have grown too. They include the income lenders gain from debt interest; income from ownership of ‘intellectual property’ (such as [trade secrets], patents, copyright, brands and trademarks); capital gains on investments; ‘above normal’ company profits (when a firm has a dominant market position that allows it to charge high prices or dictate terms); income from government subsidies; and income of financial and other intermediaries derived from third-party transactions.“
> Rather than a “free market,” the neoliberal global economy praised as “free trade” is actually “a global framework of institutions and regulations that enable elites to maximise their rental income.”
The comment you just made is IP that you now own. You just created IP. You own the copyright on anything you write.
Also, technically, as an employee anything you create which even remotely fits into the realm of the employer and the company is transferred to them the moment you create it.
IP in general is easy. IP in specific is anywhere between devilishly hard and trivial, and by its very nature it's monopolistic. You can't "just create your own" if the territory you want to build on is already occupied. By treating all IP as though it's close to impossible, and that we must societally worship the creators and their progeny, we create monumental inefficiencies.
It's really hard to have a useful conversation when nonsense like this creeps in. There's no society that behaves as a group. Talk mechanisms, not anthropomorphised outcomes.
I'm sorry you think that's nonsense.
It sounds as though even a superficial thinking through of mechanisms has stopped you talking about how we worship IP creators, which is great. It's just a mechanism to allow investment in invention to reap rewards before others can compete. You might think that's a bad idea; what's your better idea?
Except it's a lie that the real innovators are those in the private sector, and that they need to recoup their investment through the intellectual property regime. Nearly all discoveries are made through public sector funding.  Your worshipping of supposed individual inventors is part of the great man theory adopted by Silicon Valley  and which is spouted by Global North and American bourgeois media/stories/movies.
Research and development is a team sport. The lone genius myth will die.
 Mariana Mazzucato, https://web.archive.org/web/20160204223931/https://nybooks.c...
Also, the patent lawyers I worked with had so much domain experience (ML) that it was kinda scary.
However it's cool for big companies to show they have thousands of patents.