At ContentMine we're doing something totally complementary to this. Some of the tools will overlap and we should be sharing what we're doing. For example, I've been working on a standardised declarative JSON-XPath scraper definition format and a subset of it for academic journal scraping. I've been building a library of ScraperJSON definitions for academic publisher sites, and I've converged on some formats that work for a majority of publishers with no modification (because they silently follow undocumented standards like the HighWire metadata). We've got a growing community of volunteers who will keep the definitions up to date for hundreds or thousands of journals. If you also use our scraper definitions for your metadata you'll get all the publishers for free.
Our goal initially is to scrape the entire literature (we have TOCs for 23,000 journals) as it is published every day. We then use natural language and image processing tools to extract uncopyrightable facts from the full texts, and republish those facts in open streams. For example we can capture all phylogenetic trees, reverse engineer the newick format from images, and submit them to the Tree Of Life. Or we can find all new mentions of endangered species and submit updates to the IUCN Red List. There's a ton of other interesting stuff downstream (e.g. automatic fraud detection, data streams for any conceivable subject of interest in the scientific literature).
I have a question. Why are you saying you'll never do full texts? You could index all CC-BY and better full texts completely legally, and this would greatly expand the literature search power.
I realized you and your team (Peter et al.) have been working on a project in a similar space, and was hoping there would some beneficial overlap. It looks like there is! The ScraperJSON definitions for publishers sounds like exactly what Scholar Ninja needs. Let's chat soon :)
Thanks you for the quick description of ContentMine, the more awareness about this projects, the better. I've talked to Peter Murray-Rust on a few occasions and have to say that what you are doing with ContentMine is phenomenal and I wish you all the best. I hope you'll see that our projects are complimentary rather than competitive.
About indexing full texts: We do generate keyword indexes based on full texts, but we do not add this full text to the network. I hope you can see what I'm saying: you can still search through full text, but you can't access full text, it doesn't exist on the network, only a keyword: [entry1, entry2,...] index exists. One improvement to Scholar Ninja for open access articles would be the ability to show snippets, like Google Scholar does; it's on the mental TODO.
My approach has been to use the Zotero translators, since their 200,000 users have been alright at responding to publisher site changes. Unfortunately they are trapped in the Firefox ecosystem until someone converts their translators to generic js. Then Zotero could be a downstream consumer of the same scrapers, but also maybe maintain them as well.
zotero's scrapers: https://github.com/zotero/translators which for example I use for an IRC bot, https://github.com/kanzure/paperbot
Are these yours or are there more somewhere? https://github.com/ContentMine/journal-scrapers
I dunno about a majority following HighWire.. here's a corpus dump of what I've seen (just random debug data from paperbot): http://diyhpl.us/~bryan/papers2/paperbot/publisherhtml.zip
(Only 333 of the 1218 samples have "citation_pdf_url". But this collection is extremely biased towards things that I am reading, rather than a sample of the entire academic spectrum.)
The stuff in that repo is a proof of principle - we will be growing the collection massively before we demo in mid July.
Thanks for the corpus dump, taking a look now.
edit: I'm not suggesting a majority use HighWire, but that we can have far fewer definitions than publishers. If we include Prism and DC along with some obvious sets of CSS class names, that will already get us pretty far.
Speaking of the devil, are you aware you can't install extensions from 3rd party sources anymore at all? You can thank Google for this idiotic and completely self-interested move.
Whether or not I'll carry on with this project depends to a large degree upon how well it is received within the community and how many master hackers I can get to collaborate with. Though I have to say, it has been the most fun project to date, that's for sure.
With regards to extensions, yes, I agree that's silly, but it's their game, so ... I could still install the extension manually on my Chrome for Mac though. I think the limitation is Windows only? Anyway, I'll work on getting this ready for the Chrome web store, so as to remove any barriers that currently exist.
Before I launch on the store, I'd like to be reasonably sure that I'm not breaking any laws and that it works like it should. :)
So it's definitely still possible to use extensions on Windows without going through the store.
Its' very similar to what I proposed as "the World Wide Molecular Matrix" (WWMM) about 10 years ago (http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix). P2p was an exciting development then and there was talk about browser/servers. Then the technology was Napster-like.
WWMM was ahead of both the technology and the culture. It should work now and I think Ninja wil fly (if that's the right verb). I think we have to pick a field where there is a lot of interest (currently I am fixated on dinosaurs) , where there is a lot of Open material, and where the people are likely to have excited minds.
We need a project that will start to be useful within a month. Because the main advocacy will be showing that it's valuable. The competition is with centralised for profit services such as Mendeley. The huge advantage of Ninja is that it's distributed, which absolutely gauarantees non-centrality. The challenges - not sure in what order - are apathy, and legal challenges (e.g. can it be represented as spyware - I know it's absurd but the world is becoming absurd).
Love to talk at Berlin.
> Scraping Google is a bad idea, which is quite funny as Google itself is the mother of all scrapers, but I digress.
It's not really "funny"/ironic/etc -- Google put capital into scraping websites to build an index, and you're free to do the same, but you shouldn't expect Google to allow you to scrape their index for free.
EDIT: just saw this:
> Right now, PLOS, eLife, PeerJ and ScienceDirect are supported, so any paper you read from these publishers, while using the extension, will get indexed and added to the network automatically.
Yeah, they're not going to like that. You might want to consult a lawyer.
The fact that it's an unproven concept is exciting to me; that and the fact that it's using very modern technologies to solve an existing problem. If nothing else, maybe this project can server as a cool demo for the underlying tech, i.e. WebRTC. To my knowledge it is the only keyword-base search engine based entirely in the browser.
I agree with your remark about Google, I was trying to be witty but often my humor fails me, as my family and friend will be eager to confirm :) Given that Google doesn't have public APIs, even paid ones, for any of their search services, leads me to believe the numbers just don't add up for them.
With regards to angering publishers: I really really do not want to come on their bad side, and I can't see how this project could. It's mission is to help users discover content that they have, help them find the right papers, which are still hosted by publishers. Never will Scholar Ninja be used to circumvent paywalls or share paywalled fulltext papers, this is just not in anyone's interest. Scholar Ninja only indexes pages you read anyway, so it doesn't cause any additional load on servers, and it only contains keyword references to documents, e.g. "cancer": ["10.1371/journal.pmed.0010065", ...], which enables us to do keyword searching.
ScienceDirect is owned by Elsevier and is a different kettle of fish. One we all hope to boil in the very near future. However, the title, authors, etc. are not copyrightable and are explicitly free for indexing in the ToS. This is not crawling, only scraping from an already rendered page. They really can't complain.
You may want to have a look at http://en.wikipedia.org/wiki/Sui_generis_database_right
You can say "it is only an alphabetical list of names and titles", but you have to remember that the "sui generis" DB rights have been created to protect phone books, and almost every compilation of data is more complex than a phone book.
I'm not following the field closely at the moment, but I'm pretty sure PLOS at least has an OAI interface too.
With regards to indexing, it looks like we're going to partner with ContentMine (Peter, Richard et al.) to seed the index. Scholar Ninja does not, in essence, discriminate which content should be indexed and which should not, as long as it is science - it is only a matter of implementing rules (to extract authors, title, journal, date, etc) for documents/pages you would like indexed:
Edit: Looked it up. At first glance, it looks like there might be some licenses associated with harvesting this data. Will have to investigate further.
Before Scholar Ninja reaches maturity of version 1.0 though, we will be seeding the network with as many sources as we legally and technically can, with a strong focus on properly licensed open access content.