Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: An open distributed search engine for science (juretriglav.si)
98 points by juretriglav on June 21, 2014 | hide | past | favorite | 20 comments

Jure, your projects never cease to impress me. Really looking forward to talking in depth at OKfest. This idea is so close to what we've been doing that it's a real shame we didn't talk earlier, but the parts of what you're doing that are unique are also truly awesome.

At ContentMine we're doing something totally complementary to this. Some of the tools will overlap and we should be sharing what we're doing. For example, I've been working on a standardised declarative JSON-XPath scraper definition format and a subset of it for academic journal scraping. I've been building a library of ScraperJSON definitions for academic publisher sites, and I've converged on some formats that work for a majority of publishers with no modification (because they silently follow undocumented standards like the HighWire metadata). We've got a growing community of volunteers who will keep the definitions up to date for hundreds or thousands of journals. If you also use our scraper definitions for your metadata you'll get all the publishers for free.

Our goal initially is to scrape the entire literature (we have TOCs for 23,000 journals) as it is published every day. We then use natural language and image processing tools to extract uncopyrightable facts from the full texts, and republish those facts in open streams. For example we can capture all phylogenetic trees, reverse engineer the newick format from images, and submit them to the Tree Of Life. Or we can find all new mentions of endangered species and submit updates to the IUCN Red List. There's a ton of other interesting stuff downstream (e.g. automatic fraud detection, data streams for any conceivable subject of interest in the scientific literature).

I have a question. Why are you saying you'll never do full texts? You could index all CC-BY and better full texts completely legally, and this would greatly expand the literature search power.

Thanks Richard! Open Knowledge festival is going to be off the hook :)

I realized you and your team (Peter et al.) have been working on a project in a similar space, and was hoping there would some beneficial overlap. It looks like there is! The ScraperJSON definitions for publishers sounds like exactly what Scholar Ninja needs. Let's chat soon :)

Thanks you for the quick description of ContentMine, the more awareness about this projects, the better. I've talked to Peter Murray-Rust on a few occasions and have to say that what you are doing with ContentMine is phenomenal and I wish you all the best. I hope you'll see that our projects are complimentary rather than competitive.

About indexing full texts: We do generate keyword indexes based on full texts, but we do not add this full text to the network. I hope you can see what I'm saying: you can still search through full text, but you can't access full text, it doesn't exist on the network, only a keyword: [entry1, entry2,...] index exists. One improvement to Scholar Ninja for open access articles would be the ability to show snippets, like Google Scholar does; it's on the mental TODO.

> I've been building a library of ScraperJSON definitions for academic publisher sites, and I've converged on some formats that work for a majority of publishers with no modification (because they silently follow undocumented standards like the HighWire metadata). We've got a growing community of volunteers who will keep the definitions up to date for hundreds or thousands of journals. If you also use our scraper definitions for your metadata you'll get all the publishers for free.

My approach has been to use the Zotero translators, since their 200,000 users have been alright at responding to publisher site changes. Unfortunately they are trapped in the Firefox ecosystem until someone converts their translators to generic js. Then Zotero could be a downstream consumer of the same scrapers, but also maybe maintain them as well.

zotero's scrapers: https://github.com/zotero/translators which for example I use for an IRC bot, https://github.com/kanzure/paperbot

Are these yours or are there more somewhere? https://github.com/ContentMine/journal-scrapers

I dunno about a majority following HighWire.. here's a corpus dump of what I've seen (just random debug data from paperbot): http://diyhpl.us/~bryan/papers2/paperbot/publisherhtml.zip

(Only 333 of the 1218 samples have "citation_pdf_url". But this collection is extremely biased towards things that I am reading, rather than a sample of the entire academic spectrum.)

I started out with the Zotero translators, but they are really messy and not standardised. Our ultimate goal is to make it trivial for non-programmers to define and maintain journal scrapers. That was going to be extremely hard with the Zotero system. We started over by building a generic declarative scraping system. I also aim to get Zotero to eventually adopt our scraper system and collection.

The stuff in that repo is a proof of principle - we will be growing the collection massively before we demo in mid July.

Thanks for the corpus dump, taking a look now.

edit: I'm not suggesting a majority use HighWire, but that we can have far fewer definitions than publishers. If we include Prism and DC along with some obvious sets of CSS class names, that will already get us pretty far.

Thanks for the highly relevant links. It does make sense to try and outsource the maintenance of scraper rules, especially if there are projects focusing solely on that part.

I hope you carry on with this project. If there's any search engine that can beat Google (long into the future) it's a P2P one.

Speaking of the devil, are you aware you can't install extensions from 3rd party sources anymore at all? You can thank Google for this idiotic and completely self-interested move.

Thanks for the kind words!

Whether or not I'll carry on with this project depends to a large degree upon how well it is received within the community and how many master hackers I can get to collaborate with. Though I have to say, it has been the most fun project to date, that's for sure.

With regards to extensions, yes, I agree that's silly, but it's their game, so ... I could still install the extension manually on my Chrome for Mac though. I think the limitation is Windows only? Anyway, I'll work on getting this ready for the Chrome web store, so as to remove any barriers that currently exist.

Before I launch on the store, I'd like to be reasonably sure that I'm not breaking any laws and that it works like it should. :)

If you make your extension available as a plain zip file, people can download it, unzip it, and then use the "Load unpacked extension..." feature available if "Developer mode" is checked.

So it's definitely still possible to use extensions on Windows without going through the store.

This is really great and is fully complementary to our Content Mine (contentmine.org).

Its' very similar to what I proposed as "the World Wide Molecular Matrix" (WWMM) about 10 years ago (http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix). P2p was an exciting development then and there was talk about browser/servers. Then the technology was Napster-like.

WWMM was ahead of both the technology and the culture. It should work now and I think Ninja wil fly (if that's the right verb). I think we have to pick a field where there is a lot of interest (currently I am fixated on dinosaurs) , where there is a lot of Open material, and where the people are likely to have excited minds.

We need a project that will start to be useful within a month. Because the main advocacy will be showing that it's valuable. The competition is with centralised for profit services such as Mendeley. The huge advantage of Ninja is that it's distributed, which absolutely gauarantees non-centrality. The challenges - not sure in what order - are apathy, and legal challenges (e.g. can it be represented as spyware - I know it's absurd but the world is becoming absurd).

Love to talk at Berlin.

It seems like nothing like this currently exists in a centralized, non-distributed way. Why add the complexity of a p2p network into an unproven concept? Is it purely to save on the cost of indexing and serving queries?

> Scraping Google is a bad idea, which is quite funny as Google itself is the mother of all scrapers, but I digress.

It's not really "funny"/ironic/etc -- Google put capital into scraping websites to build an index, and you're free to do the same, but you shouldn't expect Google to allow you to scrape their index for free.

EDIT: just saw this:

> Right now, PLOS, eLife, PeerJ and ScienceDirect are supported, so any paper you read from these publishers, while using the extension, will get indexed and added to the network automatically.

Yeah, they're not going to like that. You might want to consult a lawyer.

The P2P network approach here is important for two reasons, one is that I do not have fulltext access to journals and researchers using this extension do, and second, I do not have the resources to run a centralized search engine, which would be expensive, as you say, because of index/server costs.

The fact that it's an unproven concept is exciting to me; that and the fact that it's using very modern technologies to solve an existing problem. If nothing else, maybe this project can server as a cool demo for the underlying tech, i.e. WebRTC. To my knowledge it is the only keyword-base search engine based entirely in the browser.

I agree with your remark about Google, I was trying to be witty but often my humor fails me, as my family and friend will be eager to confirm :) Given that Google doesn't have public APIs, even paid ones, for any of their search services, leads me to believe the numbers just don't add up for them.

With regards to angering publishers: I really really do not want to come on their bad side, and I can't see how this project could. It's mission is to help users discover content that they have, help them find the right papers, which are still hosted by publishers. Never will Scholar Ninja be used to circumvent paywalls or share paywalled fulltext papers, this is just not in anyone's interest. Scholar Ninja only indexes pages you read anyway, so it doesn't cause any additional load on servers, and it only contains keyword references to documents, e.g. "cancer": ["10.1371/journal.pmed.0010065", ...], which enables us to do keyword searching.

Actually PLOS, eLife and PeerJ are all Open Access publishers and explicitly condone this kind of scraping of their sites. They want to promote reuse.

ScienceDirect is owned by Elsevier and is a different kettle of fish. One we all hope to boil in the very near future. However, the title, authors, etc. are not copyrightable and are explicitly free for indexing in the ToS. This is not crawling, only scraping from an already rendered page. They really can't complain.

> This is not crawling, only scraping from an already rendered page. They really can't complain.

You may want to have a look at http://en.wikipedia.org/wiki/Sui_generis_database_right

Thanks for pointing this out, but IANAL; could you briefly explain what this is about in the context of Scholar Ninja?

The problem is that the phrase "This is not crawling, only scraping" is to be taken with a grain of salt. If you find a page with a lot of information and you "just scrape it", you may still be subject to "sui generis" database rights, i.e. you are probably not allowed to reuse the data you just got.

You can say "it is only an alphabetical list of names and titles", but you have to remember that the "sui generis" DB rights have been created to protect phone books, and almost every compilation of data is more complex than a phone book.

Thanks for that explanation, makes sense now. I really don't want to upset anyone and would like Scholar Ninja to be perceived as an additional value to all involved parties. For publishers specifically, it surely must be in their interest that people are able to find their papers. I'm hoping that because Scholar Ninja is also framed as an open-source, non-profit initiative, I'll upset people even less.

Fingers crossed.

Why not index preprints, which are generally available via OAI harvesting?

I'm not following the field closely at the moment, but I'm pretty sure PLOS at least has an OAI interface too.

I'm not really familiar with the term OAI harvesting, could you elaborate?

With regards to indexing, it looks like we're going to partner with ContentMine (Peter, Richard et al.) to seed the index. Scholar Ninja does not, in essence, discriminate which content should be indexed and which should not, as long as it is science - it is only a matter of implementing rules (to extract authors, title, journal, date, etc) for documents/pages you would like indexed: https://github.com/ScholarNinja/extension/blob/master/app/sc...

Edit: Looked it up. At first glance, it looks like there might be some licenses associated with harvesting this data. Will have to investigate further.

What about http://commoncrawl.org/? Why not use it?

It's very unlikely that commoncrawl.org will have access to full text papers, which is mostly based on expensive library/university subscriptions.

Before Scholar Ninja reaches maturity of version 1.0 though, we will be seeding the network with as many sources as we legally and technically can, with a strong focus on properly licensed open access content.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact