
Show HN: An open distributed search engine for science - juretriglav
http://juretriglav.si/an-open-distributed-search-engine-for-science/
======
Blahah
Jure, your projects never cease to impress me. Really looking forward to
talking in depth at OKfest. This idea is so close to what we've been doing
that it's a real shame we didn't talk earlier, but the parts of what you're
doing that are unique are also truly awesome.

At ContentMine we're doing something totally complementary to this. Some of
the tools will overlap and we should be sharing what we're doing. For example,
I've been working on a standardised declarative JSON-XPath scraper definition
format and a subset of it for academic journal scraping. I've been building a
library of ScraperJSON definitions for academic publisher sites, and I've
converged on some formats that work for a majority of publishers with no
modification (because they silently follow undocumented standards like the
HighWire metadata). We've got a growing community of volunteers who will keep
the definitions up to date for hundreds or thousands of journals. If you also
use our scraper definitions for your metadata you'll get all the publishers
for free.

Our goal initially is to scrape the entire literature (we have TOCs for 23,000
journals) as it is published every day. We then use natural language and image
processing tools to extract uncopyrightable facts from the full texts, and
republish those facts in open streams. For example we can capture all
phylogenetic trees, reverse engineer the newick format from images, and submit
them to the Tree Of Life. Or we can find all new mentions of endangered
species and submit updates to the IUCN Red List. There's a ton of other
interesting stuff downstream (e.g. automatic fraud detection, data streams for
any conceivable subject of interest in the scientific literature).

I have a question. Why are you saying you'll never do full texts? You could
index all CC-BY and better full texts completely legally, and this would
greatly expand the literature search power.

~~~
kanzure
> _I 've been building a library of ScraperJSON definitions for academic
> publisher sites, and I've converged on some formats that work for a majority
> of publishers with no modification (because they silently follow
> undocumented standards like the HighWire metadata). We've got a growing
> community of volunteers who will keep the definitions up to date for
> hundreds or thousands of journals. If you also use our scraper definitions
> for your metadata you'll get all the publishers for free._

My approach has been to use the Zotero translators, since their 200,000 users
have been alright at responding to publisher site changes. Unfortunately they
are trapped in the Firefox ecosystem until someone converts their translators
to generic js. Then Zotero could be a downstream consumer of the same
scrapers, but also maybe maintain them as well.

zotero's scrapers:
[https://github.com/zotero/translators](https://github.com/zotero/translators)
which for example I use for an IRC bot,
[https://github.com/kanzure/paperbot](https://github.com/kanzure/paperbot)

Are these yours or are there more somewhere?
[https://github.com/ContentMine/journal-
scrapers](https://github.com/ContentMine/journal-scrapers)

I dunno about a majority following HighWire.. here's a corpus dump of what
I've seen (just random debug data from paperbot):
[http://diyhpl.us/~bryan/papers2/paperbot/publisherhtml.zip](http://diyhpl.us/~bryan/papers2/paperbot/publisherhtml.zip)

(Only 333 of the 1218 samples have "citation_pdf_url". But this collection is
extremely biased towards things that I am reading, rather than a sample of the
entire academic spectrum.)

~~~
Blahah
I started out with the Zotero translators, but they are really messy and not
standardised. Our ultimate goal is to make it trivial for non-programmers to
define and maintain journal scrapers. That was going to be extremely hard with
the Zotero system. We started over by building a generic declarative scraping
system. I also aim to get Zotero to eventually adopt our scraper system and
collection.

The stuff in that repo is a proof of principle - we will be growing the
collection massively before we demo in mid July.

Thanks for the corpus dump, taking a look now.

edit: I'm not suggesting a majority use HighWire, but that we can have far
fewer definitions than publishers. If we include Prism and DC along with some
obvious sets of CSS class names, that will already get us pretty far.

------
higherpurpose
I hope you carry on with this project. If there's any search engine that can
beat Google (long into the future) it's a P2P one.

Speaking of the devil, are you aware you can't install extensions from 3rd
party sources anymore at all? You can thank Google for this idiotic and
completely self-interested move.

~~~
juretriglav
Thanks for the kind words!

Whether or not I'll carry on with this project depends to a large degree upon
how well it is received within the community and how many master hackers I can
get to collaborate with. Though I have to say, it has been the most fun
project to date, that's for sure.

With regards to extensions, yes, I agree that's silly, but it's their game, so
... I could still install the extension manually on my Chrome for Mac though.
I think the limitation is Windows only? Anyway, I'll work on getting this
ready for the Chrome web store, so as to remove any barriers that currently
exist.

Before I launch on the store, I'd like to be reasonably sure that I'm not
breaking any laws and that it works like it should. :)

~~~
gorhill
If you make your extension available as a plain zip file, people can download
it, unzip it, and then use the "Load unpacked extension..." feature available
if "Developer mode" is checked.

So it's definitely still possible to use extensions on Windows without going
through the store.

------
petermurrayrust
This is really great and is fully complementary to our Content Mine
(contentmine.org).

Its' very similar to what I proposed as "the World Wide Molecular Matrix"
(WWMM) about 10 years ago
([http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix](http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix)).
P2p was an exciting development then and there was talk about browser/servers.
Then the technology was Napster-like.

WWMM was ahead of both the technology and the culture. It should work now and
I think Ninja wil fly (if that's the right verb). I think we have to pick a
field where there is a lot of interest (currently I am fixated on dinosaurs) ,
where there is a lot of Open material, and where the people are likely to have
excited minds.

We need a project that will start to be useful within a month. Because the
main advocacy will be showing that it's valuable. The competition is with
centralised for profit services such as Mendeley. The huge advantage of Ninja
is that it's distributed, which absolutely gauarantees non-centrality. The
challenges - not sure in what order - are apathy, and legal challenges (e.g.
can it be represented as spyware - I know it's absurd but the world is
becoming absurd).

Love to talk at Berlin.

------
yid
It seems like nothing like this currently exists in a centralized, non-
distributed way. Why add the complexity of a p2p network into an unproven
concept? Is it purely to save on the cost of indexing and serving queries?

> Scraping Google is a bad idea, which is quite funny as Google itself is the
> mother of all scrapers, but I digress.

It's not really "funny"/ironic/etc -- Google put capital into scraping
websites to build an _index_ , and you're free to do the same, but you
shouldn't expect Google to allow you to scrape their _index_ for free.

EDIT: just saw this:

> Right now, PLOS, eLife, PeerJ and ScienceDirect are supported, so any paper
> you read from these publishers, while using the extension, will get indexed
> and added to the network automatically.

Yeah, they're not going to like that. You might want to consult a lawyer.

~~~
Blahah
Actually PLOS, eLife and PeerJ are all Open Access publishers and explicitly
condone this kind of scraping of their sites. They want to promote reuse.

ScienceDirect is owned by Elsevier and is a different kettle of fish. One we
all hope to boil in the very near future. However, the title, authors, etc.
are not copyrightable and are explicitly free for indexing in the ToS. This is
not crawling, only scraping from an already rendered page. They really can't
complain.

~~~
gioele
> This is not crawling, only scraping from an already rendered page. They
> really can't complain.

You may want to have a look at
[http://en.wikipedia.org/wiki/Sui_generis_database_right](http://en.wikipedia.org/wiki/Sui_generis_database_right)

~~~
juretriglav
Thanks for pointing this out, but IANAL; could you briefly explain what this
is about in the context of Scholar Ninja?

~~~
gioele
The problem is that the phrase "This is not crawling, only scraping" is to be
taken with a grain of salt. If you find a page with a lot of information and
you "just scrape it", you may still be subject to "sui generis" database
rights, i.e. you are probably not allowed to reuse the data you just got.

You can say "it is only an alphabetical list of names and titles", but you
have to remember that the "sui generis" DB rights have been created to protect
phone books, and almost every compilation of data is more complex than a phone
book.

~~~
juretriglav
Thanks for that explanation, makes sense now. I really don't want to upset
anyone and would like Scholar Ninja to be perceived as an additional value to
all involved parties. For publishers specifically, it surely must be in their
interest that people are able to find their papers. I'm hoping that because
Scholar Ninja is also framed as an open-source, non-profit initiative, I'll
upset people even less.

Fingers crossed.

------
nl
Why not index preprints, which are generally available via OAI harvesting?

I'm not following the field closely at the moment, but I'm pretty sure PLOS at
least has an OAI interface too.

~~~
juretriglav
I'm not really familiar with the term OAI harvesting, could you elaborate?

With regards to indexing, it looks like we're going to partner with
ContentMine (Peter, Richard et al.) to seed the index. Scholar Ninja does not,
in essence, discriminate which content should be indexed and which should not,
as long as it is science - it is only a matter of implementing rules (to
extract authors, title, journal, date, etc) for documents/pages you would like
indexed:
[https://github.com/ScholarNinja/extension/blob/master/app/sc...](https://github.com/ScholarNinja/extension/blob/master/app/scripts/extractor.js)

Edit: Looked it up. At first glance, it looks like there might be some
licenses associated with harvesting this data. Will have to investigate
further.

------
hershel
What about [http://commoncrawl.org/](http://commoncrawl.org/)? Why not use it?

~~~
juretriglav
It's very unlikely that commoncrawl.org will have access to full text papers,
which is mostly based on expensive library/university subscriptions.

Before Scholar Ninja reaches maturity of version 1.0 though, we will be
seeding the network with as many sources as we legally and technically can,
with a strong focus on properly licensed open access content.

