

Show HN: A new take on academic search - mhluongo
http://scholr.ly/

======
kanzure
I seem to recall when Google Scholar came out that they had a set of example
queries. Maybe this was another search engine that did this. Anyway, a set of
sample queries with results that you're proud of would be nice to see on
scholr.ly, because apparently I can't figure out how to use it.

I tried some queries and the results were awful.

query: author:"george church" ... I see only one publication, titled
"registered trademark of AAAS. Synthetic Gene Networks That Count" which must
be wrong.

query: author:"george whitesides" ... again only one publication ("New
Editorial Board Chair"), and a coauthor named "Dropspots Microdevice"
<http://scholr.ly/person/6263228/dropspots-microdevice>

Naturally, his affiliation is listed as "for Lab on a Chip".

query: "Rapid casting of patterned vascular networks for perfusable engineered
three-dimensional tissues" (the title of a recent paper) returned a few
publication results, none of which were the publication in question. It was in
Nature Materials, so it's not exactly obscure.

query: doi:10.1038/nmat3357 ... no results found. Hmm.

Well, I don't know how to use this site. Here's what I really want:

* I want a public domain, reusable index of all papers everywhere. I want bibtex, xml, json, and all the other terrible formats for keeping track of those papers.

* I want working Zotero translators that can properly extract metadata from all science publishing sites. And I just want this data instead of having to fish it out of a search engine.

* I want to be told where I can find pdf files hosted on sites, so that I don't have to send emails begging for copies. Google Scholar does this pretty well.

* Actual data about which papers cite what, instead of trusting the "cited by" links/results in Google Scholar. Also, a "this paper cites..." feature so that I don't have to bother digging out the doi numbers from the bibliography myself.

* Supplementary documents are evil, but they exist and they should be immediately available next to or with any paper.

Edit:

* An index of which libraries at which universities are subscribed to which publishers. If anyone wants to contribute to this information, please go to your library's ezproxy service (ezproxy.lib.whoever.edu:2048/menu or sometimes on a different subdomain) and copy/paste to kanzure@gmail.com for aggregation. This is useful for understanding which universities are likely to have access to which papers. Each service listed also has either partial/incomplete subscription and very rarely full subscription, but getting a list of the exact access rights is even harder to squirrel out. Tracking down access to papers is a pain in the butt. ILL is silly and costs way too much ($0.30/query say whaaat).

* I also want an academic aggregator that has the balls to go after OCLC, WorldCat, etc. I wish Bill Gates would just buy the entire scientific publishing industry, but alas! I should emphasize this OCLC alternative would be open-access. Mendeley doesn't count, I can't even get their data out.

I'm definitely neglecting some other major issues..

~~~
mhluongo
Actually- we address some of that! So we don't quite provide a Zotero data
utopia (yet...), but

* each paper page has PDF links and BibTeX

* we extract citation data from PDFs so you can browse the citation graph immediately. We're working on stats right now, but everything we extract is publicly inspectable.

* we're all about open access- in fact, we are _only_ open access, and will do our best to contribute back to bettering academic communication.

* trust me, we have the balls. Still working on the rest.

~~~
jrochkind1
Yeah? Is there a freely available api for your service, which exposes PDF
links among other things? If so, I and many other university library
developers would be VERY interested in your service. If you'd like to get
university library software engineer attention on your service (and especially
if you have a freely avail API, please let us know), I'd suggest you email
some info to the code4lib listserv:
[http://www.lsoft.com/scripts/wl.exe?SL1=CODE4LIB&H=LISTS...](http://www.lsoft.com/scripts/wl.exe?SL1=CODE4LIB&H=LISTSERV.ND.EDU)

~~~
kanzure

        > freely available api for your service, which exposes PDF links
    

I configured an API like this just the other day (zotero's translation-
server). I wanted an IRC bot to fetch pdfs. I call him paperbot. The big
problem was figuring out how to extract the right links and metadata from each
publisher. But the server seems to work.

------
pataprogramming
Very fast and unfiltered initial reaction:

This site is very heavy. Literature reviews are a process of finding the tens
of papers you need out of thousands of candidates, and this site gives ten
results per slow-loading page. The results take up a lot of screen real
estate, and are not optimized for scanning. Whitespace usage seems to be
intended as something to make the site look pretty and modern without any
particular functional value.

The right column is annoying. The "Authors" heading is way to the right,
making it hard to figure out what it's supposed to be. And all enties in the
search I did seemed to be labelled "related publications", so it's not
immediately clear why it would be headed "Authors". The author pages are
pretty slight when you get to them and, again, are not well-optimized for
quick visual extraction of information.

The paper page is terrible. Even on my 1920x1200 screen, a long paper name
takes up over half the page height. Useful targets (like obtaining PDFs and
bib info) are small and hard to find relative to the giant, useless title. And
why on earth would one need to click on a "see more" in order to see the full
list of citations? When you do click through, the sliding transition holds no
value and the list is filled with duplicates (e.g.,
[http://scholr.ly/paper/2887595/enhancing-search-
performance-...](http://scholr.ly/paper/2887595/enhancing-search-performance-
on-gnutella-like-p2p-systems-ieee-transactions-on-parallel-and-distributed-
systems-tpds-2006-yingwu-zhu-received-the-phd-degree-in-computer-science-
engineering-from-university-of-cincinnati-in-2005-he-received-the-
bs-a/citations)).

Google Scholar, despite the fact the you can't easily surf citations in both
directions, is very useful for hoovering up a large number of papers so their
relevance can be assessed. This site is not, and doesn't seem to provide any
particular new value in paper discovery. If there's something else going on
here, it isn't immediately obvious.

The academic search space has a lot of opportunity for improvement, but for me
the interface of this site just adds friction to an already painful process.

~~~
mhluongo
First, thanks for the honesty. We're far from where we want to be and I
appreciate the criticism. I'll address your criticisms in another comment, but
first I'd like to ask- what could we do to improve your academic search? Where
are you coming from, and what do you need fixed?

~~~
pataprogramming
I'm a CS grad student, by the way.

Looking for relevant papers involves sifting through a LOT of chaff. For
search results, I tend to want focused density in my results, and I want to do
as little work as possible to get it.

    
    
      * Scannable
    
      * Enough context to establish possible relevance
    
      * An easy way to obtain the fulltext of the paper and a .bib entry
    

As far as scannability...

    
    
      * I'd rather scroll than click.
    
      * I'd rather not scroll than scroll.
    

The more info I can easily read on each screen, the better, and I want action
links with the search result itself. Clicks that go to other pages or sites
require leaving a trail of tabs open in the browser to avoid losing the search
context. So don't assume that if someone clicks on a link that they want it to
open in the same window. I want to whip through all the garbage as fast as
possible, and every click and animated expanding box makes that harder.

Part of the issue is that search results are only a small part of the paper-
finding process, and the poor quality of most results (as well as text buried
in PDFs) means that a lot of additional steps are required to assess
relevance. I've tried Zotero but don't like being trapped inside it, so have
developed my own workflow for capturing and assessing papers:

First, every paper I download gets a unique identifier that is easy to
recreate from the paper's metadata, so I can figure out what it is just from a
printed hardcopy. The code is similar to the one that Google Scholar used to
generate, slightly extended to improve uniqueness. It's not perfect, but I
think I've had only three collisions during the time I've been using it.

Second, the paper is saved as CODENAME.pdf in a papers directory, and possibly
symlinked to a project directory. I've got a greasemonkey script to
automatically route appropriate sites through my university's ezproxy, but the
slight differences between IEEE, ACM, and Springer are constantly annoying.

Third, a BiBTeX entry (with the code as the identifier) is appended to a
master .bib file. Google Scholar's BibTeX entries are often incomplete, so
getting them from the publisher's site is much preferable. Bad entries still
creep in, and have to be cleaned up later if it ends up being used as a
reference.

Fourth, an entry for the paper is created in an appropriate .org file, keyed
with the code. Notes will later be transcribed, and keywords appended.

That's the trawling process. Later, I'll go back and actually sort through all
the papers pulled in, to determine whether or not they're really relevant, or
might be relevant to another project. This process can either be using a PDF
reader (which is painful) or using a large pile of actual hardcopy printouts
(which is painful). On Linux, I've yet to find a good way to annotate PDFs, so
hardcopy is actually the most useful. As each paper is assessed, I use
different colored highlighters to mark the most relevant bits, particularly
references that I want to chase (which, for example, get marked with red
highlighter). A quick assessment of the value of the paper is scrawled across
the front page, along with its code. If it's determined to be irrelevant, a
paper can be discarded at any point in the process.

Highlighted references are chased during another trawl. Each reference has to
be entered by hand into Google Scholar, since it doesn't let you surf the
reference chain directly. (MSR's fancy bits are Silverlight-based, so I've
never used it much.) At this point, I'll have the knowledge to guess whether
other works by the same author might be relevant, and at this point I'll do
author-specific searches, or search for later papers by other authors that
cite an interesting one.

Good surveys are of particular interest, if they can be found, as they're
likely to have a high density of good references as well as to be cited by
other researchers working in the same area. Often, I'll want to chase down a
large proportion of the cited papers in a good survey. If particular
conferences or journals are found that are highly relevant, slogging through
the ToCs on the publisher's website is often another way to find useful
connections.

I prefer an assembly-line approach: I don't want to actually read papers while
trawling; I don't want to chase references while reading.

If I click on a paper title in the search results, the most important thing I
want to see on the next screen is not the paper title; it's everything else
about the paper that will let me figure out how much additional attention it's
worth to me. If I've deliberately looked up the paper, that's when I want to
surf a citation graph, or explore other works by the same author.

The process is very messy and only partially automatable. But, any new search
site would have to provide a lot of value relative to Google Scholar in order
to result in a real improvement to the overall workflow.

------
dude_abides
I'm surprised nobody mentioned MSR academic search in the comments yet. It is
light-years better than Google Scholar.

<http://academic.research.microsoft.com/>

Example author page:

[http://academic.research.microsoft.com/Author/1180211/hari-b...](http://academic.research.microsoft.com/Author/1180211/hari-
balakrishnan)

Example citation graph:

[http://academic.research.microsoft.com/VisualExplorer#118021...](http://academic.research.microsoft.com/VisualExplorer#1180211&citation)

Coauthor graph:

[http://academic.research.microsoft.com/VisualExplorer#118021...](http://academic.research.microsoft.com/VisualExplorer#1180211)

And finally, the most addictive feature, coauthor path:

[http://academic.research.microsoft.com/VisualExplorer#118021...](http://academic.research.microsoft.com/VisualExplorer#1180211&1112639)

~~~
mhluongo
I'm not as surprised. I'm not certain MSR markets the project at all, for all
the promise it has.

------
jrochkind1
The big challenge with 'scholarly search' is that what users want is to get to
fulltext, but most scholarly fulltext is behind paywalls.

Google Scholar and MSN Academic both have ways of trying to deal with this.
Interestingly, both often get you to a copy from the extensive number of "a
professor illegally posted a PDF on the public web even though the publisher
holds copyright and wouldn't allow that" copies. Google Scholar also will try
to let you register your academic affiliation as a 'preference' and use the
academic institution's infrastructure to get you to a post-authenticated
licensed copy.

scholar.ly... doesn't seem to do this all that well.

I'm surprised that dude_abides thinks MSN Academic is 'light years better than
Google Scholar', hasn't been my experience.

(I happen to work in the field we're talking about, being a software developer
for a university library)

~~~
mhluongo
We're only open-access right now, so there's no need to supply university
credentials. Just click on a result to expand it and get the full-text, or go
to the paper result page and click "full-text".

------
pseut
So, not trying to be a dick, but what does it do? Is it supposed to be a
better Google Scholar? The "about" page is pretty terse and all it tells me is
that your site is "lovingly crafted" which I'm sure is very wonderful, but
uninformative.

~~~
mhluongo
That's not "dick" at all- we need better messaging. We're a people-focused
academic search engine, currently for computer science. Instead of just
returning papers, we also deliver authors relevant to those papers. The idea
is that there's a bunch of value in the citation and co-authorship networks
that aren't exposed when results are simply flat PDFs.

~~~
pseut
Who is this aimed for? PhD students? Faculty? Businesses looking for
consultants? I'm not that far along in my career (assistant prof in econ) but
I know many of the people working in my area personally, so it's not clear how
much value I'd get out of a search engine like yours -- not that the
information isn't useful, but people working in the area have already
internalized a lot of it. For prospective grad students or other people
outside the field, that obviously wouldn't apply.

~~~
mhluongo
Right now, I think we'll be most useful to grad students. I think we'll also
be a good tool for lit review, less cohesive fields, or interdisciplinary
work. We've found your anecdote to be true time and again, though the effect
varies by field.

I spoke to two CS professors who had interesting views. The first told me he
"knew everyone", and that the tool wasn't worth his time- the second said that
he "thought he knew everyone", but had a great use regardless. He wanted his
students and peers to try it so that they could learn the movers in his part
of computer science, and take over the conference he'd been organizing.

~~~
pseut
You may want to check out the RePEc sites for ideas, they're for econ, so you
may not have come across them. One main page is: <http://ideas.repec.org>
author listings are: <http://ideas.repec.org/i/eall.html>

Just another possible source of ideas.

------
milesokeefe
So far in my using it, it appears to offer the same features as Google
Scholar, and I can't find much on the about page. How would you describe your
new take?

Also I would recommend adding a meta image of your icon on at least the index
page, because the only images available for preview when sharing on Facebook
are pictures of the developers.[1]

[1]<http://i.imgur.com/NljWy.png>

~~~
mhluongo
I think I answered this question on the other comment, but more concretely you
can

* browse the citation and co-authorship graphs without jumping between PDFs.

* find authors relevant to your searches- when we do our job right, serendipitously.

I think many of our future features are going to hew closely to "there's power
in people".

------
davekinkead
Here's some feedback from 5 mins of use. (My field is philosophy)

Searching for paper title 'Morality as a System of Hypothetical Imperatives'
doesn't return the paper or author.

Searching for keywords 'Democratic Authority' returns unrelated CS,
engineering, and risk management results but no philosophy or political
science results.

Searching for author wasn't effective - top hit for 'AJ Simmons' was 'Jennifer
Pealer M. A'

Lots of gobbledigook in the results > ⌊aj⌋xj + fj≤f0 > expand > > 2007 > > n�
j=1 ajxj + Substitute aj = ⌊aj ⌋ + fj, b = ⌊b ⌋ + f0. Then,

So all up, not very promising from my initial perspective and as a user, I see
no clear advantage over GS.

The think that academic search seems to lack right now is the ability to
follow citations easily. In early stage research, I would love to be able
enter a keyword eg 'Philosophical anarchism' and follow the citation trail but
limited to that keyword eg show citing papers based on their impact AND
keyword match.

~~~
mhluongo
We're only indexing computer and information science related articles right
now. One of the interface's most obvious failures is communicating that- we'll
try to improve the messaging and make that more clear. Coverage of material in
philosophy is down the road.

What about following keywords in Scholrly doesn't satisfy what you need?

~~~
davekinkead
Compare these two searches from a mathematics & CS related area of philosophy:

[http://scholr.ly/results/?q=bayesian+epistemology&x=0...](http://scholr.ly/results/?q=bayesian+epistemology&x=0&y=0&page=0)
[http://scholar.google.com.au/scholar?q=bayesian+epistemology...](http://scholar.google.com.au/scholar?q=bayesian+epistemology&btnG=&hl=en&as_sdt=1%2C5)

Scholrly omits big names like Hartmann and Talbott. Most of the author section
is actually filled with non-people.

The publications section doesn't show me citation count or any reason why it
might be relevant such as keyword highlights or score. I'd have a very hard
time trusting these results that have such clear omissions.

'Causal Inference' at least includes Judea Pearl as a top hit but clicking the
link [http://scholr.ly/paper/1147771/the-mathematics-of-causal-
inf...](http://scholr.ly/paper/1147771/the-mathematics-of-causal-inference-in-
statistics) sends me to a page that is 80% white space and not very
informative.

Then compare
[http://scholar.google.com.au/citations?user=bAipNH8AAAAJ&...](http://scholar.google.com.au/citations?user=bAipNH8AAAAJ&hl=en&oi=sra)
with <http://scholr.ly/person/5080037/judea-pearl>

Why do I want to look at an empty page with a picture of an owl? (Pearl by the
way is a rockstar CS professor with 400+ papers and nearly 50k citations - you
list a single paper of his)

My biggest pain point of academic search is knowing where I should spend my
limited time researching. GS does an OK job of pointing me in the right
direction but scholrly wastes my time with numerous dead ends.

~~~
mhluongo
On search context- very good point. We're putting together a couple graphical
summaries to try for profiles, as well as exposing keywords on search results
soon. I think you're absolutely right about result trustworthiness and backing
up why something is relevant in a meaningful way.

I'd love to include results from philosophy and related fields, but we just
don't have data sources for them yet. Right now, I think we need to do a
better job hammering in that the data we've indexed is in the computer and
information science arena. We don't have the resources of all the other
players in the field, so I think a niche strategy is important. Are there any
particularly great repositories for bibliographic or full-text data in
philosophy?

On Judea Pearl- you're right, we messed up! Do consider, though, that he's
claimed his Google Scholar Citations profile. We've done all of our profile
inference automatically. Sometimes that leads to a bunch of author pages that
need to be merged, like in Pearl's case- we have all his papers, but didn't
know they were written by the same real-world person, so they're fragmented
([http://scholr.ly/results/?q=judea+pearl&x=0&y=0&...](http://scholr.ly/results/?q=judea+pearl&x=0&y=0&page=0)).
Other times, we have more complete profiles than Google, because many
academics don't claim their Google Scholar Citations profiles.

It's an ongoing concern, so we'll be spending a bunch of time on this problem
(called entity disambiguation in the lit) as well as upgrading venues,
publishers, etc to clickable first-class citizens before our next major
release.

Thanks so much for your feedback- hopefully I'll be able to post again soon
with improvements.

------
olympus
I went and did a couple searches and the publications search always cut out
after the first page. The list of authors would continue for multiple pages
though. Not sure if this is a bug or if I'm doing something wrong, but I
thought I'd point it out. FYI, my searches were just test searches to see how
much depth you had in my field, like "bootstrapping," "pattern recognition,"
and "classifier fusion."

~~~
mhluongo
Sorry! That's a regression, we'll take care of it- thank you!

