
A map of one million scientific papers from the arXiv - Amorymeltzer
http://paperscape.org
======
dstyrb
This is one of the things most scientists envy physicists for. I would
estimate 95% of those papers are copyrighted, and yet, since we have had this
structure for so long, no publisher tries to pursue us for sharing our work
with the public for free.

I don't know about other fields of physics, but in astro, most of the data is
free access as well. I personally work only with public data and I'm paid to
do it. A string attached to governmental funding from the Euro or NSF is
usually a mandated free access database.

Sometimes I take for granted the fact that my morning ritual involves reading
every publication in my field from the day before, without license. And then I
download some free data, program in my free languages, write in my free latex
editor, and then publish my work for free in a place anyone can read it. It's
utopic.

edit: two archives with a lot of different missions data for example:
[http://irsa.ipac.caltech.edu/frontpage/](http://irsa.ipac.caltech.edu/frontpage/)
[https://archive.stsci.edu](https://archive.stsci.edu)

------
FrankenPC
Side note: My dad (RIP. Princeton PhD high energy physics working at UCSD as a
professor/researcher.) lived in the high energy realm for decades. He worked
on every major particle accelerator known and some unknown.

True story. He had a hobby going in a public storage unit with a surplus
military linear accelerator. Smallish. About 30 feet long. Of course it
required huge amounts of power so he cut a hole in the unit and ran a line to
the nearest pole and siphoned 480 mains volts. And the gamma radiation was
very dangerous so he hauled in several tons of lead destined for EPA long term
sequestering. We worked one summer building shielding walls and measuring the
operational radiation. After the unit was 'safely' running, we would take
various pieces of thrown away Lucite from the physics machine shop and turn
them into polished beam trees (Google it). We then gave them away for
Christmas gifts. What fun for a 10 year old kid!

~~~
jacquesm
That's a lovely story.

~~~
FrankenPC
Thanks. An addendum to his life: Apparently my dad led an early experiment at
Fermi lab that discovered scaling violations which led to QCD. It was a while
until it was officially confirmed and published. He also worked with a
physicist named Masek at a Stanford SPEAR experiment which discovered a new
quark/anti-quark. Neither got recognition which is how the good old boys
network functions in basic research.

~~~
jacquesm
You should really do some digging and do a proper write up, it would make one
hell of a story and I'm sure your dad would approve. So many people working
hard without any recognition but that's no reason not to illuminate that a
bit.

~~~
FrankenPC
He's gone. And I only have the anecdotal info from colleagues he worked with.
MANY scientists are swept under the carpet due to the recognition power play
that occurs at the high institutional echelons. I can't prove that either. My
dad was a modest man and never extolled his past successes. So, I'll never
really know.

------
mrdrozdov
Amazing. This got me thinking about citation counts. The most cited paper in
Computer Science of all time is Vapnik's Statistical Learning Theory (1998)
with about 10k citations. The most cited paper of any kind of all time is
Protein measurement with the folin phenol reagent by Lowry et al. (1951) with
> 300k citations. There's a big time gap here, but not big enough to make up
for a > 290k difference in citations. I always thought that CS was one of the
more prolific paper writing communities, clearly not the case.

PS. I'm not sure which paper in the arXiv has the greatest number of
citations. I don't think either of these papers are there.

~~~
danharaj
It was my understanding that CS is more of a conference-going community rather
than a paper-writing community.

~~~
seccess
It is true that CS is more conference-oriented, however most top conferences
require that a paper be submitted, reviewed, and (if accepted) published in
the conference proceedings before you can present your work there. This does
vary by discipline though: algs/theory is more traditional journal oriented, I
believe.

------
kartikkumar
Wow awesome! Only a couple of weeks ago I tweeted about the idea of building a
genealogy tree by walking along a graph generated by arXiv. This is a really
neat visualization. Is the codebase open-source?

~~~
ethikal
Yes, see [https://github.com/paperscape](https://github.com/paperscape)

~~~
kartikkumar
Excellent, thanks!

------
kaeluka
Somewhere in this cloud is the paper that could change your life and you have
no idea which one it is.

~~~
iyn
Any ideas how could 'content discovery' work (or be improved) with the
research papers? What is the current standard, just the
keywords/topics/authors or is there something else?

~~~
kaeluka
Content discovery does work using citations. How it can be meaningfully
improved, I don't know. Often, the missing piece will come from a completely
different discipline. I don't see how this gap could be bridged using only
citations, unfortunately.

~~~
iyn
That's the thing. I understand that citations are good enough when you know
what you're looking for (at least from my perspective), but imo there's no
good solution to finding seemingly unrelated paper/research that could be 'the
missing piece', hence the question about 'content discovery' :).

------
mdturnerphys
This is seems to be quite hampered by only including references that are found
in the arXiv. Two of my papers from grad school are surrounded by papers that
have very little to do with them. My three papers that are on the arXiv are
very spread out, with the distance between two of them being ~90% of the map
height and the third somewhere in the middle. They are all in physics, but
very focused on experiment and apparatus. I think that physics theory (vs.
experiment) is over-represented on the arXiv and connections to theory papers
are much more influential on the map. It would be interesting to redo this
with a database like [http://adsabs.harvard.edu/](http://adsabs.harvard.edu/),
which doesn't depend on author self-selection.

------
mangeletti
Nice work.

I was thinking while navigating this that, if I was researching something
related to physics, etc., this would much better than using some a engine,
because you might not know exactly what you want to look for, until you see
it.

------
udev
This map is beautiful!

But what does (x,y) position mean? If two papers are close on the map are they
also close in some other aspect?

I mean, what gave this map this particular shape?

~~~
Buetol
This has been most probably applied a force-directed layout:
[https://en.wikipedia.org/wiki/Force-
directed_graph_drawing](https://en.wikipedia.org/wiki/Force-
directed_graph_drawing)

Basically, it's like having a spring between each node (paper) and letting the
equilibrium do the rest.

~~~
carlob
Yes. From their facebook about page:

> In laying out the map, an N-body algorithm is run to determine positions
> based on references between the papers. There are two “forces” involved in
> the N-body calculation: each paper is repelled from all other papers using
> an anti-gravity inverse-distance force, and each paper is attracted to all
> of its references using a spring modelled by Hooke’s law.

However it must have taken them a while to converge for 10^6 particles.

~~~
Buetol
It's actually pretty fast now, could take max one day to get something like
that with
[https://github.com/anvaka/ngraph.offline.layout](https://github.com/anvaka/ngraph.offline.layout)

------
mech4bg
Previous discussion about this with some technical details from one of the
authors:
[https://news.ycombinator.com/item?id=6314730](https://news.ycombinator.com/item?id=6314730)

------
DrNuke
Papers are the top of the iceberg if we consider applied science and
technology. Patents, actual products / services and, above all, money
generated are much more important imho.

------
visarga
So few CS papers compared to physics.

~~~
bnegreve
CS folks are not really used to upload their papers on the arXiv. So this is
probably not a good indication of the number of papers published in each
field.

~~~
jwcrux
Any good resources where CS folks typically upload their papers? I've used the
Google/Twitter/etc. "published research" pages, but those are obviously
company-specific.

~~~
seccess
Many CS researchers publish in conferences, not journals, so they tend to be
pretty spread around. Each field usually has a major conference whose
proceedings are worth looking into when they roll around. Of course,
conference papers are behind paywalls, but you can usually find a free version
if you search the authors/paper title in Google Scholar. The system could be
better.

------
chlestakoff
I am curious about the 2D embedding method: what constitutes a vector in the
original "paper space" and how the 2D clusters where determined.

------
pravj
Interesting. I visit arXiv often and notice that most of the new papers are in
the 'astrophysics' and 'high-energy' field, and the map exactly resembles
that.

Can you please enlighten us about the technical details behind the scene,
right from collecting the data to processing it.

I'm also working with a large graph entity and would love to read about your
process.

~~~
pravj
I see that the data is little old, found the GitHub organization for
'paperscape', a tool to visualize arXiv.

    
    
        https://github.com/paperscape

------
nekopa
How could we go about making a 3D version of this? I had a distinct feeling of
travelling a galaxy using this. It could be awesome to actually be sitting in
a 'spaceship' (knowledgeship?) and travelling the paths between these papers

~~~
jkldotio
Yes, both tSNE and force-directed layouts can do 3D as well as 2D. The
following link goes to a "spaceship" force-directed visualisation of Python
Github projects, the same author has used his engine for other visualisations
too.

[https://github.com/anvaka/allpypi](https://github.com/anvaka/allpypi)

------
Luc
Very nice and potentially useful. Is there a way to click through to the
papers from there? I feel like it should be possible.

EDIT: Site is probably getting hammered, I just needed to wait a minute for
everything to load.

~~~
pc86
If you click on the individual paper there will be a popup in the upper right
that contains various links.

------
mrdrozdov
How can one tell how many citations a paper has?

EDIT: Clicking a paper and then "(citations)" will you show the one-level
graph of citations, and under the search bar you can see how many results
there were.

------
new_hackers
Wow very cool!! I was looking at the little dots around the edge of the
cluster and thought, "hmm I wonder what these are?". Then I realized I needed
to dust my monitor...

------
ecesena
Can you specify a search query in the url?

