Hacker News new | past | comments | ask | show | jobs | submit login
A map of one million scientific papers from the arXiv (paperscape.org)
182 points by Amorymeltzer on Nov 24, 2015 | hide | past | web | favorite | 55 comments



This is one of the things most scientists envy physicists for. I would estimate 95% of those papers are copyrighted, and yet, since we have had this structure for so long, no publisher tries to pursue us for sharing our work with the public for free.

I don't know about other fields of physics, but in astro, most of the data is free access as well. I personally work only with public data and I'm paid to do it. A string attached to governmental funding from the Euro or NSF is usually a mandated free access database.

Sometimes I take for granted the fact that my morning ritual involves reading every publication in my field from the day before, without license. And then I download some free data, program in my free languages, write in my free latex editor, and then publish my work for free in a place anyone can read it. It's utopic.

edit: two archives with a lot of different missions data for example: http://irsa.ipac.caltech.edu/frontpage/ https://archive.stsci.edu


Side note: My dad (RIP. Princeton PhD high energy physics working at UCSD as a professor/researcher.) lived in the high energy realm for decades. He worked on every major particle accelerator known and some unknown.

True story. He had a hobby going in a public storage unit with a surplus military linear accelerator. Smallish. About 30 feet long. Of course it required huge amounts of power so he cut a hole in the unit and ran a line to the nearest pole and siphoned 480 mains volts. And the gamma radiation was very dangerous so he hauled in several tons of lead destined for EPA long term sequestering. We worked one summer building shielding walls and measuring the operational radiation. After the unit was 'safely' running, we would take various pieces of thrown away Lucite from the physics machine shop and turn them into polished beam trees (Google it). We then gave them away for Christmas gifts. What fun for a 10 year old kid!


That's a lovely story.


Thanks. An addendum to his life: Apparently my dad led an early experiment at Fermi lab that discovered scaling violations which led to QCD. It was a while until it was officially confirmed and published. He also worked with a physicist named Masek at a Stanford SPEAR experiment which discovered a new quark/anti-quark. Neither got recognition which is how the good old boys network functions in basic research.


You should really do some digging and do a proper write up, it would make one hell of a story and I'm sure your dad would approve. So many people working hard without any recognition but that's no reason not to illuminate that a bit.


He's gone. And I only have the anecdotal info from colleagues he worked with. MANY scientists are swept under the carpet due to the recognition power play that occurs at the high institutional echelons. I can't prove that either. My dad was a modest man and never extolled his past successes. So, I'll never really know.


Looked up Masek. According to his obituary in Physics Today [1], his team at SPEAR "discovered new bound states of a charm and anti-charm quark." Important work, but not a new quark/anti-quark.

[1] http://scitation.aip.org/content/aip/magazine/physicstoday/n...


Amazing. This got me thinking about citation counts. The most cited paper in Computer Science of all time is Vapnik's Statistical Learning Theory (1998) with about 10k citations. The most cited paper of any kind of all time is Protein measurement with the folin phenol reagent by Lowry et al. (1951) with > 300k citations. There's a big time gap here, but not big enough to make up for a > 290k difference in citations. I always thought that CS was one of the more prolific paper writing communities, clearly not the case.

PS. I'm not sure which paper in the arXiv has the greatest number of citations. I don't think either of these papers are there.


That line of reasoning isn't convincing to me (though I have no data to confirm or deny it); CS could still be a more prolific paper writing community, just that papers compete more for citations. In CS if I want to cite something, I have my choice of 10 papers from the same era from roughly the same group of people saying roughly the same thing (some might prefer to cite earlier papers, as the "original", others might prefer later papers as the ideas are more clarified). In other fields, there might be one "standard" paper to cite for a era/topic/group.


Couldn't you say the same about other fields? Although I understand where you are coming from. If I wanted to make a more accurate comparison it'd only be fair to examine the distributions of different fields as well as their top performers, but I still think that is too huge a gap to make up for anything besides some heavily skewed distributions.


Definitely! I don't have the breadth of experience or data to really judge how it compares from CS to other fields.


In CS it's customary to stop citing papers at some point. E.g., lots of papers are published about Turing machines without citing Turing.

Also, absolute limitations on page count is really common in CS, and the page counts tend to be pretty low. In other areas, journals might allow for more citations or citations might not count toward page count.

Also, arxiv is physics-biased.


Also, arxiv is physics-biased.

Have there been any significant CS papers published in the last ~5 years that aren't on Arxiv?

The last thing I remember not being there was a set of papers on IBM Watson that were published in an IBM Systems Engineering Journal.

I have a feeling that some papers out of Microsoft tend not to end up there too, but I can't think of a specific example.


> Have there been any significant CS papers published in the last ~5 years that aren't on Arxiv?

Even if not, there might be insignificant CS papers not indexed by Arxiv which cite significant papers which are indexed ;) This makes the citation counts comparatively lower if most insignificant physics papers are in Arxiv.

That said, it doesn't surprise me much that worldwide there are still more people working in physics, biology or mathematics than in CS.


It was my understanding that CS is more of a conference-going community rather than a paper-writing community.


It is true that CS is more conference-oriented, however most top conferences require that a paper be submitted, reviewed, and (if accepted) published in the conference proceedings before you can present your work there. This does vary by discipline though: algs/theory is more traditional journal oriented, I believe.


Yeah, I'd say ACM proceedings are where most of the good papers are published.


Google scholar says closer to 50k citations. Where did you get your information from?


Thanks for the heads up.

I found 38k citations for The Nature of Statistical Learning Theory: https://scholar.google.com/scholar?cluster=86085598803682809...

And 25k for Statistical Learning Theory: https://scholar.google.com/scholar?cluster=86748554971781655...

My initial estimation of 10k was from a CiteSeer list that I didn't realize was limited to only documents in the CiteSeer database: http://citeseer.ist.psu.edu/stats/articles


Wow awesome! Only a couple of weeks ago I tweeted about the idea of building a genealogy tree by walking along a graph generated by arXiv. This is a really neat visualization. Is the codebase open-source?



Excellent, thanks!


Somewhere in this cloud is the paper that could change your life and you have no idea which one it is.


Any ideas how could 'content discovery' work (or be improved) with the research papers? What is the current standard, just the keywords/topics/authors or is there something else?


Content discovery does work using citations. How it can be meaningfully improved, I don't know. Often, the missing piece will come from a completely different discipline. I don't see how this gap could be bridged using only citations, unfortunately.


That's the thing. I understand that citations are good enough when you know what you're looking for (at least from my perspective), but imo there's no good solution to finding seemingly unrelated paper/research that could be 'the missing piece', hence the question about 'content discovery' :).


Forward and backward citations, every paper is a node in a graph of highly related content.


Survey papers


This is seems to be quite hampered by only including references that are found in the arXiv. Two of my papers from grad school are surrounded by papers that have very little to do with them. My three papers that are on the arXiv are very spread out, with the distance between two of them being ~90% of the map height and the third somewhere in the middle. They are all in physics, but very focused on experiment and apparatus. I think that physics theory (vs. experiment) is over-represented on the arXiv and connections to theory papers are much more influential on the map. It would be interesting to redo this with a database like http://adsabs.harvard.edu/, which doesn't depend on author self-selection.


Nice work.

I was thinking while navigating this that, if I was researching something related to physics, etc., this would much better than using some a engine, because you might not know exactly what you want to look for, until you see it.


This map is beautiful!

But what does (x,y) position mean? If two papers are close on the map are they also close in some other aspect?

I mean, what gave this map this particular shape?


This has been most probably applied a force-directed layout: https://en.wikipedia.org/wiki/Force-directed_graph_drawing

Basically, it's like having a spring between each node (paper) and letting the equilibrium do the rest.


Yes. From their facebook about page:

> In laying out the map, an N-body algorithm is run to determine positions based on references between the papers. There are two “forces” involved in the N-body calculation: each paper is repelled from all other papers using an anti-gravity inverse-distance force, and each paper is attracted to all of its references using a spring modelled by Hooke’s law.

However it must have taken them a while to converge for 10^6 particles.


It's actually pretty fast now, could take max one day to get something like that with https://github.com/anvaka/ngraph.offline.layout


It's based on citations. If you go to 'about' on their site, there's more information about what x, y, size, color, brightness encode in the visualization.


If this is the case, I would love to see a similar map for IEEE papers. The citations are free to view on IEEE website.


Previous discussion about this with some technical details from one of the authors: https://news.ycombinator.com/item?id=6314730


Papers are the top of the iceberg if we consider applied science and technology. Patents, actual products / services and, above all, money generated are much more important imho.


So few CS papers compared to physics.


CS folks are not really used to upload their papers on the arXiv. So this is probably not a good indication of the number of papers published in each field.


CS folks are not really used to upload their papers on the arXiv.

Maybe not to the same extent as physics people, but there is still a lot of CS on arXiv. More so in some subfields than others, but there's a pretty steady stream of CS papers showing up there. Enough that one person can't keep up with reading and digesting all of them as they appear.

That said, I don't disagree that there's a lot more physics than CS on arXiv. :-) I'm just not sure if that's because CS people don't upload to arXiv, or because CS people publish fewer papers in general, or "other".


Any good resources where CS folks typically upload their papers? I've used the Google/Twitter/etc. "published research" pages, but those are obviously company-specific.


Many CS researchers publish in conferences, not journals, so they tend to be pretty spread around. Each field usually has a major conference whose proceedings are worth looking into when they roll around. Of course, conference papers are behind paywalls, but you can usually find a free version if you search the authors/paper title in Google Scholar. The system could be better.


dl.acm.org won't have the paper uploaded if it wasn't published in an ACM journal or conference, but it has the metadata, including citations, for a huge number of papers published elsewere.


IACR eprint for crypto, ECCC for complexity theory, though plenty of TCS people use arxiv too.


I am curious about the 2D embedding method: what constitutes a vector in the original "paper space" and how the 2D clusters where determined.


Interesting. I visit arXiv often and notice that most of the new papers are in the 'astrophysics' and 'high-energy' field, and the map exactly resembles that.

Can you please enlighten us about the technical details behind the scene, right from collecting the data to processing it.

I'm also working with a large graph entity and would love to read about your process.


I see that the data is little old, found the GitHub organization for 'paperscape', a tool to visualize arXiv.

    https://github.com/paperscape


How could we go about making a 3D version of this? I had a distinct feeling of travelling a galaxy using this. It could be awesome to actually be sitting in a 'spaceship' (knowledgeship?) and travelling the paths between these papers


Yes, both tSNE and force-directed layouts can do 3D as well as 2D. The following link goes to a "spaceship" force-directed visualisation of Python Github projects, the same author has used his engine for other visualisations too.

https://github.com/anvaka/allpypi


Very nice and potentially useful. Is there a way to click through to the papers from there? I feel like it should be possible.

EDIT: Site is probably getting hammered, I just needed to wait a minute for everything to load.


If you click on the individual paper there will be a popup in the upper right that contains various links.


How can one tell how many citations a paper has?

EDIT: Clicking a paper and then "(citations)" will you show the one-level graph of citations, and under the search bar you can see how many results there were.


Wow very cool!! I was looking at the little dots around the edge of the cluster and thought, "hmm I wonder what these are?". Then I realized I needed to dust my monitor...


Can you specify a search query in the url?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: