
Arxiv Sanity Preserver - stared
http://www.arxiv-sanity.com/
======
smhx
Being in the deep learning community, the number of papers appearing has been
getting out of control. Most of the papers appear on arxiv and are of low
quality. This is particularly problematic right before conference deadlines.

Karpathy's Arxiv-sanity helps a lot to keep in touch with the latest and
greatest deep learning without having to spend all my time reading papers.

------
jchung
As someone who uses arxiv only very rarely, can someone please explain how
this preserves sanity?

------
georgeoliver
Not having anything to do with publishing papers myself I thought the 100,000+
papers submitted in 2015 sounded like a lot, until I looked into it [1] and
saw there's likely more than a million academic papers published every year.

What are all those papers like? I wonder if a large proportion are the
equivalent of peer-reviewed blog posts.

[1] [https://www.quora.com/How-many-academic-papers-are-
published...](https://www.quora.com/How-many-academic-papers-are-published-
each-year)

~~~
xamuel
I wonder how many research professors there are in the world. Basic googling
suggests there are order-of-magnitude 1000000 university faculty in the world,
so a million papers is at most 1 paper per faculty member per year, which
doesn't seem unreasonable. Of course, how many of those faculty are serious
researchers, and how many are just instructors, is a harder question.

~~~
dalke
It's not only research professors who publish. Corporate research scientists
also publish, including those at Microsoft, Google, and Intel.

~~~
jrowley
Grad students publish too!

~~~
dalke
True! And undergrads. And private scholars. And ..

The easy way to answer this would be to use something like Web of Knowledge to
get a rough sense of the number of distinct authors.

------
stared
To get some sense of the arXiv submission growth:
[http://arxiv.org/help/stats/2015_by_area/index](http://arxiv.org/help/stats/2015_by_area/index)

------
gaur
> because things were seriously getting out of hand.

Whatever that means.

~~~
karpathy
Imagine waking up every morning with 50 new arxiv papers uploaded that night.
You panic and quickly scan through the papers - any of them could be very
related to your research, or scoop your latest idea, or have good ideas you
can use in your own work. Arxiv makes no attempts to filter these for you, so
it's up to you to carefully scan through this unlabeled list of paper titles.
You eventually find 3 papers that you have to read and put them on your list.
You manage to read 1 that day. Next day you wake up and 50 new papers are up.
You iterate for a few weeks and suddenly you have a toread list of 20 papers
and 100 new arxiv papers just came in that evening. That's what's currently
happening in research at least in deep learning (but I imagine more widely
too), especially around big conference deadlines, and that's what I label
"things seriously getting out of hand".

That's a first use case. The second way things are out of hand is that you
remember this paper from 3 years ago that was very related to this one, but
can't remember it's name anymore. Here you can sort by similarity to any
paper, and usually these papers come up on top of the sorted list. This is
also useful for finding related work. Another use case is a peace of mind that
you somehow did not miss some papers that you definitely should know about.

Google Scholar is supposed to have similar features: it emails you papers it
thinks you would be interested in and can in principle show similar papers. I
don't know what they do internally but these features are quite terrible and
low quality in my own experience compared to what I get here. More generally
the amount of innovation in Google Scholar over the last few years is sadly
either zero or negative (but overall I still get nightmares about what would
happen to academia if Google pulled a Google Reader with Scholar). For arxiv-
sanity it's tfidf vectors of bigrams from full text of each paper and I do L2
lookups for similarity ranking and train personalized SVMs for people for
recommendations. The results are, at least for me, significantly better.

~~~
jakub_h
What about using deep learning to properly classify arxiv papers about deep
learning (and other things, perhaps)? ;)

~~~
karpathy
the right tool for the job :) In this case I'm perfectly happy with SVMs over
tfidf bigrams and where that places you in the tradeoffs space.

------
zenlikethat
I have been checking this site out lately and using it to download PDFs to
read on my phone later. I really like it! The options to see most popular
papers and to search by field is really nice.

Thank you Andrej for putting this together and maintaining it.

------
conceit
Who curates the _top recent_ papers seen on first visit and is personal
preference only accounted for in _recommended_?

------
stared
For talking about arXiv papers and recommending them to others, there is also:
[https://scirate.com/](https://scirate.com/)

~~~
da-bacon
I wrote the first version of Scirate exactly because the number of papers I
had to eyeball each day was so high. Now that I've left academia I find it
even more useful (so huge thanks to those who rewrote it from scratch)! If you
are in quantum computing it definitely helps your sanity.

------
0x54MUR41
That's an awesome website.

I hope it will support TLS for registration or login mechanisms.

------
lucidrains
Hey, if you need any help expanding the site, I'd be game to help out!

------
jeffjose
Looks like the site is down.

~~~
karpathy
Yeah sorry about that - I see cryptic errors in the server logs popping up at
random. I think it must be something to do with the scale of requests coming
in and breaking the site in some way I don't currently understand. I have near
zero experience with scaling web sites, if anyone who does is passionate about
meta research you're very welcome to look through serve.py and help me out.

One of the problems I caught: error: [Errno 24] Too many open files from
tornado. Trying to fix (edit ok made ulimit -n larger and I don't see this
error anymore at least)

~~~
kuprel
Would be cool if you could upvote/downvote papers

~~~
karpathy
I thought about this quite a bit. I don't think I want downvoting. And I don't
want effectless upvoting. The way it is right now is that you can add a paper
to your library, and it will then feed into your personal SVM as a positive
example of papers you like to see more of. Adding paper to your library as a
type of paper you like to see more of effectively counts as an "up vote", and
is what is sorted by when you go to the "top" tab.

~~~
Houshalter
Does that mean papers are "downvoted" by default, i.e. added to the negative
example list?

------
strahil
Ironically it says "The connection was reset".

