Imagine waking up every morning with 50 new arxiv papers uploaded that night. Yo...

dalke · on March 19, 2016

I have no real point here, only historical commentary.

I've been reading papers from the 1960s, which is when the term "information explosion" was coined. People then were struggling to stay current with the literature, and thought 'things were seriously getting out of hand.'

This was the start of abstracting services, like ISI, where you could even arrange the results of a keyword search of all the new papers to be sent to you each week - a clear predecessor to personalized RSS feeds.

Going back even further to the immediate post-war era, the library systems of the time, which were structured around books and journals and organized by topic, couldn't keep up with the deluge of research reports which cut across multiple topics. The field of information retrieval, using first punched cards and then computers, started because the publication flow was 'seriously getting out of hand'.

Or for a specific example, after high T_c superconductors were discovered in 1986, there was a mad rush of interest as solid state physicists from around the world explored the new territory. A Google Scholar search for "high temperature superconductor" finds:

  1986 -   846 publications
  1987 - 2 600
  1988 - 3 900
  1989 - 4 780
  1990 - 4 870
  1991 - 5 250

That's 14 papers per day, any one of which might be "very related to your research, or scoop your latest idea, or have good ideas you can use in your own work."

Granted, 14 << 50, but that doesn't include some of the papers about "high Tc" which don't use the whole phrase. Also, those are 14 peer-reviewed papers per day, so there has been some filter, and experimental research in high Tc research requires more equipment than deep learning.

Think of my comment as a reminder that things have been out of hand for most of a century, and dealing with that deluge emotionally connects you to the headache that generations of researchers before you have had to suffer with. :)

semi-extrinsic · on March 19, 2016

I'm not sure how bad the problem is in other fields though. I subscribe to a daily arXiv search alert covering physics.comp-ph and physics.flu-dyn, as well as any cross-posts to these. It averages ~20 titles and abstracts per day. I skim the titles, and if the title looks interesting I read the abstract, and if the abstract is interesting I open the full link in a background tab. This takes three minutes each morning, plus the time it takes to read any full papers where I found the abstract interesting, which is typically 2-3 papers a week. By now I've learned to read papers quickly and save them in a filename system for future reference.

scott_s · on March 20, 2016

Beware research paralysis. There's a point at which you have to ignore the work others are doing so that you can make progress on your own.

M2Ys4U · on March 20, 2016

As somebody from outside academia, what's the worst that can happen by not reading every paper related to your research, you accidentally end up replicating (part of) somebody else's results? That sounds useful in itself.

gnyman · on March 20, 2016

I'm not in the field, but have friends who are. But from what I understand it, replicated experiments are not as highly regarded and does not get published in the same nice journals. Getting published in high ranking journals is very important if you want a career in research. Another thing you might miss is if somebody does a similar or the same experiment and comes to a negative result, reading that might save you a few days or months of unnecessary work.

naasking · on March 20, 2016

> Another thing you might miss is if somebody does a similar or the same experiment and comes to a negative result, reading that might save you a few days or months of unnecessary work.

Given how low the reproduction rate in science is, I'm not sure that time would be wasted.

IshKebab · on March 20, 2016

Yeah that's basically the worst case. But it isn't useful in itself. Most research isn't of the type that having someone do it again would be any use.

For example, suppose you discover the structure of DNA, you try to publish but find someone already published it last year. You've just wasted a lot of time.

I don't know what the solution is.

scott_s · on March 20, 2016

In computer science, that's going to be less helpful. In fields like psychology, sociology and even biology, the initial idea may not be too hard. The onerous part is in designing an experiment, running the experiment, and performing the data analysis. The sorts of questions that the experiment are trying to answer tend to be multivariate, which means it's easy to do all of the prior things wrong. Replication is key in sussing out if any of that went wrong. For psychology and sociology, replicating research can take just as long as the original.

In the fields of computer science where you actually implement something, it's the design and implementation of that new artifact that takes up most of your time. The experiments are not nothing, but they tend to be the sort of thing you can script: run a bunch of programs, accumulate results, do data analysis. Even the data analysis tends to be scripted. If, at the last moment you discover a small tweak that could improve your implementation, it can be trivial (in effort, not necessarily time) to re-run your experiments.

Now, the experimental evaluation is still important, and it is also easy to do wrong. But I also claim it's more deterministic. If an author is honest in describing their experiment, it's easier for reviewers to cry foul in computer science systems research than in, say, psychology. There are ways in computer science to design poor experiments that show bogus results, but it tends to be more obvious.

If, upon trying to publish, you discover that someone else had a similar idea and implemented something similar, you have replicated the hardest part. You spent a lot of time and effort designing this new thing that overcomes all of these challenges. If someone else already did that, you could have just skipped all of that, and started on improving it right away. In computer science systems research, replicating someone's research may actually be must faster. Sometimes you can view their code directly, or you can implement their idea in another system. Re-implementing an idea in a new context can take a lot of engineering effort, but it can still be a lot less work than doing it the first time.

Now, what happens if that was a bogus technique, and you can't replicate the results? That's a publishable result, but it tends not to be the whole paper. You figure out a better way, and explicitly compare your new way to that old published way. Again, that's because in computer science systems research, you're not discovering fundamental properties of things. Instead, you're discovering better ways of doing things.

I do sometimes read computer science systems papers and think "Eh, I don't buy this result". That's usually not because they did anything wrong (although sometimes it is), but because I just think that what they are investigating does not matter. "Sure, I believe you figured out a reliable way to optimize a three wheeled car, but four wheels is still better."

Theoretical computer science is not impacted by this at all, as there usually are not experiments in such papers. Their "results" tends to be a proof.

jakub_h · on March 19, 2016

What about using deep learning to properly classify arxiv papers about deep learning (and other things, perhaps)? ;)

karpathy · on March 19, 2016

the right tool for the job :) In this case I'm perfectly happy with SVMs over tfidf bigrams and where that places you in the tradeoffs space.

ajford · on March 19, 2016

Do you want Skynet? This sounds like the start of Skynet...

kanzure · on March 20, 2016

> Imagine waking up every morning with 50 new arxiv papers uploaded that night.

Others in this thread propose RSS and aggregation of abstracts as a solution. My proposal is to have reviews of literature, and then just read the reviews instead. This should save a bunch of time.