Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Imagine waking up every morning with 50 new arxiv papers uploaded that night. You panic and quickly scan through the papers - any of them could be very related to your research, or scoop your latest idea, or have good ideas you can use in your own work. Arxiv makes no attempts to filter these for you, so it's up to you to carefully scan through this unlabeled list of paper titles. You eventually find 3 papers that you have to read and put them on your list. You manage to read 1 that day. Next day you wake up and 50 new papers are up. You iterate for a few weeks and suddenly you have a toread list of 20 papers and 100 new arxiv papers just came in that evening. That's what's currently happening in research at least in deep learning (but I imagine more widely too), especially around big conference deadlines, and that's what I label "things seriously getting out of hand".

That's a first use case. The second way things are out of hand is that you remember this paper from 3 years ago that was very related to this one, but can't remember it's name anymore. Here you can sort by similarity to any paper, and usually these papers come up on top of the sorted list. This is also useful for finding related work. Another use case is a peace of mind that you somehow did not miss some papers that you definitely should know about.

Google Scholar is supposed to have similar features: it emails you papers it thinks you would be interested in and can in principle show similar papers. I don't know what they do internally but these features are quite terrible and low quality in my own experience compared to what I get here. More generally the amount of innovation in Google Scholar over the last few years is sadly either zero or negative (but overall I still get nightmares about what would happen to academia if Google pulled a Google Reader with Scholar). For arxiv-sanity it's tfidf vectors of bigrams from full text of each paper and I do L2 lookups for similarity ranking and train personalized SVMs for people for recommendations. The results are, at least for me, significantly better.



I have no real point here, only historical commentary.

I've been reading papers from the 1960s, which is when the term "information explosion" was coined. People then were struggling to stay current with the literature, and thought 'things were seriously getting out of hand.'

This was the start of abstracting services, like ISI, where you could even arrange the results of a keyword search of all the new papers to be sent to you each week - a clear predecessor to personalized RSS feeds.

Going back even further to the immediate post-war era, the library systems of the time, which were structured around books and journals and organized by topic, couldn't keep up with the deluge of research reports which cut across multiple topics. The field of information retrieval, using first punched cards and then computers, started because the publication flow was 'seriously getting out of hand'.

Or for a specific example, after high T_c superconductors were discovered in 1986, there was a mad rush of interest as solid state physicists from around the world explored the new territory. A Google Scholar search for "high temperature superconductor" finds:

  1986 -   846 publications
  1987 - 2 600
  1988 - 3 900
  1989 - 4 780
  1990 - 4 870
  1991 - 5 250
That's 14 papers per day, any one of which might be "very related to your research, or scoop your latest idea, or have good ideas you can use in your own work."

Granted, 14 << 50, but that doesn't include some of the papers about "high Tc" which don't use the whole phrase. Also, those are 14 peer-reviewed papers per day, so there has been some filter, and experimental research in high Tc research requires more equipment than deep learning.

Think of my comment as a reminder that things have been out of hand for most of a century, and dealing with that deluge emotionally connects you to the headache that generations of researchers before you have had to suffer with. :)


I'm not sure how bad the problem is in other fields though. I subscribe to a daily arXiv search alert covering physics.comp-ph and physics.flu-dyn, as well as any cross-posts to these. It averages ~20 titles and abstracts per day. I skim the titles, and if the title looks interesting I read the abstract, and if the abstract is interesting I open the full link in a background tab. This takes three minutes each morning, plus the time it takes to read any full papers where I found the abstract interesting, which is typically 2-3 papers a week. By now I've learned to read papers quickly and save them in a filename system for future reference.


Beware research paralysis. There's a point at which you have to ignore the work others are doing so that you can make progress on your own.


As somebody from outside academia, what's the worst that can happen by not reading every paper related to your research, you accidentally end up replicating (part of) somebody else's results? That sounds useful in itself.


I'm not in the field, but have friends who are. But from what I understand it, replicated experiments are not as highly regarded and does not get published in the same nice journals. Getting published in high ranking journals is very important if you want a career in research. Another thing you might miss is if somebody does a similar or the same experiment and comes to a negative result, reading that might save you a few days or months of unnecessary work.


> Another thing you might miss is if somebody does a similar or the same experiment and comes to a negative result, reading that might save you a few days or months of unnecessary work.

Given how low the reproduction rate in science is, I'm not sure that time would be wasted.


Yeah that's basically the worst case. But it isn't useful in itself. Most research isn't of the type that having someone do it again would be any use.

For example, suppose you discover the structure of DNA, you try to publish but find someone already published it last year. You've just wasted a lot of time.

I don't know what the solution is.


In computer science, that's going to be less helpful. In fields like psychology, sociology and even biology, the initial idea may not be too hard. The onerous part is in designing an experiment, running the experiment, and performing the data analysis. The sorts of questions that the experiment are trying to answer tend to be multivariate, which means it's easy to do all of the prior things wrong. Replication is key in sussing out if any of that went wrong. For psychology and sociology, replicating research can take just as long as the original.

In the fields of computer science where you actually implement something, it's the design and implementation of that new artifact that takes up most of your time. The experiments are not nothing, but they tend to be the sort of thing you can script: run a bunch of programs, accumulate results, do data analysis. Even the data analysis tends to be scripted. If, at the last moment you discover a small tweak that could improve your implementation, it can be trivial (in effort, not necessarily time) to re-run your experiments.

Now, the experimental evaluation is still important, and it is also easy to do wrong. But I also claim it's more deterministic. If an author is honest in describing their experiment, it's easier for reviewers to cry foul in computer science systems research than in, say, psychology. There are ways in computer science to design poor experiments that show bogus results, but it tends to be more obvious.

If, upon trying to publish, you discover that someone else had a similar idea and implemented something similar, you have replicated the hardest part. You spent a lot of time and effort designing this new thing that overcomes all of these challenges. If someone else already did that, you could have just skipped all of that, and started on improving it right away. In computer science systems research, replicating someone's research may actually be must faster. Sometimes you can view their code directly, or you can implement their idea in another system. Re-implementing an idea in a new context can take a lot of engineering effort, but it can still be a lot less work than doing it the first time.

Now, what happens if that was a bogus technique, and you can't replicate the results? That's a publishable result, but it tends not to be the whole paper. You figure out a better way, and explicitly compare your new way to that old published way. Again, that's because in computer science systems research, you're not discovering fundamental properties of things. Instead, you're discovering better ways of doing things.

I do sometimes read computer science systems papers and think "Eh, I don't buy this result". That's usually not because they did anything wrong (although sometimes it is), but because I just think that what they are investigating does not matter. "Sure, I believe you figured out a reliable way to optimize a three wheeled car, but four wheels is still better."

Theoretical computer science is not impacted by this at all, as there usually are not experiments in such papers. Their "results" tends to be a proof.


What about using deep learning to properly classify arxiv papers about deep learning (and other things, perhaps)? ;)


the right tool for the job :) In this case I'm perfectly happy with SVMs over tfidf bigrams and where that places you in the tradeoffs space.


Do you want Skynet? This sounds like the start of Skynet...


> Imagine waking up every morning with 50 new arxiv papers uploaded that night.

Others in this thread propose RSS and aggregation of abstracts as a solution. My proposal is to have reviews of literature, and then just read the reviews instead. This should save a bunch of time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: