But, I have to ask, how do you get a feel that the content actually looks correct, and not just only quack? The improvements are usually in the 1% range, from old models to new models, and the models are complex. More often than not also lacking code, implementation / experiment procedures, etc.
Basically, I have no idea if the paper is reproducible, if the results are cherry picked from hundreds / thousands of runs, if the paper is just cleverly disguised BS with pumped up numbers to get grants, and so on.
As it is right now, I can only rely on expert assurance from those that peer review these papers - but even then, in the back of my mind, I'm wondering if they've had time to rigorously review a paper. The output of ML / AI papers these days is staggering, and the systems are so complex that I'd be impressed if some single post. doc or researcher would have time to reproduce results.
When reading the newest ML papers, I found it useful to not judge them, but instead use them as inspiration. Forget about the results. The paper may contain interesting ideas or viewpoints you didn't consider before, and those are probably much more valuable than the result table.
Personally, I'm bearish about most deep learning papers for this reason.
I'm not driven by a particular task/problem, so when I'm reading ML papers, it is primarily for new insights and ideas. Correspondingly, I prefer to read papers which have new perspectives on the problem (irrespective of whether they achieve SOTA performance). From what I've seen, most of the interesting (to me) ideas come from slightly adjacent fields. I care far more about interesting & elegant ideas, and benchmarks to just sanity-check that the nice idea can also be made to work in practice.
As for the obsession with benchmark numbers, I can only quote Mark Twain: “Most people use statistics like a drunk man uses a lamppost; more for support than illumination.”
Worry more about whether the result is generalizable. How sensitive is that incremental improvement to hyperparemeter tuning, or to how the data is pre-processed, or to the specific problem domain? These days, the academic literature seems to rarely spend much time dwelling on these subjects, which is, at least in my opinion, sufficient reason for industry practitioners to shy away from the cutting edge.
I am less familiar with how this works out on classifiers, but I can say that this is the elephant in the room with topic modeling. Hyperparameter tuning and data cleaning are much more important than choice of algorithm. Perhaps even more importantly (at least if you're trying to understand different algorithms' relative merits), the method you choose for evaluating quality is critical: One setup will be clearly better if you are focused on the data's syntagmatic qualities, but perform terribly if you instead focus on the paradigmatic. And vice versa. In short, the question, "What algorithm is best?" is malformed and unanswerable.
There's an interesting paper from a while ago where it turned out that the vector space model that performed best when evaluated against the TEFL synonymy test was good old latent semantic analysis. I find that result to be noteworthy because it's one of the few papers that took a real live test that was designed for evaluating the skills of real live humans, and used it to evaluate a machine learning model. At the same time, that in no way implies that LSA is the best fit for your sentiment analysis pipeline.
What is important is the ideas in the paper. How does it translate to your context. If it is relevant you can try it out since it may give more than 1% in your data set :)
I know it's a bit elitist but we all have a limited time only in life. Also these institutions are usually the only ones making actual breakthrough for money reasons: they brute force many hyperparameters on their giants cluster in a context where sometimes a single training cost 15k USD..
Also most papers which "look interesting" or "edgy" are usually a disappointment.
That's also a suggestion that those aren't breakthroughs. They are just someone getting 1% because their corporate sponsor spent $250k more than the other guys.
Look at the ideas, not the results. Is there something new and is it clearly expressed? If you can't answer that in five minutes, move on. Ideas transfer, results don't.
In particular, if an ML paper abstract states a percentage improvement over SOTA and then lists five existing techniques that were combined to get the result, you can just put it directly on the trash pile.
As another proof the entire concept of neural network exists since the 80's. It's the fact of using it on new hardware (the GPUs) which made it so important. And the "new hardware" is always expensive.
In 10 years also maybe every startup will have a massive cluster instead of a 4 GPU PC with current flops capabilities (which was luxury a few decades ago)
> But, I have to ask, how do you get a feel that the content actually looks correct, and not just only quack.
If you're familiar with the particular subfield, you can spot problematic evaluation methods, how much they follow general best practices, whether they cite all relevant approaches or "curate" their tables by intentionally leaving out methods that outperform theirs, etc. You can "smell" whether something sounds plausible. If you're unfamiliar with the field, start with highly-regarded conferences like CVPR/NeurIPS/ICLR, especially orals.
Authors squeeze their methods to press out that 1% improvement, because it's difficult to get through peer review these days without state-of-the-art numbers. Many reviewers are themselves not very experienced, do not spend much time on each paper and give large weight to quantitative results.
So be aware that the primary target audience of papers is often not really the general reader, but the reviewers.
> Basically, I have no idea if the paper is reproducible, if the results are cherry picked from hundreds / thousands of runs, if the paper is just cleverly disguised BS with pumped up numbers to get grants, and so on.
If they've released their code, it can be a positive sign.
> As it is right now, I can only rely on expert assurance from those that peer review these papers - but even then, in the back of my mind, I'm wondering if they've had time to rigorously review a paper.
Depends on the venue. But don't treat peer review as some sort of verification or confirmation as truth. It's more like a spam filter. It just means that 2-4 PhD students went through it (spending perhaps a few hours on it) and found it to be worth presenting to the community.
Peer review is never about reproduction, in any science. The reviewers for a psychology journal will not recruit their own subjects and redo the experiment, for example.
There's definitely a good amount of trust, gut feelings, paper-gestalt, are-they-one-of-us and similar subjective effects at play when a paper gets accepted and the process is known to be noisy.
Here's my take on adopting it to reading code:
1. Read readme if available, read the list of source files to get a sense of how the project is modularized. Identify the entry point. Identify type of program from main entry point: is it a server, a CLI, or a graphical app?
2. Run call graph analysis tool if you have it, so you can study callgraph tree starting from main entry point. Read just the function names and start making notes of how the execution works at various levels, e.g does it read input then enter an infinite loop, does it wait on network packets, does it use update/render loop, etc. Also make note of whether a function is trivial/non-trivial based on quick glance at the code.
3. Ignore the trivial ones, and read the non-trivial ones in detail. Make note of the algorithm, data structures, and dependencies.
Like i like to read about things like:
ML, Scaling, Filesystem, Databases, Algorithms etc.
I do get a lot of input through hn, friends, youtube, blogs but i'm not getting my papers from direct sources. I don't have anything like nature or so laying around either.
For instance: USENIX ATC, USENIX Security, OSDI, SOSP, PLDI, ICSE, FSE, NDSS, ASPLOS, and CCS consistently have work I find interesting.
There are new entries every day. You may want to check it out.
I like that format. A short tidbit of abstract, as well as description tags.
Like 100 papers you would need to skim every day :|.
There should be arxiv reddit mode :/
But seriously, if you want to know the differences look up the etymologies and then keep your eyes open for how each term is used in practice. That's all there is to it.
Not exactly something that illuminates the usage in scientific literature