(1) Many academics aren't aware non-academics read their papers at all: we work with other academics, go to conferences with other academics, and on the rare occasions we hear from readers, it's from other academics. Big exception: in some fields academia and industry have much more interaction, biomedical research (the subject of the linked article) being one of them. Extracting knowledge from that literature has a large number of practical and economic implications.
(2) There seems to be a perception that published papers are a repository of established or state-of-the-art knowledge. Perhaps they were meant to be that way, and perhaps more of them should be. But for many journals in many fields, publications are a form of moderated discussion. Reconstructing the state of knowledge from snippets of conversation is always going to be hard.
What can help make the literature more accessible? some of the forces are structural, some are due to current limitations of technology. But one thing that can help if you find the results of a paper interesting and are able to track down the authors, write them. People like hearing their work is noticed, and they like talking to people about things they're interested in.
Another is to make constructive suggestions (or even pitch in to improve code where it's open source & available). Between teaching, advising, committee work, etc (not to mention family), most of us have to prioritize, and as much as I'd like to clean up old code for release in the hopes someone finds it useful, it isn't going to get my grad students out the door with a degree or a job -- I'm generally spending more time on their research problems these days than my own. But if I know there's interest / use I might prioritize time a little differently.
Review articles, sometimes called surveys.
I've always thought that new PhDs would be excellent authors for those, having digested lots literature for their dissertation.
A well written PhD or MSc thesis is often the best way into a new field, ime. If the committee is good on this aspect they'll insist you've put enough detail in for someone to follow along mostly self contained.
The text changes to fit the audience, and the knowledge becomes more accepted (and or fundamental) further down the line.
Is this field specific? I have read survey articles in math and biology, and was told by some of my profs that they use these articles as an introduction to a new field.
A quick Google search seems to show these exist in CS (along with tutorial papers), physics, and chemistry but I'm having a little difficulty finding statistics survey papers (survey methods come up instead).
Is the problem that there aren't enough of them or they are behind paywalls?
Please suggest others if you find them.
Annual Reviews has a bunch of journals for surveys of various fields. Most of them are paywalled, but there's ways around that.
Publishing code and data associated with figures. A significant amount of the confusion about the literature comes from the difference between the documentation laid out in a paper and the reality of the actual implementation. Plus any peer review of the actual code used is checking the reality of the paper, rather than a description of what we think reality is.
Making this part of the publication process is vital, because there is such a high barrier to actually requesting this information later.
Full disclosure: your username is very easy to Google, and I should tell you we work on closely related topics. My opinion is shaped in part by several of the papers you've cited. I'm happy to disclose my identity and continue the conversation in private.
"(1) Many academics aren't aware non-academics read their papers at all"
What, then, is the point of applied mathematics? Please know that I don't mean that in a dismissive way. I think there are a few reasonable answers. Chief among them is the belief that exploratory research is important in its own right and does not need to have an immediate non-academic use, as embodied by this quote of Hadamard:
“Practical application is found by not looking for it, and one can say that the whole progress of civilization rests on that principle.”
I believe the above quote as far as it concerns mathematics. But how do we get from there to the applied part? I have an internal dissonance about this that goes deeper than just semantics. Before I started my PhD, I had some vague belief that after writing up some research with an algorithm in it, you'd put it on the arxiv, and from there someone might one day need something like that, code it up, and use it. If I could put in a basic working implementation that was even better.
All the evidence I've seen so far tells me this is not so. The truth is no one is going to take the time to code up your algorithm, because no one has dozens of hours to spend understanding your paper, developing an algorithm suitable for an industrial problem, often just to get improvement on a niche subset of cases. I've been wondering how to estimate the number of algorithms described on the arxiv that are ever implemented and used in a non-academic setting -- my bet is (outside of ML), less than 1%.
I've heard many times that sophisticated higher-order methods for PDEs (finite element / volume, Galerkin, ...) are used in aeronautics, to determine the wind shape over an airplane wing. I've found out from talking to people in the industry at companies like Bombardier that for the most part they do second order finite difference like the rest of us. Why? Because you can code it up in an afternoon, whereas writing the more sophisticated methods can take weeks or months. As academics, we think that the theoretical work is the really hard part, and we neglect the human cost of writing and maintaining algorithms. We have it backwards: academics are (relatively) cheap; code (and changing code) is expensive. (Of course, I make these comments assuming a certain scale. We can come back to this.)
I think the fundamental issue is that I know few applied mathematicians who start with a problem and seek out a solution. Most often, you finish your (applied) math PhD armed with some machinery. If you want to get a professorship and you've done well, you typically turn the crank of your particular machine better and faster than most. In applied math we can say our model is motivated by some problem in the sciences/economics/whatever, but in my experience that just lets us erect a straw person (create a problem) and tear it down (solve the problem) using the machinery that only we have mastered. Just because a problem is hard doesn't make it important.
What to do, then? How do you work on "consequential" problems?
To be pithy about it, I've found it useful to think in terms of $ rather than h-index. In many cases, a consequential problem is one that, if you solve, you can monetize. You could frame this as asking what kind of mathematics could enable new technologies. In my experience it is very difficult to write down a mathematical question that, if answered, can lead to new technology. But if you manage to find such a question -- and it is possible -- it can be a goldmine.
I have more to say -- especially about how mathematicians need to get a reality check on the importance of hardware and its relevance in stochastic algorithms research -- but this is long enough as it is, and I don't want to just be a crazy person rambling in the corner. I'd be very curious to hear your thoughts.
There are papers that are well-written and useful, but there are at least as many that are just drivel (I probably contributed to both kinds).
Unfortunately, the prevailing attitude is that outside people will not understand our stuff anyway, so we often make no effort to make papers understandable, or to publish data. (There is a lot of great outreach and science communication, but not so much for students or researchers from other fields who want to follow the technical details.)
Your results replicate, or they don't. Your calculations, equations, and models predict experiment. Or they don't.
Writing papers about it and getting the feedback of "peers" is nothing more than an old fashioned circle jerk for padding resumes, CVs, and persuading other people in that academic hierarchy that you deserve funding. It is a game that is divorced from actually learning, researching, understanding, measuring, and predicting the world.
Not saying incentives are perfectly aligned -- many citations are superficial ("this topic was studied before"), and papers count for a lot even if they're never cited, etc
In opensource the attitude is "See bug? Send a PR!"
Whereas academic papers are like publishing software into a blockchain (and not source but binaries, i.e. PDFs full of shortcuts): you don't want for people to easily find bugs and contribute fixes, so you handwave a lot so that no one can reproduce your exact thing.
The difficulty in such a thing would be the journals and database companies are holding on to their exclusivity and profit motives with an iron fist, so unless you want to get sued into oblivion, you'd have to stick with open source or accessible articles, so you'd need to either specialize in disciplines that have moved away from closed-source enough that the tool wouldn't have massive holes in it.
Also determining which new references and reviews have relevance (like if anybody can comment with new references, who goes through to check they're actually relevant or say what the person says they say?), preventing academics/administrators from gaming the system if it DOES get popular, etc. In open source, this is crowd-sourced, but for some academic fields the number of people who are qualified to speak on a matter is extremely small.
/academic librarian thoughts
It'd be viable for fields that don't use/rely on for-profit or closed journals, but I don't know if the money to run it would be there, especially since the odds of the big Schol Comm players suing is still there, because it'd be worth it to ruin the tool/effort before it can challenge them.
Building this would be my dream job, but hahaha no.
I think there was one pull request total?
The juice just didn't end up being worth the squeeze.
Seriously though, you're totally right. I got very dissatisfied with science when I realized that many people were effectively publishing unreproducible crap created by terrible code. Fortunately, more and more people are learning how to recognize the crap.
It doesn't have to be this way. Here's the process I use in my lab:
1. Every paper that makes a claim of any kind based on code contains a link to a public Git repo.
2. The paper contains the Git hash identifying the exact commit used to justify the claim. Copy-paste it from the paper into your checkout.
3. The repo may have moved on with fixes and improvements, you can have those too.
If you are using version control already, it's not much work to do this. Of course, you have to be committed to making your code public.
If you think the entire field of academia doesn't achieve any purpose, you may want to reconsider your position. Most likely, almost everything that you do today on a computer was an academic paper. Yes, it was without code and data. Yet, it was not unusable and achieved more than enough purpose.
The average comment on HN on academia comes from a mindset where everyone wants a product. The purpose of a paper is NOT to release a software or a product. But, to test an idea under some assumptions. That's what all research does at its core - formulate a hypothesis, design an experiment to test the hypothesis and report the results and implications. Are all research papers perfect"? No. Are all of them usable? No.
Your use case - sound synthesis for a specific instrument - may not be a scientific challenge. It is however an engineering challenge and hence, you found a better answer amongst hobbyists and tinkerers. Now, try looking for the a vaccine for Covid - and guess where you'd find that answer? In decades of research on mRNA with repeated failures, papers that couldn't be replicated, unavailability of "code" and samples with verbal descriptions skipping crucial details.
"The thing is, I don’t care if something has a thousand retweets, what I care about is if it has two or three independent confirmations from economically dis-aligned actors. This is the same as academia, by the way, everybody’s optimizing citations. What you actually want to optimize is independent replication. That’s what true science is. It’s not peer review. It is physical tests."
Others have commented as well but I will reinforce: their output is basically unusable for you for the purpose you want to put it to.
Which is fair, but you should also recognize that you are not the audience of the papers and for good or for ill the system is not set up to help you with this.
I agree that these days the tooling makes it much easier to distribute code & data somehow to match up, but there is also a cost/incentive mismatch. Basically to do a decent release of what you are working on and worse, potentially support it, costs time but has no real career value (yet). Which means it's mostly only done by people who are philosophically convinced of its value.
I think this will change over time, at least in some areas, but it won't be quick.
What a joke.
Try to help humans think better first. If you succeed at that, you might be on the right track towards developing cold fusion, er, general AI.
You get a whole lot of points for discovering something, designing something, or a proof.
But there's a very large amount of people focused entirely on aims that are very, very distant from actually making human lives genuinely better.
Mostly because everyone quietly understands all the extraordinarily complicated mathematics is actually extraordinarily complicated.
Hence the ROI isn't worthwhile.
I can't speak to each and every person working on ML, but I thought I would share a fun use case I ran across the other day.
There is a business in some foreign country that is similar to Uber Eats: customer goes to an app, browses for food from various restaurants, orders, it gets delivered.
The business was using ML to help the restaurants: the restaurants upload a pic of the dishes, a title, and a description (usually all from an existing menu). The business would parse the description to guess at what was in the dish. Scan the picture to guess at the quantity of food (entre, side, desert, etc). Compare ingredients against publicly available nutrition info. Now the end consumer can do things like: search for gluten free, vegetarian, pork free, <300 calories, desert, etc.
Almost all of this was "possible" before, but it would have required enormous effort from the restaurants inputting the data or customers reading each item. Now it is "easy", and it actually helps the end customers - and the restaurants.
I'm an academic librarian, and they're completely different ways of working: When I do academic work, I (ideally) have to take my time and I'm not supposed to present my work until it's developed enough that I'm confident it presents a substantial improvement; I have to prove that it's worth a colleague's time to engage with by meeting certain requirements. Coding/developing, on the other hand, requires a lot more back and forth, a lot more "I don't know", and is more immediate in a way I find very satisfying.
I would LOVE to see more back and forth between engineers and academics in terms of ways of working; I think there's a lot of benefit to be gained there: Tech tends to not consider the future as much as they should, but the academics could really benefit from doing what you mentioned and improve the system they work in rather than accepting it.
One of the things I'm trying to do is get better at/learn some ML so I can play around with turning the things I learned in grad school into useful tools, but I'm a single journeyman dev doing this in my spare time so odds of anything actually useful coming out of it is small.
> similar: an app that pops up serendipitous connections between a corpus (previous writings, saved articles, bookmarks ...) and the active writing session or paragraph. The corpus, preferably your own, could be from folders, text files, blog archive, a Roam Research graph or a Notion/Evernote database.
"It likely wouldn't take much to"
Are worlds apart in this case, training and deploying models on that scale is a huge investment, even if you already had all the code and cleaned training data.
Medline, a searchable online directory of medical research papers has existed for 50 years. The National Library of Medicine for many years was a leader in document search and retrieval before there was a web. In the 80's they were doing vector cosine document similarity, document clustering and automated classifcation. They were also doing so great stuff like indexing papers based on proteins and gene sequences - so a paper which might be in a field completely different than yours might pop up if a similar protein or sequence was mentioned.
(Disclosure - I worked at the National Library of Medicine in the 90's)
That being said, in the past 30 years search and retrieval exploded to say nothing of ML, but its crazy to ignore the stuff which has come before, AND it's tough to compete with a national lab whose mandate is to basically give the stuff away.
I did think that the author did a good job of outlining (some of) the basic structural issues that make this a tough field to monetize, but even setting those aside, there's no substitute for actually knowing your users and what they need, and that's something the NLM is amazing at.
(Disclosure, my PhD was funded by an NLM training grant, some of my research is funded extramurally by the NLM, and I have a lot of NLM colleagues, so I'm maybe a little bit biased)
It also felt like a long apology/explanation for Emergent Ventures rather than a true deep analysis. Pretty strong (and often false) statements for only what seems like half a year of total, somewhat vague work.
The author is talking about extracting entity-level information from those article and building knowledge bases. Medline provides access to the raw data, but it does nothing like what the article describes (well - there's https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR.html , so not entirely true. But it's hardly a final answer to the problem or even close to being usable enough that it makes commercial solutions pointless)
In biomedical fields they dismiss more than half of papers out of hand when they do a Cochrane meta analysis. It begs the question of why such papers (which aren’t fit to extract knowledge from) are published or funded at all.
I got a PhD in theoretical physics and routinely found that something was wrong on page 23 of a 50 page calculation and nobody published anything about it in 30 years. Possibly the whole body of work on string theory since 1980 [most of hep-th] is a pipe dream at best. Because young physicists have to spend the first third of their career in a squid game fight for survival not to fathom the secrets of the universe but to please their elders we get situations like that absurd idea of Steve Hawking that information gets lost in a black hole. (E.g. if you believe that you aren’t even going to try quantum gravity.)
That is is big problem - good to recognize it as such.
I can tell because the article, though lengthy, never seems to state an explicit problem to be solved. Rather, various ways to apply technology to a field are discussed.
This is a recipe for failure. You need 3 things:
1. a problem to be solved
2. a customer who has that problem
3. money in the customer's pocket waiting to be transferred to yours
The article never even gets to (1).
Regarding (2), if academic groups are the target customer, you're going to have a bad time. They have little money and they tend to be all to happy to build something that sort-of replicates the commercial product you've created for them.
This leaves scientific for-profit companies. They have lots of problems (and these days money), but these problems tend to be quite difficult to discover and solve because of the extensive domain and industry knowledge required.
Unless there's some nuance I missed, I immensely disagree with this statement.
I'm currently in the biomedical literature review space, and I appreciate the detailed insights. I wonder if the author considered that literature review is used in a wide variety of domains outside pharma/drug discovery (where I perceived their efforts were focused). Regulatory monitoring/reporting, hospital guideline generation, etc.
This is a billion dollar industry, and I couldn't agree more that it's technologically underdeveloped. I do not agree that AI-based extraction is the solution, at least in the near-term. The formal methodologies used by reviewers/meta-analysts: search strategy generation, lit search, screening, extraction, critical appraisal, synthesis/statistical analysis, are IMO more nuanced than an AI can capture. They require human input or review. My business is betting on this premise :)
2. Superhuman accuracy at limited tasks is not good enough. For instance transcribing audio at 95% word-level accuracy would be good for a human but it means every other sentence is garbled. People communicate despite this because they ask questions. A useful text-to-structured information tool has to exert back pressure on bullshit, give feedback about what it understands and push the author to tell a story that makes sense and has adequate detail.
here's an actual medical paper that my first system, Moliere, was able to help discover:
It seems that one major bias that the author of the post's blog has is their heavy conflation of 'worth' with money.
Many of us probably realize that this is not true, as academia clearly shows. Furthermore, It's not that crazy that they had trouble making money off of software, as I wouldn't expect many startups at all to be able to get customers from software solutions alone.
They also seem to try to compare their success with that of essentially 'other biotech ml companies'; for which I would expect there to be quite a bit of tangible resources that these 'other companies' provide. For example, a startup looking to provide a service of detecting diseases or conditions from DNA methylation data would likely perform the sequencing required before doing an analysis (in order to have good control over experimental conditions).
The materials alone in that case could cost quite a bit - so charging a bit more for the analysis isn't that problematic, since the transaction of some currency is already required for covering material costs.
Anyway, as you mentioned - it seems it's important to recognize that these systems aren't necessarily meant to generate revenue for a startup, but rather are much more useful as a tool in academia.
The incubator introduced me to advisors with business experience in this field. And I got told in no uncertain terms what is the gist of this article: The value lies in the molecular and clinical data. In 2021 I would add digital pathology / imaging data.
I feel like you are trying to tell me something REALLY valuable, but I don't quite understand it. Can you please elaborate?
At least in the field of materials science, papers are by and large a record of research outputs. We made material X and it achieved performance Y - here are a bunch of measurements to prove that this is in fact what we made and that it truly achieved performance Y at relevant conditions, etc. In this sense, papers really function as an advertisement: look at what we achieved.
What papers do not do is rigorously document inputs, or provide a step-by-step guide to reproduce said results, for obvious reasons.
My current take on this topic is that it would be both feasible and valuable to build a knowledge extraction system to compile and compare outputs across a specified field. Think the big “chart of all verified solar cell efficiencies over time” , but generated automatically. This would at least immediately orient researchers to the distribution of state of the art results, and help ensure that they don’t omit relevant references in their reviews.
But extracting and making sense of inputs (methods), or even “knowledge”? Forget about it.
Coming from CS, something I would really like to see though is a tool that would summarize a scientific area/domain.
Something that would kill literature reviews and/or would provide an overview of the hot topics/open questions in different areas.
(At least MIBs are rather harmless, at least in the long run, and can actually yield some benefit: innovative people are drawn to these types of industries and inevitably create cool things as a by-product of their work.)
I really like jobs people think AI can do in theory but can't really do them effectively irl. Where do I get a part-time gig like that if I think I am capable of reviewing and creating summary of non-STEM papers? Except for homeworks and assignments of course.
And you'd have to know enough about the process and authors to know what makes papers relevant. The metadata matters as much as the data.
You have to test out both solution. And as these jobs are treated as contracts there is no significant commitment for choosing one over the other. We can't be certain if one method is better than the other without trying both of them out without prejudice.
I, for one am agnostic about either choice. Because AI is overhyped yet it has spillover benefits as a marketing-sales point but offshore human intelligence has a bad rep but could be effective if you have proper documentation, supervision and review framework.
The other matter is that an Indian who could review papers that well would also cost more than 6k/year and would not be easily replaceable, which eliminates the main benefit of outsourcing for a company trying to operate in such a way in 2021.
In 2030? I'd say the odds are if somebody in Hyderabad can do that then they can start their OWN company rather than bother with us at all. Honestly, given India's role in pharmaceutical manufacture, I'd be shocked if things like that don't start popping up.
I was happy to see apost that clearly states its purpose.
>>... nothing of it will go anywhere.
>>Don’t take that as a challenge. Take it as a red flag and run. Run towards better problems.
Wow, speaking of the value of negative results, that is hugely valuable! Could easily save person-decades of work & funds for more productive results.
The insights that the most relevant knowledge is not written into the publications (for a variety of reasons), and that the few that are are of limited use to the target audience, and even when it is useful it is a small part of the workload (i. e., not a real pain point), are key to seeing that the entire category of projects to extract & encode such knowledge is doomed.
For medical papers Mesh terms:
Gene information is extracted by flybase/ worm base …
It’s time consuming, expensive probably not perfect but for certain types of papers it makes searching better.
The better use case for this is teaching, not creating knowledge bases that nobody will use.
- Is the information that makes science actually work mostly in images that the machines don't yet understand?
- Was the information paywalled or in private databases and inaccessible to this researcher?
- Are the papers mostly just advertisements for researchers to come gab with each other at conferences and doodle on cocktail napkins, and that's where all the "real science" happens?
- (From the comments) is the information needed to make sense of papers communicated privately or orally from PI's to postdocs and grad students, or within industrial research labs?
Something is missing from my mental picture here.
Don't real scientists mostly learn how to think about their fields by reading textbooks and papers? (This is a genuine question.) If so, isn't it likely that our tools just aren't advanced enough to learn like humans do? If not, what do humans use to learn how to think about science that's missing from textbooks and papers?
It's true that the so-called "folk knowledge", knowledge that exists in the community but no one bothers to publish in the form of papers, is a real problem, but at least in my field, it's by no means the majority of knowledge.
As someone from a peripheral university where you can't just drive a few miles and talk to the best in your field, I have successfully started in new subfields of study (getting to the level of writing research papers in top-tier venues) by reading the literature.
While this essay provides a very interesting point of view, I suspect it's heavily colored by the authors' failure to monetize the technology (which is related to the fact that people doing most of the grunt research work, who would benefit the most for this, are PhD students and postdocs who have no money to pay for it - in the text, the author hints at this). I wouldn't take it as an accurate description of academia.
The equivalent to those biomedical industry players are the big tech who develop closed source and push the edge in some area. They will publish but that does not mean you can replicate any of it.
Software is also fragmented, crippled by IP lawsuits, patent trolls and so on. This does inhibit ability of society to benefit from software since it depends on the private sector to sort things out. The PhDs go and build businesses to "make the science work" in that sense.
The ideal of detached pursuit of knowledge is not a complete fiction (despite the hyperbole), but it does remain an ideal that can only be approximated.
Industry is a different world, and indeed there are plenty of opaque industry papers that aren't replicable at all because much of the model is essentially a trade secret, and the paper is more an avenue for bragging than for developing new knowledge together with the rest of the community. To be honest, I would just outright disallow that kind of papers. But that's not a popular opinion, and taking into account that big tech companies sponsor our conferences and provide grants, I can't blame those who think otherwise.
Science is a profession like others. When you are earning your Ph.D. you learn to think about the field by reading papers and discussing with peers and colleagues, yes.
The intro of a well-structured research paper should follow this pattern:
- This is a really important topic and here is why.
- What is the current state of the art in this field? (this comes from reading 100-1000 publications on the topic and selecting the 5-10 most relevant to the next point). HOWEVER, the state of the art leaves this question unanswered.
- Here are some reasons why the idea in this paper can help answer that question (cite another 3-10 papers).
- Our hypothesis is that XXX can answer the important unanswered question (where X is derived from the prior section).
So, what I am getting at, a scientific publication is part of a conversation. When I'm citing the 5-10 papers to summarize the state of the art, I'm assuming the reader has read 50% of the 100-1000 papers which I also read, and knows where the 5-10 which I cite fit into that broader context.
So any paper, in isolation, only has a fraction of its meaning in the publication. The real information is the surrounding context.
Pro tip: if I'm reading a paper and want to understand it better, I also read one or two of the papers it cites, and one or two papers which cite it. Also, it can take a few times through before I start to understand what the author is trying to say.
Also, the mechanical process of effectively reading a paper is highly non-linear, and is a skill in and of itself. In a lot of ways, it is more akin to high-level pattern matching than it is to more "normal" reading. At least at my institution, it is something that we actually teach our students to do in formal ways (the obligatory "How to read a scientific paper" lecture during the first term or two) and then make them practice over and over again for years (journal clubs, etc.). The original author eventually figured this out, which is to their credit.
Yes, it is very broken.
But for most of the academics I have known and worked with, publications are taken quite seriously, and institutional knowledge is freely shared. There is an incentive to reduce the content in papers, but it is out of respect for the reader (a paper is not a textbook) and an honest attempt to limit the discussion to the core hypothesis of the work. You have 6 pages to
1) describe the content of 6*100 pages (the 100 other relevant papers on the topic),
2) present your addition to this body of knowledge,
3) discuss the insights your work brings, again referring to the content of 600 pages.
and those 600 pages you are summarizing are as information-dense as your work.
This feels a lot like the graduate students on the Academia StackExchange who are convinced the moment they present their idea it will get stolen, while every faculty member is like "I have a list of my own ideas that I don't have time to work on as long as my arm."
There should be journals/websites/blogs dedicated to trying to reexplain / implement papers.
I've been reviewing a few C++ papers (things proposed to C++23) lately. Many of them are over my head and all I can find are a few spelling errors. The ones I've understood took me 3 readings before I found some giant holes in the details (which I pointed out to the authors, the next revision corrected them). In one case I actually started implementing the feature myself, and only then did I realize there was something ambiguous in the paper (after talking with the author we decided to document an implementors choices as it doesn't matter for the 99% use case, and the rest could go either way depending on hardware and data details so better to allow choice)
The vast majority if papers I'm far too laze to go into that level of detail on. I just assume the authors is a smart person and so I trustingly let them go. It may well be if I understood the paper I'd be horrified, ask me in 2043 when we have 20 years of hindsight...
I have to believe that peer review is the same - many reviewers are just reading and looking for something obvious but not really understanding details.
I told them that I had to write it 100 times and spend two years before I understood it.
So if they could pick it up in 3 readings over 3 days, they were doing pretty good.
For example, I'm qualified to review papers on educational programs for children. I should never be asked to do that.
This is the world we live in. A scientist is a person that needs to make a living and is subject to various constraints.
The reason that there is little money to be made is that society hasn't found a way to set up the scientific process in such a way that the constraints would value the increase in public domain knowledge higher than the incentives to hold some knowledge back.
Part of this may stem from leaving specialized knowledge to academia while letting only companies reap the monetary rewards of putting the knowledge to use. Society benefits only indirectly (better drugs, machines, etc) but industry players will rather shield knowledge and adapt its representation to their own needs.
They learn by reading the literature, but also by communicating, and by an active process of testing their own understanding and resolving gaps and inconsistencies. Even when a self-taught genius like Ramanujan comes along, they benefit from being brought into the community.
The question of how one would determine the state of the art in a field has an answer, but at present it would be indistinguishable from training a scientist, rather than running a clever software tool that could synthesize from the literature.
An AlphaZero for science would instead be like the recent deepmind paper where the pattern matching capabilities and internal features of a neural network were used to navigate some domain's decision space of conjecture formation and testing.
If its not there I won't use it! If you dont provide code with your paper it better have a really useful concept in it otherwise not citation. Which beckons to the problem in the article where most important information in basic research papers is: "Hey, this concept works" as opposed to a rigorous test of exactly what makes the concept work and how to use it in other situations.
"Interesting but I don't know how to make sense of it. How can it be that "close to nothing of what makes science actually work is published as text on the web"?"
I'm not convinced this assertion is true. Difficult to parse by a non-expert? Sure. Often stored in pictures rather than text? Absolutely - indeed, a number of journals directly ask reviewers if the text and graphs are duplicative and consider this a negative. Harder to "disrupt" and monetize than many companies have expected it to be? Certainly.
"Is the information that makes science actually work mostly in images that the machines don't yet understand?"
In my mind, this is the most credible bit of the author's complaints. A lot of science in biomedicine is done in information dense graphics. The author picks especially hard to approach ones, but this is definitely a thing.
"Was the information paywalled or in private databases and inaccessible to this researcher?"
For an outsider without institutional access to journals, this can be a problem. More acutely, there is some lag between "What I'm currently working on" and "What's in the literature" simply because the literature is slow (and has gotten way slower during the pandemic).
"Are the papers mostly just advertisements for researchers to come gab with each other at conferences and doodle on cocktail napkins, and that's where all the "real science" happens?"
In biomedicine, papers are the product. Conferences tend to have a couple uses:
1) Previews of coming attractions - things I'm working on that are close enough to done to talk about, but not so close as to be making the rounds yet. These talks will often have less detail than a paper would, because we've all sat through a presentation that does into a ton of implementation detail and they're agonizing. Also I only have fifteen minutes.
2) Looking for postdocs - either from the hiring or seeking end.
3) Building professional networks - this is mostly so when someone comes to me with a problem, I know who might be working on say...causal inference with time-varying exposures...and can reach out to them. Usually to ask what papers I should read to get caught up. Or to bring them in on a paper/grand proposal.
4) Looking for problems other people are having that I can solve, and then reaching out.
"(From the comments) is the information needed to make sense of papers communicated privately or orally from PI's to postdocs and grad students, or within industrial research labs?"
Only insofar as my graduate students and postdocs have access to my time and expertise, and a job that is expressly meant to encourage understanding things. "I don't know, why don't you spend a couple weeks figuring out how they did that" is a perfectly good use of a graduate student's time, but something I find is rarely encouraged elsewhere.
"Don't real scientists mostly learn how to think about their fields by reading textbooks and papers? (This is a genuine question.) If so, isn't it likely that our tools just aren't advanced enough to learn like humans do? If not, what do humans use to learn how to think about science that's missing from textbooks and papers?"
One of the things that's likely missing, because those are all finished products, is the process. For example, I spent an hour chatting with a graduate student about three or four different ways they can approach their problem - what assumptions come with each one, tradeoffs, etc. But only the branch that actually got used is going to be published.
> Extracting, structuring or synthesizing "insights" from academic publications (papers) or building knowledge bases from a domain corpus of literature has negligible value in industry.
> Most knowledge necessary to make scientific progress is not online and not encoded.
> Close to nothing of what makes science actually work is published as text on the web
> The tech is not there to make fact checking work reliably, even in constrained domains.
> Accurately and programmatically transforming an entire piece of literature into a computer-interpretable, complete and actionable knowledge artifact remains a pipe dream.
It also states existing old school "biomedical knowledge bases, databases, ontologies that are updated regularly", with Expert Entry cutting through the noise in a way that NLP cannot.
Although I disagree with its conclusions, much of this jives with my experience. From the perspective of research, modern NLP and transformers are appropriately hyped but from the perspective of real world application, they are over-hyped. Transformers have deeper understanding than anything prior, they can figure out patterns in their context with a flexibility that goes way beyond regurgitation.
They are also prone to hallucinating text, quoting misleading snippets, require lots of resources for inference and enjoy being confidently wrong at a rate that makes industrial use nearly unworkable. They're powerful but you should think hard about whether you actually need them. Most of the time their true advantage is not leveraged.
My disagreements are with its advice.
> For recommendations, the suggestion is "follow the best institutions and ~50 top individuals".
But this just creates a rich get richer effect and retards science since most are reluctant to go against those with a lot of clout.
> Why purchase access to a 3rd party AI reading engine...when you can just hire hundreds of postdocs in Hyderabad to parse papers into JSON? (at a $6,000 yearly salary). Would you invest in automation if you have billions of disposable income and access to cheap labor? After talking with employees of huge companies like GSK, AZ and Medscape the answer is a clear no.
This reminds me of responses to questions of the sort: "Why didnt't X (where X might be Ottomans or Chinese) get to the industrial revolution first?".
Article also warns against working on ideas such as "...semantic search, interoperable protocols and structured data, serendipitous discovery apps, knowledge organization."
A lot such apps are solutions chasing after a problem but could work if designed to solve a specific real world problem. On the other hand, an outsider trying to start a generalized VC backed business targeting industry is bound to fail. In fact, this seems a major sticking point in the author's endeavor.
Industry is jaded and set in their ways, startups focus on summarization and recommendations and retrieval which are low value in scientific enterprise and academia is focused on automation which turns out brittle. Still, this line of research is needed. Knowledge production is growing rapidly while humans are not getting any smarter. Specialization has meant increases in redundant information, loss of context and a stall in theory production (hence "much less logic and deduction happening").
While the published literature is sorely lacking, humans can with effort extract and or triangulate value from it. Tooling needs to augment that process.
You're right they "could work if designed to solve a specific real world problem" but against what baseline? The baseline could be spending that time on actual deep tech projects and not NLP meta-science