Hacker News new | past | comments | ask | show | jobs | submit login
The business of extracting knowledge from academic publications (markusstrasser.org)
285 points by kevin_hu on Dec 9, 2021 | hide | past | favorite | 116 comments

I can confirm that in my current area of interest (how to synthesize a cello or saxophone sound), there are hundreds of academic papers published over decades, each of them says "our method sounds more realistic than others", but code and audio samples are never available, and verbal descriptions always skip crucial details. I have no doubt that academics have a ton of expertise, but their output in paper form is basically unusable, I'm not sure it achieves any purpose besides resume padding. Reading a forum of synth hobbyists is a hundred times more useful.

I'm an academic (applied math) and want to respond to this: academic papers are the way they are for lots of reasons, many of which (not so good) have been mentioned on HN. There are a couple that I do not see very often however:

(1) Many academics aren't aware non-academics read their papers at all: we work with other academics, go to conferences with other academics, and on the rare occasions we hear from readers, it's from other academics. Big exception: in some fields academia and industry have much more interaction, biomedical research (the subject of the linked article) being one of them. Extracting knowledge from that literature has a large number of practical and economic implications.

(2) There seems to be a perception that published papers are a repository of established or state-of-the-art knowledge. Perhaps they were meant to be that way, and perhaps more of them should be. But for many journals in many fields, publications are a form of moderated discussion. Reconstructing the state of knowledge from snippets of conversation is always going to be hard.

What can help make the literature more accessible? some of the forces are structural, some are due to current limitations of technology. But one thing that can help if you find the results of a paper interesting and are able to track down the authors, write them. People like hearing their work is noticed, and they like talking to people about things they're interested in.

Another is to make constructive suggestions (or even pitch in to improve code where it's open source & available). Between teaching, advising, committee work, etc (not to mention family), most of us have to prioritize, and as much as I'd like to clean up old code for release in the hopes someone finds it useful, it isn't going to get my grad students out the door with a degree or a job -- I'm generally spending more time on their research problems these days than my own. But if I know there's interest / use I might prioritize time a little differently.

> What can help make the literature more accessible?

Review articles, sometimes called surveys.

I've always thought that new PhDs would be excellent authors for those, having digested lots literature for their dissertation.

> I've always thought that new PhDs

A well written PhD or MSc thesis is often the best way into a new field, ime. If the committee is good on this aspect they'll insist you've put enough detail in for someone to follow along mostly self contained.

Also, I believe there is a hierarchy that goes something like: academic papers -> review articles -> specialized books -> text books.

The text changes to fit the audience, and the knowledge becomes more accepted (and or fundamental) further down the line.

There's very little academic kudos survey articles or other educational articles (as another commentator mentioned, PhD theses are often the best place to go for those types of summary). I'm not sure how one changes that. I remember thinking during my PhD how helpful it would be to have more survey type articles - especially for the subfields adjacent to one's own - you want to know if there are useful connections to your work but you don't want to delve through dozens of extremely dense papers.

> Review articles, sometimes called surveys.

Is this field specific? I have read survey articles in math and biology, and was told by some of my profs that they use these articles as an introduction to a new field.

A quick Google search seems to show these exist in CS (along with tutorial papers), physics, and chemistry but I'm having a little difficulty finding statistics survey papers (survey methods come up instead).

Is the problem that there aren't enough of them or they are behind paywalls?

For Stats, check out https://www.annualreviews.org/journal/statistics

Please suggest others if you find them.

Annual Reviews has a bunch of journals for surveys of various fields. Most of them are paywalled, but there's ways around that.

You hit the nail on the head. Will put some of that in the appendix of the post!

> What can help make the literature more accessible?

Publishing code and data associated with figures. A significant amount of the confusion about the literature comes from the difference between the documentation laid out in a paper and the reality of the actual implementation. Plus any peer review of the actual code used is checking the reality of the paper, rather than a description of what we think reality is.

Making this part of the publication process is vital, because there is such a high barrier to actually requesting this information later.

First, let me recognize the openness with which you're approaching the problem, and that you've acknowledged there is a problem. My academic background is also applied math, and I believe there is a missing viewpoint that I've been struggling to articulate to colleagues for several years. I'll take your thoughts as an opportunity to try and better understand this view / feeling. If it's alright with you, I'll limit the discussion to applied math, because it's what I know best and where I think the viewpoint is most pertinent.

Full disclosure: your username is very easy to Google, and I should tell you we work on closely related topics. My opinion is shaped in part by several of the papers you've cited. I'm happy to disclose my identity and continue the conversation in private.

You wrote: "(1) Many academics aren't aware non-academics read their papers at all"

What, then, is the point of applied mathematics? Please know that I don't mean that in a dismissive way. I think there are a few reasonable answers. Chief among them is the belief that exploratory research is important in its own right and does not need to have an immediate non-academic use, as embodied by this quote of Hadamard:

“Practical application is found by not looking for it, and one can say that the whole progress of civilization rests on that principle.”

I believe the above quote as far as it concerns mathematics. But how do we get from there to the applied part? I have an internal dissonance about this that goes deeper than just semantics. Before I started my PhD, I had some vague belief that after writing up some research with an algorithm in it, you'd put it on the arxiv, and from there someone might one day need something like that, code it up, and use it. If I could put in a basic working implementation that was even better.

All the evidence I've seen so far tells me this is not so. The truth is no one is going to take the time to code up your algorithm, because no one has dozens of hours to spend understanding your paper, developing an algorithm suitable for an industrial problem, often just to get improvement on a niche subset of cases. I've been wondering how to estimate the number of algorithms described on the arxiv that are ever implemented and used in a non-academic setting -- my bet is (outside of ML), less than 1%.

I've heard many times that sophisticated higher-order methods for PDEs (finite element / volume, Galerkin, ...) are used in aeronautics, to determine the wind shape over an airplane wing. I've found out from talking to people in the industry at companies like Bombardier that for the most part they do second order finite difference like the rest of us. Why? Because you can code it up in an afternoon, whereas writing the more sophisticated methods can take weeks or months. As academics, we think that the theoretical work is the really hard part, and we neglect the human cost of writing and maintaining algorithms. We have it backwards: academics are (relatively) cheap; code (and changing code) is expensive. (Of course, I make these comments assuming a certain scale. We can come back to this.)

I think the fundamental issue is that I know few applied mathematicians who start with a problem and seek out a solution. Most often, you finish your (applied) math PhD armed with some machinery. If you want to get a professorship and you've done well, you typically turn the crank of your particular machine better and faster than most. In applied math we can say our model is motivated by some problem in the sciences/economics/whatever, but in my experience that just lets us erect a straw person (create a problem) and tear it down (solve the problem) using the machinery that only we have mastered. Just because a problem is hard doesn't make it important.

What to do, then? How do you work on "consequential" problems?

To be pithy about it, I've found it useful to think in terms of $ rather than h-index. In many cases, a consequential problem is one that, if you solve, you can monetize. You could frame this as asking what kind of mathematics could enable new technologies. In my experience it is very difficult to write down a mathematical question that, if answered, can lead to new technology. But if you manage to find such a question -- and it is possible -- it can be a goldmine.

I have more to say -- especially about how mathematicians need to get a reality check on the importance of hardware and its relevance in stochastic algorithms research -- but this is long enough as it is, and I don't want to just be a crazy person rambling in the corner. I'd be very curious to hear your thoughts.

I may be a bit cynical, but at least in my former field (experimental physics), the main purpose of papers seems to be to "lock in" a finished achivement. You do the actual research, pass internal reviews and peer review, and then publishing the paper is just to make it "official". Unfortunately, many papers are never expected to be read. The crucial information exists, but you usually get it from personal communication, internal wikis, or review articles. You just need the paper to copy a formula or graph, and to cite it in the end.

There are papers that are well-written and useful, but there are at least as many that are just drivel (I probably contributed to both kinds).

Unfortunately, the prevailing attitude is that outside people will not understand our stuff anyway, so we often make no effort to make papers understandable, or to publish data. (There is a lot of great outreach and science communication, but not so much for students or researchers from other fields who want to follow the technical details.)

I did my phd in experimental physics and I have to say my realisation of this large point, that papers are little more than resume padding to lock in an achievement was a significant contributor towards destroying, and I use destroying seriously here, any faith or trust that peer review or publishing has anything at all to do with the scientific method at all.

Your results replicate, or they don't. Your calculations, equations, and models predict experiment. Or they don't.

Writing papers about it and getting the feedback of "peers" is nothing more than an old fashioned circle jerk for padding resumes, CVs, and persuading other people in that academic hierarchy that you deserve funding. It is a game that is divorced from actually learning, researching, understanding, measuring, and predicting the world.

In academia there always is a difference between the way results are advertised and what conclusions are drawn internally. This is more true in some fields than others, I'm most familiar with it ML, Physics. Part of your skill as a researcher is to understand based on omissions, the datasets etc. the quiet part that isn't said out loud. Depending how you sell things you can get a Nature / Science paper with confusing inconsistent terminology, hand rolled C++ implementation, provided you are the first and a another method which might be 1000x times faster will only make it into PRL (yes I'm thinking of two specific papers, but won't say which).

There's probably space for a startup that properly archives the technical nature of findings.


Counterpoint: citations are a valuable currency in science. Arguably one of the best ways to earn citations is to do good work and write clear papers.

Not saying incentives are perfectly aligned -- many citations are superficial ("this topic was studied before"), and papers count for a lot even if they're never cited, etc

I'm an outsider but it seems to me the difference between academia and opensource/hobby forums is massive:

In opensource the attitude is "See bug? Send a PR!"

Whereas academic papers are like publishing software into a blockchain (and not source but binaries, i.e. PDFs full of shortcuts): you don't want for people to easily find bugs and contribute fixes, so you handwave a lot so that no one can reproduce your exact thing.

The biggest difference IMHO is when comparing to something like Wikipedia or Stackoverflow. I wish the fabric of scholarly communication similarly allowed for browsing reviews, updating papers, commenting with new references, etc.

I think this is a valuable idea. There are online archives that allow for paper updating for academics, like SSRN, but as a CONSUMER of academic literature, the land is pretty barren.

The difficulty in such a thing would be the journals and database companies are holding on to their exclusivity and profit motives with an iron fist, so unless you want to get sued into oblivion, you'd have to stick with open source or accessible articles, so you'd need to either specialize in disciplines that have moved away from closed-source enough that the tool wouldn't have massive holes in it.

Also determining which new references and reviews have relevance (like if anybody can comment with new references, who goes through to check they're actually relevant or say what the person says they say?), preventing academics/administrators from gaming the system if it DOES get popular, etc. In open source, this is crowd-sourced, but for some academic fields the number of people who are qualified to speak on a matter is extremely small.

/academic librarian thoughts

Now THAT might be a realistic technical goal & business opportunity.

The legal costs make this a non-starter unless it's done by a giant company. Who would, in my opinion, ruin it, and the odds of enough academics complying with a big tech company are small imo.

It'd be viable for fields that don't use/rely on for-profit or closed journals, but I don't know if the money to run it would be there, especially since the odds of the big Schol Comm players suing is still there, because it'd be worth it to ruin the tool/effort before it can challenge them.

Building this would be my dream job, but hahaha no.

Generally, anyone writing a paper about something that could benefit from bugfixes would love to accept them, but doesn't have the time or resources to actually do so - unless there's another paper in it. If they have somehow managed to find enough personal time to have a hobby project, then they probably do accept bugfixes - and you should get them in before that person burns out.

It also doesn't happen enough to design for - I once presented a fairly open-source contributor friendly project at SciPy that I hoped would be compelling (it was about modeling the zombie epidemic), actively asked for help, had set up a couple open requests of varying levels of complexity.

I think there was one pull request total?

The juice just didn't end up being worth the squeeze.

Hey, that unusability of papers is a form of job security.

Seriously though, you're totally right. I got very dissatisfied with science when I realized that many people were effectively publishing unreproducible crap created by terrible code. Fortunately, more and more people are learning how to recognize the crap.

> code and audio samples are never available, and verbal descriptions always skip crucial details. I have no doubt that academics have a ton of expertise, but their output in paper form is basically unusable,

It doesn't have to be this way. Here's the process I use in my lab:

1. Every paper that makes a claim of any kind based on code contains a link to a public Git repo.

2. The paper contains the Git hash identifying the exact commit used to justify the claim. Copy-paste it from the paper into your checkout.

3. The repo may have moved on with fixes and improvements, you can have those too.

If you are using version control already, it's not much work to do this. Of course, you have to be committed to making your code public.

> I have no doubt that academics have a ton of expertise, but their output in paper form is basically unusable, I'm not sure it achieves any purpose besides resume padding.

If you think the entire field of academia doesn't achieve any purpose, you may want to reconsider your position. Most likely, almost everything that you do today on a computer was an academic paper. Yes, it was without code and data. Yet, it was not unusable and achieved more than enough purpose.

The average comment on HN on academia comes from a mindset where everyone wants a product. The purpose of a paper is NOT to release a software or a product. But, to test an idea under some assumptions. That's what all research does at its core - formulate a hypothesis, design an experiment to test the hypothesis and report the results and implications. Are all research papers perfect"? No. Are all of them usable? No.

Your use case - sound synthesis for a specific instrument - may not be a scientific challenge. It is however an engineering challenge and hence, you found a better answer amongst hobbyists and tinkerers. Now, try looking for the a vaccine for Covid - and guess where you'd find that answer? In decades of research on mRNA with repeated failures, papers that couldn't be replicated, unavailability of "code" and samples with verbal descriptions skipping crucial details.

Balaji Srinivasan had a good take on this recently in his conversation with Tim Ferris. I quote:

"The thing is, I don’t care if something has a thousand retweets, what I care about is if it has two or three independent confirmations from economically dis-aligned actors. This is the same as academia, by the way, everybody’s optimizing citations. What you actually want to optimize is independent replication. That’s what true science is. It’s not peer review. It is physical tests."

Yes and no. Literal replications are less valuable than people think - what you really want are independent tests of different parts of the causal network of the underlying model.

> but their output in paper form is basically unusable

Others have commented as well but I will reinforce: their output is basically unusable for you for the purpose you want to put it to.

Which is fair, but you should also recognize that you are not the audience of the papers and for good or for ill the system is not set up to help you with this.

I don't see how the system is very good for its intended audience either. If papers about making sound with code actually included sound and code, people in academia would be better off too. It would make it easier to search through literature, and to build on each other's work.

Papers are a conversation as much as anything else.

I agree that these days the tooling makes it much easier to distribute code & data somehow to match up, but there is also a cost/incentive mismatch. Basically to do a decent release of what you are working on and worse, potentially support it, costs time but has no real career value (yet). Which means it's mostly only done by people who are philosophically convinced of its value.

I think this will change over time, at least in some areas, but it won't be quick.

One thing that strikes me about most academic knowledge tools is that they seem to focus on parsing the current set of academic literature and producing supposedly interesting insights out of them (which quickly tends to snowball into wanting some kind of generalized model for knowledge as a whole). What I think is much more interesting is creating tools that help people create better academic writing in the first place (thinking tools if you will). This is however much more a UX problem rather than it being a pure engineering problem. That is why I think we see many more tools in the knowledge extraction space as most academics thinking about these kind of things probably have an engineering background. That combined with the the fact that we seemingly all want to throw machine learning at any problem we encounter.

To your point but even more general: the ML/AI space is far too focused on replacing people rather than helping people. There is a suffocating cultural conceit that we are on the verge of general AI and oh my gosh what will the humans do, we better institute universal basic income right away, etc.

What a joke.

Try to help humans think better first. If you succeed at that, you might be on the right track towards developing cold fusion, er, general AI.

Unfortunately you'll run into the rapid fact that in the ML/AI space you get almost zero points for building something.

You get a whole lot of points for discovering something, designing something, or a proof.

But there's a very large amount of people focused entirely on aims that are very, very distant from actually making human lives genuinely better.

Mostly because everyone quietly understands all the extraordinarily complicated mathematics is actually extraordinarily complicated.

Hence the ROI isn't worthwhile.

>But there's a very large amount of people focused entirely on aims that are very, very distant from actually making human lives genuinely better.

I can't speak to each and every person working on ML, but I thought I would share a fun use case I ran across the other day.

There is a business in some foreign country that is similar to Uber Eats: customer goes to an app, browses for food from various restaurants, orders, it gets delivered.

The business was using ML to help the restaurants: the restaurants upload a pic of the dishes, a title, and a description (usually all from an existing menu). The business would parse the description to guess at what was in the dish. Scan the picture to guess at the quantity of food (entre, side, desert, etc). Compare ingredients against publicly available nutrition info. Now the end consumer can do things like: search for gluten free, vegetarian, pork free, <300 calories, desert, etc.

Almost all of this was "possible" before, but it would have required enormous effort from the restaurants inputting the data or customers reading each item. Now it is "easy", and it actually helps the end customers - and the restaurants.

Sounds like this would be an epic fail. I mean, just add a spoon of oil more, and your calorie guess is totally off. This is a clear cut case of SEEMINGLY helping. It certainly does not help the end customer. It might help the restaurant, as they don't care if the customer is receiving valid information, as long as they are buying.

Guessing allergen content sounds like a disaster waiting to happen.

I agree.

I'm an academic librarian, and they're completely different ways of working: When I do academic work, I (ideally) have to take my time and I'm not supposed to present my work until it's developed enough that I'm confident it presents a substantial improvement; I have to prove that it's worth a colleague's time to engage with by meeting certain requirements. Coding/developing, on the other hand, requires a lot more back and forth, a lot more "I don't know", and is more immediate in a way I find very satisfying.

I would LOVE to see more back and forth between engineers and academics in terms of ways of working; I think there's a lot of benefit to be gained there: Tech tends to not consider the future as much as they should, but the academics could really benefit from doing what you mentioned and improve the system they work in rather than accepting it.

One of the things I'm trying to do is get better at/learn some ML so I can play around with turning the things I learned in grad school into useful tools, but I'm a single journeyman dev doing this in my spare time so odds of anything actually useful coming out of it is small.

I recall seeing a Show HN post a while back about a research focussed web browser that helps as a thinking tool:


It's amazing what you miss on HN when you skip a day

A sign-in/sign-up necessary just to see the browser in action? Hard pass.

This is entirely correct. Automation is a flash in the pan, making it easier for an expert to gather, process, and present information is much higher value. Unfortunately for the startup ecosystem, it has apparently become rote knowledge among investors that workflow = low-value, AI = 100x value, so the money is pushing entrepreneurs away from the more boring task of making tools for people to process knowledge while chasing fairy tales of turning millions of PDFs into an automatic gold mine...

Exactly! Assist me in my process of researching and writing. and use what I have already done in e.g. putting and classifying paper in Endnote. It's interesting that the author seemed to have a similar idea for a brief second but then tossed it away:

> similar: an app that pops up serendipitous connections between a corpus (previous writings, saved articles, bookmarks ...) and the active writing session or paragraph. The corpus, preferably your own, could be from folders, text files, blog archive, a Roam Research graph or a Notion/Evernote database.

As an ECR with English as a second language, the paid version of Grammarly has clarified my writing quite a bit. I think there's more unexplored value in this space.

It likely wouldn’t take much to craft an ”arXiv Copilot” out of GitHub Copilot.

"It's well understood how to"


"It likely wouldn't take much to"

Are worlds apart in this case, training and deploying models on that scale is a huge investment, even if you already had all the code and cleaned training data.

Can confirm: This is my main tech interest at the moment and if I consider how long it's going to take, I want to die.

It's interesting that OP did seemingy little research with respect to existing work in the field.


Medline, a searchable online directory of medical research papers has existed for 50 years. The National Library of Medicine for many years was a leader in document search and retrieval before there was a web. In the 80's they were doing vector cosine document similarity, document clustering and automated classifcation. They were also doing so great stuff like indexing papers based on proteins and gene sequences - so a paper which might be in a field completely different than yours might pop up if a similar protein or sequence was mentioned.

(Disclosure - I worked at the National Library of Medicine in the 90's)

That being said, in the past 30 years search and retrieval exploded to say nothing of ML, but its crazy to ignore the stuff which has come before, AND it's tough to compete with a national lab whose mandate is to basically give the stuff away.

The best thing about the NLM's work in this space is how deeply it has been informed by the needs and workflows of the biomedical researchers, which is a perspective that has been sorely lacking in work coming from outsiders.

I did think that the author did a good job of outlining (some of) the basic structural issues that make this a tough field to monetize, but even setting those aside, there's no substitute for actually knowing your users and what they need, and that's something the NLM is amazing at.

(Disclosure, my PhD was funded by an NLM training grant, some of my research is funded extramurally by the NLM, and I have a lot of NLM colleagues, so I'm maybe a little bit biased)


It also felt like a long apology/explanation for Emergent Ventures rather than a true deep analysis. Pretty strong (and often false) statements for only what seems like half a year of total, somewhat vague work.

Medline is great on all levels but what does it have to do with the author's work/product/topic? It's a database of biomedical articles. It's searchable using usual text search ("concept A" AND "concept B"), and has some additional features like MESH terms and metadata.

The author is talking about extracting entity-level information from those article and building knowledge bases. Medline provides access to the raw data, but it does nothing like what the article describes (well - there's https://lhncbc.nlm.nih.gov/ii/tools/SemRep_SemMedDB_SKR.html , so not entirely true. But it's hardly a final answer to the problem or even close to being usable enough that it makes commercial solutions pointless)

When you count all the ways to be wrong, the median scientific paper is wrong.

In biomedical fields they dismiss more than half of papers out of hand when they do a Cochrane meta analysis. It begs the question of why such papers (which aren’t fit to extract knowledge from) are published or funded at all.

I got a PhD in theoretical physics and routinely found that something was wrong on page 23 of a 50 page calculation and nobody published anything about it in 30 years. Possibly the whole body of work on string theory since 1980 [most of hep-th] is a pipe dream at best. Because young physicists have to spend the first third of their career in a squid game fight for survival not to fathom the secrets of the universe but to please their elders we get situations like that absurd idea of Steve Hawking that information gets lost in a black hole. (E.g. if you believe that you aren’t even going to try quantum gravity.)

> My biggest mistake was that I didn't have experience as a biotech researcher or postdoc working in a lab.

That is is big problem - good to recognize it as such.

I can tell because the article, though lengthy, never seems to state an explicit problem to be solved. Rather, various ways to apply technology to a field are discussed.

This is a recipe for failure. You need 3 things:

1. a problem to be solved

2. a customer who has that problem

3. money in the customer's pocket waiting to be transferred to yours

The article never even gets to (1).

Regarding (2), if academic groups are the target customer, you're going to have a bad time. They have little money and they tend to be all to happy to build something that sort-of replicates the commercial product you've created for them.

This leaves scientific for-profit companies. They have lots of problems (and these days money), but these problems tend to be quite difficult to discover and solve because of the extensive domain and industry knowledge required.

Yes. I wonder if the author had first worked on the patent side (I'd be interested to hear more about this idea). Perhaps working on patents first would be a path to get experience (and product-market-fit). From there, one could branch out into other domains (e.g. bio).

Patents were the more obvious go to but we didn't want to work with obfuscated patent descriptions for the next years

> Close to nothing of what makes science actually work is published as text on the web

Unless there's some nuance I missed, I immensely disagree with this statement.

I'm currently in the biomedical literature review space, and I appreciate the detailed insights. I wonder if the author considered that literature review is used in a wide variety of domains outside pharma/drug discovery (where I perceived their efforts were focused). Regulatory monitoring/reporting, hospital guideline generation, etc.

This is a billion dollar industry, and I couldn't agree more that it's technologically underdeveloped. I do not agree that AI-based extraction is the solution, at least in the near-term. The formal methodologies used by reviewers/meta-analysts: search strategy generation, lit search, screening, extraction, critical appraisal, synthesis/statistical analysis, are IMO more nuanced than an AI can capture. They require human input or review. My business is betting on this premise :)

1. The real value is in operational documents such as clinical notes, maintenance records, soldier and police notebooks, etc. this info is proprietary to a organization and its partners and is directly linked to how it produces and pays for value.

2. Superhuman accuracy at limited tasks is not good enough. For instance transcribing audio at 95% word-level accuracy would be good for a human but it means every other sentence is garbled. People communicate despite this because they ask questions. A useful text-to-structured information tool has to exert back pressure on bullshit, give feedback about what it understands and push the author to tell a story that makes sense and has adequate detail.

For anyone interested, my whole PhD was in biomedical hypothesis generation! I think the most "serious" attempts at building these systems have been focused around providing assistance to scientists, and not just coming up with new ideas on their own.

here's an actual medical paper that my first system, Moliere, was able to help discover:


I also am working within this field in academia, and I must say that I have really enjoyed reading your work. I am focused a bit more on combining biological data sources with text knowledge graphs, but in my literature review of the field, I have found that we both have very similar aspirations for career path.

It seems that one major bias that the author of the post's blog has is their heavy conflation of 'worth' with money.

Many of us probably realize that this is not true, as academia clearly shows. Furthermore, It's not that crazy that they had trouble making money off of software, as I wouldn't expect many startups at all to be able to get customers from software solutions alone. They also seem to try to compare their success with that of essentially 'other biotech ml companies'; for which I would expect there to be quite a bit of tangible resources that these 'other companies' provide. For example, a startup looking to provide a service of detecting diseases or conditions from DNA methylation data would likely perform the sequencing required before doing an analysis (in order to have good control over experimental conditions). The materials alone in that case could cost quite a bit - so charging a bit more for the analysis isn't that problematic, since the transaction of some currency is already required for covering material costs.

Anyway, as you mentioned - it seems it's important to recognize that these systems aren't necessarily meant to generate revenue for a startup, but rather are much more useful as a tool in academia.

Is your PhD thesis online anywhere?

When I was in grad school, I joined a startup incubator and build a prototype which combined two of the tools mentioned in the article: "a query builder (by demonstration)" and "A paper recommender system", a simple companion which would help scientist to not miss relevant research to them. This was 10 years ago, before Google Scholar has similar features.

The incubator introduced me to advisors with business experience in this field. And I got told in no uncertain terms what is the gist of this article: The value lies in the molecular and clinical data. In 2021 I would add digital pathology / imaging data.

>And I got told in no uncertain terms what is the gist of this article: The value lies in the molecular and clinical data. In 2021 I would add digital pathology / imaging data.

I feel like you are trying to tell me something REALLY valuable, but I don't quite understand it. Can you please elaborate?

My take: answering questions using clinical data > answering questions with papers

There is immense value in clinical data (all the information captured and siloed through EHR). Pharma companies pay for access to it to gather real-world evidence (RWE) how, for example, their drug performs. Molecular information is increasingly valuable too for research, biomarker development, patient cohort identification etc. The imaging data and pathology data are valuable because they are typically expertly annotated and can be used to train computer-vision algorithms etc. to solve medical problems - like diagnosis.

Having invested quite a bit of my own time into various aspects of the scientific knowledge extraction morass, I’d say the author is largely on point, but there’s a significant, and potentially valuable distinction to be made between extracting research outputs and research inputs.

At least in the field of materials science, papers are by and large a record of research outputs. We made material X and it achieved performance Y - here are a bunch of measurements to prove that this is in fact what we made and that it truly achieved performance Y at relevant conditions, etc. In this sense, papers really function as an advertisement: look at what we achieved.

What papers do not do is rigorously document inputs, or provide a step-by-step guide to reproduce said results, for obvious reasons.

My current take on this topic is that it would be both feasible and valuable to build a knowledge extraction system to compile and compare outputs across a specified field. Think the big “chart of all verified solar cell efficiencies over time” [1], but generated automatically. This would at least immediately orient researchers to the distribution of state of the art results, and help ensure that they don’t omit relevant references in their reviews.

But extracting and making sense of inputs (methods), or even “knowledge”? Forget about it.

[1] https://www.nrel.gov/pv/cell-efficiency.html

Interesting insights. Particularly on the business aspect. But I am not surprise by the outcome, as the author said: nobody want to pay for what its proposed in academia. Everybody is already more or less struggling with funding, so nobody want to add extra fat in their funding requests.

Coming from CS, something I would really like to see though is a tool that would summarize a scientific area/domain. Something that would kill literature reviews and/or would provide an overview of the hot topics/open questions in different areas.

Edit: corrections

Don't know much about this industry but yes, it feels like one of those industries that sprang up because one person with money said, "Hmm, sounds like a good idea," and then other people with money and FOMO joined in. When this happens past a certain level you get a miniature innovation bubble (MIB)!

(At least MIBs are rather harmless, at least in the long run, and can actually yield some benefit: innovative people are drawn to these types of industries and inevitably create cool things as a by-product of their work.)

Science is not about innovation. Science is about tiny little results that by themselves have no immediate benefit, but slowly improve our overall understanding, and eventually lead to an unexpected benefit. Science is not developed in order to solve a business problem - it is purely an advancement in overall knowledge of the world (the traditional aim of natural philosophy). In this sense, science is not compatible with business interests.

> Why purchase access to a 3rd party AI reading engine or a knowledge graph when you can just hire hundreds of postdocs in Hyderabad to parse papers into JSON? (at a $6,000 yearly salary)

I really like jobs people think AI can do in theory but can't really do them effectively irl. Where do I get a part-time gig like that if I think I am capable of reviewing and creating summary of non-STEM papers? Except for homeworks and assignments of course.

Yeah, you can't outsource that to Hyderabad. You'd need to know subject knowledge + very specific English and possibly other languages depending on the field (not saying Indians can't do this, but I've studied enough languages to know that doing high level/academic work in a non-native language is hell even when the language is pitched to students).

And you'd have to know enough about the process and authors to know what makes papers relevant. The metadata matters as much as the data.

All good points. But you do have to recognize the tradeoff. Has AI come so far that it could perform better than industry specific human intelligence? You have to consider that maybe some Indian researchers could review the papers as they are doing that job as part time gig.

You have to test out both solution. And as these jobs are treated as contracts there is no significant commitment for choosing one over the other. We can't be certain if one method is better than the other without trying both of them out without prejudice.

I, for one am agnostic about either choice. Because AI is overhyped yet it has spillover benefits as a marketing-sales point but offshore human intelligence has a bad rep but could be effective if you have proper documentation, supervision and review framework.

Oh yeah, I was just thinking currently. In five to ten years once AI/ML/etc. trickle out of tech/theory spaces and starts to be combined with subject expertise, I think we'll see really interesting things.

The other matter is that an Indian who could review papers that well would also cost more than 6k/year and would not be easily replaceable, which eliminates the main benefit of outsourcing for a company trying to operate in such a way in 2021.

In 2030? I'd say the odds are if somebody in Hyderabad can do that then they can start their OWN company rather than bother with us at all. Honestly, given India's role in pharmaceutical manufacture, I'd be shocked if things like that don't start popping up.

> This post is about the issues with semantic intelligence platforms that predominantly leverage the published academic literature.

I was happy to see apost that clearly states its purpose.

edit: misspelling

Hey, author here. Great discussion so far. Will update the post with some of the comments and critiques.

And the bottom line is:

>>... nothing of it will go anywhere.

>>Don’t take that as a challenge. Take it as a red flag and run. Run towards better problems.

Wow, speaking of the value of negative results, that is hugely valuable! Could easily save person-decades of work & funds for more productive results.

The insights that the most relevant knowledge is not written into the publications (for a variety of reasons), and that the few that are are of limited use to the target audience, and even when it is useful it is a small part of the workload (i. e., not a real pain point), are key to seeing that the entire category of projects to extract & encode such knowledge is doomed.

I think if the author had listened to pretty much any post-doc/technician or senior researcher in the field who has had to review a number of publications they would have been told these things straight away.

A lot of papers are hand or computer assisted annotated.

For medical papers Mesh terms: https://www.nlm.nih.gov/mesh/meshhome.html

Gene information is extracted by flybase/ worm base …

It’s time consuming, expensive probably not perfect but for certain types of papers it makes searching better.

If someone published a promising in vitro study, the mere fact that it was done and even the barest hint of a conclusion can dramatically impact one's decision to pursue that line of research. So I'm a bit skeptical that there's negligible value in biomedical research papers.

This paper touches one aspect of it, which is that the source material is bad, but it doesn’t even start on the fact that the tools aren’t good enough and that many of the fashionable ideas (Word embeddings) are dead ends.

The very idea that "word embededings are dead ends" requires citations, evidence.

Any article on this topic should mention Tshitoyan et al.


Knowledge extraction is weird. Just because I extracted some knowledge doesn't mean that I now 'have' that knowledge.

The better use case for this is teaching, not creating knowledge bases that nobody will use.

I would argue "learning" - personal learning is a great topic.

Interesting but I don't know how to make sense of it. How can it be that "close to nothing of what makes science actually work is published as text on the web"?

- Is the information that makes science actually work mostly in images that the machines don't yet understand?

- Was the information paywalled or in private databases and inaccessible to this researcher?

- Are the papers mostly just advertisements for researchers to come gab with each other at conferences and doodle on cocktail napkins, and that's where all the "real science" happens?

- (From the comments) is the information needed to make sense of papers communicated privately or orally from PI's to postdocs and grad students, or within industrial research labs?

Something is missing from my mental picture here.

Don't real scientists mostly learn how to think about their fields by reading textbooks and papers? (This is a genuine question.) If so, isn't it likely that our tools just aren't advanced enough to learn like humans do? If not, what do humans use to learn how to think about science that's missing from textbooks and papers?

I can't speak for biomedicine, but speaking as an academic in CS the claim that "close to nothing of what makes science actually work is published as text on the web" looks like a huge hyperbole to me.

It's true that the so-called "folk knowledge", knowledge that exists in the community but no one bothers to publish in the form of papers, is a real problem, but at least in my field, it's by no means the majority of knowledge.

As someone from a peripheral university where you can't just drive a few miles and talk to the best in your field, I have successfully started in new subfields of study (getting to the level of writing research papers in top-tier venues) by reading the literature.

While this essay provides a very interesting point of view, I suspect it's heavily colored by the authors' failure to monetize the technology (which is related to the fact that people doing most of the grunt research work, who would benefit the most for this, are PhD students and postdocs who have no money to pay for it - in the text, the author hints at this). I wouldn't take it as an accurate description of academia.

Also CS, my interpretation of "what makes science work" is a little different and I would argue that - despite a lot of foundations and techniques being shared in research papers - this field more than any other is constraining the free circulation and application of knowledge.

The equivalent to those biomedical industry players are the big tech who develop closed source and push the edge in some area. They will publish but that does not mean you can replicate any of it.

Software is also fragmented, crippled by IP lawsuits, patent trolls and so on. This does inhibit ability of society to benefit from software since it depends on the private sector to sort things out. The PhDs go and build businesses to "make the science work" in that sense.

The ideal of detached pursuit of knowledge is not a complete fiction (despite the hyperbole), but it does remain an ideal that can only be approximated.

As an academic, all my papers from the last 5 or so years have associated github repos where all the code is accessible under free licenses. Most of my peers in academia do the same. Documentation quality is admittedly quite hit-and-miss, because we aren't paid for that and we need to jump to the next paper, but all the code is there and everything can be replicated even if it takes some effort due to rushed code or suboptimal documentation.

Industry is a different world, and indeed there are plenty of opaque industry papers that aren't replicable at all because much of the model is essentially a trade secret, and the paper is more an avenue for bragging than for developing new knowledge together with the rest of the community. To be honest, I would just outright disallow that kind of papers. But that's not a popular opinion, and taking into account that big tech companies sponsor our conferences and provide grants, I can't blame those who think otherwise.

<disclaimer: former real scientist>

Science is a profession like others. When you are earning your Ph.D. you learn to think about the field by reading papers and discussing with peers and colleagues, yes.

The intro of a well-structured research paper should follow this pattern:

- This is a really important topic and here is why.

- What is the current state of the art in this field? (this comes from reading 100-1000 publications on the topic and selecting the 5-10 most relevant to the next point). HOWEVER, the state of the art leaves this question unanswered.

- Here are some reasons why the idea in this paper can help answer that question (cite another 3-10 papers).

- Our hypothesis is that XXX can answer the important unanswered question (where X is derived from the prior section).

So, what I am getting at, a scientific publication is part of a conversation. When I'm citing the 5-10 papers to summarize the state of the art, I'm assuming the reader has read 50% of the 100-1000 papers which I also read, and knows where the 5-10 which I cite fit into that broader context.

So any paper, in isolation, only has a fraction of its meaning in the publication. The real information is the surrounding context.

Pro tip: if I'm reading a paper and want to understand it better, I also read one or two of the papers it cites, and one or two papers which cite it. Also, it can take a few times through before I start to understand what the author is trying to say.

Exactly! Scientific papers are not meant to stand on their own -- they are pieces of a much larger jigsaw puzzle. In order to make heads or tails out of a paper, one really needs to have a sense of where the paper fits into its larger picture. Building up necessary base of knowledge to develop that sense, both in terms of explicit knowledge and tacit knowledge, is part of what a PhD student is actually doing while they are working on their PhD, and is part of why the process takes as long as it does.

Also, the mechanical process of effectively reading a paper is highly non-linear, and is a skill in and of itself. In a lot of ways, it is more akin to high-level pattern matching than it is to more "normal" reading. At least at my institution, it is something that we actually teach our students to do in formal ways (the obligatory "How to read a scientific paper" lecture during the first term or two) and then make them practice over and over again for years (journal clubs, etc.). The original author eventually figured this out, which is to their credit.

As the article states, papers are mostly career advancement tools and scientists are incentivized to put the least amount of useful information into them they can get away with. Real scientists mostly learn from their instructors who possess all the jealously guarded institutional knowledge.

Yes, it is very broken.

Hard disagree. With a caveat -- I do acknowledge that for an important number of professional academics your statement may be true, and I have heard a former post-doc at ETH Zurich describe their papers as career points (so also a grain of truth at elite institutes).

But for most of the academics I have known and worked with, publications are taken quite seriously, and institutional knowledge is freely shared. There is an incentive to reduce the content in papers, but it is out of respect for the reader (a paper is not a textbook) and an honest attempt to limit the discussion to the core hypothesis of the work. You have 6 pages to 1) describe the content of 6*100 pages (the 100 other relevant papers on the topic), 2) present your addition to this body of knowledge, 3) discuss the insights your work brings, again referring to the content of 600 pages.

and those 600 pages you are summarizing are as information-dense as your work.

Also hard disagree. Everyone I know takes papers quite seriously, there's a major push to get more information into the literature (publish null effect estimates!), etc.

This feels a lot like the graduate students on the Academia StackExchange who are convinced the moment they present their idea it will get stolen, while every faculty member is like "I have a list of my own ideas that I don't have time to work on as long as my arm."

It is always difficult to try to understand and implement the theory explained in papers which seem fine on the surface, but when you look more closely, you find a bunch of mistakes, there are giant holes in the details, and you end up trying to redo the whole paper.

There should be journals/websites/blogs dedicated to trying to reexplain / implement papers.

Doing a good review is hard.

I've been reviewing a few C++ papers (things proposed to C++23) lately. Many of them are over my head and all I can find are a few spelling errors. The ones I've understood took me 3 readings before I found some giant holes in the details (which I pointed out to the authors, the next revision corrected them). In one case I actually started implementing the feature myself, and only then did I realize there was something ambiguous in the paper (after talking with the author we decided to document an implementors choices as it doesn't matter for the 99% use case, and the rest could go either way depending on hardware and data details so better to allow choice)

The vast majority if papers I'm far too laze to go into that level of detail on. I just assume the authors is a smart person and so I trustingly let them go. It may well be if I understood the paper I'd be horrified, ask me in 2043 when we have 20 years of hindsight...

I have to believe that peer review is the same - many reviewers are just reading and looking for something obvious but not really understanding details.

Someone once said that they enjoyed one of my papers, and that even though they thought the writing was very clear they still had to read it 3 times to understand it.

I told them that I had to write it 100 times and spend two years before I understood it.

So if they could pick it up in 3 readings over 3 days, they were doing pretty good.

SO difficult, especially given that just because you're in the 'same' field and technically qualified to do a peer review doesn't mean you actually understand what you're reading.

For example, I'm qualified to review papers on educational programs for children. I should never be asked to do that.

I mean honestly this is just total bullshit. There is plenty of value in academic papers. It's just that there is very little money to be made in developing tools such as those mentioned by the OP as there is very little money in academia.

I understood the criticism directed at the value of papers as instruments of knowledge sharing. The argument is not that papers are completely useless in terms of knowledge sharing but that this pure purpose of dissemination is largely overshadowed by considerations of carreer, prestige, funding or any interest other than knowledge sharing.

This is the world we live in. A scientist is a person that needs to make a living and is subject to various constraints.

The reason that there is little money to be made is that society hasn't found a way to set up the scientific process in such a way that the constraints would value the increase in public domain knowledge higher than the incentives to hold some knowledge back.

Part of this may stem from leaving specialized knowledge to academia while letting only companies reap the monetary rewards of putting the knowledge to use. Society benefits only indirectly (better drugs, machines, etc) but industry players will rather shield knowledge and adapt its representation to their own needs.

Very little money, and more than that, very little unallocated money. I'm fortunate enough to have a decently well funded lab, but almost all of that funding is spoken for. Your Compelling New Product needs to be compelling enough for me to put it into grants anticipating it'll still be around in several years, often enough to get around most grants not being funded.

It's hard for me to even comprehend how this could be true, but it does sound familiar enough from credible sources that maybe it's right regardless of what makes sense to me.

The 'what makes science work' is stored in the scientists.

They learn by reading the literature, but also by communicating, and by an active process of testing their own understanding and resolving gaps and inconsistencies. Even when a self-taught genius like Ramanujan comes along, they benefit from being brought into the community.

The question of how one would determine the state of the art in a field has an answer, but at present it would be indistinguishable from training a scientist, rather than running a clever software tool that could synthesize from the literature.

Well that's an interesting idea, isn't it (even if completely impractical today)? Self-training AI robot scientist who not only reads the literature but actually chats with other scientists and tries to do science to improve its understanding. AlphaZero but for science.

AlphaZero cannot chat and interact outside moving pieces. For science, self-training would be too wasteful, intractable and impossible to boot, given there's no simulator.

An AlphaZero for science would instead be like the recent deepmind paper where the pattern matching capabilities and internal features of a neural network were used to navigate some domain's decision space of conjecture formation and testing.

In my experience useful scientific knowledge is accumulated in people actively working. Documents (books, papers, guides, programs, talks, blog posts, etc) are communication tools, but are limited by the medium and the ability of the authors. People can consume documents and create analogies to their specific work, but from there it's the process of working that produces: experts, systems, tools. Sometimes those products are again documented.

Try: Paperswithcode.com

If its not there I won't use it! If you dont provide code with your paper it better have a really useful concept in it otherwise not citation. Which beckons to the problem in the article where most important information in basic research papers is: "Hey, this concept works" as opposed to a rigorous test of exactly what makes the concept work and how to use it in other situations.

Current real life scientist in biomedicine:

"Interesting but I don't know how to make sense of it. How can it be that "close to nothing of what makes science actually work is published as text on the web"?"

I'm not convinced this assertion is true. Difficult to parse by a non-expert? Sure. Often stored in pictures rather than text? Absolutely - indeed, a number of journals directly ask reviewers if the text and graphs are duplicative and consider this a negative. Harder to "disrupt" and monetize than many companies have expected it to be? Certainly.

"Is the information that makes science actually work mostly in images that the machines don't yet understand?"

In my mind, this is the most credible bit of the author's complaints. A lot of science in biomedicine is done in information dense graphics. The author picks especially hard to approach ones, but this is definitely a thing.

"Was the information paywalled or in private databases and inaccessible to this researcher?"

For an outsider without institutional access to journals, this can be a problem. More acutely, there is some lag between "What I'm currently working on" and "What's in the literature" simply because the literature is slow (and has gotten way slower during the pandemic).

"Are the papers mostly just advertisements for researchers to come gab with each other at conferences and doodle on cocktail napkins, and that's where all the "real science" happens?"

In biomedicine, papers are the product. Conferences tend to have a couple uses:

1) Previews of coming attractions - things I'm working on that are close enough to done to talk about, but not so close as to be making the rounds yet. These talks will often have less detail than a paper would, because we've all sat through a presentation that does into a ton of implementation detail and they're agonizing. Also I only have fifteen minutes.

2) Looking for postdocs - either from the hiring or seeking end.

3) Building professional networks - this is mostly so when someone comes to me with a problem, I know who might be working on say...causal inference with time-varying exposures...and can reach out to them. Usually to ask what papers I should read to get caught up. Or to bring them in on a paper/grand proposal.

4) Looking for problems other people are having that I can solve, and then reaching out.

"(From the comments) is the information needed to make sense of papers communicated privately or orally from PI's to postdocs and grad students, or within industrial research labs?"

Only insofar as my graduate students and postdocs have access to my time and expertise, and a job that is expressly meant to encourage understanding things. "I don't know, why don't you spend a couple weeks figuring out how they did that" is a perfectly good use of a graduate student's time, but something I find is rarely encouraged elsewhere.

"Don't real scientists mostly learn how to think about their fields by reading textbooks and papers? (This is a genuine question.) If so, isn't it likely that our tools just aren't advanced enough to learn like humans do? If not, what do humans use to learn how to think about science that's missing from textbooks and papers?"

One of the things that's likely missing, because those are all finished products, is the process. For example, I spent an hour chatting with a graduate student about three or four different ways they can approach their problem - what assumptions come with each one, tradeoffs, etc. But only the branch that actually got used is going to be published.

Business? It is not a business - it is a racket...

The article's core claims are:

> Extracting, structuring or synthesizing "insights" from academic publications (papers) or building knowledge bases from a domain corpus of literature has negligible value in industry.

> Most knowledge necessary to make scientific progress is not online and not encoded.

> Close to nothing of what makes science actually work is published as text on the web

> The tech is not there to make fact checking work reliably, even in constrained domains.

> Accurately and programmatically transforming an entire piece of literature into a computer-interpretable, complete and actionable knowledge artifact remains a pipe dream.

It also states existing old school "biomedical knowledge bases, databases, ontologies that are updated regularly", with Expert Entry cutting through the noise in a way that NLP cannot.

Although I disagree with its conclusions, much of this jives with my experience. From the perspective of research, modern NLP and transformers are appropriately hyped but from the perspective of real world application, they are over-hyped. Transformers have deeper understanding than anything prior, they can figure out patterns in their context with a flexibility that goes way beyond regurgitation.

They are also prone to hallucinating text, quoting misleading snippets, require lots of resources for inference and enjoy being confidently wrong at a rate that makes industrial use nearly unworkable. They're powerful but you should think hard about whether you actually need them. Most of the time their true advantage is not leveraged.


My disagreements are with its advice.

> For recommendations, the suggestion is "follow the best institutions and ~50 top individuals".

But this just creates a rich get richer effect and retards science since most are reluctant to go against those with a lot of clout.

> Why purchase access to a 3rd party AI reading engine...when you can just hire hundreds of postdocs in Hyderabad to parse papers into JSON? (at a $6,000 yearly salary). Would you invest in automation if you have billions of disposable income and access to cheap labor? After talking with employees of huge companies like GSK, AZ and Medscape the answer is a clear no.

This reminds me of responses to questions of the sort: "Why didnt't X (where X might be Ottomans or Chinese) get to the industrial revolution first?".

Article also warns against working on ideas such as "...semantic search, interoperable protocols and structured data, serendipitous discovery apps, knowledge organization."

A lot such apps are solutions chasing after a problem but could work if designed to solve a specific real world problem. On the other hand, an outsider trying to start a generalized VC backed business targeting industry is bound to fail. In fact, this seems a major sticking point in the author's endeavor.

Industry is jaded and set in their ways, startups focus on summarization and recommendations and retrieval which are low value in scientific enterprise and academia is focused on automation which turns out brittle. Still, this line of research is needed. Knowledge production is growing rapidly while humans are not getting any smarter. Specialization has meant increases in redundant information, loss of context and a stall in theory production (hence "much less logic and deduction happening").

While the published literature is sorely lacking, humans can with effort extract and or triangulate value from it. Tooling needs to augment that process.

"follow the best institutions and ~50 top individuals" wasn't meant as a suggestion actually, just an observation of what most people do.

You're right they "could work if designed to solve a specific real world problem" but against what baseline? The baseline could be spending that time on actual deep tech projects and not NLP meta-science

But you're right; open source projects for extracting infos (like PubTator) are valuable but ontologies/KGs need ongoing expert (ML, AI, SWEs, information architects, labelers) work (unlike most of Wikipedia or GH) so it's tough to make something that doesn't suck in a distributed open source fashion

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact