The business of extracting knowledge from academic publications

pfisherman · 2023-11-02T01:04:35

Word of advice to all those who are chomping at the bit to disrupt pharma with AI.

A pharmaceutical company that heavily leverages computation is called a pharmaceutical company. All modern pharmaceutical companies heavily leverage computational tools - including some powered by deep learning.

Any company building a computational platform to accelerate drug discovery / development is not a pharmaceutical company. They are a software company who wants to sell software to pharmaceutical companies, which is a terrible business to be in.

The product the author is selling already exists for free. All of the big knowledge bases use automated ML tooling for curation and extraction. And the people who run them and QC them are world class experts in their domains. I mean, just take a look at uniprot.[0]

And the only types of pharma companies who would want a big knowledge graph would be the large ones with active programs across multiple therapeutic areas. Most of the small companies / start ups tend to be focused on getting one or two assets to market in a therapeutic domain. And the big companies possibly have their own ML teams doing literature extraction - because a team of 5-10 FTE ML / software engineers is like a rounding error in their R&D budget.

The other thing is that the most valuable knowledge is the stuff that is not in the literature. That’s why we do experiments.

[0] https://academic.oup.com/nar/article/51/D1/D523/6835362

w10-1 · 2023-11-02T02:12:54

Ingenuity's Pathway Analysis product has been the standard tool big pharma has used for 15+ years. Founded by folks from Stanford, they created their own knowledge-graph database a decade before they were popular, and have been paying PhD's to curate that database with genetic findings from papers over the last decades. Obviously their analysis also includes the publicly-available databases, and they have their own private genomic databases and expertise from helping many of the first-available genomic services go live.

Still, it's a tough business, and they were bought by Qiagen (to make an end-to-end solution, with moderate success, as the market is strangled by Illumina's control over NGS sequencing). And with exponential growth in information, their early relative advantage likely has waned.

Also note Veeva software, which provides infrastructure for pharma, is a public benefit company, legally devoted to its clients. That's how much power the customer has.

That said, there's something of a revolution in drug discovery as more is being done by small companies with ultra-focused expertise. If they get a viable candidate, a bigger company buys it to marshal through approvals, and then it can be sold again to a marketing company. Focusing on these pop-up drug-discovery companies that evolve out of someone's PhD could be a good niche; I would expect Pharma VC's to prefer funding companies using their favorite computing vendor, for reliability if not insight. Here computational discovery assistance would help a bit, but validating findings and science could be golden.

pfisherman · 2023-11-02T02:29:59

I think the big case study everybody gives is Schrödinger who took 20 years to IPO.

The problem is that the market is just not that big. Assume that every single pharma company buys your product - like Schrödinger - where does your revenue top out?

You can beat that as a small pharma company with one or two assets that make it to market. So the question is: if the computational platform is so good then why not just be pharma company?

petra · 2023-11-02T06:15:13

For a good computational platform, that can pinpoint unique and valuable candidates , isn't there an option to do license/consult in exchange for royalties ?

pfisherman · 2023-11-02T13:52:07

No. Makes more sense to license the asset (molecule). And even then, big pharma companies won’t talk to you until your molecule is in phase 2.

If anybody with an AI pharma start up tells you they plan to make money by licensing their platform, then run the other way because it is a red flag.

If you want to see what a real “AI” pharma company looks like check out Vertex pharmaceuticals.

riwsky · 2023-11-02T05:14:06

“If you’re so smart, then why aren’t you rich?”

kromem · 2023-11-02T02:55:13

I wouldn't be so quick to dismiss a niche for third party data analysis.

A very common feature across verticals is that in house data analysis is siloed.

While it's a difficult sell, getting multiple siloed data sources to agree to 3rd party analysis for shared gains can be extremely successful for everyone involved when pulled off.

So looking specifically at pharma, while analysis of published papers is meh, an often discussed component of research is the bias against publishing negative results.

So a hypothetical product like using ML across multiple firms' data of failed products and research in order to establish a model that could more quickly identify dud research avenues by leveraging industry data could only exist with the broadest dataset as a 3rd party product and would deliver gains that could be quite profitable for some of the largest companies out there.

Also, I think anyone who has worked with larger corporations on in house tech knows that even if the individual efforts are quite large and sophisticated, the sheer amount of bureaucracy that goes into every single thing at a sizable firm can mean significant advantages for startups vs in-house efforts, particularly when related to fast moving fields.

I'd agree that "moving into a niche without knowing it extremely well" can be fraught with issues and that attempting to get buy in from B2B firms for a startup is a nightmare, but I'd disagree with "large company does X in house so creating a startup to do X is a bad idea."

pfisherman · 2023-11-02T03:12:38

I mean what you suggested already kind of exists in several forms.[0,1]

Third party data analyses is a thing, but this is often bundled with conducting experiments by CROs.

See my other comment about market size. The problem is that you can make A LOT more money selling drugs than you can selling software to pharma companies. So if your software is any good, then you should use it to make drugs and just be a pharma company.

0. https://www.opentargets.org/

1. https://www.citeline.com/en/products-services/clinical/pharm...

kromem · 2023-11-02T07:45:34

What I suggested was simply an illustrative example off the top of my head. The fact there's multiple similar companies that already independently exist furthers my point.

As for your other point, I'd imagine there's a pretty big difference between the capabilities and infrastructure between bringing a drug to market and analyzing the data of people who bring drugs to market.

dilawar · 2023-11-02T06:41:43

Thanks for commenting. I've been thinking a bit about taking some computational insights we have developed over years to bioinformatics.

During my Ph.D., my PI wanted to do some work on signalling pathway analysis. He was interested in knowing when and why sometimes body starts assuming that the sick state is the right state and start acting against drugs given to patients. This making treatment very hard.

I don't know what is status of that but one insight I got from him that most (all?) Pharma companies do not trust acdemic data. I didn't ask reasons but I'd assume because of replication issue (crisis). Knowing how data is gathered and published in signalling domain, I'd not blame them for having low trust in academic data.

pfisherman · 2023-11-03T02:58:17

Depends on what you want to use the data for. In general, you should take a trust but verify approach to new results, methods, data, etc.

wodenokoto · 2023-11-02T03:53:46

I did a stint at large pharma, and it was not free software that was used to extract structured data from publication.

riku_iki · 2023-11-02T01:13:02

> I mean, just take a look at uniprot.[0]

this looks like one narrow niche knowledge base. I am not expert in this domain, but there are probably some other use cases not covered by existing offerings.

> literature extraction - because a team of 5-10 FTE ML / software engineers

I think this problem is so hard and open ended, that your 5-10 non-star avg salary ml ftes likely produce very mediocre and likely not usable results.

pfisherman · 2023-11-02T01:38:09

Uniprot fills a pretty big niche - proteins. Your body is literally made out of proteins. They carry out most of the work and chemical reactions that amount to what we call life, and almost all drugs work by modulating some protein target in some way. UniprotIDs are the canonical identifiers that computational biologists use for proteins.

But they are just one knowledge base. There are others. Each with their own focus area. And they are all associated with prominent bioNLP / biomedical AI research labs and employ human SME curators.

> 5-10 avg salary non-star ML FTEs likely produce very mediocre and likely non usable results

LOL!

riku_iki · 2023-11-02T02:14:38

> But they are just one knowledge base. There are others.

So, is your claim that all topics/niches/verticals/steps in pharma development and productions are covered by all those databases with absolute quality and perfect UI/workflow? I follow some studies about trials meta-analysis, and my impression is that end result is mainly produced by manual work of some low paid postdocs, which makes them not trustworthy.

> LOL!

I actually claim non-trivial expertise in this area (facts extraction from untrivial niche documents), and my observations is that even SOTA (aka results from top researchers) are hardly useable in real life, because this area is very hard. Could you support your "LOL!" with any references I could check?..

pfisherman · 2023-11-02T02:58:38

I already gave you a real life example with the uniprot reference. Here is another flagship knowledge base that heavily leverages NLP extraction.[0] Here is another one that gets used in what seems like every network biology article.[1]

Meta analyses? Automating meta analyses is not a real need. They have their place, but it’s like a quaint cottage industry type thing - like custom haberdashery.

Also the most valuable knowledge is not in any publication. If you are reading about it in an article then you are already 2-3 years too late.

0. https://geneontology.org

1. https://string-db.org/

riku_iki · 2023-11-02T03:09:53

> that heavily leverages NLP extraction

I search https://www.google.com/search?q=site%3Ageneontology.org+nlp+... and don't see anything meaningful.

> Automating meta analyses is not a real need. They have their place, but it’s like a quaint cottage industry type thing - like custom haberdashery.

my opinion is that it has to be top level tool in modern science which could easily sort out lots of bs and contradictions in reported results.

MichaelZuo · 2023-11-02T02:47:18

The parent is right, saying a database about proteins 'looks like one narrow niche knowledge base.' in this context is just nonsense.

riku_iki · 2023-11-02T02:49:28

in which context, and why your bold statements should be trusted blindly?

Kim_Bruning · 2023-11-02T05:10:44

TL;DR: I explain what proteins are and why they're important, show some biological "flowcharts" , and end up with one "function definition" from the "source code" that makes you you, and has to do with you eating and breathing.

Long:

This is the bit that makes biology awesome to me, so excuse me for the small essay ;-)

Proteins are basically extremely advanced nanomachines which work together in larger systems to ultimately form a cell. Having a listing of all the proteins in a cell (the sum of the parts) is insufficient to grok the whole, but it's pretty darn important. The abilities and limitations determine and constrain what a cell can do, and ultimately influence what organisms and ecosystems are and are not capable of. Which is a big chunk of the science of biology.

Uniprot lists many/all of the genes/proteins that have been decoded so far. It's a bit odd to call that a "niche" in the context of the field of biology.

I'm not going to discourage you though: Proteins and protein systems are pretty darn awesome!

Example:

(using KEGG rather than uniprot, since it's got graphical maps, which is handy to get an intuition)

For instance, if you want to know why you need to eat and why you need to breathe (flowchart for what your cells do with starch and oxygen):

* https://www.genome.jp/pathway/map00500 start by finding starch on this map (gets split into glucose)

* https://www.genome.jp/pathway/map00010 which gets broken down into 2* pyruvate

* https://www.genome.jp/pathway/map00020 which gets processed

* https://www.genome.jp/pathway/map00190 and ultimately "burned" with oxygen.

Each step is a 'chemical reaction catalyzed by proteins'[1] (in the rectangles). You can dig in deeper to find your actual source code: Say we click on a random step (in this case near the top of glycolysis on map 00010)

* https://www.genome.jp/entry/K01810+K06859+K13810+K15916+5.3....

At the bottom you can find the gene listed for Homo Sapiens (HSA)

* https://www.genome.jp/entry/hsa:2821

And this lists the amino-acid (AA) sequence for the protein, and the nucleotide (NT) sequence found in humans. Since this is highly preserved functionality, that's probably (almost) exactly the source code that you have in each of your cells.

KEGG is nice to get an overview of some of the pathways that are fully understood with the maps.

[1] calling it a "chemical reaction" is sort of underselling many proteins. Proteins can have moving parts and can work together. I prefer to think of them as sophisticated nanomachines.

riku_iki · 2023-11-02T05:20:03

> Uniprot lists many/all of the genes/proteins that have been decoded so far. It's a bit odd to call that a "niche" in the context of the field of biology.

I actually checked uniprot, yes, it lists proteins (probably most of them), but ontology is raither narrow, it has few dozens properties, you can't for example query that DB with question: give me diseases which can be attributed to broken pathways synthesizing protein X, you would need to do a lot of manual work and check external databases of uncertain quality.

Another question is quality of that dataset, why it is so obvious that all those millions of pathways for hundreds thousands proteins are researched and described with 100% accuracy?

Kim_Bruning · 2023-11-02T05:27:35

> you can't for example query that DB with question: give me diseases which can be attributed to broken pathways synthesizing protein X, you would need to do a lot of manual work and check external databases of uncertain quality.

nod

Uniprot is more useful if you're looking for the actual "bare metal" NN and AA sequences. Which is rather important in its own right, obviously: Sooner or later you DO need the actual sequences if you're going to do something with them in real life.

But uniprot doesn't -itself- give you an understanding of what that code is then doing.

radus · 2023-11-02T01:19:06

> this looks like one narrow niche knowledge base. I am not expert in this domain, but there are probably some other use cases not covered by existing offerings.

UniProt is not niche. It contains curated information about all proteins across numerous domains.

riku_iki · 2023-11-02T01:20:51

yes, proteins is a niche in grand scheme of things

pama · 2023-11-02T01:35:16

The GP was talking about pharma. Proteins are not niche in pharma. Everything else may be niche in that domain, but proteins make up more than 95% of the targets of the pharmaceutical industry.

riku_iki · 2023-11-02T02:28:45

> The GP was talking about pharma.

post is not about pharma specifically, but about general bio-medical literature, including genes, diseases, symptoms, etc.

> proteins make up more than 95% of the targets of the pharmaceutical industry.

do you have references to support this? Some google search says it is 280B market out of 1.5T total pharma market: https://www.alliedmarketresearch.com/protein-therapeutics-ma... https://www.statista.com/topics/1764/global-pharmaceutical-i...

Also, it is likely important steps how to research and produce protein drugs, but the end target is to cure diseases, so you need lots of additional data about diseases of all types, symptoms, pathways, trials, etc.

dekhn · 2023-11-02T02:41:12

This isn't a "do you have references" situation. You're wasting folks time. Nearly all drugs target proteins, with a few that target DNA or RNA.

riku_iki · 2023-11-02T02:45:52

You can just ignore my comments and walk away? To me you just another "internet expert" which I am not sure why should I blindly trust.

myownpetard · 2023-11-02T03:24:13

Your sources are talking about the ratio of small molecule vs. large molecule drugs. Even if you're developing small molecule drugs you are likely targeting some aspect of protein signaling/gene expression.

People are being dismissive of your comments because to say that proteins are niche in the context of pharma is like saying advertising is niche in the context of Meta and Google.

riku_iki · 2023-11-02T03:33:04

> People are being dismissive of your comments because to say that proteins are niche in the context of pharma is like saying advertising is niche in the context of Meta and Google.

its all about how you define word "niche", for google, main revenue stream is supported by several pillars: search tech, infra tech, ads tech, ecosystem+network effect, human management. You remove one pillar, and everything is destroyed, so one can say ads is one of the niches in their food chain. I suspect with proteins it is about the same.

> in the context of pharma

there is no context of pharma. Post is about more broad bio-medical publications.

Kim_Bruning · 2023-11-02T05:48:07

I'd say it's less like pillars and more like (emergence) layers, with proteins being a pretty important layer .

I'm now not entirely sure what your experience is with bio sciences. You're definitely coming at it from an odd angle though!

riku_iki · 2023-11-02T05:56:11

I didn't claim expertise, that's why I say "it looks like", "I suspect",

Kim_Bruning · 2023-11-02T06:05:01

Well, maybe find some time and dive in a bit and see what can be found?

You never know, maybe you'll end up contributing to our understanding of life, maybe (indirectly) even save a few lives!

riku_iki · 2023-11-02T06:10:34

I am working on the service which potentially can answer questions like in this comment: https://news.ycombinator.com/item?id=38109294

life science is one of potential applications if there is an interest and money.

Kim_Bruning · 2023-11-02T06:22:23

So the pathway to synthesize every protein is +/- the same: That's gene transcription[1] and translation[2]. If that's broken, you're in big trouble!

But if you mean in general if you're capable of looking at metabolic pathways where each protein catalyses a step in the pathway, that's definitely interesting. If a certain person has a flawed gene coding for protein X, that could indeed cause a problem.

To find valid answers, you might need to eg. track nodes and states in a graph, to figure all the consequences of a break. Not all types of storage systems/engines are equally good at that.

[1] https://en.wikipedia.org/wiki/Transcription_(biology)

[2] https://en.wikipedia.org/wiki/Translation_(biology)

edit: s/protein pathway/metabolic pathway/

riku_iki · 2023-11-02T06:34:02

> Not all types of storage systems/engines are equally good at that.

Yes, I built system which traverses paths in graphs with 1B nodes and 10B links in 1h on affordable server. But that's only one part of the puzzle.

Kim_Bruning · 2023-11-02T06:40:35

Neat!

myownpetard · 2023-11-02T03:59:47

> Word of advice to all those who are chomping at the bit to disrupt pharma with AI.

Literally the first line in the comment that started this thread.

riku_iki · 2023-11-02T04:05:02

Sure, now let's read the post?

dekhn · 2023-11-02T15:20:07

I read the post before I made any criticism of your comments. We're talking on a thread within the larger context of comments on the post. But more importantly, if you read the post, you will see there is a theme of industrial biochemistry (IE, pharma and biotech) running through it, because pharma/biotech is the primary consumer of these products, and the vast majority of the revenue stream.

riku_iki · 2023-11-02T15:46:00

> there is a theme of industrial biochemistry

that's one of the themes (and you are already working hard to stretch drugs pharma to "biochemistry"), if you can't see other themes in his examples and screenshots, I think this discussion is not interesting to me.

denshep · 2023-11-02T06:16:27

The first Google search link you've provided is focused on proteins as an active ingredient (like an antibody), not the targets.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6314433/ Check out this article for example.

riku_iki · 2023-11-02T06:30:17

So, that link says they discovered 1.5k FDA approved drugs which target proteins, while FDA has total 19k drugs approved: https://www.fda.gov/media/115824/download#:~:text=FDA%20regu....

pama · 2023-11-02T12:46:10

Hi there. Apologies for any confusion. I was clarifying the intent of the first comment in this thread. I will add some impressionistic comments below in case it helps your future service.

Even when a drug target is unknown, or remains mysterious, very high chances are that the target is a protein. DNA or RNA as targets are niche (DNA is often avoided as a target on purpose and it’s hard to be specific to it without DNA-like material, and RNA is still hard to target effectively, though things are improving). There is not much else of use in the cells (lipids, sugars, cofactors, and metabolites, some examples of which have been targeted by a couple drugs each over the long history of trials, often unintentionally.

Small molecules are an excellent modality for an eventual approved therapy. They almost always (so far) targets a protein. They are hard to design but when they are done well they expose the target to something nature hasn’t seen before in order to get a desired effect. Sometimes people don’t care about the target itself (think recreational drugs, or phenotypic drug discovery), but the target typically remains a protein.

Proteins make up the machinery of the cell. You jam or modify them to achieve desired effects.

pama · 2023-11-02T12:52:53

Also, and totally minor: there are nowhere close to 19k different approved small molecules. The same drug can be included in multiple products or formulations bringing that number you mentioned to 19k marketed products. Each generic formulation of iboprofen increases the latter count by 1. Counting all the marketed products of pure orange juice may add up to a large number but it is still one ingredient.

riku_iki · 2023-11-02T14:18:30

Yes, I compare drugs with drugs, not molecules with drugs.

pama · 2023-11-03T09:41:13

No. The analysis in the paper above referred to molecules vs the snippet in the fda referred to products. More generally, other than marketing materials to doctors or patients or very rare exceptions talking about formulations etc, the scientific literature refers to the molecular entity as a drug not the particular named/branded product. I hope that the tool you are building will not mix up such concepts.

riku_iki · 2023-11-03T14:24:53

The link you posted found "1,578 US FDA-approved drugs"(exact citation).

pama · 2023-11-04T03:33:02

Yes that is the standard language in the field. It does not refer to the number of marketed products. Look up the number of novel drugs approved by the FDA per year. It only recently exceeded 40 per year, and the FDA has not existed for very long.

riku_iki · 2023-11-04T03:46:56

if we consider that you are not making things up(which I am very not confident), then your link is useless in this discussion, because it doesn't give a sense how many non-protein targeting drugs were approved by FDA.

dekhn · 2023-11-02T01:56:40

and an increasing number of drugs- in the "old days" it was almost entirely small molecules, but they are starting to peter out (both because the low-hanging fruit has already been plucked, and also, small molecules are a usually a terrible way to modulate biological activity in specific ways.

YossarianFrPrez · 2023-11-01T23:58:58

This was a fascinating read. The well-meaning and hard-working author went through several iterations of trying to make a profit on 'academic-knowledge-graph-adjacent' products, but things ultimately fell through.

The article describes two separate things likely to appeal to HN readers. The first is that there is a lot of tacit knowledge not captured in scientific publications. The second is that the author and his team, despite best efforts, never found product-market fit.

To the first point: it wasn't until I got to graduate school that I realized that the scientific literature isn't exactly 'An accurate record of true facts.' It is instead the paper-trail of a slow-moving conversation among researchers, where old ideas are slowly jettisoned and new ideas are evaluated and tried on.

To the second point, a few reactions:

* If your primary market / audience is graduate students or post-docs, good luck! You can't sell to people who don't have money: grad students are paid around minimum wage, postdocs its slightly better. If I had to sell a science or research-adjacent product, I'd either sell to entire departments or colleges / campuses. This is likely a pretty protracted sales process and doesn't seem pleasant.

* I wonder if Law would be a better place to start. Either selling to law-firms directly (building tools for generating internal knowledge graphs when they are given 80k documents in the discovery process), or particularly for IP lawyers. IP lawyers have the money and expertise to have their skills augmented by AI-powered literature searches. I have to imagine it's not just patents they read when looking for prior art.

* I wonder if the author and his team tried to solve too broad of a problem: they never seem to have gotten hyper-specific, and built something from the bottom up.

Al-Khwarizmi · 2023-11-02T03:09:11

Academic here.

To be honest, while the author's depiction of academic publishing is mostly not wrong, they make it sound much worse than it actually is. Folk knowledge is a thing, but papers do contain most of the valuable knowledge if you know how to read them.

I think 95% of this person's failure to monetize their product comes from trying to sell it to an audience that is just quite broke, and the rest is probably mostly post hoc rationalization. Not only grad students and postdoc wages are low, in many countries (not the US) professors aren't well paid either (and buying software subscriptions from grant funds is often not allowed or difficult due to crazy bureaucracy).

As a full professor myself, I almost don't buy software for work. I suffer the torture of Microsoft Office, which my institution is subscribed to, I'm subscribed to Overleaf with grant money (for now, but I might be forced to cancel depending on how the funding goes) and I pay for ChatGPT out of pocket because trying to use grant money for that is bureaucratic hell. That's all. It would take a really transformative piece of software for me to subscribe to something else.

whatshisface · 2023-11-02T07:01:56

What do you use ChatGPT for? I know this has been discussed to death but never by someone outside of the mobile app writing business.

Al-Khwarizmi · 2023-11-02T15:43:18

Quite a lot of things. A (probably non-exhaustive, off the top of my head) of things where it saves me the most time:

- Bureaucracy. Writing silly boilerplate, e.g. data management plans or gender perspective statements in grant proposals.

- Cutting or expanding text (we routinely have lots of forms and submissions where you need to write a text in a given word or character range).

- Polite emails in English to people I don't know much (e.g. "Write a polite professional email reminding this person that the deadline for reviewing paper Y expired yesterday...")

- Brainstorming. "Give me 10 ideas about research direction in topic X". It won't give great ideas, but it's good to set the mind rolling.

- Routine scripts/code used in experiments and papers: write a Python script to make a box plot with such and such data, or to take a file in this format and strip this unneeded content, etc. The typical kind of code that appears a lot in research, is trivial to code but consumes time and ChatGPT does it in seconds.

- Suggest titles (paper titles, grant proposal titles, etc.).

- Suggest ideas for exercises or exam questions (e.g. write an assignment that can be solved with the coin change algorithm but involves no coins or currency).

- How to do X in Excel (although the problem here is that my Excel is in Spanish - why, why did they decide to translate function names? - and it's not that good at that - but anyway, it's very useful).

The productivity boost is very noticeable, well worth the cost, even if it hurts to pay out of pocket for a tool used at work.

PaulHoule · 2023-11-02T00:34:14

Corporate filings. Business intelligence. There is value in those areas. Science is too esoteric. And boy there are a lot of papers that don’t really signify anything at all but they fill up some pages and add to somebody’s paper count.

It is a hoot that they sell access to individual scientific papers for $35 because if you think one will help you with some commercial problem you have the odds are the real value is $0.00.

riku_iki · 2023-11-02T01:17:37

> Corporate filings. Business intelligence. There is value in those areas

also, lots of competition already

Etheryte · 2023-11-02T01:42:32

Competition is good, it means there is a market. There's plenty of empty niches with no competition exactly because there's no money there.

riku_iki · 2023-11-02T02:37:29

As for corporate fillings, there are lots of strong products already, also data in those fillings is rather limited.

"Business intelligence" is very broad term, maybe it is possible to find market fit there, since area is moving very fast, but hard to judge without seeing specifics.

hbcondo714 · 2023-11-02T04:52:05

I happen to run a corporate filings product[1] so I'm curious to know in what way you find the data in filings limited. There are financial statements (ex. balance sheet) & disclosures (ex. litigation) in 100+ page annual reports so our tool makes it easier to find them. We also do AI (ex. sentiment analysis) and diffs (ex. redline / blackline) which yield their own insights.

[1] https://last10k.com

riku_iki · 2023-11-02T04:57:48

> so I'm curious to know in what way you find the data in filings limited.

to me every filling has maybe 20 essential numbers which are interesting: balance sheet, income statement and major sectors, everything else is some generic boilerplate, and there are dozens of services which will already sell it for cheap.

Not sure what else you can sell to your clients..

PaulHoule · 2023-11-02T15:15:06

I worked at a place where we developed information extraction systems that could be customized to the needs of particular customers. This was before transformers so the technology wasn't 100% ready.

Think of a global aircraft manufacturer turning maintenance documentation into a knowledge graph, a global clothing and shoes retailer building a model of what social media thinks about them, etc. I told other employees that our product could generate enough value for one customer that it would be worth it for one to buy us and... that's what happened.

bigger_cheese · 2023-11-02T04:49:27

I'm an Engineer (not a scientist) not sure how much is applicable to other fields but when I've been interested in what the "current state of topic X is" I have looked for recently published thesis on the topic, a good thesis will have summarized the papers for me (gone through all the churn of papers and highlighted the key points).

I think this is kind of what the author attempted to build (i.e. something that spits out the Literature review portion of a thesis.)

I think that's probably why graduate students were excited - they are the people who have to write a thesis at the end of the day.

luma · 2023-11-02T01:54:00

Law firms have about as much money as pharma, and the major legal research services (Westlaw et al) already have LLM based offerings.

weswilson · 2023-11-02T04:44:42

On the medical side, there are knowledgebases that offer clinical decision support like UpToDate (https://www.wolterskluwer.com/en/solutions/uptodate) that are kept up to date (pun intended) by specialists in their field. Every year or so, the articles are reviewed and updated with new information that has integrated into practice. For a relatively small fee, a practitioner has pretty much all access to the latest evidence-based standard of care across any specialty. UpToDate is also a commercial product. With a claimed 2+ million subscribers at roughly $200-500/yr, there is clearly money out there for a well made product.

In regards to the article, parsing academic publications and spitting out a word cloud or k-nn graphs of topics isn't going to be useful to a professional. They've already built up a working model in their mind that they've honed over the years. They have years of filtering information and the ones contributing to these knowledgebases have the experience to curate that information to professionals which is what's lacking from these NLP experiments.

I do think that ML and tools like SemanticScholar can be used to identify new literature that may affect knowledgebase articles and flag them for review. I'd be surprised if that doesn't already exists to some extent.

ttpphd · 2023-11-01T23:48:11

Well done discussion of the issues. This is correct in broad strokes. Knowledge is not a static consumable transmitted in journal articles. Science is a social dance, and it's the movement and process that's important. This kind of "knowledge extraction" would be like analyzing song lyrics to try to understand why people dance.

jdale27 · 2023-11-01T23:48:59

Author's original post: https://markusstrasser.org/extracting-knowledge-from-literat...

Previous discussion: https://news.ycombinator.com/item?id=29481061

Der_Einzige · 2023-11-01T22:52:00

There is a lot of momentum in academic AI publishing to try to stop some of the worst of the BS that the author identifies. It's becoming normalized to publish code in a demo form available on a place like huggingface, which is massively improving reproducibility. Websites like paperswithcode have done similar to improve things.

But neither of these are enough to combat the plethora of issues that the author identifies and it makes me so sad, particularly once you get into publishing and see how insane peer review is from every corner of it (paper author, reviewer, conference chair). Academia truly feels like a cartel at times.

Gooblebrai · 2023-11-01T23:10:44

Unfortunately AI academic publishing is not representative or the general academic publishing. Luckily, AI academics are heavily influenced by the open source culture of software engineering. It's going to be hard to see the same movement in areas like drug discovery or material science

gus_massa · 2023-11-01T23:57:03

In some areas you can upload the code to github (but very few people does that anyway).

In others you need a hardware like a microscope X with a lens Y and a light Z, and use cells of W cultivated with nutrients V by the graduate student U that is the only one that can keep the cells happy. You can't just git clone & config & make it.

marviel · 2023-11-02T00:36:45

Big implications and ponderings here for the slew of RAG based applications that are about to hit the market.

I saw a blog post by a fellow from OpenAI earlier today which basically said "when you refer to a model like Bard, or Claude, or Llama -- you're moreso referring to the dataset, than you are the architecture."

His meaning was at training time, but perhaps a similar observation can be had in this context: The value of a retrieval / organization system is only as good as the average information quality in its target corpus

nextworddev · 2023-11-02T01:34:19

"about" to hit the market? There's like 3000+ Co-Pilots already in the market.

But yes, more to come out next year obviously.

VeninVidiaVicii · 2023-11-01T22:59:59

As a fifth year genetics PhD student, I found myself agreeing to mostly everything here. I also think the logic applies to those of us in the field!

> Divergent Tasks are Hard to Evaluate and Reason About By “divergent” I mean loosely defined tasks where it's unclear when they're done. That includes "mapping a domain", "gathering evidence", "due diligence" and generally anything without a clear outcome like "book a hotel", "find a barber", "run this assay", "order new vials"...

This is not just a problem for outsiders, or building a map, it’s actually true. I can’t count how many projects I was told to “find some kind of meaningful evidence” for something that is super tenuous.

photochemsyn · 2023-11-02T13:16:50

Nice writeup, and a good explanation of why the academic literature is often less than helpful. There are several issues that aren't really mentioned however. For anyone looking into a specific field (such as the one mentioned in the article, metabolic engineering in yeast), you might want to consider:

1) Identification of all the major research groups working on the problem, including their physical locations and resources, lab standards and data standards, history of grants and proposals, etc. This will not be in the academic literature, and is fairly difficult and possibly expensive to acquire, and if you don't know what to look for, well, you need to hire experts with relevant experience. Think of it as due diligence (Theranos investors got burned because they didn't do this).

2) You have to aggregate a lot of papers to get a good picture of what each research group is up to. A single coherent project might generate a dozen papers scattered here and there, and importantly, they won't have published their failures in most cases, even though that information is just as valuable to an outsider attempting to replicate or advance their work as the successes are.

3) A common mistake is to neglect materials & methods and instead focus on results and discussion. A large fraction of the literature is based on poor methodology and so the results can't really be trusted, and fraud is remarkably widespread in academia, for various reasons from PhDs desperate to graduate to PIs who've made the practice their bread and butter for decades. Clearly written materials and methods sections that include all the information needed to replicate the work are an indication that it's fairly trustworthy. Deliberate obfuscation is a bad sign.

nextos · 2023-11-01T23:52:07

I agree with the OP but only partially. UniProt is a good counterexample of a database that has been built by extracting knowledge from publications and that is incredibly useful.

But it took decades of expert hand-curators going through piles of articles to get to the current state. Also, proteomics articles report very simple outcomes that relatively easy to annotate. For instance, the subcellular localization of a given protein isoform.

a_bonobo · 2023-11-02T02:10:28

Yes and UniProt is not a business, it's run by a foundation funded by US/UK/Swiss universities

Octokiddie · 2023-11-02T00:39:41

> Close to nothing of what makes science actually work is published as text on the web

I appreciate the author didn't find a good product-market-fit. But the above claim makes no sense and isn't supported in the article.

The author doubles down on the idea later:

> All that is to say: discovering relevant literature, compiling evidence, finding mechanisms turns out to be a tiny percentage of actual, real life R&D

The author admits to not having worked in research, and this is one place that lack of experience shows. This is the kind of thing that takes years, or decades to develop an appreciation for.

Discovering relevant literature, efficiently and well, is a thing that can make or break you as a research scientist.

Over time I've sensed the devaluation of literature search skills in science, but I've also noticed those that can't do good literature searches do bad research. They waste time, sometimes years, in re-discovery of old results. They commit resources to experiments that don't need to be done. They have objectively worse ideas because they don't actually understand the field they're working in. They can't see the holes in the literature that lead to opportunities.

willtemperley · 2023-11-02T12:58:22

There is a wealth of Malaria data, painstakingly extracted and georeferenced from academic sources using Wellcome Trust funding [1].

I know this because I put it online in before I left them in 2012 [2] however I'm unable to use their new data explorer to extract any usable information.

In the past we provided direct access to the malaria endemicity surveys and georeferenced anopheles occurrence records but I'm now struggling to find that for some reason.

[1] https://malariaatlas.org/ [2] https://pubmed.ncbi.nlm.nih.gov/23680401/

pama · 2023-11-02T13:31:19

Skimming through the discussions here, I thought that perhaps we could change how we teach biology in high school and instead start by teaching stories of drug discovery using practical examples that motivate concepts that can then be discussed in greater detail at later courses.

sumanthvepa · 2023-11-02T03:52:45

I'm curious has one tried to do what the author did for a subject like physics or geology, mathematics or computer science? I for one would pay for a service that allowed me to discover interesting CS and math papers.

jdale27 · 2023-11-02T22:11:43

1. Define "interesting".

2. How much would you pay?