Show HN: "data-to-paper" – autonomous stepwise LLM-driven research

QuadmasterXLII · 2024-05-12T17:11:20 1715533880

It's paper reviewing season for me and I think I got one of these submitted. Took a while reading it to realize that it wasn't just a stupid human writing it, there was literally no substance there to find. I can't share details because the confidentiality statement I signed as part of reviewing was pretty strict. However, going forward we are going to have to start deanonymizing and blacklisting the 'authors,' otherwise the ratio of time spent 'writing' vs reviewer time wasted will be crippling.

twobitshifter · 2024-05-12T17:39:57 1715535597

In one group that I am part of, we had a reviewer use AI on submissions, this scared the larger org and we now have a policy of no-AI reviews - however I think AI is closer to competently reviewing some elements of papers than it is editing them itself. For example, it’s the best spelling / grammar tool I’ve ever seen. Since many submissions are by non native English speakers, a limited AI review comment would make sense to me.

Overall, because of the happy to serve alignment of commercial AI it’s more likely to praise us than be critical, which would mean that OTS models may not fit in to the reviews of methods and conclusions.

disqard · 2024-05-13T05:00:53 1715576453

It is likely that soon, all authors will use AI to "author" their papers, and reviewers will use AI to "review" these papers.

Quite an interesting situation, where academia becomes intermediated via LLMs.

8organicbits · 2024-05-12T06:52:26 1715496746

> You are solely responsible for the entire content of created manuscripts including their rigour, quality, ethics and any other aspect. The process should be overseen and directed by a human-in-the-loop and created manuscripts should be carefully vetted by a domain expert. The process is NOT error-proof and human intervention is necessary to ensure accuracy and the quality of the results.

I'm happy to see this directly stated. Is there any guidance for domain experts on the types of mistakes an LLM will make? The process will be different from vetting a university student's paper so they are unlikely to know what to look out for. How often will a domain expert reject generated papers? Given the large vetting burden, does this save any time versus doing the research the traditional way? I'm honestly wary domain experts won't be used, careful review won't be performed, and believable AI slop will spread in academic channels that aren't ready to weed out these flawed papers. We're relying pretty heavily on personal ethics here, right?

sarusso · 2024-05-12T08:09:19 1715501359

The example paper does not mention what type of diabetes it is about - if type 1 or type 2 - and they have very different risk factors.

While it’s kind of clear form the context that it’s about type 2, I doubt a paper like this would pass a peer review without stating it explicitly, in particular with respect to the data set that could potentially include both. Rigor is essential in drawing scientific conclusions.

I guess this is a good example about the statistical nature of LLMs outputs (type 2 is the most common) and consequentially their limitations...

twobitshifter · 2024-05-12T17:35:03 1715535303

Can we feed llms peer reviews and add a reviewer stage to this? Multi agent system would likely catch the poor effort submissions. It could either just reject or provide feedback if the recommendation was to revise.

ttaallooss · 2024-05-12T15:03:11 1715526191

The hypothesis you have raised about the source of the implicit assumptions these models make is indeed an interesting and plausible one, in my opinion.

Biases in data will always exist, as this is the nature of our world. We need to think about them carefully and understand the challenges they introduce, especially when training large "foundational" models that encode a vast amount of data about the world. We should be particularly cautious when interpreting their outputs and when using them to draw any kind of scientific conclusions.

I think this is one of many reasons why we implemented the system with inherent human overseeing and strongly encourage people to provide input and feedback throughout the process.

robwwilliams · 2024-05-12T04:15:52 1715487352

Most interesting in the omics era. There is a huge gap between massive well structured data and granular use of these data to both develop and test ideas. For one particular family of mice we have about 15 million vectors of phenome data—all of it mappable as genetic loci.

A tool to smoothly catalyze “data to paper” or better yet “data to prevention or treatment” is what we need.

roykishony · 2024-05-12T04:26:16 1715487976

yes that's sounds like the type of data that will be fun to try out with data-to-paper! The repo is now open - you're welcome to give it a try. and happy to hear suggestions for improvements and development directions. data-to-treatment date-to-insights data-to-prevention data-to-???

startupsfail · 2024-05-12T17:00:26 1715533226

Evaluate quality of generated papers on 10-20 samples with peer review.

uniqueuid · 2024-05-12T07:40:07 1715499607

With all the positive comments here, I feel like someone should play the role of the downer.

First of all, it's inevitable that LLMs will be/are used in this way and it's great to see development and discussion in the open! That's really important.

Secondly, this will absolutely destroy some areas of science even more than they have already been.

Why? First, science as all of humankind is always a balance between benevolent and malevolent actors. Science already battles data forgery, p-hacking and replication issues. Giving researchers access to tools like this will mean that some conventional quality assurance processes will fail hard. Double-blind peer review will no longer work when there are 10:1 or 100:1 AI generated to high-quality submissions.

Second, doing analysis and writing a paper is one bottleneck of science, but epistemologically, it's not the important one. There are innumerable ways to analyze extant data and it's completely moot to do any analysis in this way. Simmons, Nelson and Simonsohn / Gelman et al. etc have shown: Given a dataset, (1) the findings you can get are practically always from very negative effects to very positive effects, depending on the setup of the analysis. So having one analysis is pointless, especially without theory. (2) even when you give really good labs the same data and question, almost nobody will get the same result (many labs experiment).

What does this tell us? There are a few parts of science that are extremely important and without them science is not only low-impact, it even has a harmful effect by creating costs for pruning and distilling findings. The really important part are causal analyses, and they practically always involve data collection. That's why sciences with strong experimental traditions fare a bit better - when you need to run a costly experiment yourself in order to publish a paper, this creates a strong incentive to think things through and do high-impact research.

So yeah, we've seen this coming and it must create a big backlash that prevents this kind of research from being published, even if vetted humans.

Source: am a scientist, am a journal editor.

oefrha · 2024-05-12T11:29:38 1715513378

Agreed as a former scientist (theoretical high energy physics). I’ve yet to meet one person in related fields who’s enthusiastic about giving paper mills a 2000% productivity boost while giving honest people a 20% boost at best, and by the looks of it, this kind of data-to-mindless-statistical-correlation agents will hit the already bullshit-laden, not very scientific fields the hardest. I’m not sure that future can be deterred though, the cat is already out of the bag.

YeGoblynQueenne · 2024-05-12T11:35:23 1715513723

I just hope that one day we find the jerk who put the poor animal in the bag in the first place.

Sorry, I just had to. Hottest day of the year in the UK today and warm weather causes me to lose inhibition.

escape_goat · 2024-05-12T17:44:33 1715535873

Generally speaking, I defer to your expertise point of view in the matter, and I agree that it will be far easier to generate meaningless research that passes the test of appearing meaningful to reviewers than it will be to generate meaningful research that passes the test of appearing meaningful to reviewers.

However, it is an open secret that this is already true, is the thing. Meaningful peer review is already confined to islands within a system that has devolved into generating content. The automation of the process doesn't represent a tipping point, and I don't think that the ethically disclosed production of 'research' by large language models is going to represent a significant part of the problem. The errors of the current system will be reduced to absurdity by the existent ethical norms.

pilgrim0 · 2024-05-12T15:21:34 1715527294

So, in the report, the statement "the power of AI to perform complete _end-to-end_ scientific research" is a blatant lie. Given that your comment seems to be the most reasonable one, and considering that I've seen, over and over, that it's always the domain experts who are the least enthusiastic about AI byproducts, I recalled a saying from the Shogun series:

"Why is it that only those who have never fought in a battle are so eager to be in one?"

uniqueuid · 2024-05-12T15:39:44 1715528384

Thanks, that's a nice quote.

With regard to the debate, I think it's good not to engage in too much black-and-white thinking. Science itself is a pretty muddy affair, and we still haven't grown beyond simplistic null hypothesis significance testing (NHST), even decades after its problematic implications became clear.

That's why it's so important to look at the macro implications: I.e. how does this shift costs? As another comment nicely put it, LLMs are empowering good science, but they are potentially empowering bad science at an order of magnitude more.

pilgrim0 · 2024-05-12T16:31:08 1715531468

Having a design background, I agree completely. To explain why design matters in this case, we simply need to look at ergonomic factors: literally the “economy of work.” That’s why I pointed out the "end to end" claim as a lie because it’s impossible to assert such things without thorough testing of the applications and continued analysis of its effects on the whole supply chain. Most of those AI byproducts will likely be laughable in the coming decades, similarly to the recurring weird-form-factor boom surrounding whatever device is in vogue. Refer to the video linked in [1] for good examples of weird PC input devices from the 2000s. It takes considerable time for the most viable form-factors to be established, and once that’s achieved, then the designs of the vast majority of products within a category converge to the most ergonomic (and economic) one. What bothers me most is not the advent of novelty and experiments, but the overconfidence and overpromises surrounding what are merely untested product hypotheses for most of AI applications. The negligible marginal cost of producing derivative work in software, fueled by the high availability of accessible tooling and lack of rigorous design and scientific training, is to blame. Never mind the hype cycle, which is natural and expected. In times like these, it is when we most need pragmatic skepticism. I wonder if AI developers at all care to do the bare minimum due diligence required to launch their products. Seems to be a rare thing in SWE in general.

[1] https://youtu.be/Sbtgc6mi44M?si=X2e0DSlxZjC7_YOf

cess11 · 2024-05-12T13:05:30 1715519130

What's scientific about this? The README.md isn't clear about the philosophy of science that this tool supposedly implements and applies.

Seems to me to be scientific in the same manner ELIZA is therapeutic.

ttaallooss · 2024-05-12T14:31:50 1715524310

I encourage you to look at the manuscript we have put on arXiv: https://arxiv.org/abs/2404.17605 and go through the thread on X: https://x.com/RoyKishony/status/1785319021329674593

We will be happy to explain and even correct ourselves, if needed, if approached in a civil, respectful manner.

cess11 · 2024-05-12T14:56:08 1715525768

Skimmed through much of it, I don't see anything explicit about which philosophy of science is applied. It seems more like automated information processing, similar to what quant finance and similar is up to.

Do you belong to some popperian philosophy? It can't be feyerabendian, since his thinking put virtue as foundational for science. Do you agree with the large journal publishers, that the essence of science is to increase their profits?

Not sure why you think you've earned my respect, and it would be very hard for me to violate your rights since we communicate by text alone.

missblit · 2024-05-12T15:34:30 1715528070

Hello,

Your example paper has omitted non-English characters in the names of anyone who gets cited. Look especially at citation [5], where a lot of the authors have European characters in their names which get omitted.

There is also possibly a missing × or ⋅ in "1.81 10^5" on page 3.

roykishony · 2024-05-12T16:21:45 1715530905

wow - thank you for the meticulous check - these are issues we should certainly fix!

alchemist1e9 · 2024-05-12T14:31:44 1715524304

This is a step forward! Forget the detractors and any negative comments this is a small peek into the future which will include automated research, automated engineering, all sorts of tangible ways to automate progress. Obviously the road will be bumpy, with many detractors and complaints.

Here is a small idea for taking it one step further in the future. Perhaps there could be an additional stage where once the initial data is analyzed and some candidate research ideas generate, a domain knowledge stage is incorporated. So Semantic Scholar API helps generate a set of reference papers currently, instead those papers could be downloaded in full, put into a local RAG, and then have agents read in detail each paper with the summary of the current data in context, effectively doing research, store it’s summaries and ideas in the same RAG, then combine all that context specific research into the material for the further development of the paper.

There is a link to awesome-agents and I’d be curious what their opinion is of various other agent frameworks, especially as I don’t think they actually used any.

For my proposed idea above I think txtai could provide a lot of the tools needed.

ttaallooss · 2024-05-12T14:45:53 1715525153

This is a super cool idea! We have considered implementing a variation of what you suggested, with the additional feature of linking each factual statement directly to the relevant lines in the literature. Imagine that in each scientific paper, you could click on any factual or semi-factual statement to be led to the exact source—not just the paper, but the specific relevant lines. From there, you could continue clicking to trace the origins of each fact or idea.

alchemist1e9 · 2024-05-12T15:48:45 1715528925

> This is a super cool idea!

Thank you. I’m honored you found it useful.

> From there, you could continue clicking to trace the origins of each fact or idea.

Exactly! I think you would like automated semantic knowledge graph building example in txtai.

Imagine how much could be done when price/token drops by another few orders of magnitude! I can envision a world with millions of research agents doing automated research on many thousands of data sets simultaneously and then pooling their research together for human scientists to study, interpret and review.

roykishony · 2024-05-12T15:00:52 1715526052

thanks! indeed currently we only provide the LLM with a short tldr created by Semantic Scholar for each paper. Reading the whole thing and extracting and connecting to specific findings and results will be amazing to do. Especially as it can start creating a network of logical links between statements in the vast scientific literature. txtai indeed looks extremely helpful for this.

alchemist1e9 · 2024-05-12T15:44:22 1715528662

Excellent! I’m glad my input was interesting.

txtai has some demos of automated semantic graph building that might be relevant.

I noticed you didn’t really use any existing agent frameworks, which I find very understandable as their value added can be questionable over DIY approaches. However txtai might fit better with your overall technology style and philosophy.

Has your team studied latest CoT, OPA, or research into Cognitive architectures?

roykishony · 2024-05-12T16:43:47 1715532227

thanks. will certainly look deeper into txtai. our project is now open and you are more than welcome to give a hand if you can! yes you are right - it is built completely from scratch. Does have some similarities to other agent packages, but we have some unique aspects especially in terms of tracing information flow between many steps and thereby creating the idea of "data-chained" manuscripts (that you can click each result and go back all the way to the specific code lines). also, we have a special code-running environment that catches many different types of common improper uses of imported statistical packages.

alchemist1e9 · 2024-05-12T20:28:06 1715545686

“data-chained” will be very valuable, especially for the system to evaluate itself and verify the work it’s performed.

this is obviously just my initial impression on a distracted Sunday but I’m very encouraged by your project and I will absolutely be following it and looking at your source code.

The detractors don’t understand LLMs and probably haven’t used them in the way you have and I have. They don’t understand that with CoT and OPA that they can be used to reason and think themselves.

I’ve used them for full automated script writing, performing the job of a software developer. I’ve also used them to create study guides and practice tests, and then grade those tests. When one implements first hand automated systems with agent frameworks using the APIs it gives a deeper understanding of their power over the basic chat usage most are familiar with.

The people arguing that your system can’t do real science are silly, as if the tedious process and logical thinking is something so complex and human that the LLMs can’t do it when used within a cognitive framework, of course they can!

Anyway I’m very exited by your project. I hope this summer to spend at least a week dedicated to setting it up and exploring potential integrations with txtai for use on private knowledge bases in addition to your public Scholarly published papers.

roykishony · 2024-05-12T16:46:47 1715532407

and yes we are implementing CoT and OPA - but surely there is ton of room for improvements!

jeffreygoesto · 2024-05-12T07:02:45 1715497365

But who wants to spend human time to read all that? To me if seems wet should train an AI to do it. Stanislaw Lem predicted that AI goes on such a tangent that we better not interact with it in his book https://en.m.wikipedia.org/wiki/Peace_on_Earth_(novel)

roykishony · 2024-05-12T14:12:41 1715523161

Thanks everyone for engagement and discussion. Following the range of comments, just a few thoughts:

1. Traceability, transparency and verifiability. I think the key question for me is not only whether AI can accelerate science, but rather how we can use AI to accelerate science while at the same time enhancing key scientific values, like transparency, traceability and verifiability.

More and more these days when I read scientific papers, published either at high impact journals or at more specialized journals, I find it so hard, and sometimes even frustratingly impossible, to understand and check what exactly was done to analyze the raw data and get to the key results, what was the specific chain of analysis steps, what parameters where used, etc, etc. The data is often not there or is poorly annotated, the analysis is explained poorly, the code is missing or is impossible to track, etc etc. At all, it became practically impossible to repeat and check the analysis and the results of many peer reviewed publications.

Why are papers so hard to follow and trace? Because writing clear and fully traceable and transparent papers is very hard, and we don’t have powerful tools for doing this, and it requires the scientific process itself (or at least the data analysis part) to be done in an organized and fully traceable way.

Our data-to-paper approach is designed to provide ways to use AI powerfully, not only to speed up science (by a lot!), but also at the same time to use AI to enhance transparency, traceability and verifiability. Data-to-paper sets a standard for traceability and verifiability which imo exceeds the current level of human created manuscripts. In particular:

1. “Data-Chaining": by tracing information flow through the research steps, data-to-paper creates what we call “data-chained” manuscripts, where results, methodology and data are programmatically linked. See this video (https://youtu.be/mHd7VOj7Q-g). You can also try click-tracing results in this example ms: https://raw.githubusercontent.com/rkishony/data-to-paper-sup...

See more about this and more examples in our preprint: https://arxiv.org/abs/2404.17605

2. Human in the loop. We are looking at different ways to create a co-piloted environment where human scientists can direct and oversee the process. We currently have a co-pilot app that allows users to follow the process, to set and change prompts and to provide review comments at the end of each steps (https://youtu.be/Nt_460MmM8k). Will be great to get feedback (and help!) on ways in which this could be enhanced.

3. P-value hacking. Data-to-paper is designed to raise an hypothesis (autonomously, or by user input) and then go through the research steps to test the hypothesis. If the hypothesis test is negative, it is perfectly fine and suitable to write a negative-result manuscript. In fact, in one of the tests that we have done we gave it data of a peer reviewed publication that reports a positive and a negative result and data-to-paper created manuscripts that correctly report both of these results.

So data-to-paper on its own is not doing multiple hypothesis searches. In fact it can help you realize just how many hypotheses you have actually tested (something very hard for human research even when done honestly). Can people ask data-to-paper to create 1000 papers and then read them all and choose only the single one in which a positive result is found? Yes - people can always cheat and science is built on trust, but it is not going to be particularly easier than any other of the many ways available for people to cheat if they want.

4. Final note: LLMs are here and are here to stay and are already used extensively in science doing (sadly sometimes undisclosed: https://retractionwatch.com/papers-and-peer-reviews-with-evi...). The new models of ChatGPT5, ChatGPT6, ... will likely write a whole manuscript for you in just a single prompt. So the question is not whether AI will go into science (it already does), but rather how to do so and use AI in ways that fosters, not jeopardizes, accountability, transparency, verifiability and other important scientific values. This is what we are trying to do with data-to-paper. We hope our project stimulates further discussions on how to harness AI in science while preserving and enhancing key scientific values.

uniqueuid · 2024-05-12T15:13:29 1715526809

Hi,

thanks for the honest and thoughtful discussion you are conducting here. Comments tend to be simplistic and it's great to see that you raise the bar by addressing criticism and questions in earnest!

That said, I think the fundamental problem of such tools is unsolvable: Out of all possible analytical designs, they create boring existing results at best, and wrong results (i.e. missing confounders, misunderstanding context ...) as the worst outcome. They also pollute science with harmful findings that lack meaning in the context of a field.

These issues have been well-known for about ten years and are explained excellently e.g in papers such as [1].

There is really one way to guard against bad science today, and that is true pre-registration. And that is something which LLMs fundamentally cannot do.

So while tools such as data-to-paper may be helpful, they can only be so in the context of pre-registered hypotheses where they follow a path pre-defined by humans before collecting data.

[1] http://www.stat.columbia.edu/~gelman/research/unpublished/p_...

roykishony · 2024-05-12T16:05:44 1715529944

Thanks much for these thoughtful comments and ideas.

I can’t but fully agree: pre-registered hypothesis is the only way to fully guard against bad science. This in essence is what the FDA is doing for clinical trials too. And btw lowering the traditional and outdated 0.05 cutoff is also critical imo.

Now, say we are in a utopian world where all science is pre-registered. Why can’t we imagine AI being part of the process that creates the hypotheses to be registered? And why can’t we imagine it also being part of the process that analyzes the data once it’s collected? And in fact, maybe it can even be part of the process that help collects the data itself?

To me, neither if we are in such a utopian world, nor in the far-from-utopian current scientific world, there is ultimately no fundamental tradeoff between using AI in science and adhering to fundamental scientific values. Our purpose with data-to-paper is to demonstrate and to provide tools to harness AI to speed up scientific discovery while enhancing the values of traceability and transparency and make our scientific output much more traceable and understandable and verifiable.

As of the question of novelty: indeed, research on existing public datasets which we have currently done cannot be too novel. Though scientists can also use data-to-paper with their own fascinating original data. It might help in some aspects of the analysis, certainly help them keep track of what they are doing and how to report it transparently. Ultimately I hope that such co-piloting deployment will allow us delegating more straight forward tasks to the AI and letting us human scientists to engage in higher level thinking and higher level conceptualization.

uniqueuid · 2024-05-12T17:21:59 1715534519

True, we seem to have a pretty similar perspective after all.

My concern is an ecological one within science, and your argument addresses the frontier of scientific methods.

I am sure both are compatible. One interesting question is what instruments are suitable to reduce negative externalities from bad actors. Pre-registration works, but is limited to few fields where the stakes are high. We will probably similarly see a staggered approach with more restrictive methods in some fields and less restrictive ones in others.

That said, there remain many problems to think about: E.g. what happens to meta-analyses if the majority of findings comes from the same mechanism? Will humans be able to resist the pull of easy AI suggestions and instead think hard where they should? Are there sensible mechanisms for enforcing transparency? Will these trends bring us back to a world in which trust was only based on prestige of known names?

Interesting times, certainly.

alchemist1e9 · 2024-05-12T16:04:14 1715529854

> That said, I think the fundamental problem of such tools is unsolvable: Out of all possible analytical designs, they create boring existing results at best, and wrong results (i.e. missing confounders, misunderstanding context ...) as the worst outcome. They also pollute science with harmful findings that lack meaning in the context of a field.

This doesn't seem correct to me at all. If new data is provided and the LLM is simply an advanced tool that applies known analysis techniques to the data, then why would they create “boring existing results”?

I don’t see why systems using an advanced methodology should not produce novel and new results when provided new data.

There is a lot of reactionary or even luddite responses to the direction we are headed with LLMs.

uniqueuid · 2024-05-12T17:14:37 1715534077

Sorry but I think we have very different perspectives here.

I assume you mean that LLMs can generate new insights in the sense of producing plausible results from new data or in the sense of producing plausible but previously unknown results from old data.

Both these things are definitely possible, but they are not necessarily (and in fact often not) good science.

Insights in science are not rare. There are trillions of plausible insights, and all can be backed by data. The real problem is the reverse: Finding a meaningful and useful finding in a sea of billion other ones.

LLMs learn from past data, and that means they will have more support for "boring", i.e. conventional hypotheses, which have precedent in training material. So I assume that while they can come up with novel hypotheses and results, these results will probably tend to conform to a (statistically defined) paradigm of past findings.

When they produce novel hypotheses or findings, it is unlikely that they will create genuinely meaningful AND true insights. Because if you randomly generate new ideas, almost all of them are wrong (see the papers I linked).

So in essence, LLMs should have a hard time doing real science, because real science is the complex task of finding unlikely, true, and interesting things.

alchemist1e9 · 2024-05-12T22:57:11 1715554631

Have you personally used LLMs within agent frameworks that apply CoT and OPA patterns or others from cognitive architecture theories?

I’d be surprised if you have used LLMs beyond the classic chat based linear interface that is commonly used and still have the opinions you do.

In my opinion, once you combine RAG and agent frameworks with raw observational input data they can absolutely do real reasoning, analysis, and create new insights that are meaningful and will be considered genuine new science. This project/group we are discussing have practically proven this with their replication examples. The reason this is possible is because the LLM is not just taught how to repeat information but it can actually reason and analyze at a human level and beyond when utilizing it’s capabilities within a well designed cognitive architecture using agents.

visarga · 2024-05-12T04:07:59 1715486879

You can train idea-to-paper models on tons of papers with code. There are many examples of paper impl on github.

roykishony · 2024-05-12T04:35:57 1715488557

yes - LLMs tuned based on data science publications will be great. need a dataset of papers with reliable and well-performed analysis. Notably though it works quite well even with the general purpose LLMs. The key was to break the complex process into smaller steps where results from upstream steps are used downstream. that also creates papers where every downstream result is programmatically linked to upstream data.

Cyphase · 2024-05-12T08:35:26 1715502926

@dang typo in title ("Show NH")

MaxBarraclough · 2024-05-12T16:07:35 1715530055

Perhaps they're just focusing on the New Hampshire readers.

YeGoblynQueenne · 2024-05-12T10:42:25 1715510545

Oh, cool. Now all those dodgy conferences and journals that fill my inbox with invitations to publish at their venues can stop bothering me and just generate the research they want themselves.

Eiim · 2024-05-12T16:43:22 1715532202

I'm working on my Master's in Statistics, so I feel I can comment on some of what's going on here (although there are others more experienced than me in the comments as well, and I generally agree with their assessments). I'm going to look only at the diabetes example paper for now, mostly because I have finals tomorrow. I find it to be the equivalent of a STA261 final project at our university, with some extra fluff and nicer formatting. It's certainly not close to something I could submit to a journal.

The whole paper is "we took an existing dataset and ran the simplest reasonable model (a logistics regression) on it". That's about 5-10 minutes in R (or Python, or SAS, or whatever else). It's a very well-understood process, and it's a good starting point to understand the data, but it can't be the only thing in your paper, this isn't the 80's anymore.

The overall style is verbose and flowery, typical of LLMs. Good research papers should be straightforward and to the point. There's also strange mixing of "we" and "I" throughout.

We learn in the introduction that interaction effects were tested. That's fine, I'd want to see it set up earlier why these interaction effects are posited to be interesting. It said earlier that "a comprehensive investigation considering a multitude of diabetes-influencing lifestyle factors concurrently in relation to obesity remains to befully considered", but quite frankly, I don't believe that. Diabetes is remarkably well-studied, especially in observational studies like this one, due to its prevalence. I haven't searched the literature but I really doubt that no similar analysis has been done. This is one of the hardest parts of a research paper, finding existing research and where its gaps are, and I don't think an LLM will be sufficiently capable of that any time soon.

There's a complete lack of EDA in the paper. I don't need much (the whole analysis of this paper could be part of the EDA for a proper paper), but some basic distributional statistics of the variables. How many responses in the dataset were diabetic? Is there a sex bias? What about age distribution? Are any values missing? These are really important for observational studies because if there's any issues they should be addressed in some way. As it is, it's basically saying "trust us, our data is perfect" which is a huge ask. It's really weird that a bunch of this is in the appendix (which is way too long to be included in the paper, would need to be supplementary materials, but that's fine) (and also it's poorly formatted) but not mentioned anywhere in the paper itself. When looking at the appendix, the main concern that I have is that only 14% of the dataset is diabetic. This means that models will be biased towards predicting non-diabetic (if you just predict non-diabetic all of the time, you're already 86% accurate!). It's not as big of an issue for logistic regression, or for observational modeling like this, but I would have preferred an adjustment related to this.

In the results, I'm disappointed by the over-reliance on p-values. This is something that the statistics field is trying to move away from, of a multitude of reasons, one of which is demonstrated quite nicely here: p-values are (almost) always miniscule with large n, and in this case n=253680 is very large. Standard errors and CIs have the same issue. The Z-value is the most useful measure of confidence here in my eyes. Effect sizes are typically the more interesting metric for such studies. On that note, I would have liked to see predictors normalized so that coefficients can be directly compared. BMI, for example, has a small coefficient, but that's likely just because it has a large range and variance.

It's claimed that the AIC shows improved fit for the second model, but the change is only ~0.5%, which isn't especially convincing. In fact, it could be much less, because we don't have enough significant figures to see how the rounding went down. p-value is basically meaningless as previously stated.

The methods section says almost nothing that isn't already stated at least once. I'd like to know something about the tools which were used in this section, which is completely lacking. I do want it highlight this quote: "Both models employed a method to adjust for all possible confounders inthe analysis." What??? All possible confounders? If you know what that means you know that that's BS. "A method"? What is your magic tool to remove all variance not reflected in the dataset, I need to know! I certainly don't see it reflected in the code.

The code itself seems fine, maybe a little over-complicated but that might be necessary for how it Interfaces with the LLM. The actual analysis is equivalent to 3 basic lines of R (read CSV, basic log reg with default parameters 1, basic log reg with default parameters 2).

This paper would probably get about a B+ in 261, but shouldn't pass a 400-level class. The analysis is very simple and unimpressive for a few reasons. For one, the questions asked of the dataset are very light. More interesting, for example, might have been to do variable selection on all interaction terms and find which are important. More models should have been compared. The dataset is also extremely simple and doesn't demand complex analysis. An experimental design, or messy data with errors and missing values, or something requiring multiple datasets, would be a more serious challenge. It's quite possible that one of the other papers addresses this though.

roykishony · 2024-05-12T17:47:15 1715536035

Thanks so much for these thorough comments.

You suggested some directions for more complex analysis that could be done on this data - I would be so curious to see what you get if you could take the time to try out running data-to-paper as a co-pilot on your own - you can then give it directions and feedback on where to go - will be fascinating to see where you take it!

We also must look ahead: complexity and novelty will rapidly increase as ChatGPT5, ChatGPT6 etc are rolled in. The key with data-to-paper is to build a platform that harnesses these tools in a structured way that creates transparent and well-traceable papers. Your ability to read and understand and follow all the analysis in these manuscripts so quickly speaks to your talent of course, but also to the way these papers are structured. Talking from experience, it is much harder to review human-created papers at such speed and accuracy...

As for your comments on “it's certainly not close to something I could submit to a journal” - please kindly look at the examples where we show reproducing peer reviewed publications (published in a completely reasonable Q1 journal, PLOS One). See this original paper by Saint-Fleur et al: https://journals.plos.org/plosone/article?id=10.1371/journal...

and here are 10 different independent data-to-paper runs in which we gave it the raw data and the research goal of the original publication and asked it to do the analysis reach conclusions and write the paper: https://github.com/rkishony/data-to-paper-supplementary/tree... (look up the 10 manuscripts designated “manuscriptC1.pdf” - “manuscriptC10.pdf”)

See our own analysis of these manuscripts and reliability in our arxiv preprint: https://arxiv.org/abs/2404.17605

Note that the original paper was published after the training horizon of the LLM that we used and also that we have programmatically removed the original paper from the result of the literature search that data-to-paper does so that it cannot see it in the search.

Thanks so much again and good luck for the exam tomorrow!

bjornsing · 2024-05-12T05:32:38 1715491958

> data-to-paper is a framework for systematically navigating the power of AI to perform complete end-to-end scientific research, starting from raw data and concluding with comprehensive, transparent, and human-verifiable scientific papers (example).

Even if this thing works I wouldn’t call it “end-to-end scientific research”. IMHO the most challenging and interesting part of scientific research is coming up with a hypothesis and designing an experiment to test it. Data analysis and paper writing is just a small part of the end-to-end process.

rlt · 2024-05-12T06:04:33 1715493873

The very next paragraph:

> Towards this goal, data-to-paper systematically guides interacting LLM and rule-based agents through the conventional scientific path, from annotated data, through creating research hypotheses, conducting literature search, writing and debugging data analysis code, interpreting the results, and ultimately the step-by-step writing of a complete research paper.

bjornsing · 2024-05-12T10:09:01 1715508541

> from annotated data, through creating research hypotheses

Then it’s all just wrong, automated p-hacking. You’re supposed to start with the hypothesis, not generate it from the data you’re about to publish.

YeGoblynQueenne · 2024-05-12T10:48:09 1715510889

More to the point you're supposed to start with an observation that your current theory can't explain. Then you make a hypothesis that tries to explain the observation and collect more observations to try and refute your hypothesis; if you're a good falsificationist, that is. That doesn't seem to be the process described above. Like you say it's just a pipeline from data to paper, great for writing papers, but not much for science.

But I guess these days in many fields of science and in popular parlance "data" has become synonymous with "observation" and "writing papers" with "research", so.