Hacker News new | past | comments | ask | show | jobs | submit login

Sorry, I guess you were not being sarcastic. LLM's are good at vocabulary and syntax, but not good at content (because nothing in their architecture is designed to do that). Since the kind of article we're looking for would be fine if written exactly the same, but it was true, an LLM is not a good match for finding this.

Now there might be algorithms that could, for example automatically checking for photo doctoring or reuse of previously used images that are not attributed. These sorts of things would also not be an LLM's forte.

My apologies again, it's just that LLMs are the subject of so much hype nowadays that I genuinely thought you might be saying this in jest.




I think you misunderstood my proposal

LLMs are good a producing embeddings which are latent representations of the content in the text. That content for research papers includes things like authorship, research directions, and citations to other papers.

When you fine-tune a model that generates such embeddings with a labeled dataset representing fraud (consisting of say 1000s of samples), the resulting model will produce different embeddings which can be clustered.

The clusterings will be different between the model with the fraudulent information and without the fraudulent information.

Now using this embedding generation model, you (may) have a way to discern what truly significant research looks like, versus research that has been tainted from excess regurgitation of untrustworthy data.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: