Show HN: ArXiv-txt, LLM-friendly ArXiv papers

westurner · 2025-02-20T14:41:46 1740062506

If you train an LLM on only formally verified code, it should not be expected to generate formally verified code.

Similarly, if you train an LLM on only published ScholarlyArticles ['s abstracts], it should not be expected to generate publishable or true text.

Traceability for Retraction would be necessary to prevent lossy feedback.

owalerys · 2025-02-26T20:03:41 1740600221

Really clean API design, I'm a fan!

lgas · 2025-02-20T10:05:45 1740045945

It just extracts the abstracts?

jerpint · 2025-02-20T11:59:08 1740052748

For now , yes - abstracts and other metadata

rrekaf · 2025-02-20T21:12:36 1740085956

do you plan on adding descriptions of figures and tables?

jerpint · 2025-02-21T01:22:28 1740100948

will probably focus on getting the text out of the papers first, figures might be a good next step after that

sbpost · 2025-02-20T15:58:45 1740067125

The example you give doesn't seem to work - the raw txt does not have authors.

jerpint · 2025-02-21T01:23:33 1740101013

you're right - I hadn't noticed! I fixed it now, thanks for pointing it out

jmartin2683 · 2025-02-20T14:38:39 1740062319

This would be awesome wrapped in an MCP server/tool call :)

jerpint · 2025-02-20T17:42:10 1740073330

whoa - i haven't yet played with MCP - might be a good first project!

cchance · 2025-02-21T20:27:00 1740169620

Was super excited that it was going to be the actual papers, kinda cool but just being abstracts doesn't go very far, good luck getting the papers working thats gonna be pretty cool once working, then to feed it all into a vector db XD