I've tried this and found it doesn't make much difference. The idea was to someh... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		defgeneric on Sept 3, 2024 \| parent \| context \| favorite \| on: Web scraping with GPT-4o: powerful but expensive I've tried this and found it doesn't make much difference. The idea was to somehow preserve the document structure while reducing the token count, so you do things like strip all styles, etc. until you have something like a structure of divs, then reduce that. But I found no performance gain in terms of output. It seems whatever structure of the document is left over after doing the reduction has little semantic meaning that can't be conveyed by spaces or newlines. Even when using something like html2markdown, it doesn't perform much better. So in a sense the LLM is "too good", and all you really need to worry about is reducing the token count.

a_wild_dandan on Sept 3, 2024 | [–]

I wonder if using nested markdown bullet points would help. You would preserve the information hierarchy, and LLMs are phenomenal with (and often output) markdown.

nickpsecurity on Sept 4, 2024 | [–]

That’s interesting that it didn’t change the performance. It might still reduce cost (CPU vs GPU’s) when pages have a lot of formatting.

Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact