Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This doesn't allow full text search easily, though.


PDF with an image on one page, then the plain text of the page flowed over following pages.


+ (text minus stop words)


ocr?


why go from text to image and back to text? seems wasteful and error prone...


It's a hard problem to figure out what's readable text on a page, and what isn't. Even Google has a hard time figuring that out. OCR works very well with screenshots, and is purely computation time. But the real reason is generally just having timestamps, urls, and screenshots is good enough. I usually remember about when it was, and some words in the url, and don't need the heavyweight text search setup.


Just hard with the "read more" buttons.


Trying to parse the SPAs of today is just painful. Simpler to just render the page screenshot and OCR! Guaranteed to only index text that actually matters




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: