This doesn't allow full text search easily, though.

rhizome · on Jan 20, 2021

PDF with an image on one page, then the plain text of the page flowed over following pages.

BrianOnHN · on Jan 20, 2021

+ (text minus stop words)

ramraj07 · on Jan 20, 2021

Triv888 · on Jan 20, 2021

why go from text to image and back to text? seems wasteful and error prone...

lazyjeff · on Jan 20, 2021

It's a hard problem to figure out what's readable text on a page, and what isn't. Even Google has a hard time figuring that out. OCR works very well with screenshots, and is purely computation time. But the real reason is generally just having timestamps, urls, and screenshots is good enough. I usually remember about when it was, and some words in the url, and don't need the heavyweight text search setup.

Moru · on Jan 20, 2021

Just hard with the "read more" buttons.

ramraj07 · on Jan 20, 2021

Trying to parse the SPAs of today is just painful. Simpler to just render the page screenshot and OCR! Guaranteed to only index text that actually matters