Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am wanting to do something similar. Archivebox seems to be the best solution for this sort of self-hosted, searchable web archive. It has multiple search back-ends and plugins to sync browser bookmarks (or even history).

I haven't finished getting it set up though, so take this recommendation with a hefty grain of salt.




How would something like this work in practice? Would you generate any tags or summaries per site when inserting it into the db?


ArchiveBox can extract text from HTML (and possibly PDFs too). I think it can be configured to extract subtitles from YouTube videos as well. So it can do full text searches. Basically you could have your own, offline & curated search-engine.


You could run a full text search or search against an auto-generated summary. Or if you want to be fancy, use semantic search like in Retrieval Augmented Generation.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: