I am wanting to do something similar. Archivebox seems to be the best solution for this sort of self-hosted, searchable web archive. It has multiple search back-ends and plugins to sync browser bookmarks (or even history).
I haven't finished getting it set up though, so take this recommendation with a hefty grain of salt.
ArchiveBox can extract text from HTML (and possibly PDFs too). I think it can be configured to extract subtitles from YouTube videos as well. So it can do full text searches. Basically you could have your own, offline & curated search-engine.
You could run a full text search or search against an auto-generated summary. Or if you want to be fancy, use semantic search like in Retrieval Augmented Generation.
I haven't finished getting it set up though, so take this recommendation with a hefty grain of salt.