Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: App for collecting a large, searchable database of website snapshots?
5 points by DamnInteresting on Sept 24, 2020 | hide | past | favorite | 1 comment
I do a lot of research and writing, and it is necessary for me to keep snapshots of the web-based sources I rely upon in my writings. The tool I have used historically is no longer viable, so I am seeking a replacement.

Requirements:

• Support for a very large collection of documents, including the html and assets (images, css, etc)

• Full-text search

• Annotations (ideally in context)

• Saves original source URL

Nice to have:

• Data stored locally (not just cloud-based)

• Option to include linked pages in snapshot

• Support for static files such as PDFs

Anything to suggest or recommend? Thanks!




I have been starting to hack on this problem. It has been on my mind for years but I started coding on it for real this last week.

When I want to save something the system makes up a uuid (for the capture not the resource) and then it copies the web page and resources to a directory. I am using wget for now but I suspect I'll need something better.

Then the system runs "readability" and prints RDF metadata into a turtle file which could be imported into a triple store or document store.

Send a message to the email in my profile and we can talk about it.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: