I do a lot of research and writing, and it is necessary for me to keep snapshots of the web-based sources I rely upon in my writings. The tool I have used historically is no longer viable, so I am seeking a replacement.
Requirements:
• Support for a very large collection of documents, including the html and assets (images, css, etc)
• Full-text search
• Annotations (ideally in context)
• Saves original source URL
Nice to have:
• Data stored locally (not just cloud-based)
• Option to include linked pages in snapshot
• Support for static files such as PDFs
Anything to suggest or recommend? Thanks!
When I want to save something the system makes up a uuid (for the capture not the resource) and then it copies the web page and resources to a directory. I am using wget for now but I suspect I'll need something better.
Then the system runs "readability" and prints RDF metadata into a turtle file which could be imported into a triple store or document store.
Send a message to the email in my profile and we can talk about it.