Very cool. This seems like such a "Duh. Users want this" feature. I wish it was integrated in Firefox years ago. I bookmark some sites and then come back years later, but perhaps they hijacked the URL or they later changed the URL's parameters.
And then when I return, I get a 404. Instead of bookmarking, I'd love to "capture" the current info, divs, and graphics.
You could do that with the Firefox extension "SinleFile" [1]. It main purpose is to save a full webpage as one single html file with images etc encoded in the html file (no more messy html file + folder cluttering), very neat. There is also a variant that is a html/zip hybrid called SingleFileZ [2] that gets smaller saved file.
In the settings you could set so that sites you bookmark is saved and also link the bookmark to the saved file locally if you want to.
That looks great! Firefox should have something like SingleFile built-in. Instead of just copying features from Chrome, they should lead the innovation and add useful features like this.
I've been using SingleFile for a couple of months now, to auto-save every page I visit, and it works well. It's exactly as simple as it needs to be. My only real complaints are that it can be slow on large pages and that it pollutes your download history (every auto-save shows up as a download), both of which I suspect are due to limitations in the extension API.
While it's far from an ideal solution, you can throw your bookmarks periodically to archive.org's "Save Page Now!" service. It's easy to semi-automate it - here's how I use it with pinboard.in bookmark exports: https://pastebin.com/uUVE22RD
This solution is on the user side, which is great because each person can get and manage saved pages for themselves.
But if we're looking for a developer side solution, then making pages that last an order of magnitude longer may be better for everyone in the long run, e.g. https://jeffhuang.com/designed_to_last/
I like the ability to use tags too -- I've got various product/tech ideas and it's nice collecting information with it and not having to worry about the pages going away or changing.
Does it capture your view of the page, though (i.e. use your own browser, with its cookie jar, to do the scrape)? I'd like to, for example, snapshot my Facebook feed.
I made an application[0] that does capture your view. It's screenshot-based. It works outside of a browser too - anything on your screen. All local too, so no privacy concerns :)
No it does not. That means it also doesnt work for any news sites that you have subscriptions to.
Im using the joplin web clipper pretty heavily for this purpose.
Afraid of lose the contents on the page, I used to save it as PDF on cloud. Far from ideal but happened way too much I go back to the page and get a 404 page error or so.
If enough people did this, it would provide more pressure for websites not to just nuke their old URL schemes every time they switch from wordpress to Drupal...
I love this extension! I use it in place of bookmarks. I then run a cron job to move .html files from my downloads folder to a bookmarks folder. Then I generate thumbnails and an html index to easily browse my bookmarks. I feel so much more relaxed knowing I have can save with one click any good information I find on the web. Eventually I want to add NLP keyword extraction and categorization, and an internal search feature.
The problem here is quite real; just because one has access to a remote resource now doesn't mean that access will remain, either in the short or long term.
I'm not crazy about using a 3rd party service for it though.
I've half-considered wiring up something with OBS to just record my web browser all day, but, besides the intense storage, indexing and searching is more than painful.
Just want to plug HTTrack [1] here as well. Not nearly as slick, but it's worked extremely well for me when Webrecorder couldn't do the job. Being usable through the command line also makes it useful for some projects WR can't do.
I came here to say the same thing. HTTRACK is incredibly handy especially when it comes to old pages. My personal use case is archiving small sites to make sure they don't go dark, for example webcomics. Having a single folder with all the comic image files is great.
I am the author of https://www.pagedash.com, which is a service that simply saves the entire page statically, and also makes an attempt at a dynamic save (allowing all initial JS to run). No network interaction is captured. Try it if you want something simpler. We've been around since 2017. See previous Show HN https://news.ycombinator.com/item?id=15653206
> Webrecorder creates an interactive copy of any web page that you browse, including content revealed by your interactions such as (...) clicking buttons, and so forth.
How can they guarantee this if the code may run on the server?
It's probably a replay of what you did. I'm trying to think how the internals would be made using Javascript, but probably it's a case of recording changes to the DOM structure ("After 3 seconds, the password input field had the value 'hunter2'. After 5 seconds, a DIV appears with the the text 'Incorrect password'", etc).
Instead this captures network traffic (bytes) into a .warc file. Then you can replay it later (from the website) or you can download the .warc file to your computer & play it offline using the desktop app (Webrecorder player app) see the bottom of the page or [1]
The downside is that the desktop app is slow (5 seconds to loads a page on my laptop). I've created a fast player app that loads pages instantly, but I haven't open sourced it yet ;)
I am also just interpreting the copy on the OP, but I don't think it's just a replay -- I imagine they record all HTTP requests and responses, so can echo back the recorded response for any given JS query to server, when it's triggered.
The page linked says:
> Webrecorder takes a new approach to web archiving by capturing ("recording") network traffic and processes within the browser while you interact with a web page. Unlike conventional crawler-based web archiving methods, this allows even intricate websites, such as those with embedded media, complex Javascript, user-specific content and interactions, and other dynamic elements, to be captured and faithfully restaged.
and:
> Webrecorder creates an interactive copy of any web page that you browse, including content revealed by your interactions such as playing video and audio, scrolling, clicking buttons, and so forth.
I think they record whatever HTTP transactions you trigger, so they can be played back when triggered in the replay.
>I think they record whatever HTTP transactions you trigger, so they can be played back when triggered in the replay.
That could be a useful history for circumventing forced usability reductions when a site 'upgrades' but doesn't change the request structure and/or doesn't deny access/EOL a previous API version but just hides the old interface after a makeover.
For example: I would love to have a recording of sessions from the previous version of ESPN's html fantasy football interface with a record of request to see if I could roll my own so to speak.
(ESPN ruined the usability of the site by rewriting it in some Javascript framework hoping to annoy users into app conversions last year.)
Is there a reliable screenshot version of this I can install on my own Linux system? Some sort of headless browser, I imagine. I realize this interactive system is much more powerful but it's overkill for an application I have in mind, a best effort single screenshot is fine. I've seen various attempts at doing this over the years and none have really worked reliably.
I'd consider paying a service to do this but it seems like self-installable is better.
Couple of options: You can build a small JavaScript package with a HTTP server to call your endpoint and puppeteer for actually running and interfacing with Chrome headless and then either Dockerize it or even run it as a Lambda / Google Cloud Function. Then there's https://github.com/browserless/chrome/ which is both a commercial service as well as being Open Source. Plenty of other OSS and commercial options as well, ultimately it comes down to your use-case. Personally I've found reliably taking a screenshot at the right point in time challenging since a page is never really "done" loading, i.e. there could always be a refresh or timeout which loads another URLs, adds an iframe or changes something else entirely, so there is no single "yes, we're done loading now" event to wait for. Source: I run the service at https://urlscan.io where we have the same problem ;)
I think it can be smart-ish about this (scrolling for you, but maybe I'm thinking about an extension). I've never found anything that handles virtualised lists/tables though, like screenshotting an entire Slack thread
I feel like the “Make website browsable offline” feature of yesteryears have been neglected. Somehow people assume everyone always has internet connectivity... but I need to save websites, docs, etc, for long flights or camping trips without signal.
Safari annoyingly has a reading list feature that claims to cache the web page but annoyingly 50% of the time doesn’t. As with all apple cloud services there is just no way to explicitly sync.
Since when is the Safari Reading List a cloud service? It might be that I just have most of iCloud disabled and that makes the system work as expected, but for me it is definitely 100% local (at which point I would presume failures are due to the mechanism it using to save things not working on all kinds of web pages: I honestly only use it for simple document sites, and so wouldn't know if it fails a lot on rich web app sites).
I think perhaps they're annoyed that their Reading List isn't syncing between their devices and Apple hasn't put a button that forces the sync to occur.
Yes, it is advertised as being able to make it available offline. But if you save it to reading list on iPhone, or your Mac, there's no guarantee it's available on the other device and often it even stops working on the device that I added to reading list.
Until Firefox broke its extension API there was an extremely useful extension called Scrapbook that could you could point at a website and save that site and all linked site (with settings for how far to recurse and what URL patterns to accept.)
FTR: I still use Firefox as for me the alternatives are worse. As someone else mentioned Pocket exist and while the communication around it could hardly been worse it is still way better than Chrome IMO.
Oddly enough Mozilla gets crap for it but they bought Pocket which allows for offline reading. I think its fine for Mozilla to include Pocket. We seem to be fine installing Chrome which has much worse spyware. You can uninstall Pocket, you gotta go through more effort to uninstall Googles OS from Chrome.
Exactly the reason I use Opera on Android devices. Does not save movies or odd file data but html + images perfectly and web pages looks like originals. I use it a lot to save pages for offline viewing during flights. Now Vivaldi is on Android too and I hope it does have the same feature.
Technically interesting, but why would I use this over one of the many full page screenshot or "Print to PDF" browser extensions? That is what I use when I want to archive something.
Web archivists are coming at the problem with different requirements, related to fidelity and the systematic mirroring of address structure along with payload content and control information.
By the sound of it, you likely wouldn't want to substantially change your personal filing system to have your page captures conform with those requirements.
I tried a few, but could not get good results with any of them. print to pdf, output always looked terrible. Full page screenshot took ages to scroll throught the page hijacking my viewport. Those captures should happen in the background.
Yes, it would be nice if the Chrome extension api would allow full page screen captures to happen instanstly. Currently all extensions need to scroll up/down.
Webrecorder is quite good, but the necessity of using a separate program to record and to replay made this less desirable for me. I ended up making an app that integrated the two (conceptually, no technical relationship), which is rather outdated now but available here, should anyone be interested: https://github.com/CGamesPlay/chronicler
There's Chrome's history, saved pages in Pocket, Bookmarks, possibly separate history in my mobile, etc. But the main issue is if that how do I get back the results when I want them in a seamless manner.
How do I get them when I do a regular search using Google? Retrieval is the main problem.
I would love just a simple one-click print-current-page-to-PDF button that managed to capture the whole page, and did something intelligent about infinite scroll sites.
It's Http Archive(HAR) recorder+indexer not a scree capture/mp4 video recorder program. You can use it with your webbrowser, you don't need an app. You can self-host the instance as well.
Well, at least two. For one, it might be desired to have a copy of the data yourself, rather than relying on some hosted service. Also, webrecorder would allow you to store authenticated pages.
Feedback on video: Video looks like machine recorded.
Suggestion: During playback, smoothen the mouse movement. Videos will have smooth design appeal to sell your product better.
And then when I return, I get a 404. Instead of bookmarking, I'd love to "capture" the current info, divs, and graphics.