Is there a reason why you'd want to save archival material in a proprietary format? Wouldn't it be better/easier to use wget with the `--warc-file` flag?
For this project I wanted a consistent file format for my entire collection.
I have a bunch of stuff I want to save which is behind paywalls/logins/clickthroughs that are tricky for wget to reach. I know I can hand wget a cookies file, but that’s mildly fiddly. I save those pages as Safari webarchive files, and then they can drop in alongside the files I’ve collected programatically. Then I can deal with all my saved pages as a homogeneous set, rather than being split into two formats.
Plus I couldn't find anybody who'd done this, and it was fun :D
This is only for personal stuff where I know I'll be using Safari/macOS for the foreseeable future. I don't envisage using this for anything professional, or a shared archive -- you're right that a less proprietary format would be better in those contexts. I think I'm in a bit of a niche here.
(I'm honestly surprised this is on the front page; I didn't think anybody else would be that interested.)
2/ Proprietary format: it is, but before I started I did some experiments to see what's actually inside. It's a binary plist and I can recover all the underlying HTML/CSS/JS files with Python, so I'm not totally hosed if Safari goes away.
I didn't think anybody else would be that interested.
'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem, especially programmatically, so the niche is probably a little roomier than you might initially suspect.
I use it all the time to archive webpages, and I imagine it wouldn't be hard to throw together a script to use FireFox's headless mode in combination with SingleFile to selfhost a clone of the wayback machine.
Thanks, I've seen it, last I tried it it missed bg images. But my point is this is something browsers should support better and kind of sort of do now but even with that it's a hassle.
Thanks all the JS - SPA develops that insisting on putting JS all over the place. Wouldn't it be better to have everything in one .html, using <script> <style> just inline. Then it is also just one file over the internet. There must be a bundler that does that no?
Seems JS developer just want their code to the obfuscated and unachievable as possible unless it is via their web server.
These SPA bundles are on the order of megabytes, not kilobytes. You want your users, for their own sake and yours, to be able to cache as much as possible instead of delivering a unique megablob payload for every page they hit.
Good point on the cache. However things such as putting background image in CSS, so user can right click to download the image is just stupid. Why is css all the sudden in control of the image display? It just makes archiving pages harder.
> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem
Is it really? I remember hacking around with with JavaScript's XMLSerializer (I think) like 5 years ago and solved that for ~90% of the websites I tried to archive. It'd save the DOM as-is when executed.
Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.
90% feels like an overestimate to me but it's already quite poor, you wouldn't accept that for saving most other things. Another problem is highlighted in the piece - it's a hassle to ensure external tools handle session state and credentials. Dynamic content is poorly handled, the default behaviours are miserable (a browser will run random Javascript from the network but not Javascript you've saved, etc).
There's a lot of interest in 'digital preservation' and perhaps one sign of how it's very much early days of the field - it's tricky to 'just save' the results of one of the most basic current computer interactions - looking at a web page.
But if you serialize the DOM as-is, you literally get what you see on the page when you archive it. Nothing about it is dynamic, and there is no sessions nor credentials to handle. Granted, it's a static copy of a specific single page.
If you need more than that, then WARC is probably the best. For my measly needs of just preserving exactly what I see, serializing the DOM and saving the result seems to do just fine.
Yes you save something that's mildly better than print-page-to-PDF. But it still misses things and the interactive stuff is very much part of 'exactly what I see'. Like, a random article with an interactive graph, for instance - like this recent HN hit https://ciechanow.ski/airfoil/
It's not that there aren't workarounds, it's that they are clunky and 'you can't actually save the most common computery entity you deal with' is just a strange state of affairs we've somehow Stockholmed ourselves to.
> Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.
One category that the archivers do poorly with is news articles where a pop-up renders on page load which then requires client-side JS execution to dismiss the pop-up.
Sometimes it is easily circumvented by manual DOM manipulation, but that's hardly a bulletproof solution. And it feels automateable.
> What they really need is browser support, or at least an extension so a browser can open the files directly
That's probably the wrong thing. What browsers really need is a thin but standardized API that lets any third-party app that the user has installed on their machine to supply the content for various fetch/reads.
You'd open the WARC in Firefox or Safari or whatever, but Safari et al wouldn't have any special understanding of the format. It would know that your app does WARCs, though, and then knock on the door and say, "Please tell me the content I should be showing here; I'll defer to you for any further "requests" associated with the file/page loaded in this tab—just tell me the content I should use for those, too."
One of the main use cases for an archived web page would be to share archives, and in that case I think you'd want them to be double-clickable with little fuss.
> Although Safari is only maintained by Apple, the Safari webarchive format can be read by non-Apple tools – it’s a binary property list that stores the raw bytes of the original files. I’m comfortable that I’ll be able to open these archives for a while, even if Safari unexpectedly goes away.
> Once I’d written the initial version of this script and put all the pieces together, I used it to create webarchives for 6000 or so bookmarks in my Pinboard account. It worked pretty well, and captured 85% of my bookmarks – the remaining 15% are broken due to link rot. I did a spot check of a few dozen archives that did get saved, and they all look good.
I was a tad confused by this part.
Did you (or how did you) verify that the headlessly saved web archives for thousands of bookmarks visually match the pages shown in the browser?
This is the biggest problem I've had with command-line archival tools: they save some version of the page, but it often differs substantially from what I actually see in my browser — things like pop-up artifacts covering the page or news articles are full of ads that are otherwise blocked in my headed browser.
The SingleFile extension for Chrome works more completely and accurately than anything else I've come across so far, but it does still break weirdly sometimes too.
I would love to find a programmatic way to automate the visual verification, e.g., archiving a page with multiple different tools and visually diffing the rendered pages across tools with small margins of error. Maybe someone else has worked on this already.
WebScrapbook is also worth a look. I find that I like it slightly better than SingleFile for creating copies that are not packaged as single files. This lets me hard link identical asset files to save space.
Hey hey, WebArchive. Haven't thought about them in a long while. 11-ish years ago I was tasked with porting an existing dotnet application to PHP feature by feature. Their existing application had the strange ability to export reports to WebArchive.
I could find no pre-existing WebArchive package for PHP so I set forth building my own.
sidenote: be careful when opening a webarchive from a third party, if that improvable opportunity ever materializes:
> In February 2013, a vulnerability with the webarchive format was discovered and reported by Joe Vennix, a Metasploit Project developer. The exploit allows an attacker to send a crafted webarchive to a user containing code to access cookies, local files, and other data. Apple's response to the report was that it will not fix the bug, most likely because it requires action on the users' part in opening the file.
I initially glossed over this believing it may be something trivial, but it really is a deeper XSS concern.
It feels weird me to dismiss as wontfix a security issue that gives the archived page far greater access to browser data than it has loaded at its original URL.
> Last updated at Tue, 16 Jan 2024 16:26:37 GMT
> tldr: For now, don't open .webarchive files, and check the Metasploit module, Apple Safari .webarchive File Format UXSS
> Safari's webarchive format saves all the resources in a web page - images, scripts, stylesheets - into a single file. A flaw exists in the security model behind webarchives that allows us to execute script in the context of any domain (a Universal Cross-site Scripting bug). In order to exploit this vulnerability, an attacker must somehow deliver the webarchive file to the victim and have the victim manually open it ^1 (e.g. through email or a forced download), after ignoring a potential "this content was downloaded from a webpage" warning message ^2.
Just look at the number of (relatively trivial) attack vectors identified by the author in this post:
> Attack Vector #1: Steal the user's cookies.
> Attack Vector #2: Steal CSRF tokens.
> Attack Vector #3: Steal local files.
> Attack Vector #4: Steal saved form passwords.
> Attack Vector #5: Store poisoned javascript in the user's cache.
This category of issue is present with many types of web archives. Sanitizing archives during capture or retrieval while maintaining fidelity is a really hard problem.
Mostly using Zotero for saving snapshots of webpage. Saving webpage is getting more and more difficult as it approaches the functionality of the desktop application.
I’m using AnyBox[1] for my personal archive. It can automatically create a PDF or WebArchive and also stores the output of reader mode for future reference. And it syncs via iCloud.
I was searching something for a web page preservation and also considered Safari web-archives, but decided this is a “no go” for me because of private format which is basically a vendor lock. Thus I ended up with a Chrome extension named SingleFile which does a pretty decent job by saving the whole page (or its part) as a single self sufficient html file viewable by any browser. Also html files are easily indexed by Spotlight or other search engines. The extension has no command line though but personally I don’t need that.
The author considered the proprietary nature of the webarchive format, and determined that it was readable without Apple software, and that it wouldn't be too difficult to create a tool to view or transform webarchive files if Safari were to disappear: https://alexwlchan.net/til/2024/whats-inside-safari-webarchi...
Of course, without a working implementation, there could be hidden obstacles.
I’d love to have a viewer for this or converter to a standard html single page archive that works in other browsers too. Is there some reason for apple’s proprietary to exist over self contained html? I have a bunch of apple webarchives too (stored from
ios to notion) and am worried if there is no durable solution to open these beyond ”code it yourself”.
Something like ArchiveBox or SingleFile are in the same ballpark of tools, but SingleFile at least seems to eschew Safari Webarchive as a format. ArchiveBox may support Safari webarchives, but for some reason they omit it in their docs.
> I’d love to have a viewer for this or converter to a standard html single page archive that works in other browsers too
About 10 years ago I searched and found a webarchive to MHTML converter from someone's small site. I recall there was one caveat, something like it didn't include the date in the metadata of the output MHTML.
Sorry I don't have the link on-hand, it'd be on the HDD I pulled from my old Macbook.
Edit 2: pretty sure this is the original converter, since I noted a caveat that it outputs the timestamp of the conversion not the time the file was originally saved (the script above handles this correctly otoh). https://langui.net/webarchive-to-mht/
Unfortunately I don't think browsers handle opening MHTML well out of the box.
At least last time I checked in only worked for local paths (file://) and only some browsers. Otherwise it would either try to download or just show plain text.
I ended up using Chrome to dump to MHTML [1] and then reshuffle the content into individual files, rewriting the path references and fixing mime types [2]. That gives a reasonably faithful static capture of a page that can be shared as link.
The main issue with this is that you lose text reflowing, so it's more annoying to access on mobile. You also lose interactivity; I've seen links and menus implemented with JavaScript break.
I save pages into PDF files. Low tech, but works since 2001.
I print with zero page margins, so in viewer it seems like continuous page. I found Firefox produced smallest pdfs. Chrome embeds fonts and other stuff. I also use UBlock rules to hide some elements.
Pretty useful for archiving discussions on Reddit.
> Pretty useful for archiving discussions on Reddit.
What I do now is save the comments within the discussions (on HN, Reddit, Twitter etc) as text which is indexable and searchable with additional metadata which helps for filtering (author is the main one I use), while automatically archiving the entire URLs associated with them.[1]
For me, this is the best of both worlds - quick access via fault-tolerant search and filtering to the most interesting stuff while having a snapshot archive for the full context.
[1]: https://notado.app - I've been working on this for a few years now and have posted a lot in my HN comment history and technical blogs about how I have iterated on and evolved this workflow to the point where it is now
I get you, but I still find it sad there's so little trust left in the web stack that even a PDF is preferable. Technically, a PDF can contain anything (bitmaps, text/glyphs without semantic ordering, even JavaScript).
I often save things as PDF. The 'export as PDF' option in Safari creates a long PDF that I find much better for reference on screen than 'printing' to PDF.
But the big flaw in this, especially for saving programming related pages, is that it loses the parts of scrollable content which is not currently in view, e.g. the ends of lines in a code block.
> Pretty useful for archiving discussions on Reddit.
I use SingleFile for that, saves pages as a single self-contained html file. That way you can still interact with collapse comment buttons and outlinks.