
Show HN: CLI tool for saving web pages as a single file - flatroze
https://github.com/Y2Z/monolith
======
FreeHugs
One thing I always wonder when I see native software posted here:

How do you guys handle the security aspect of executing stuff like this on
your machines?

Skimming the repo it has about a thousand lines of code and a bunch of
dependencies with hundreds of sub-dependencies. Do you read all that code and
evaluate the reputation of all dependencies?

Do you execute it in a sandboxed environment?

Do you just hope for the best like in the good old times of the C64?

~~~
gambler
This is a good question. I think you can make it even better by generalizing
the problem. How on earth do developers hope to advance general computing
forward when simply _running programs_ isn't a solved problem? Most _software
engineers_ I know don't run docker on their home PCs. What about people who
aren't in IT? Does anyone here even care? The general attitude I see is "plebs
don't need to run anything they can't get outside of an app store". It's a
horrible attitude.

I very much like this quote from Alan Kay:

 _" It doesn't matter what the computer can do, if it can't be learned by
billions of people."_

There is no good technical reason why modern operating systems can't work out
some some scheme for sanboxing arbitrary programs by default. It is obviously
necessary. I imagine something like "applications" folder where every
subfolder automatically becomes an isolated "container". It would have to be
designed with security as the primary concern, though; unlike current
container solutions.

~~~
derefr
But arbitrary programs are... arbitrary. Especially ones run by software
engineers, and _especially_ ones run by software engineers as part of a POSIX-
alike “utility bag” ecosystem.

Who’s to say that the user’s _intent_ by running the program they just
downloaded, isn’t to—say—overwrite a system folder? (Oh, wait, that’s exactly
what Homebrew does, with the user’s full intent behind it!)

There are tons of attempts to do what you’re talking about. Canonical’s
“snaps” are a good example. As well, every OS sandboxes _legacy_ apps by
default (because they’re already virtualizing them, and sandboxing something
in a virtualization layer is easy.)

But none of those solutions really work for the “neat FOSS hack script someone
wrote” workflow we’re talking about here, where you build programs from source
and run them for their intentional side-effects _on your system_.

You might suggest that there could be a _shared_ sandbox for all the POSIX-
like utilities to interoperate in. But what if you’re attempting to use those
utilities against your real documents? (For example, a bulk metadata auto-
tagging and auto-renaming utility, to get TV episodes from torrents loaded
into Plex correctly.) How do you draw the line of what such a program can
operate on? AFAICT, you just... can’t. Its whole purpose is to silently
automate some task. If it requires constant security prompting, the task isn’t
automated.

~~~
gambler
_> But what if you’re attempting to use those utilities against your real
documents?_

You copy or move documents inside the specific sandbox.

If you want a pipeline, you establish a chain of inbox/outbox folders.

Obviously, most of this should be done by the OS, not the user.

The workflow:

\- You click "download" in your browser.

\- When it's done and you click on your download, the OS asks how you want to
open it. Instead of "execute" option you get "run in a sandbox" option.

\- You type in the name of the sandbox, the app gets copies to
/apps/sandboxName or something of that sort.

\- The system automatically creates /apps/sandboxName/inbox and
/apps/sandboxName/outbox.

\- To process a file in some way, you drop it into inbox dir.

For command line, the only change would be switching from "executable pulls
arbitrary files" to "I push specific file to the executable".

    
    
      zip -r squash.zip dir1

becomes

    
    
      | -r squash.zip | zip dir1 |
    

Start pipeline. Push squash.zip as an argument to zip, get the output. Zip
would be the container name.

~~~
derefr
Let me put it another way: how would you implement a dotfile management
framework (like any of these:
[https://dotfiles.github.io](https://dotfiles.github.io))? Programmers seem to
really like them, judging by how many of them there are. But the whole point
of them is to forcefully usurp the assets of literally every other program on
the system. They're user-level rootkits, in a sense.

Or, for a simpler, more obvious example: find(1), grep(1), etc. A set of
utilities that can all be asked the equivalent of "read literally every file
the VFS has access to and tell me whether they match an arbitrary-code-
execution predicate." Do you want to literally copy your entire hard disk into
the 'inbox' of these utilities, in order to get them to search it for you?
(And before you say "well, we can trust the base utilities that ship with the
OS to do more than arbitrary third-party utilities"—there's a whole
competition of grep(1) replacements, e.g. ag(1), rg(1), etc. Do you want to
make it impossible for people to innovate in this space?)

Or how about Nix, or GNU Stow, or, uh, _Git_? These utilities become useless
if they have their own sandbox. Does your git worktree live in Git's sandbox?
Vi's sandbox? The inability to make this distinction functional is why mobile
OSes only have fullly-integrated IDEs!

Or how about shells themselves! (Or, equivalently, any scripting runtime, e.g.
Ruby, Python, etc.) Should people not be allowed to install these from third
parties?

Or, the most based example of all: make(1) [and its spiritual descendants],
and the GNU autotools built atop it. How does ./configure work if you can't
detect true properties of the target system, only of the sandbox you're in?

~~~
gambler
_> "Do you want to literally copy your entire hard disk into the 'inbox' of
these utilities, in order to get them to search it for you?"_

Well, let's think about the goal here. grep reads files and outputs lines from
those files. It needs full read access to everything you want to search. It
does not need write access outside of its sandbox. It does not need direct
access to network sockets, audio stuff and so on.

Is it unreasonable to create a readonly "view" of the filesystem inside grep's
folder? Is it unreasonable to have "files" representing network access,
microphone, audio? It will have visual representation in file manager without
the need to create custom UI. It could be manipulated by drag-and-drop OR
command line. More importantly: it's easy for users to understand. "This app
lives in a box. You can put things in that box for the app to use."

 _> Does your git worktree live in Git's sandbox?_

Yes? I mean, I currently have a folder called projects. All my git stuff is in
there anyway.

 _> Vi's sandbox?_

If you want multiple sandboxes to be able to operate on a directory, you
create "views" for that directory (readonly or read/write) in multiple
sandboxes. This shouldn't be some sort of mind-bending idea, considering Unix
has symlinks, hardlinks, and mounted filesystems of all sorts.

 _> Or how about shells themselves! (Or, equivalently, any scripting runtime,
e.g. Ruby, Python, etc.) Should people not be allowed to install these from
third parties?_

There is no reason why a Ruby executable should have unlimited access to the
entire file system. Especially if you're only using it for a specific purpose,
like serving a website.

What I'm describing here isn't some novel, mind-blowing idea. It's simply
dependency injection. With file-based user interface. Every single part of
this had been done in various operating systems or programming environments
more than once. It's just a matter of combining it all in a sensible way.

~~~
JetSpiegel
> It does not need write access outside of its sandbox.

You have SElinux for that, if you like bureaucracy and filing triplicate forms
to able to run scripts with side effects.

------
mikaelmorvan
The main problem with your code is that you only handle simple web1 site.

What about javascript execution ? If you replay your capture, you have no idea
of what you will see on general Web2 website.

The only way I know to capture a web page properly is to "execute" it on a
browser.

Gildas, the guy behind SingleFile ([https://github.com/gildas-
lormeau/SingleFile](https://github.com/gildas-lormeau/SingleFile)) is well
aware of that and his approach realy works everytime.

Try on a Facebook post, a Tweet, ... It just works.

~~~
lucideer
The capture includes JS, so this should work for most JS-dependent sites, with
the exception of scripts loading other additional assets.

Tbh, often those are superfluous, or egregious examples of bad web dev, so it
seems a reasonable solution for most cases.

SingleFile is a different approach, but it's a lot more involved/less
convenient than a cli, and loading in something like WebDriver on the cli for
this would be overkill, unless you're doing very serious archival work.

~~~
mikaelmorvan
superfluous, or egregious examples of bad web dev?? Do you know what Web 2.0
is? Do you know what are React, Angular, and the other JS Frameworks?

When you create a modern webapp, a lot of data are retrieved from servers as
Json and formated in the browser in Javascript. Even sometimes Css is
generated on browser-side. Even more, on webapp where user login is taken into
account, the display is modified accordingly.

That's the web of 2019. The approach consisting of geting remote files and
launching them in a browser is really naive.

Speaking of SingleFile, it as a cli version and can handle full web 2.0 webapp
without any problem. And of course, the Web 1.0 webapps work as well.

~~~
tinsx
I think that's exactly what that person means by superfluous and egregious
examples of bad web development; SPAs, javascript frameworks of that nature.
:p

~~~
mikaelmorvan
Yes, the debate between building a SPA with rich features or Web old pages
with good SEO is eternal :) We see more and more an hybrid approach that can
be called web 1.5 :)

------
Springtime
MHTML is pretty good for this already btw (not to take away from this neat
project though :)). Similarly stores assets as base64'd data URIs and saves it
as a single file. Can be enabled in Blink-based browsers using a settings flag
and previously in Firefox using addons (also in the past natively in Opera and
IE).

~~~
flatroze
Apparently everybody knew about MHTML but me Ü I'm going to look into that
format and see if I could enhance monolith to output proper MHTML, among other
additions and improvements. Thank you for the info!

~~~
masklinn
I don't know that it would be a very useful thing to do at least in the short
term: there's a bunch of "web archive" formats out there and the common thread
between them is that they're custom archive formats, you need special clients
or support for those formats:

* mthml encodes the page as a multipart MIME message (using multipart/related), essentially an email (you're usually able to open them by replacing the .mth by .eml)

* WARC is its own thing with its own spec

* WAFF is a zipfile, not sure about the specifics

* webarchive is a binary plist, not sure about the specifics either

Your tool generates straight HTML which any browser should be able to open. It
probably has more limitations, but it doesn't require dedicated client /
viewer support.

Maybe once you've got all the fetching and extracting and linking nailed down
it would be a nice extension to add "output filters", but that seems more like
a secondary long-term goal, especially as those archive formats are usually
semi-proprietary and get dropped as fast as they get created (WARC might be
the most long-lived as it descends from the Internet Archive's ARC, is an ISO
standard and is recognised as a proper archival format by various national
libraries).

~~~
mftrhu
There isn't much to WAFF. Each WAFF file can contain more than one saved page.
Each page needs to be contained within its own folder (whose name is usually
the timestamp of when the page was saved, but it doesn't matter AFAICT). There
can be an `index.rdf` file in there, to specify metadata and which file to
open, but otherwise you should look for an `index.SOMETHING` file - usually
`index.html`.

E.g.

    
    
      test.maff
      `--  1566561512/
           |--  index.rdf
           |--  index.html
           `--  index_files/
                `--  ???
    

When I was messing around with archiving things locally I settled on WAFF,
because it's pretty much trivial to create and to use. Even if your browser
does not support it, you just need to unpack it to a tempdir and open the
index file.

------
alpb
I think it would be way better to explain in the repository:

\- how do you handle images?

\- does it handle embedded videos?

\- does it handle JS? to what extent?

\- does it handle lazily loaded assets (i.e. images that load only when you
scroll down, or JS that loads 3 seconds later after the page is loaded)

In general, how does this work? The current readme doesn't do a decent job
explaining what the tool exactly is. For all I can tell, it probably just
takes a screenshot of the page, encodes as base64 into the html and shows it.

~~~
quickthrower2
It can’t handle JS completely because we can’t predict a programs behaviour
using static analysis. See Halting Problem for example.

~~~
kuzehanka
I saw a tool that handles JS to a limited extent by capturing and replaying
network requests to accommodate said JS. It records your session while you
interact with a site, and is then able to replay everything it captured.

This tool was able to capture three.js applications and other interactive
sites quite well.

~~~
bhl
Was it webrecorder [1]? I found this project a couple weeks back while looking
for web archiving tools.

[1] [https://webrecorder.io/](https://webrecorder.io/)

~~~
kuzehanka
Yep, that's the one! Thanks for reminding the name.

------
mrieck
If you only want a portion of a webpage I made a tool called SnipCSS for that:

[https://www.snipcss.com](https://www.snipcss.com)

The desktop version saves an HTML file, stylesheet and images/fonts locally,
and it only contains the HTML of the snippet with the CSS rules that apply to
the DOM subtree of the element you select.

I'm still working out bugs but it would be great if people try it out and let
me know how it goes.

~~~
sansnomme
There have been quite a few extensions in this space:

[https://stackoverflow.com/questions/10266334/add-on-to-
copy-...](https://stackoverflow.com/questions/10266334/add-on-to-copy-a-page-
element-with-styles)

[https://github.com/Dalimil/Web-Design-Pirate](https://github.com/Dalimil/Web-
Design-Pirate)

~~~
mrieck
I tried SnappySnippet before when looking into the idea - it didn't work well
for me and crashed often. I never saw DesignPirate, but just now I tried it
and it didn't output any CSS. I'm not sure but it doesn't look like either of
these use chrome.debugger API to call devtools api methods. (you get a warning
in Chrome if you use that)

I'm hoping my tool will be better so it's good enough people would be willing
to pay for it, but we'll just have to see.

------
jordwalke
I really like this concept, and I've been using an npm package called inliner
which does this too:
[https://www.npmjs.com/package/inliner](https://www.npmjs.com/package/inliner)

I'm glad there's more people taking a look at the use case, and I'd be
interested to see a list of similar solutions.

If you combine this with Chrome's headless mode, you can prerender many pages
that use JavaScript to perform the initial render, and then once you're done
send it to one of these tools that inlines all the resources as data URLs.

    
    
      /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome ./site/index.md.html --headless --dump-dom --virtual-time-budget=400
    

The result is that you get pages that load very fast and are a single HTML
file with all resources embedded. Allowing the page to prerender before
inlining will also allow you to more easily strip all the JavaScript in many
cases for pages that aren't highly interactive once rendered.

~~~
flatroze
Thanks, I'll look into making it work with pipes or some other way to interact
with headless browsers.

~~~
ahub
Reach out when you do so, I've a similar use case here !

------
mehrdadn
This is awesome. One question though: how does it handle the same resource
(e.g. image) appearing multiple times? Does it store multiple copies,
potentially blowing up the file size? If not, how does it link to them in a
single HTML file? If or if so, is there any way to get around it without using
MHTML (or have you considered using MHTML in that case)?

Also, side-question about Rust: how do I get rid of absolute file paths in the
executable to avoid information leakage? I feel like I partially figured this
out at some point, but I forget.

~~~
flatroze
Thank you! It's pretty straight-forward: this program just retrieves assets
and converts them into data-URLs (data:...), then replaces the original
href/src attribute value, so in case with the same image being linked multiple
times, monolith will for sure bloat the output with the same base64 data,
correct. I haven't looked into MHTMTL, ashamed to admit it's the first time
I'm hearing about that format. I need to do some research, maybe I could
improve monolith to overcome issues related to file size, thank you for the
tip!

And about Rust: I think you're way ahead of me here as well, this is my first
Rust program. If you're talking about it embedding some debug info into the
binary which may include things like /home/mehrdadn then perhaps there's a
compiler option for cargo or a way to strip the binary after it's compiled.
¯\\_(ツ)_/¯ Sorry, that's the best I can tell at the moment.

~~~
dspillett
One way to dedupe inline image resources while still using HTML rather than
MHTML, could be to encode them in css once, and transform the image element to
something with that class.

~~~
mehrdadn
That'd easily break Javascript though.

~~~
dspillett
Good point. I was thinking in the direction of something I'm tinkering with in
a similar area. There getting a static snapshot of the current DOM or fragment
is key (meaning scripts being stripped out is an intentional feature).
Tweaking the document contents for efficiency could significantly impact a lot
of script work that may be present.

------
fit2rule
I've been printing to PDF for decades now, and nothing comes close to the ease
of use and versatility of 2 decades worth of interesting web pages .. I have
pretty much every interesting article, including many from HN, from decades of
this habit.

Need to find all articles relating to 'widget'?

    
    
        $ ls -l ~/PDFArchive/ | grep -i widget
    

This has proven so valuable, time and again .. there is a great joy in not
having to maintain bookmarks, and in being able to copy the whole directory
content to other machines for processing/reference .. and then there's the
whole pdf->text situation, which has its thorns truly (some website content is
buried in masses of ad-noise), but also has huge advantage - there's a lot of
data to be mined from 50,000 PDF files ..

Therefore, I'd quite like to know, what does monolith have to offer over this
method? I can imagine that its useful to have all the scripting content
packaged up and bundled into a single .html file - but does it still work/run?
(This can be either a pro or a con in my opinion..)

~~~
mxuribe
Its funny, on a rare occasion, I too have saved some content as a PDF - more
so for archiving rather than for offline viewing...But i guess i never thought
to scale it for all/most of my bookmarks. It seems so obvious now after
reading your comment. However, my experience with PDFs has been negative. From
filesize to slow booting of myriad pdf viewers, etc., it just seems like
viewing stuff in native html, text is better - at least for what I've
experienced. Further, my preferred browser - firefox - leaves much to be
desired in this arena of generating proper PDFs, and i end up switching to
chrome (bleh!) just to "PDF something" that i saw/read online. Again, this
function in firefox is not something that i use as often, hence why i stick
with FF, and not gone back to chrome. However, going back to your
approach....I wonder if i can use a tool - either like this monolith or
singlefile, or even pupeteer, etc. - to snapshot web content, but save it into
html instead of pdf. I would guess html content is still grep-able (as you
noted for your PDF local searches). Hmmm...a local cache of my own offline
bookmarks...Hmmm, interesting. Thank you for this inspiration!!

~~~
gildas
I think SingleFile or SingleFileZ [1] would indeed solve your needs.

[1] [https://github.com/gildas-lormeau/SingleFileZ](https://github.com/gildas-
lormeau/SingleFileZ)

~~~
mxuribe
Yep, I'll look into this, thanks!

------
leshokunin
This would be a perfect fit for IPFS. I love the idea of having just one file
in a permanent link.

~~~
flatroze
This could also be an interesting alternative to PDF, especially with web
fonts embedded as data URLs.

~~~
turbinerneiter
I've been using this kind of standalone PDFs produced from Markdown with
pandoc for a while, and the possibilities are insane.

Imagine a paper in the form of a single HTML file, which has (a subset of) the
data included, the graphs zoomable, the colors chanegable (to whatever vision
problems you have) - maybe even the algorithm to play around with!

Jupyter Notebooks already go in that way. only without the single-file, open
in browser aspect, I think.

------
js8
I am using "Save Page WE" Firefox extension for this. Better at saving JS
content and less clutter than saving all the images and stuff.

~~~
Crinus
Same, Save Page WE is a great extension :-)

~~~
ausjke
singlefileZ on chrome is good too

------
sametmax
Good, but won't work with the heavy JS pages using Ajax to load any single
content.

The firefox extension seems to do that :

[https://addons.mozilla.org/fr/firefox/addon/single-
file/](https://addons.mozilla.org/fr/firefox/addon/single-file/)

~~~
gildas
Unfortunately it's not written in Rust so it won't make the first page of HN.

------
gildas
Note that SingleFile can easily run on command line too, cf.
[https://github.com/gildas-
lormeau/SingleFile/tree/master/cli](https://github.com/gildas-
lormeau/SingleFile/tree/master/cli).

------
interfixus
Nice. I can see some automated uses for this. In ordinary browsing, am
currently using a Firefox addon called SingleFile which works surprisingly
well. Stuffs everything into (surprise, surprise) one huge single file - html
with embedded data, so compatible everywhere.

~~~
flatroze
It sounds like a great add-on, I have to check it out to see what it does to
remote assets and how it works with asynchronously loaded assets.

------
mikekchar
With respect to the Unlicense, does anybody have any knowledge about how good
it is in countries which don't allow you to intentionally pass things into the
public domain (most countries that aren't the US)? How does it compare to CC0
in that respect?

~~~
flatroze
Which license would you recommend to release this software under to reach the
broad adoption yet permissive terms, if not the Unlicense?

~~~
mikekchar
I honestly don't know. That's why the question :-) Is CC0 good for software?
It seems to be a bit more complete from a non-US view point, but I don't know
if there are lurking situations. Possibly MIT is better -- it's pretty darn
permissive. I'm really just soliciting opinions.

~~~
pgcj_poster
Yes, CC0 is the only Creative Commons License suitable for software. It's
endorsed by the Free Software Foundation [1], although not the Open Source
Initiative. I use it for everything that I don't want to copyleft.

[1] [https://www.gnu.org/licenses/license-
list.html#PublicDomain](https://www.gnu.org/licenses/license-
list.html#PublicDomain)

------
hendry
I imagined that
[https://www.w3.org/TR/widgets/](https://www.w3.org/TR/widgets/) would be the
open container format for saving a Web app to a single file.

------
cr0sh
This is interesting - I think any of us who save things off the internet have
made something like this (I usually save entire sites or large chunks, though
- so I have a different toolset - still, I also do single pages, so I might
try out this tool).

One thing I would propose to add - either a flag, or by default - have it
parse the path to the page and create the file with the name - that way you
can just "monolith {url}" and not have to worry about it.

I am also curious as to how it handles advertisements and google tracking and
such; some way to strip out just those scripts (and elements) could be handy.

------
makach
Ahh, to me it looks like it creates an amalgamation of the web page+contents.

How does this work on neverending webpages/forever scroll? How will it behave
if you need to authenticate before browsing the page?

~~~
flatroze
That's it in the nutshell!

It seems to work for basic pages quite well, I think that lazy load will work
for most pages as long as the JavaScript is embedded (no -j flag provided) and
the Internet connection is on. It saves what's there when the page is loaded,
the rest is a gamble since every website implements infinite scroll
differently.

Authentication is another tricky part -- it's different for every browser. I
will try to convert it into a web extension of sorts, so that pages could be
saved directly from the browser while the user is authenticated.

~~~
donatzsky
For authentication, you could add an option for passing http headers, as well
as accept Netscape-style cookie files.

Whenever I want to download a video, using YouTube-dl, from a site that
requires authentication, I first login using my browser and then exports the
cookies using an extension.

~~~
sah2ed
May I ask what extension you use for cookie exporting?

------
jplayer01
Ah, I've been thinking about making something like this. You beat me to it.
I've been using the SingleFile add-on until now. I'll definitely give this a
try.

~~~
gildas
FYI, SingleFile can run on command line too, cf. [https://github.com/gildas-
lormeau/SingleFile/tree/master/cli](https://github.com/gildas-
lormeau/SingleFile/tree/master/cli)

------
lucasverra
super project ! i ve pretty baffled with the difficulty to save a webpage in
proper format. I’ve tried with PDF converter, getPolaroid app and of course
firefox screen shot feature for the entire scroll thing. Will try this for
saving purposes.

I am also interested in cloning/forking sites for modification purposes, I
will feedback you on the results four my consulting gigs

~~~
flatroze
Thank you for the kind words.

It will evolve into a reliable tool in a couple weeks and it should eventually
work for embedding everything, including things like web fonts and @url()'s
within CSS. If anything doesn't work, please open an issue, I have plenty of
time to work on it.

------
sankalp210691
This is pretty useful. It would be great to have a functionality of converting
the HTML page to a PDF as well.

------
sergioisidoro
This sounds great, but the first thing I thought was how this would be a
perfect tool to make automated mass phishing scams.

If the outcomes are realistic, take a massive list of sites, make a snapshot
of each page, replace the POST login URLs with the phishers, deploy these
individual HTML files, and spread the links through email.

I wonder how does this project handle forms.

~~~
flatroze
Thank you for reminding me, I need to set action="" to be an absolute path
when the page is saved.

upd: Done, now forms get their action="/submit" converted into
action="[https://website.com/submit"](https://website.com/submit") when the
page is saved.

------
personjerry
Styling breaks on this site: [https://www.scientificamerican.com/article/the-
hunt-is-on-fo...](https://www.scientificamerican.com/article/the-hunt-is-on-
for-alpha-centauris-planets/)

~~~
flatroze
Thank you for the heads up, I'll test it and enhance to preserve styles
better.

------
fouc
Sweet idea! I would especially like to be able to capture videos and pictures
too.

I suspect for saving videos, a good approach would be some sort of proxy +
headless browser combination, where the proxy is responsible for saving a copy
of all data the browser requests for.

Thoughts?

~~~
flatroze
Thanks! Pictures should work, I'll check more tags first thing tomorrow when I
start working on improving it.

I use youtube-dl for youtube and other popular web services myself. Embedding
a video source as a data URL could in theory work, but it'd be quite a long
base64 line. Also, editing .html files with tens or hundreds of megabytes of
base64 in them would perhaps be less than convenient.

------
personjerry
`cargo install` install 237 packages for this?! I don't think that's
acceptable.

~~~
Deukhoofd
Probably the reqwest crate. That thing alone uses like 30 crates, not
including the dependencies of those crates.

~~~
flatroze
The compile time is rather long as well, I'm looking into ways of reducing the
amount of dependencies.

------
ajxs
Very cool. Have you considered incorporating an option for following links
within the same domain to a certain depth? I remember using tools such as this
in the past to save all the content from certain websites.

~~~
flatroze
Thank you! I'll add it as an issue, since it could definitely be useful for
"archiving" certain resources more than 1 level deep. Do you remember the name
of that tool by any chance?

~~~
ajxs
I wish I could remember the exact tool, this was over a decade ago. If you
just do a quick internet search for such a tool you'll likely find whatever I
used, it certainly wasn't anything sophisticated. It was a Windows GUI tool
designed specifically for the task. Something makes me think that 'GetRight'
tool might have been able to do the same thing, but I can't seem to see the
feature on their website.

~~~
flatroze
Ah, I remember using something like that. I thought that tool was saving it
into one .html file, but data URLs didn't exist back then, so creating
directories alongside with HTML files was the only option to "replicate" a web
resource, now I understand exactly what you were talking about. I'll do some
more digging around and implement that in the nearest future. I may need to
make all the requests async first to make sure that saving one resource with
decent depth won't take too long.

------
tenken
How is this different from
[https://en.m.wikipedia.org/wiki/Web_ARChive](https://en.m.wikipedia.org/wiki/Web_ARChive)

~~~
masklinn
It looks like it creates a normal HTML file (embedding assets as data URI) so
it should require no special client / support.

HTMLD, WARC, MTHML, MAFF and webarchive are all "container" formats which
bundle assets next to the HTML using various methods (resp. bundle, custom,
multipart MIME, zip and binary plist).

~~~
emerongi
The issue with this is that if the website requires some external API for
content, it might not work properly.

[https://webrecorder.io/](https://webrecorder.io/) solves that problem by
recording all interactions and then replaying them as needed.

> Webrecorder takes a new approach to web archiving by “recording” network
> traffic and processes within the browser while the user interacts with a web
> page. Unlike conventional crawl-based web archiving methods, this allows
> even intricate websites, such as those with embedded media, complex
> Javascript, user-specific content and interactions, and other dynamic
> elements, to be captured and faithfully restaged.

------
tannhaeuser
Well you could do that for a long time with MHTML, WARC, etc. downloaders,
including those available in browsers via "Save Page as", though CSS imports
aren't covered by older tools (are they by yours?). Anyway, congrats for
completing this as a Rust first-timer project, which certainly speaks to the
quality of the Rust ecosystem. For using this approach as offline browser, of
course, the problem is that Ajax-heavy pages using Javascript for loading
content won't work, including every React and Vue sites created in the last
five years (but you could make the point those aren't worth your attention as
a reader anyway).

~~~
flatroze
CSS imports are covered by converting .css files into data URLs, later I will
parse those and embed resources found within stylesheets as well.

------
dtjohnnymonkey
Thank you for this. I’ve been looking for something that does this exact
thing. I don’t like any of the other HTML archiving formats .

------
dfee
If the output were a tar file, couldn’t we also say it was saving web pages as
a single file? Wouldn’t that also be easier?

~~~
flatroze
I think there's an issue with opening a tar file, e.g. if sent to someone who
needs to view the document but isn't techy.

It seems to me that having one file that any browser can easily open (and not
require Internet connection to view) is a big advantage over having a
directory with assets alongside the .html file. It may be one of those things
that make things easier yet nobody really complains about how things are
usually done when the page gets saved. I hope more browsers add support for
saving pages as MHTML in the nearest future so that we wouldn't need tools
like this one.

------
ahub
I noticed there is a `-j` argument to remove javascript. A `-i` argument for
removing images would be great too.

~~~
flatroze
It is done, option -i in the latest version (2.0.3) now replaces all src="..."
attributes with src="<data URL for a transparent PNG pixel>" within IMG tags.

------
ur-whale
Does not compile with some byzantine message about let in const funcs being
unstable.

~~~
flatroze
Could you please open issue on github providing the output that you get in the
terminal?

~~~
ur-whale
I have closed my github account since the takeover occured.

~~~
dspillett
The important part of what he said was "providing the output that you get in
the terminal". Simply stating "I got an error" and expecting the developer(s)
to use clairvoyance to glean further detail is far from a helpful way to
report a problem. Perhaps dropping the details in a pastebin site and linking
to that would be a possible alternative? Or just including the error message
here if it is short enough, though HN shouldn't really be used as a tech
support channel.

------
sbmthakur
Nice work! I am wondering if Puppeteer can also be used to accomplish the same
thing.

~~~
flatroze
It for sure would help with those SPA websites that get their DOM fully
generated by JS. A web extension that saves the current DOM tree as HTML would
perhaps do a better job, especially when it comes to resources which require
some web-based authentication.

------
dvcrn
I'm not so experienced but how does this compare to .webarchive?

~~~
flatroze
The idea is almost identical, yet saving as .webarchive is only supported by
Safari, and it's also not a plaintext format, hence can't be edited as easily.

------
Exuma
Saving this for later

~~~
dredmorbius
FYI: "favorite" is one way of doing that through HN.

Bookmarks, or downloads, externally.

~~~
skinnymuch
Favorites is limited to a certain amount on HN before you start losing the
oldest favorite.

~~~
dredmorbius
How many specifically?

I'm at 252 posts, presently, just checked. That seems to be a complete log.

------
nessunodoro
call me old fashioned, but I still use Ctrl+S

------
VvR-Ox
Very cool idea - thank you for this!

On question: How does it handle those cookie pop-ups, gdpr-warnings etc?

~~~
flatroze
Oh, thank you kindly.

That's an interesting question. I think it depends on how the given modal is
implemented, but closing them should technically work (unless the page is
saved with JavaScript code removed [-j flag]). Those notifications can easily
be removed from the saved file using any text editor, should be pretty easy if
you know how to edit HTML code. I don't think removing it would violate
anything since "this website" will no longer really be a website but rather a
local document at that point.

