One thing I always wonder when I see native software posted here:
How do you guys handle the security aspect of executing stuff like this on your machines?
Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?
Do you execute it in a sandboxed environment?
Do you just hope for the best like in the good old times of the C64?
That isolated the build process. Similar method to isolate the execution of the built project:
$ cd target/release
$ sudo docker run --rm -w "$(pwd)" -v "$(pwd):$(pwd)" -u "$(id -u):$(id -g)" rust ./monolith https://www.grepular.com
Slightly related: I released a project last night for easily running node applications inside containers: https://gitlab.com/mikecardwell/safernode - Without even having node or npm installed on the host system, you can still run commands like "npm install" or "npm start" to run node applications safely isolated inside ephemeral containers.
I've heared this too, but as far as I know it's only because there are potential bugs in the container software that allow the malware to escape.
To me, this is kind of like saying you should just run stuff as root, because there might be a privelege escalation vulnerability which lets the code run as root anyway.
Correct me if I'm wrong.
My goal was to make things more secure, not completely secure.
Previously, dodgy libs could read (and add) ssh keys into ~/.ssh/, take over my NPM account by fetching ~/.npmrc, grab a copy of my ~/.bitcoin/wallet.dat, and add a keylogger into my ~/.bashrc
Now, at least it has to break out of docker first.
> To me, this is kind of like saying you should just run stuff as root, because there might be a privelege escalation vulnerability which lets the code run as root anyway.
But I never said it was preferable to run directly on the host. There are other choices.
> My goal was to make things more secure, not completely secure.
There is no such thing as completely secure. The argument against docker is more along the lines of "is it really as secure as people think it is?"
> I've heared this too, but as far as I know it's only because there are potential bugs in the container software that allow the malware to escape.
I'm not sure docker was designed for the purpose of secure isolation, so if it fails to securely isolate, I'm not sure it would count as a bug.
Linux relies on a concoction of properly-configured kernel subsystems to provide some level of isolation for containerized processes, and systems like LXD and Docker try to patch up the gaps.
The cgroups interfaces don't offer much security stuff directly -- they're mainly about containing groups of process within certain resource consumption quotas, and afaik, don't really attempt to contemplate secure isolation directly.
LXD approaches this by adding a uid/gid translation layer, so that the uid/gid for anything within an unprivileged container will be offset by a specified value, e.g., calls with user ID 1000 in a container are made to present to the host as user ID 1000000. This comes with its own host of issues which LXD tries to hide.
The short answer is that if security is any type of priority for the system in question and you want to run containerized processes, you should use an OS that implements container security directly in the kernel, like FreeBSD with jails or illumos with Zones, instead of depending on getting exactly the right configuration between all the moving pieces in the Linux container stack.
This is a good question. I think you can make it even better by generalizing the problem. How on earth do developers hope to advance general computing forward when simply running programs isn't a solved problem? Most software engineers I know don't run docker on their home PCs. What about people who aren't in IT? Does anyone here even care? The general attitude I see is "plebs don't need to run anything they can't get outside of an app store". It's a horrible attitude.
I very much like this quote from Alan Kay:
"It doesn't matter what the computer can do, if it can't be learned by billions of people."
There is no good technical reason why modern operating systems can't work out some some scheme for sanboxing arbitrary programs by default. It is obviously necessary. I imagine something like "applications" folder where every subfolder automatically becomes an isolated "container". It would have to be designed with security as the primary concern, though; unlike current container solutions.
But arbitrary programs are... arbitrary. Especially ones run by software engineers, and especially ones run by software engineers as part of a POSIX-alike “utility bag” ecosystem.
Who’s to say that the user’s intent by running the program they just downloaded, isn’t to—say—overwrite a system folder? (Oh, wait, that’s exactly what Homebrew does, with the user’s full intent behind it!)
There are tons of attempts to do what you’re talking about. Canonical’s “snaps” are a good example. As well, every OS sandboxes legacy apps by default (because they’re already virtualizing them, and sandboxing something in a virtualization layer is easy.)
But none of those solutions really work for the “neat FOSS hack script someone wrote” workflow we’re talking about here, where you build programs from source and run them for their intentional side-effects on your system.
You might suggest that there could be a shared sandbox for all the POSIX-like utilities to interoperate in. But what if you’re attempting to use those utilities against your real documents? (For example, a bulk metadata auto-tagging and auto-renaming utility, to get TV episodes from torrents loaded into Plex correctly.) How do you draw the line of what such a program can operate on? AFAICT, you just... can’t. Its whole purpose is to silently automate some task. If it requires constant security prompting, the task isn’t automated.
Let me put it another way: how would you implement a dotfile management framework (like any of these: https://dotfiles.github.io)? Programmers seem to really like them, judging by how many of them there are. But the whole point of them is to forcefully usurp the assets of literally every other program on the system. They're user-level rootkits, in a sense.
Or, for a simpler, more obvious example: find(1), grep(1), etc. A set of utilities that can all be asked the equivalent of "read literally every file the VFS has access to and tell me whether they match an arbitrary-code-execution predicate." Do you want to literally copy your entire hard disk into the 'inbox' of these utilities, in order to get them to search it for you? (And before you say "well, we can trust the base utilities that ship with the OS to do more than arbitrary third-party utilities"—there's a whole competition of grep(1) replacements, e.g. ag(1), rg(1), etc. Do you want to make it impossible for people to innovate in this space?)
Or how about Nix, or GNU Stow, or, uh, Git? These utilities become useless if they have their own sandbox. Does your git worktree live in Git's sandbox? Vi's sandbox? The inability to make this distinction functional is why mobile OSes only have fullly-integrated IDEs!
Or how about shells themselves! (Or, equivalently, any scripting runtime, e.g. Ruby, Python, etc.) Should people not be allowed to install these from third parties?
Or, the most based example of all: make(1) [and its spiritual descendants], and the GNU autotools built atop it. How does ./configure work if you can't detect true properties of the target system, only of the sandbox you're in?
>"Do you want to literally copy your entire hard disk into the 'inbox' of these utilities, in order to get them to search it for you?"
Well, let's think about the goal here. grep reads files and outputs lines from those files. It needs full read access to everything you want to search. It does not need write access outside of its sandbox. It does not need direct access to network sockets, audio stuff and so on.
Is it unreasonable to create a readonly "view" of the filesystem inside grep's folder? Is it unreasonable to have "files" representing network access, microphone, audio? It will have visual representation in file manager without the need to create custom UI. It could be manipulated by drag-and-drop OR command line. More importantly: it's easy for users to understand. "This app lives in a box. You can put things in that box for the app to use."
>Does your git worktree live in Git's sandbox?
Yes? I mean, I currently have a folder called projects. All my git stuff is in there anyway.
>Vi's sandbox?
If you want multiple sandboxes to be able to operate on a directory, you create "views" for that directory (readonly or read/write) in multiple sandboxes. This shouldn't be some sort of mind-bending idea, considering Unix has symlinks, hardlinks, and mounted filesystems of all sorts.
>Or how about shells themselves! (Or, equivalently, any scripting runtime, e.g. Ruby, Python, etc.) Should people not be allowed to install these from third parties?
There is no reason why a Ruby executable should have unlimited access to the entire file system. Especially if you're only using it for a specific purpose, like serving a website.
What I'm describing here isn't some novel, mind-blowing idea. It's simply dependency injection. With file-based user interface. Every single part of this had been done in various operating systems or programming environments more than once. It's just a matter of combining it all in a sensible way.
Amusing that you should mention Canonical "snaps". I made a snap of monolith and contributed the yaml upstream. https://snapcraft.io/monolith - it's in the edge channel because upstream haven't done a stable release yet. It's a strictly confined application.
Homebrew is designed to take over your OS /usr/local directory. Not that there's much in there by default, but Homebrew's presence greatly changes the semantics of /usr/local, given that it's normally meant as a prefix for the local system administrator to install things as root into, whereas Homebrew re-assigns ownership of the whole directory structure to the user installing Homebrew (who, admittedly, is a member of the "staff" group, but still only one member on a potentially multiuser computer.)
Even with decent containers, making software that's good enough for everyone to use is considerably more difficult than making it for yourself and a few friends.
It's okay to do things that don't scale at all, and also okay to make a proof of concept and let someone else worry about scaling it up.
That said, you might want to look at Chromebooks and the Windows and Mac app stores to see what's going on with containerization beyond mobile. (Also, web browsers.)
> I imagine something like "applications" folder where every subfolder automatically becomes an isolated "container". It would have to be designed with security as the primary concern, though; unlike current container solutions.
> I imagine something like "applications" folder where every subfolder automatically becomes an isolated "container"
Snap [1] goes pretty far in this direction. Apps are isolated against each other, and with AppArmor isolated from the system (at least on Ubuntu, your distro might vary). Android does much of the same.
A big problem is that most software exists to manipulate data on the user's machine, so isolating the software from the User folder is impractical. At the same time this data is usually the most valuable thing about the entire computer. That makes it fundamentally very hard to design a system where you can trust arbitrary apps. Android tried to solve this with a "file open" dialog that's controlled by the OS so that there's an easy way to give apps temporary access to single files, but that leads to weird UX.
Some people run ~everything in individual containers, with original commands being aliases for the docker (or whatever) equivalent.
I've shunned that assuming it'd be a big slow down, but I do keep meaning to at least try it, uh, after I knock it.
(No, containers like anything else aren't and haven't been completely secure all of the time since and forever, but it'd take a more sophisticated - and certainly deliberately malicious - tool to do any damage to your system, or to files you didn't explicitly allow it access to.)
At least with containers, if something manages to do damage, it will have to do it through a security vulnerability - which will eventually get patched if caught. The current state of affairs is that malware can do damage without even needing to use a security vulnerability whatsoever. It's "working as designed", and will keep working forever.
I do all of my development in a VM, which allows me to take snapshots and have a portable GNU workstation that's decoupled from my desktop and hardware.
Meanwhile, my desktop remains clean and ready to play media in native environment with good hardware support.
I used to think it'd be slow, until I tried it. My computer is 8+ years old, and it works fine. I mostly do text work.
I used to use a VM (motivation being to run Linux on my locked-down corporate-imaged Macbook) and it was usable, but not as fast it could be if it had free reign over all CPU/RAM.
But, at least the way I was doing it, it's not adding any security as discussed here, since you're doing everything in the VM so anything in the VM has access to everything just as if everything on the host anyway.
It's a valid question. It seems to me users tend to trust things which have certain level of popularity and reputation associated with them.
I personally prefer to hope for the worst. This way when nothing happens I feel extra lucky, and if bad things do happen, I feel proud of being ready for it.
I don't.
In the age old debate between security and convince I sometimes want the latter and willing to ease on the former. To me being prepared for the worst is not fun.
I use reputation of the software creator to manage my risks.
I just clicked on the OP user name first which has the caption Bad boy. Then I read your comment. Now I worry more about the security aspect than before.
People trust by default and only distrust when they spot a signal that believably correlates with maliciousness. Kind of like how people judge each other based on how they dress.
I’d have a different initial response. Somehow it triggered my brain to think more about security and be on alert. I’m any case this faded away quickly. The mind is beautiful :)
> Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?
I usually do a quick check to see if there's any red flag. Like URLs or Base64 blobs.
I also try to stay away from programs written in languages whose environment I don't know so I can check if any dependency stands out wrt what the program claims to do.
> Do you execute it in a sandboxed environment?
Either I run it on an untrusted server, or on my laptop as an unpriviliged user (with no access to X/wayland).
> Do you read all that code and evaluate the reputation of all dependencies?
Why of course. I do this for every piece of software on my computer, from the device drivers to the OS, I review every patch to firefox & chrome as well. /s
Running someone else's software inherently means extending them trust. This objection is especially confusing on a piece of software where you can actually inspect all the source if you like (unlike e.g. device drivers and OS code unless you run Gnu/Linux + all Free drivers, as few but RMS do).
But on that last bit I disagree. Many, many systems run nothing but stock open source kernel drivers under Linux. I daresay home systems with closed drivers are more of an exception. All those VMs in the "cloud".
Before I installed this tool, I checked if the author was a member of any well known organization and since they are not, I skimmed all the code for monolith and any of it's dependency that are not extremely popular, in this case mime-sniffer.
Largely agree, but it's hardly OP's fault. In my opinion this is one of the things we get by using 30 year old Operating System and there having been almost no innovation in that field for so long.
That's more secure than running straight on the machine but not completely secure, security was never a core goal of linux containers.
You want to run the software properly sandboxed, since linux doesn't have really engineered OS-level solutions (à la Solaris) that means vm e.g. running in qemu, or biting the bullet and switching to qubes.
I think that's an extremely uncharitable view on containers. There has been a massive amount of work put into securing containers for a variety of use cases using both layers available and by adding to the kernel.
> I think that's an extremely uncharitable view on containers.
It's an objective one.
> There has been a massive amount of work put into securing containers for a variety of use cases using both layers available and by adding to the kernel.
That doesn't change the fact that security was never the primary goal for containers, so secure containers were and are a bunch of tricks, kludges and prayers being built up in the hope that eventually all the holes in the model will be patched.
The "Making containers safer"[lwn][hn] talk was literally two days ago. Note how it says safer, not safe.
It is not objective by any stretch of the imagination.
I suppose when security enhancements are made to any other system to make them safer (i.e. everything in the realm of security), you apply the same logic? Subjective.
Can you please quote how containers are not built with security in mind? What would even be the point of user namespacing, network namespaces, filesystem namespaces, etc... if not security?
Look at how the cloud providers offer support for containers.
Do they ever offer to run your container in the same VM as those of other customers?
They never do this. For secure isolation, they only trust VM isolation. It seems unlikely that this will change.
> What would even be the point of user namespacing, network namespaces, filesystem namespaces, etc... if not security?
They're for installation/configuration/administration. They allow you to run multiple applications on one Linux VM, and to configure them independently, almost as if you were running multiple VMs (with the advantage of lower overheads - only one instance of the kernel).
Kubernetes puts this to good use, letting you treat application deployments as commodities across your cluster.
Containers do not offer secure isolation. They are by nature much leakier than the isolation VMs can offer. The Docker folks still treat isolation-failures as bugs, of course. (Well, ignoring things like the way 'uptime' gives the uptime of the underlying machine, and not of your container.)
I recently took a course on how cgroups and namespaces work, and can be combined to create containers, and my impression is security is a huge kludge. For example, the capabilities are just a seeming random assortment of different permissions, with a big dumping ground in the admin capability. It's hard to see how such a system can be reliably secured. Plus, it's all open source with a couple core contributors. What's to stop some state agency inserting its code into the core? No way to review everything, and a suitably clever developer can place a backdoor somewhere in all of the millions of lines of code. So, I must agree that security is not really at the forefront of Linux or container technology.
Docker itself could be called a huge kluge, at least compared to Solaris 'zones' and FreeBSD 'jails'.
They're similar to containers, but are supported directly by the kernel, whereas Docker has to pull together different kernel features to create its abstraction. [0]
> What's to stop some state agency inserting its code into the core? No way to review everything
1. This isn't a point about containers, it's a point about Free and Open Source software in general. Do you avoid all Open Source software when security matters?
2. I'm pretty sure the Linux kernel folks review everything, and I imagine the Docker folks do too
3. You're implicitly assuming that closed-source software is safe from government pressure. It is not.
Nothing is safe from government pressure. But, at least with local closed source we know it's going to just be our government pressure. Otherwise, it could be any actor, which may be less friendly towards us.
> with local closed source we know it's going to just be our government pressure
We don't. Companies that produce proprietary code are not immune from attacks on their repository, and are more vulnerable to, say, bribery. They're also more vulnerable to attacks on their distributed binaries - users do not have the option to compile from source, so you compromise every user this way.
Proprietary software is also far more likely to embed 'telemetry' spying, or to use sloppy security practices and rely on security-by-obscurity. Authors of Free and Open Source software know that they (generally at least [0]) cannot get away with this kind of thing.
It simply isn't true that proprietary software is more trustworthy than FOSS. If anything, the opposite appears to be true.
We may semi-trust our package and repo systems. This tool is readily available through AUR on my Arch machine, I see.
Or we may go the whole hog and actually have a peek through the source.
AUR are packages that aren't in Arch's repo system. Granted tools like yaourt do make installing AUR packages nearly as easy as pacman but anyone can upload anything to AUR thus you are expected to vet the packages yourself (hence why tools like yaourt repeatedly prompt you to read the build scripts et al before running them).
The capture includes JS, so this should work for most JS-dependent sites, with the exception of scripts loading other additional assets.
Tbh, often those are superfluous, or egregious examples of bad web dev, so it seems a reasonable solution for most cases.
SingleFile is a different approach, but it's a lot more involved/less convenient than a cli, and loading in something like WebDriver on the cli for this would be overkill, unless you're doing very serious archival work.
superfluous, or egregious examples of bad web dev??
Do you know what Web 2.0 is? Do you know what are React, Angular, and the other JS Frameworks?
When you create a modern webapp, a lot of data are retrieved from servers as Json and formated in the browser in Javascript. Even sometimes Css is generated on browser-side. Even more, on webapp where user login is taken into account, the display is modified accordingly.
That's the web of 2019. The approach consisting of geting remote files and launching them in a browser is really naive.
Speaking of SingleFile, it as a cli version and can handle full web 2.0 webapp without any problem. And of course, the Web 1.0 webapps work as well.
With the exception of actual XHR requests (which should ideally be for dynamic resources, and as such somewhat outside the remit of saving a webpage), I was referring specifically to JS loading JS, etc. solutions. React, Angular do not recommend/advise you to do this. This isn't a requirement in Web 2.0 or Web 5.0 or anything else.
In terms of React at least, fetch requests are not a part of the framework in any way and any present would typically be done in custom code in lifecycle methods. Even Redux, is—by default—client-side only. Stores are in-memory, actions populating them would make fetch requests with React/Redux-independent logic.
Other JS frameworks are, typically, the same. And all of that is just considering dynamic XHR. Loading scripts is much less typical, and never required. The most common application of this I've seen is the GA snippet, which mainly does it to ensure the load is async without relying on developer implementation: it's 100% unnecessary to do it this way.
So yes, unless you're distributing a tracking snippet that you expect non-devs to be blindly pasting into their wordpress panels and still have it work efficiently, generally speaking use of this method is never necessary, and commonly a red flag for poor architecture.
I think that's exactly what that person means by superfluous and egregious examples of bad web development; SPAs, javascript frameworks of that nature. :p
Yes, the debate between building a SPA with rich features or Web old pages with good SEO is eternal :)
We see more and more an hybrid approach that can be called web 1.5 :)
Ajax will* still work fine with an internet connection as long as those ajax endpoints don't require cookies and don't linkrot.
* not 100% sure how the tool handles relative URLs embedded in source : if it's not clever enough though, this is very fixable via PR (as in its not an architectural limitation)
The main problem with too many websites is they've become too much about technology and have left visitors, as well as the spirit and intent of the internet behind.
By default, it will capture everything that would be interpreted/displayed. However, you can also run it with Selenium (and Firefox) instead of Puppeteer, and install your favorite extensions to block all the unwanted resources (there's a switch for that).
MHTML is pretty good for this already btw (not to take away from this neat project though :)). Similarly stores assets as base64'd data URIs and saves it as a single file. Can be enabled in Blink-based browsers using a settings flag and previously in Firefox using addons (also in the past natively in Opera and IE).
Apparently everybody knew about MHTML but me Ü
I'm going to look into that format and see if I could enhance monolith to output proper MHTML, among other additions and improvements. Thank you for the info!
I don't know that it would be a very useful thing to do at least in the short term: there's a bunch of "web archive" formats out there and the common thread between them is that they're custom archive formats, you need special clients or support for those formats:
* mthml encodes the page as a multipart MIME message (using multipart/related), essentially an email (you're usually able to open them by replacing the .mth by .eml)
* WARC is its own thing with its own spec
* WAFF is a zipfile, not sure about the specifics
* webarchive is a binary plist, not sure about the specifics either
Your tool generates straight HTML which any browser should be able to open. It probably has more limitations, but it doesn't require dedicated client / viewer support.
Maybe once you've got all the fetching and extracting and linking nailed down it would be a nice extension to add "output filters", but that seems more like a secondary long-term goal, especially as those archive formats are usually semi-proprietary and get dropped as fast as they get created (WARC might be the most long-lived as it descends from the Internet Archive's ARC, is an ISO standard and is recognised as a proper archival format by various national libraries).
There isn't much to WAFF. Each WAFF file can contain more than one saved page. Each page needs to be contained within its own folder (whose name is usually the timestamp of when the page was saved, but it doesn't matter AFAICT). There can be an `index.rdf` file in there, to specify metadata and which file to open, but otherwise you should look for an `index.SOMETHING` file - usually `index.html`.
When I was messing around with archiving things locally I settled on WAFF, because it's pretty much trivial to create and to use. Even if your browser does not support it, you just need to unpack it to a tempdir and open the index file.
I had never heard about MHTML either. Another use case could be embedding markdown source for the HTML in the document as well. This would allow single-file documents with figures that could be edited as light markup (with some tooling) and be viewed by anyone with a browser. This is something I have been dreaming about for years!
Tbh. I had arrived at the conclusion that Mime would be great, but it never struck me that someone had already made a "standard" of mime and HTML.
One issue with MHTML is that it does not seem to be currently supported by iframes. The use case I was working on was comparing search results from Google and DuckDuckGo by simply scraping and downloading to later embed. For that, I used a cli tool from an open source library [1]. MHTML seems like a nice format but I'm not sure if there's a library to convert them into stand-alone HTML files.
Edit: This question just came to mind. If MHTML saves images using base64, and base64 dataurl images have a limit size, how would you save extremely large photos? Take for example the cover image of this article https://story.californiasunday.com/gone-paradise-fire. When I saved the page in MHTML format, the re-rendered image showed up quite blurry compared to the original. Was the size limit the cause?
> MHTML seems like a nice format but I'm not sure if there's a library to convert them into stand-alone HTML files.
Many years ago used a program that could convert them but the name escapes me. A brief search shows a few results that appear to do a similar conversion, potentially they may be of use.
> This question just came to mind. If MHTML saves images using base64, and base64 dataurl images have a limit size, how would you save extremely large photos?
From what I've read there's no official limit for base64 encodings though IE/Edge limit them to 4GB according to caniuse.com. Haven't personally encountered an image or GIF that was too large not to be saved in an MHT (I have over 10k MHTML files saved, some image and GIF heavy ones up to 200MB each).
Also a sibling comment corrected me about the data URI use, MHTML uses a separate scheme but nevertheless still uses base64 for encoding.
For that example article you linked it seems likely to be the way the program you're using is handling the Javascript-loaded images. I saved it in Vivaldi (Blink-based engine) and the main image displayed at full res when opened locally while the other images didn't, while when saved with a pre-Quantum Firefox using the UnMHT addon it saved all the images at their fully loaded resolutions. Some MHTML saving implementations clearly have advantages over others it would seem.
> Similarly stores assets as base64'd data URIs and saves it as a single file.
Does it? IIRC it stores assets as MIME attachments, hence the "M": the result is not HTML (which this would I assume be), it's a multipart MIME message whose root is an HTML document.
edit: in fact when downloading mht files osx / safari misrecognises them as exported emails and appends the "eml" extension.
Always struck me as quite odd MHTML fell out of favor. Back in the day when I wanted to preserve a web page it was the logical choice since you could just click "save as archive".
Blame XMLHttpRequest, Flash, JS, and embedded video. It doesn't make sense to archive a document when the necessary interactive content elements will essentially fail when opened offline.
What would be cool is one package with the original HTML file, the fully rendered DOM written out as a second file+assets package, and a PDF just in case the first two get fucked.
You can first prerender the page with Chrome in headless mode (see my other comment), and then convert it into a single document using an inlining tool (such as the OP's). That way the JS will run and render the page (see my other comments here for an example).
It appears that (at least) Safari cannot open mhtml files. The benefit of a tool such as what the OP shared is that it can produce plain html pages that are openable by anyone.
(also, I tried mhtml in Chrome using the proper flag and it doesn't appear to store/inline/render static assets correctly).
I think it would be way better to explain in the repository:
- how do you handle images?
- does it handle embedded videos?
- does it handle JS? to what extent?
- does it handle lazily loaded assets (i.e. images that load only when you scroll down, or JS that loads 3 seconds later after the page is loaded)
In general, how does this work? The current readme doesn't do a decent job explaining what the tool exactly is. For all I can tell, it probably just takes a screenshot of the page, encodes as base64 into the html and shows it.
I saw a tool that handles JS to a limited extent by capturing and replaying network requests to accommodate said JS. It records your session while you interact with a site, and is then able to replay everything it captured.
This tool was able to capture three.js applications and other interactive sites quite well.
Sure but many websites load resources and such with JS - the request might not contain the content but 3 seconds on the page with JS to let it fill in everything, etc.
SPA's often (but not always) do this. Content is loaded in via React components and such...
The desktop version saves an HTML file, stylesheet and images/fonts locally, and it only contains the HTML of the snippet with the CSS rules that apply to the DOM subtree of the element you select.
I'm still working out bugs but it would be great if people try it out and let me know how it goes.
I tried SnappySnippet before when looking into the idea - it didn't work well for me and crashed often.
I never saw DesignPirate, but just now I tried it and it didn't output any CSS. I'm not sure but it doesn't look like either of these use chrome.debugger API to call devtools api methods. (you get a warning in Chrome if you use that)
I'm hoping my tool will be better so it's good enough people would be willing to pay for it, but we'll just have to see.
I'm glad there's more people taking a look at the use case, and I'd be interested to see a list of similar solutions.
If you combine this with Chrome's headless mode, you can prerender many pages that use JavaScript to perform the initial render, and then once you're done send it to one of these tools that inlines all the resources as data URLs.
The result is that you get pages that load very fast and are a single HTML file with all resources embedded. Allowing the page to prerender before inlining will also allow you to more easily strip all the JavaScript in many cases for pages that aren't highly interactive once rendered.
This is awesome. One question though: how does it handle the same resource (e.g. image) appearing multiple times? Does it store multiple copies, potentially blowing up the file size? If not, how does it link to them in a single HTML file? If or if so, is there any way to get around it without using MHTML (or have you considered using MHTML in that case)?
Also, side-question about Rust: how do I get rid of absolute file paths in the executable to avoid information leakage? I feel like I partially figured this out at some point, but I forget.
Thank you! It's pretty straight-forward: this program just retrieves assets and converts them into data-URLs (data:...), then replaces the original href/src attribute value, so in case with the same image being linked multiple times, monolith will for sure bloat the output with the same base64 data, correct. I haven't looked into MHTMTL, ashamed to admit it's the first time I'm hearing about that format. I need to do some research, maybe I could improve monolith to overcome issues related to file size, thank you for the tip!
And about Rust: I think you're way ahead of me here as well, this is my first Rust program. If you're talking about it embedding some debug info into the binary which may include things like /home/dataflow then perhaps there's a compiler option for cargo or a way to strip the binary after it's compiled. ¯\_(ツ)_/¯ Sorry, that's the best I can tell at the moment.
Okay thanks! That was a pretty quick reply :) Regarding MHTML, it's basically the MIME format emails are in (which are basically inherently self-contained HTML documents). Various browsers have had varying degrees of support for it over the years. Chrome recently made it harder to save in MHTML format; I don't know how long they will be able to read the format, so I can't guarantee that if you go in that direction it'll still be useful for a long time, but at the moment there is still some support for it.
One way to dedupe inline image resources while still using HTML rather than MHTML, could be to encode them in css once, and transform the image element to something with that class.
Good point. I was thinking in the direction of something I'm tinkering with in a similar area. There getting a static snapshot of the current DOM or fragment is key (meaning scripts being stripped out is an intentional feature). Tweaking the document contents for efficiency could significantly impact a lot of script work that may be present.
I've been printing to PDF for decades now, and nothing comes close to the ease of use and versatility of 2 decades worth of interesting web pages .. I have pretty much every interesting article, including many from HN, from decades of this habit.
Need to find all articles relating to 'widget'?
$ ls -l ~/PDFArchive/ | grep -i widget
This has proven so valuable, time and again .. there is a great joy in not having to maintain bookmarks, and in being able to copy the whole directory content to other machines for processing/reference .. and then there's the whole pdf->text situation, which has its thorns truly (some website content is buried in masses of ad-noise), but also has huge advantage - there's a lot of data to be mined from 50,000 PDF files ..
Therefore, I'd quite like to know, what does monolith have to offer over this method? I can imagine that its useful to have all the scripting content packaged up and bundled into a single .html file - but does it still work/run? (This can be either a pro or a con in my opinion..)
Having gone this route in part myself, advantages of HTML or other more-structured file formats, if there is appropriate metadata markup:
- Allow for recording source and author information (PDF ... doesn't always provide this).
- Allows for full-text search.
- Allows for editing out annoyances.
I'll frequently go from HTML to some simplified representation (e.g., Markdown), and then re-generate formats that are useful elsewhere: HTML, PDF, ePub, etc.
Dumping from HTML to Markdown frequently makes cruft-removal far simpler, and the principle content of most pages is text. In rare instances, images are useful, and even more rarely, any multimedia content (video, audio, programmatic content).
What's depressing is the number of sites which screw with even basic HTML. E.g., the NY Times rarely use HTML tables for tabular representation, and instead use a homebrew combination of custom markup, CSS, and JS to much the same effect. Pretty, in situ, but brittle and transports exceedingly poorly.
Its funny, on a rare occasion, I too have saved some content as a PDF - more so for archiving rather than for offline viewing...But i guess i never thought to scale it for all/most of my bookmarks. It seems so obvious now after reading your comment. However, my experience with PDFs has been negative. From filesize to slow booting of myriad pdf viewers, etc., it just seems like viewing stuff in native html, text is better - at least for what I've experienced. Further, my preferred browser - firefox - leaves much to be desired in this arena of generating proper PDFs, and i end up switching to chrome (bleh!) just to "PDF something" that i saw/read online. Again, this function in firefox is not something that i use as often, hence why i stick with FF, and not gone back to chrome. However, going back to your approach....I wonder if i can use a tool - either like this monolith or singlefile, or even pupeteer, etc. - to snapshot web content, but save it into html instead of pdf. I would guess html content is still grep-able (as you noted for your PDF local searches). Hmmm...a local cache of my own offline bookmarks...Hmmm, interesting. Thank you for this inspiration!!
I've been using this kind of standalone PDFs produced from Markdown with pandoc for a while, and the possibilities are insane.
Imagine a paper in the form of a single HTML file, which has (a subset of) the data included, the graphs zoomable, the colors chanegable (to whatever vision problems you have) - maybe even the algorithm to play around with!
Jupyter Notebooks already go in that way. only without the single-file, open in browser aspect, I think.
Nice. I can see some automated uses for this.
In ordinary browsing, am currently using a Firefox addon called SingleFile which works surprisingly well. Stuffs everything into (surprise, surprise) one huge single file - html with embedded data, so compatible everywhere.
With respect to the Unlicense, does anybody have any knowledge about how good it is in countries which don't allow you to intentionally pass things into the public domain (most countries that aren't the US)? How does it compare to CC0 in that respect?
I honestly don't know. That's why the question :-) Is CC0 good for software? It seems to be a bit more complete from a non-US view point, but I don't know if there are lurking situations. Possibly MIT is better -- it's pretty darn permissive. I'm really just soliciting opinions.
Yes, CC0 is the only Creative Commons License suitable for software. It's endorsed by the Free Software Foundation [1], although not the Open Source Initiative. I use it for everything that I don't want to copyleft.
This is interesting - I think any of us who save things off the internet have made something like this (I usually save entire sites or large chunks, though - so I have a different toolset - still, I also do single pages, so I might try out this tool).
One thing I would propose to add - either a flag, or by default - have it parse the path to the page and create the file with the name - that way you can just "monolith {url}" and not have to worry about it.
I am also curious as to how it handles advertisements and google tracking and such; some way to strip out just those scripts (and elements) could be handy.
It seems to work for basic pages quite well, I think that lazy load will work for most pages as long as the JavaScript is embedded (no -j flag provided) and the Internet connection is on. It saves what's there when the page is loaded, the rest is a gamble since every website implements infinite scroll differently.
Authentication is another tricky part -- it's different for every browser. I will try to convert it into a web extension of sorts, so that pages could be saved directly from the browser while the user is authenticated.
For authentication, you could add an option for passing http headers, as well as accept Netscape-style cookie files.
Whenever I want to download a video, using YouTube-dl, from a site that requires authentication, I first login using my browser and then exports the cookies using an extension.
Ah, I've been thinking about making something like this. You beat me to it. I've been using the SingleFile add-on until now. I'll definitely give this a try.
super project ! i ve pretty baffled with the difficulty to save a webpage in proper format. I’ve tried with PDF converter, getPolaroid app and of course firefox screen shot feature for the entire scroll thing.
Will try this for saving purposes.
I am also interested in cloning/forking sites for modification purposes, I will feedback you on the results four my consulting gigs
It will evolve into a reliable tool in a couple weeks and it should eventually work for embedding everything, including things like web fonts and @url()'s within CSS. If anything doesn't work, please open an issue, I have plenty of time to work on it.
This sounds great, but the first thing I thought was how this would be a perfect tool to make automated mass phishing scams.
If the outcomes are realistic, take a massive list of sites, make a snapshot of each page, replace the POST login URLs with the phishers, deploy these individual HTML files, and spread the links through email.
Sweet idea! I would especially like to be able to capture videos and pictures too.
I suspect for saving videos, a good approach would be some sort of proxy + headless browser combination, where the proxy is responsible for saving a copy of all data the browser requests for.
Thanks! Pictures should work, I'll check more tags first thing tomorrow when I start working on improving it.
I use youtube-dl for youtube and other popular web services myself. Embedding a video source as a data URL could in theory work, but it'd be quite a long base64 line. Also, editing .html files with tens or hundreds of megabytes of base64 in them would perhaps be less than convenient.
Very cool.
Have you considered incorporating an option for following links within the same domain to a certain depth? I remember using tools such as this in the past to save all the content from certain websites.
Thank you! I'll add it as an issue, since it could definitely be useful for "archiving" certain resources more than 1 level deep. Do you remember the name of that tool by any chance?
I wish I could remember the exact tool, this was over a decade ago. If you just do a quick internet search for such a tool you'll likely find whatever I used, it certainly wasn't anything sophisticated. It was a Windows GUI tool designed specifically for the task. Something makes me think that 'GetRight' tool might have been able to do the same thing, but I can't seem to see the feature on their website.
Ah, I remember using something like that. I thought that tool was saving it into one .html file, but data URLs didn't exist back then, so creating directories alongside with HTML files was the only option to "replicate" a web resource, now I understand exactly what you were talking about. I'll do some more digging around and implement that in the nearest future. I may need to make all the requests async first to make sure that saving one resource with decent depth won't take too long.
yes, "teleport pro" for win98, you could scrape a site and duplicate locally or scrape only for specific file type or size, had recursive link follow depth option and created several threads for the requests(sniff)
It looks like it creates a normal HTML file (embedding assets as data URI) so it should require no special client / support.
HTMLD, WARC, MTHML, MAFF and webarchive are all "container" formats which bundle assets next to the HTML using various methods (resp. bundle, custom, multipart MIME, zip and binary plist).
The issue with this is that if the website requires some external API for content, it might not work properly.
https://webrecorder.io/ solves that problem by recording all interactions and then replaying them as needed.
> Webrecorder takes a new approach to web archiving by “recording” network traffic and processes within the browser while the user interacts with a web page. Unlike conventional crawl-based web archiving methods, this allows even intricate websites, such as those with embedded media, complex Javascript, user-specific content and interactions, and other dynamic elements, to be captured and faithfully restaged.
Well you could do that for a long time with MHTML, WARC, etc. downloaders, including those available in browsers via "Save Page as", though CSS imports aren't covered by older tools (are they by yours?). Anyway, congrats for completing this as a Rust first-timer project, which certainly speaks to the quality of the Rust ecosystem. For using this approach as offline browser, of course, the problem is that Ajax-heavy pages using Javascript for loading content won't work, including every React and Vue sites created in the last five years (but you could make the point those aren't worth your attention as a reader anyway).
I think there's an issue with opening a tar file, e.g. if sent to someone who needs to view the document but isn't techy.
It seems to me that having one file that any browser can easily open (and not require Internet connection to view) is a big advantage over having a directory with assets alongside the .html file. It may be one of those things that make things easier yet nobody really complains about how things are usually done when the page gets saved. I hope more browsers add support for saving pages as MHTML in the nearest future so that we wouldn't need tools like this one.
It is done, option -i in the latest version (2.0.3) now replaces all src="..." attributes with src="<data URL for a transparent PNG pixel>" within IMG tags.
The important part of what he said was "providing the output that you get in the terminal". Simply stating "I got an error" and expecting the developer(s) to use clairvoyance to glean further detail is far from a helpful way to report a problem. Perhaps dropping the details in a pastebin site and linking to that would be a possible alternative? Or just including the error message here if it is short enough, though HN shouldn't really be used as a tech support channel.
It for sure would help with those SPA websites that get their DOM fully generated by JS. A web extension that saves the current DOM tree as HTML would perhaps do a better job, especially when it comes to resources which require some web-based authentication.
The idea is almost identical, yet saving as .webarchive is only supported by Safari, and it's also not a plaintext format, hence can't be edited as easily.
That's an interesting question. I think it depends on how the given modal is implemented, but closing them should technically work (unless the page is saved with JavaScript code removed [-j flag]).
Those notifications can easily be removed from the saved file using any text editor, should be pretty easy if you know how to edit HTML code. I don't think removing it would violate anything since "this website" will no longer really be a website but rather a local document at that point.
How do you guys handle the security aspect of executing stuff like this on your machines?
Skimming the repo it has about a thousand lines of code and a bunch of dependencies with hundreds of sub-dependencies. Do you read all that code and evaluate the reputation of all dependencies?
Do you execute it in a sandboxed environment?
Do you just hope for the best like in the good old times of the C64?