Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Crawl a modern website to a zip, serve the website from the zip (github.com/potahtml)
223 points by unlog 6 days ago | hide | past | favorite | 66 comments





Nice work!

Obligatory mention for RedBean, the server that you can package along with all assets (incl db, scripting and TLS support) into a single multi-platform binary.

https://redbean.dev/


Wow this is so cool. I like these types of single binary things. Even Pocketbase is like that where you can just compile your whole app into one Go binary and just rsync that over to your server and run it.

Go makes a binary for each target.

RedBean uses magic from @jart to run the same binary on multiple platforms.


Microsoft Interne Explorer (no, I'm not using it personally) had a file format called *.mht that could save a HTML page together with all the files referenced from it like inline images. I believe you could not store more than one page in one *.mht file, though, so your work could be seen as an extension.

Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).


.mht us alive and well. It is a MIME wrapper on the files and is generated by Chrome, Opera, and Edge's save option "Webpage as single file" and defaults to an extension of .mhtml.

When I last looked Firefox didn't support it natively but it was a requested feature.


I use SingleFile on Firefox quite often for this purpose. https://addons.mozilla.org/en-US/firefox/addon/single-file/

> When I last looked Firefox didn't support it natively but it was a requested feature.

That sounds familiar, unfortunately


There are firefox plug-ins that claim to support saving as mhtml, I have no experience with them.

I use it regulary. It works on static sites quite well, but subsites are not automaticaly saved, so not crawled.

Unfortunately it's not supported by Safari either.

Yes! You know, I was considering this the previous couple of days, was looking around on how to construct a `mhtml` file for serving all the files at the same time. Unrelated to this project, I had the use case of a client wanting to keep an offline version of one of my projects.

> Although UNIX philosophy posits that it's good to have many small files, I like your idea for its contribution to reduceing clutter (imagine running 'tree' in both scenarios) and also avoiding running out of inodes in some file systems (maybe less of a problem nowadays in general, not sure as I haven't generated millions of tiny files recently).

Pretty rare for any website to have many files, as they optimize to have as few files as possible(less network requests, which could be slower than just shipping a big file). I have crawled react docs as a test, and it's a zip file of 147mb with 3.803 files (including external resources).

https://docs.solidjs.com/ is 12mb (including external resources) with 646 files


trying to use this for mirroring a document site. disappointed at 1. it running quite slow, 2. it kept outputing error messages like "ProtocolError: Protocol error (Page.bringToFront): Not attached to an active page". not sure what reason

If the URL is public you may post it here or in a GitHub issue, so I can take a look to what's wrong with it.

not reproduce it, but 'wget -m --page-requisites --convert-links <url>' did a good job for me. never mind

SingleFile extension is the modern equivalent these days.

I just opened an .mht file from 2000 on Edge/Mac the other day and it displayed just fine.

Wow! I never knew things like this existed! I always used wget (full below) but nowadays seemingly all sites are behind cloudflare so I need to pass a cookie too.

Glad to see easier methods!

  wget \
     --header "Cookie: <cf or other>"
     --user-agent="<UA>"
     --recursive \
     --level 5 \
     --no-clobber \
     --page-requisites \
     --adjust-extension \
     --span-hosts \
     --convert-links \
     --domains <example.com> \
     --no-parent \
         <example.com\sub>

I'm a big fan of modern JavaScript frameworks, but I don't fancy SSR, so have been experimenting with crawling myself for uploading to hosts without having to do SSR. This is the result

for a long crawling task, if exited/broken for any reason, does it save and restore at the next run?

The README says:

> Can resume if process exit, save checkpoint every 250 urls


nice, better make it as a commandline option with default value. 250 is too many for large files and slow connection.

My 5 cents:

- status codes 200-299 are all OK

- status codes 300-399 are redirects, and also can be OK eventually

- 403 in my experience occurs quite often, where it is not an error, but suggestion that your user agent is not OK

- robots.txt should be scanned to check if any resource is prohibited, or if there are speed requirements. It is always better to be _nice_. I plan to add something like that and also missing it in my project

- It would be interesting to generate hash from app, and update only if hash is different?


Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).

I thought about robots.txt but as this is a software that you are supposed to run against your own website I didn't consider it worthy. You have a point on speed requirements and prohibited resources (but is not like skipping over them will add any security).

I haven't put much time/effort into an update step. Currently, it resumes if the process exited via checkpoints(it saves current state every 250 URLs, if any is missing then it can continue, else it will be done)

Thanks, btw what's your project!? Share!


I agree with your points.

You might be interested in reddit webscraping thread https://www.reddit.com/r/webscraping/

My passion project is https://github.com/rumca-js/Django-link-archive

Currently I use only one thread for scraping, I do not require more. It gets the job done. Also I know too little to play more with python "celery" threads.

My project can be used for various things. Depends on needs. Recently I am playing with using it as a 'search engine'. I am scraping the Internet to find cool stuff. Results are in https://github.com/rumca-js/Internet-Places-Database. No all domains are interesting though.


> Status codes, I am displaying the list because mostly on a JavaScript driven application you don't want other codes than 200 (besides media).

What? Why? Regardless of the programming language used to generate content, the standard, well known HTTP status codes should be returned as expected . If your JS served site, gives me a 200 code when it should be a 404, you're wrong.


I think you are misunderstanding, your application is expected to give mostly 200s codes, if you get a 404, then a link is broken or a page misbehaving which is exactly why that page url is displayed on the console with a warning.

In many cases, 403 is really 404 on things like S3.

How is it different from HTTrack? And what about the media extension, which one is supported and which one isn’t? Sometimes when I download some sites with HTTrack, some files just get ignored because by default it looks only for default types, and you have to manually add them there.

Big fan of HTTrack! reminds me of the old days and makes me sad of the current state of the web.

I am not sure if HTTTrack progressed from fetching resources, long time since I used it for last time, but what my project does, is spin a real web-browser(chrome in headless mode which means it's hidden) and then it lets the JavaScript on that website execute, which means it will display/generate some fancy HTML that you can then save it as is into an index.html. It saves all kind of files, it doesn't care the extension or mime types of files, it tries to save them all.


> It saves all kind of files, it doesn't care the extension or mime types of files, it tries to save them all.

That’s awesome to know, I will give it a try. One website I remember I tried to download and has all sorts of animations with .riv extension and it didn’t work well with HTTrack, will try it with this soon, thanks for sharing it!


let me know how that goes I am interested!

The libwebsockets server (https://libwebsockets.org) supports serving directly from zip archives. Furthermore, if a URL is mapped to a compressed archive member, and assuming the browser can accept gzip-compressed files (as most can), then the compressed data is copied from archive over http to the browser, without de-compressing or conversion by the server. The server does a little bit of header fiddling but otherwise sends the raw bytes to the browser, which automatically decompresses it.

I used to use MAFF (Mozilla Archive Format)[1] a lot back in the day. I was very upset when they ended the support[2].

I never dug deeper whether I can unzip and decode the packing, but saving as simple ZIP does somewhat guarantee future-proofing.

[1] https://en.wikipedia.org/wiki/Mozilla_Archive_Format

[2] https://support.mozilla.org/en-US/questions/1180271


I'm curious about this vs a .har file

In Chrome Devtools, network tab, last icon that looks like an arrow pointing into a dish (Export har file)

I guess a .har file as ton more data though I used it to extract data from sites that either intensionally or unintentionally make it hard to get data. For example, signing up for an apartment the apartment management site used pdf.js and provided no way to save the PDF. So saved the .har file and extracted the PDF.


IIUC HAR files contain an awful lot of data that you would not want to end up being stored in a web page archive:

- irrelevant http headers (including cache control) taking up too much space

- auth data / cookies, credentials, personal infos that you don't want saved for privacy and security reasons especially if you want to share your archive.

HAR is also not very efficient for this use case: it's a list of request represented in json. A folder representation is far better, for storage efficiency as well as for reading the archive (you basically need to implement some custom logic that rcan read the page replaying the requests).


I like the approach here! Saving to a simple zip file is elegant. I worked on a similar idea years ago [0], but made the mistake of building it as a frontend. In retrospect, I would make this crawl using a headless browser and serve it via a web application, like you're doing.

I would love to see better support for SPAs, where we can't just start from a sitemap. If you're interested in, you can check out some of the code from my old app for inspiration on how to crawl pages (it's Electron, so it will share a lot of interfaces with Puppeteer) [1].

[0] https://github.com/CGamesPlay/chronicler/tree/master [1] https://github.com/CGamesPlay/chronicler/blob/master/src/mai...


It tries to fetch a sitemap for in case there's some missing link. But it starts from the root and crawls internal links. There's a new mode added this morning for spa with the option `--spa` that will write the original HTML instead of the generated/rendered one. That way some apps _will_ work better.

Understood that this is early times, are you considering a licence to release it under?

+1 can you add a license file with an MIT or BSD or whatever your preference is? (It's very cool. I'd love to help with this project, I'm guessing others would as well)

Sure, I forgot about that detail, what license do you suggest?

MIT and BSD seem to be by far the most common these days (I generally do MIT personally)

added

AGPL

Is the output similar to a web archive file (warc)?

That's something I haven't explored, sounds interesting. Right now, the zip file contains a mirror of the files found on the website when loaded in a browser. I've ended with a zip file by luck, as mirroring to the file system gives predictable problems with file/folder names.



What's the benefit of this approach?

One possible advantage I see is it creates a 1:1 correspondence between a website and a file.

If what I care about is the website (and that's usually going to be the case), then there's a single familiar box containing all the messy details. I don't have to see all the files I want to ignore.

That might not be a benefit for you and not having used it, it is only a theoretical benefit in an unlikely future for me.

But just from the title of the post, I had a very clear piccture of the mechanism and it was not obvious why I would want to start with a different mechanism (barring ordinary issues with open source projects).

But that's me and your mileage may vary.


That the page HTML is indexable by search engines without having to render in the server. Such unzipping to a directory served by nginx. You may also use it for archiving purposes, or for having backups.

How does it work with single page apps? If the data is loaded from the server, does it save the page contents as full, or just the source of the page?

It saves the generated/rendered html, but I have just added a `spa` mode, that will save the original HTML without modifications. This makes most simple web app work.

I have also updated the local server for fetching from origin missing resources. For example, a webapp may load some JS modules only when you click buttons or links, when that happens and the requested file is not on the zip, it will fetch it from origin and update the zip. So mostly you can back up an SPA by crawling it first and then using it for a bit for fetching the missing resources/modules.


Note I've recently released a tool that can find all chunks of a SPA that works for popular configurations (like webpack or ES modules):

https://github.com/zb3/getfrontend


Thanks for sharing!

So a modern chm (Microsoft Compiled HTML help file)

Seems like a very useful tool to impersonate websites. Useful to scammers. Why would someone crawl their own website?

Scammers don't need this to copy an existing website, and I could see plenty of legitimate uses. Maybe you're redoing the website but want to keep the previous site around somewhere, or you want an easy way to archive a site for future reference. Maybe you're tired of paying for some hosted CMS but you want to keep the content.

All the scenarios you described can be achieved by having access to the source code, assuming you own it.

Lots of things are possible with access to source code that are still easier when someone writes a tool for that scenario.

Crawling a build you already have isn't one of them

The website in question may be a dynamic website (e.g., WordPress, MediaWiki, or whatever other CMS or custom web app) and you either want a snapshot of it for backup, or you run it locally and want un static copy to host it elsewhere that only support static files.

> Why would someone crawl their own website?

My main use case is that the docs site https://pota.quack.uy/ , Google cannot index it properly. On here https://www.google.com/search?q=site%3Apota.quack.uy you will see some tiles/descriptions won't match what the content of the page is about. As the full site is rendered client side, via JavaScript, I can just crawl myself and save the html output to actual files. Then, I can serve that content with nginx or any other web server without having to do the expensive thing of SSR via nodejs. Not to mention, that being able to do SSR with modern JavaScript frameworks is not trivial, and requires engineering time.


I’m not quite understanding: you’re saying you deploy your site one way, then crawl it, then redeploy it via the zipfile you created? And why is SSR relevant to the discussion?

Modern websites execute JavaScript that render DOM nodes that are displayed on the browser.

For example if you look at this site on the browser https://pota.quack.uy/ and do `curl https://pota.quack.uy/` do you see any of the text that is rendered in the browser as output of the curl command?

You don't, because curl doesn't execute JavaScript, and that text comes from JavaScript. One way to fix this problem, is by having a Node.js instance running that does SSR, so when your curl command connects to the server, a node instance executes JavaScript that is streamed/served to curl. (node is running a web server)

Another way, without having to execute JavaScript in the server is to crawl yourself, let's say in localhost, (you do not even need to deploy) then upload the result to a web server that could serve the files.


I want to take down a full copy of a site hosted on Squarespace before moving off of it.

I have no access to source and can't even republish the site directly without violating Squarespace's copyright.

But having the old site frozen in amber will be great for the redesign.


I think you can also screenshot full length in Chrome-based browsers; do both desktop & mobile widths.

It would be a good backup for the backup, & you designer will thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: