Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Missed the edit window, but here's the command I use. Newlines added here for clarity.

  wget-mirror() {
    wget --mirror --convert-links --adjust-extension --page-requisites \
    --no-parent --content-disposition --content-on-error \
    --header="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
    --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0" \
    --restrict-file-names="windows,nocontrol" -e robots=off --no-check-certificate \
    --no-hsts --retry-connrefused --retry-on-host-error --reject-regex=".*\/\/\/.*" $1
  }

Some notes:

— This command hits servers as fast as possible. Not sorry. I have encountered a very small number of sites-I-care-to-mirror that have any sort of mitigation for this. The only site I'm IP banned from right now is http://elm-chan.org/ and that's just because I haven't cared to power-cycle my ISP box or bother with VPN. If you want to be a better neighbor than me, look into wget's `--wait`/`--waitretry`/`--random-wait`.

— The only part of this I'm actively unhappy with is the fixed version number in my fake User-Agent string. I go in and increment it to whatever version's current every once in a while. I am tempted to try automating it with an additional call to `date` assuming a six-week major-version cadence.

— The `--reject-regex` is a hack to work around lots of CMS I've encountered where it's possible to build up links with an infinite number of path separators, e.g. an `www.example.com///whatever` containing a link to `www.example.com////whatever` containing a link to…

— I am using wget1 aka wget. There is a wget2 project, but last time I looked into it wget2 did not support something I needed. I don't remember what that something was lol

— I have avoided WARC because I usually prefer the ergonomics of having separate files and because WARC seems more focused on use cases where one does multiple archives over time (as is the case for Wayback Machine or a search engine) where my archiving style is more one-and-done. I don't tend to back up sites that are actively changing/maintained.

— However I do like to wrap my mirrored files in a store-only Zip archive when there are a great number of mostly-identical pages, like for web forums. I back up to a ZFS dataset with ZSTD compression, and the space savings can be quite substantial for certain sites. A TAR compresses just as well, but a `zip -0` will have a central directory that makes it much easier to browse later.

Here is an example of the file usage for http://preserve.mactech.com with separate files vs plain TAR vs DEFLATE Zip archive vs store-only Zip archive. These are all on the same ZSTD-compressed dataset and the DEFLATE example is here to show why one would want store-only when fs-level compression is enabled.

  982M    preserve.mactech.com.deflate.zip
  408M    preserve.mactech.com.store.zip
  410M    preserve.mactech.com.tar
  3.8G    preserve.mactech.com
Also I lied and don't have a full TiB yet ;)

  [lammy@popola#WWW] zfs list spinthedisc/Backups/WWW
  NAME                      USED  AVAIL     REFER  MOUNTPOINT
  spinthedisc/Backups/WWW   772G   299G      772G  /spinthedisc/Backups/WWW


  [lammy@popola#WWW] zfs get compression spinthedisc/Backups/WWW
  NAME                     PROPERTY     VALUE           SOURCE
  spinthedisc/Backups/WWW  compression  zstd            local



  [lammy@popola#WWW] ls 
  Academic                        DIY                             Medicine                        SA
  Animals                         Doujin                          Military                        Science
  Anime                           Electronics                     most_wanted.txt                 Space
  Appliance                       Fantasy                         Movies                          Sports
  Architecture                    Food                            Music                           Survivalism
  Art                             Games                           Personal                        Theology
  Books                           History                         Philosophy                      too_big_for_old_hdds.txt
  Business                        Hobby                           Photography                     Toys
  Cars                            Humor                           Politics                        Transportation
  Cartoons                        Kids                            Publications                    Travel
  Celebrity                       LGBT                            Radio                           Webcomics
  Communities                     Literature                      Railroad
  Computers                       Media                           README.txt


Some of this could stand to be re-organized. Since I've gotten more into it I've gotten better at anticipating an ideal directory depth/specificity at archive time instead of trying to come back to them later. Like `DIY` (i.e. home improvement) there should go into `Hobby` which did not exist at the time, `SA` (SomethingAwful) should go into `Communities` which did not exist at the time, `Cars` into `Transportation`, etc.

`Personal` is the directory that's been hardest to sort because personal sites are one of my fav things to back up but also one of the hardest things to try and organize when they reflect diverse interests. For now I've settled on a hybrid approach. If a site is geared toward one particular interest or subsulture, it gets sorted into `Personal/<Interest>`, like `Academics`, `Authors`, `Artists`, `Goth` (loads of '90s goths had web pages for some reason). Sites reflecting The Style At The Time might get sorted into `1990s` for a blinking-construction-GIF Tripod/Angelfire site or `2000s` for an early blog. Some times I sort personal sites by generation like `GenX` or `Boomer` (said in a loving way — Boomers did nothing wrong) when they reflect interests more typical of one particular generation.



Maybe save the log automatically? And then check and report unsolved errors, at end of the fuction or better separate one so log can be reinspected any time.

I have encountered "GnuTLS: The TLS connection was non-properly terminated. Unable to establish SSL connection." multiple times, and retry options seem to be useless when that happens. Some searches suggest it could be related to tls handshake fragmentation, but nonetheless wget could retry if related options are used. Manual retry seems to download the missing URLs, otherwise mirroring jobs are randomly incomplete.


It's weirdly specific but I remember old versions of go caused that error. The final packet (close_notify) to close the connection was set with the wrong error level.


This is great, thanks for sharing with that additional context.


Wow. Only 772GB. Way under 1TB. Liar!!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: