— This command hits servers as fast as possible. Not sorry. I have encountered a very small number of sites-I-care-to-mirror that have any sort of mitigation for this. The only site I'm IP banned from right now is http://elm-chan.org/ and that's just because I haven't cared to power-cycle my ISP box or bother with VPN. If you want to be a better neighbor than me, look into wget's `--wait`/`--waitretry`/`--random-wait`.
— The only part of this I'm actively unhappy with is the fixed version number in my fake User-Agent string. I go in and increment it to whatever version's current every once in a while. I am tempted to try automating it with an additional call to `date` assuming a six-week major-version cadence.
— The `--reject-regex` is a hack to work around lots of CMS I've encountered where it's possible to build up links with an infinite number of path separators, e.g. an `www.example.com///whatever` containing a link to `www.example.com////whatever` containing a link to…
— I am using wget1 aka wget. There is a wget2 project, but last time I looked into it wget2 did not support something I needed. I don't remember what that something was lol
— I have avoided WARC because I usually prefer the ergonomics of having separate files and because WARC seems more focused on use cases where one does multiple archives over time (as is the case for Wayback Machine or a search engine) where my archiving style is more one-and-done. I don't tend to back up sites that are actively changing/maintained.
— However I do like to wrap my mirrored files in a store-only Zip archive when there are a great number of mostly-identical pages, like for web forums. I back up to a ZFS dataset with ZSTD compression, and the space savings can be quite substantial for certain sites. A TAR compresses just as well, but a `zip -0` will have a central directory that makes it much easier to browse later.
Here is an example of the file usage for http://preserve.mactech.com with separate files vs plain TAR vs DEFLATE Zip archive vs store-only Zip archive. These are all on the same ZSTD-compressed dataset and the DEFLATE example is here to show why one would want store-only when fs-level compression is enabled.
[lammy@popola#WWW] zfs list spinthedisc/Backups/WWW
NAME USED AVAIL REFER MOUNTPOINT
spinthedisc/Backups/WWW 772G 299G 772G /spinthedisc/Backups/WWW
[lammy@popola#WWW] zfs get compression spinthedisc/Backups/WWW
NAME PROPERTY VALUE SOURCE
spinthedisc/Backups/WWW compression zstd local
[lammy@popola#WWW] ls
Academic DIY Medicine SA
Animals Doujin Military Science
Anime Electronics most_wanted.txt Space
Appliance Fantasy Movies Sports
Architecture Food Music Survivalism
Art Games Personal Theology
Books History Philosophy too_big_for_old_hdds.txt
Business Hobby Photography Toys
Cars Humor Politics Transportation
Cartoons Kids Publications Travel
Celebrity LGBT Radio Webcomics
Communities Literature Railroad
Computers Media README.txt
Some of this could stand to be re-organized. Since I've gotten more into it I've gotten better at anticipating an ideal directory depth/specificity at archive time instead of trying to come back to them later. Like `DIY` (i.e. home improvement) there should go into `Hobby` which did not exist at the time, `SA` (SomethingAwful) should go into `Communities` which did not exist at the time, `Cars` into `Transportation`, etc.
`Personal` is the directory that's been hardest to sort because personal sites are one of my fav things to back up but also one of the hardest things to try and organize when they reflect diverse interests. For now I've settled on a hybrid approach. If a site is geared toward one particular interest or subsulture, it gets sorted into `Personal/<Interest>`, like `Academics`, `Authors`, `Artists`, `Goth` (loads of '90s goths had web pages for some reason). Sites reflecting The Style At The Time might get sorted into `1990s` for a blinking-construction-GIF Tripod/Angelfire site or `2000s` for an early blog. Some times I sort personal sites by generation like `GenX` or `Boomer` (said in a loving way — Boomers did nothing wrong) when they reflect interests more typical of one particular generation.
Maybe save the log automatically? And then check and report unsolved errors, at end of the fuction or better separate one so log can be reinspected any time.
I have encountered "GnuTLS: The TLS connection was non-properly terminated. Unable to establish SSL connection." multiple times, and retry options seem to be useless when that happens. Some searches suggest it could be related to tls handshake fragmentation, but nonetheless wget could retry if related options are used. Manual retry seems to download the missing URLs, otherwise mirroring jobs are randomly incomplete.
It's weirdly specific but I remember old versions of go caused that error. The final packet (close_notify) to close the connection was set with the wrong error level.
— This command hits servers as fast as possible. Not sorry. I have encountered a very small number of sites-I-care-to-mirror that have any sort of mitigation for this. The only site I'm IP banned from right now is http://elm-chan.org/ and that's just because I haven't cared to power-cycle my ISP box or bother with VPN. If you want to be a better neighbor than me, look into wget's `--wait`/`--waitretry`/`--random-wait`.
— The only part of this I'm actively unhappy with is the fixed version number in my fake User-Agent string. I go in and increment it to whatever version's current every once in a while. I am tempted to try automating it with an additional call to `date` assuming a six-week major-version cadence.
— The `--reject-regex` is a hack to work around lots of CMS I've encountered where it's possible to build up links with an infinite number of path separators, e.g. an `www.example.com///whatever` containing a link to `www.example.com////whatever` containing a link to…
— I am using wget1 aka wget. There is a wget2 project, but last time I looked into it wget2 did not support something I needed. I don't remember what that something was lol
— I have avoided WARC because I usually prefer the ergonomics of having separate files and because WARC seems more focused on use cases where one does multiple archives over time (as is the case for Wayback Machine or a search engine) where my archiving style is more one-and-done. I don't tend to back up sites that are actively changing/maintained.
— However I do like to wrap my mirrored files in a store-only Zip archive when there are a great number of mostly-identical pages, like for web forums. I back up to a ZFS dataset with ZSTD compression, and the space savings can be quite substantial for certain sites. A TAR compresses just as well, but a `zip -0` will have a central directory that makes it much easier to browse later.
Here is an example of the file usage for http://preserve.mactech.com with separate files vs plain TAR vs DEFLATE Zip archive vs store-only Zip archive. These are all on the same ZSTD-compressed dataset and the DEFLATE example is here to show why one would want store-only when fs-level compression is enabled.
Also I lied and don't have a full TiB yet ;) Some of this could stand to be re-organized. Since I've gotten more into it I've gotten better at anticipating an ideal directory depth/specificity at archive time instead of trying to come back to them later. Like `DIY` (i.e. home improvement) there should go into `Hobby` which did not exist at the time, `SA` (SomethingAwful) should go into `Communities` which did not exist at the time, `Cars` into `Transportation`, etc.`Personal` is the directory that's been hardest to sort because personal sites are one of my fav things to back up but also one of the hardest things to try and organize when they reflect diverse interests. For now I've settled on a hybrid approach. If a site is geared toward one particular interest or subsulture, it gets sorted into `Personal/<Interest>`, like `Academics`, `Authors`, `Artists`, `Goth` (loads of '90s goths had web pages for some reason). Sites reflecting The Style At The Time might get sorted into `1990s` for a blinking-construction-GIF Tripod/Angelfire site or `2000s` for an early blog. Some times I sort personal sites by generation like `GenX` or `Boomer` (said in a loving way — Boomers did nothing wrong) when they reflect interests more typical of one particular generation.