Hacker News new | past | comments | ask | show | jobs | submit login
This Page is Designed to Last (jeffhuang.com)
1321 points by tannhaeuser on Dec 19, 2019 | hide | past | favorite | 446 comments

There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.

Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic. Imagine having the last 30 years of web browsing history saved on your local machine. This would especially be useful when in research mode and deep diving a topic.

[1] https://github.com/machawk1/warcreate

[2] https://github.com/machawk1/wail

[3] https://github.com/internetarchive/warcprox

EDIT: I forgot to mention https://github.com/webrecorder/webrecorder (the best general purpose web recorder application I have used during my previous research into archiving personal web usage)

This was what made me convert from bookmarking to clipping pages into Evernote around 6-7 years ago. I realized I had this huge archive of reference bookmarks that were almost useless because 1) I could rarely find what I was looking for, if I even remembered I'd bookmarked something in the first place, and 2) if I did, it was likely gone anyway. With Evernote I can full text search anything I've clipped in the past (and also add additional notes or keywords to ease in finding or add reference info).

Since starting with replacing bookmarks, I've moved other forms of reference info in there, and now have a whole GTD setup there as well, which is extremely handy since I can search in one place for reference info and personal tasks (past and future). Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.

Shout out to https://joplinapp.org/

I was an Evernote user when I was on macOS. When I switched to Linux, a proper web clipper was something I really missed. I'm now on Joplin and it does everything I used to use Evernote for and then some.

It even has vim bindings now!

As far as longevity goes, I think they got their archive / backup format right - it's just a tarball with markdown in it.

No need of proprietary code and apps why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve.

Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.

Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.

> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text.

I have used rst intensively on a project. A few years later, I would be hard pressed to write anything in ti and would need to start with a Quick Start tutorial. With all its faults, Markdown is simple enough that it can be (and is) used anywhere, so there is no danger of me forgetting its syntax (even if it wasn't much simpler to star with).

So personally I would prefer md over rst anytime.

> Markdown is not a specification

Not by that name... https://commonmark.org/

It is stil not a specification like restructuredText[1]. Also wikiMarkup (which really started this markdown) is different from GitHub markdown, which is different from other markdown editors. Also many sites use their own markdown versions.

If you are in restructuredText world there is one specification and all implementation adhere to it, be it pandoc, sphinx, pelican, nikola. The beauty of it is that it has extension mechanisms which provides enough room for each tool to develop it. But markup can be parsed by any tool.

[1] https://docutils.sourceforge.io/docs/ref/rst/restructuredtex...

I don't know why markdown is so popular other than maybe "it was easy to get running" or "works for me".

It's better than "designed by a committee" standards, but it lacks elegance or maybe craftsmanship.

Because its inherently appealing, close to what you wanted intuitively, and if you're only dealing with a single implementation of it, works fairly well.

You don't really get bit by its lack of a standard and extensibility until after you've bought in.

It's essentially designed by the opposite of a committee -- rather than including everything but the kitchen-sink, it contains support for almost no usecases except the one. Which is very appealing, when you only have the one usecase.

Well rst is better than markdown from day one. The only reason it became famous is thanks to wikimarkup.

So markdown needs to thank the popularity of Wikipedia for its success, as rst did not have any application like Wikipedia. But still rst is used widely enough with its killer Sphinx, readthedocs and now its kind of de-facto documentation writing markup in Python and many open source software world.

Because you can teach someone markdown in five minutes. And even if they don't know all the ins and outs, the basics are pretty foolproof (paragraphs, headings, bold and italic).

> No need of proprietary code and apps

Joplin is free and open source.

This... I just found a plugin for the static site generator Pelican that is 7 years old that still works. After running Pelican you get plain HTML that can be hosted anywhere. I like Netlify, but other options like GitHub pages are also great. The author recommends not putting on GitHub Pages because they haven't found a working business model and might not be here in the future. But... GitHub has been taken over by Microsoft which is most likely not going bankrupt soon and Microsoft loves their backward compatibility so I am confident they won't screw GitHub up too much.

You can say the same about geocities when it was acquired by yahoo. But it didn't last and then now it's happening with yahoo-groups. So I am not hopeful if GitHub becomes a liability microsoft will keep it.

Yahoo! And Microsoft have very different business models. One is intently more sustainable (selling software and services).

> No need of proprietary code and apps

Joplin is open source, which is a big part of the sell to me. It definitely isn't the best of all possible note taking systems that could ever exist, but it's the best open source one I've found so far, and I don't have time to write a better one at the moment.

> why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve

This is solving a different problem though. WARC/MHT and other solutions can do this. Joplin is more of a note taking system that allows ingesting content from the web into one's own local notebook, which is relevant to what the GP post was talking about - Evernote.

However, it would seem that "the modern web" is the now popular standard. 10 years ago it might have been Flash or Java web applets or whatever. Now it's JS. I'm not convinced that JS is any better than what it has replaced. However, people keep paying developers to write them, so presumably someone likes them.

> Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.

Agreed, but that's also not a problem that Joplin, Evernote, or any other such tool is going to be able to solve. Unless you are complaining that Joplin is an Electron app? That's my biggest issue with it personally. It runs well enough, but is definitely the heaviest application I use regularly, which is a little sad for a note taking program. On the other hand, I haven't found a better open source replacement for _Evernote_. There are lots of other open source note-taking programs though.

> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.

reST is indeed very nice. At one point, I kept my personal notes as a Sphinx wiki with everything stored in reST. I found this to be less ergonomic than Evernote/Joplin, although in principle it could do all the same things that Joplin can do, and then some.

> No need of proprietary code and apps why not build it into browsers.

Safari does this. pages added to the reading list archive the content for offline reading.

Joplin is open source.

Thanks a lot for the recommendation! I have been a little annoyed with Evernote not having an app for Ubuntu, which I recently started using quite heavily. So this looks very interesting!

The developer behind it is doing some awesome stuff so I decided to sponsor him on GitHub.

> Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.

I have used Evernote and OneNote, but have finally, after a long interim period, resorted to using only markdown.

I have a "Notes" root folder and organize section groups and sections in subfolders. VSCode (or Emacs), with some tweaks, shortcuts, and extensions, provides a good-enough markdown editing experience. Like an extension that allows you to paste in images, storing it in a resources folder in the note's current location (yes, I see small problems with this down the road when re-organizing, but nothing that can't be handled).

For Firefox, I use the markdown-clipper extension the few times I would like to copy a whole article, it works well enough. Or I copy/paste what I need; mostly, I take my own summarized notes.

For syncing, I use Nextcloud, which also makes the notes available both for reading and editing on Android and iOS (I use both).

Up until very recently, I used Joplin, which also uses markdown, but there were two things I could not live with: it does not store the markdown files with a readable filename, e.g., its title, and being tied to a specific editor.

If you are mostly clipping and not writing your own notes, I can imagine my setup won't work well, or be very efficient.

I want to use a format that has longevity, and storing in a format that I cannot grep is out of the question.

Thanks for pointing out markdown-clipper!



You bookmark in Pinboard or Delicious and ArchiveBox saves the page. Handy.

>if I even remembered I'd bookmarked something in the first place

I had recently participated in a discussion on the problem of forgetting bookmarks[1].

Copying my workflow from there,

1. If the entire content should be easily viewed, then store via pocket extension.

2. If a partial content should be easily viewed i.e. some snippet with link to entire source, then store in notes (Apple).

3. If the content seem useful in the future, but it is okay to forget it; then I store it in the browser bookmarks.

But, my workflow doesn't address the problem raised by Mr. Jeff Huang; if Pocket app or notes disappear so goes my archives. I think self hosted archive as mentioned by the parent is the way to go, but I don't think it's a seamless solution to a common web browser user.


My solution for a small subset of the forgetting problem:

I frequently see something and want to try it out the next time I want to do something else. So I emulate User Agent strings and append lots of "like [common thing I search for a lot]" to the bookmark. When I start typing into the search bar for those other things I'll be reminded of the bookmark.

For example, since file.io is semi-deprecated I decided to try out 0x0.st . But I kept forgetting when I actually needed to transfer a file, so I made a bookmark titled "0x0.st Like file io".

As a side note, I have a similar bash function called mean2use that I use to define aliases that wrap a command and ask me if I'd like to do it another way instead or if I'm sure I want to use the command. I've found this is a nice way to retrain my habits.

That was useful, can you add this to the original needgap thread I linked?

Disclaimer: needgap is a problem validation platform I built.

I'm glad you mention Evernote. I also use it for this, and also for many other purposes.

It is true that it is propietary software but it is worth mentioning that all the content can be exported as an .enex file, which is xml.

So, the data can be easily exported.

>easily exported

Have you actually looked at such an xml: https://gist.github.com/evernotegists/6116886

Exported sure, it's all there. But importing that into your new favorite notes application is not going to be trivial, especially not for regular users.

That's why I've decided to stick mostly to regular files in a filesystem.

Presumably "regular users" will not be individually writing XML parsing code to convert the notes. The developers of their "new favorite notes application" will do it (and if they can't be bothered, maybe it shouldn't be your "new favorite notes application").

Joplin, for example, can import notes exported from Evernote. It's just a menu option that even regular users should have no trouble employing.

I store bookmarks (i.e. URLs) in a simple .txt file. My text editor lets me click on them to bring them up in a browser.

> Only downside is I'm dependent on Evernote

No special software nor database required.

What benefit does that provide compared to regular browser bookmarks? It doesn't seem to address either of the issues I mentioned.

1. it's independent of the browser

2. it works with any browser

3. I can move it to any machine and use it

4. It is not transmitted to the browser vendor

5. Being a text file, it is under my control

6. I can back it up

7. I don't need some database manager to access it

8. I can add notes and anything else to the file

9. It's stupid simple

What happens when the website itself becomes unavailable?

For me, this is the problem that Evernote solves - it saves the entire content of the page, images, text, and clickable links.

The link stops working.

I find Evernote's search isn't that good, at least in the free version. Often trying to remember keywords and using Google is faster.

I know about DevonThink, i read good recommendations. But it's IOS/Mac only.

Any Evernote alternative for Win/Android with great search ?

Another commenter suggested https://joplinapp.org/, it has a nice search feature and has apps for most platforms.

You just opened up a world for me I hadn't thought about!

So simple! Thank you!

You're welcome! If you're interested in getting into GTD in Evernote (which I highly recommend), I wrote a blog post a while back about my setup: https://www.tempestblog.com/2017/08/16/how-i-stay-organized-...

Nice article but http://www.thesecretweapon.org/ isnt reachable anymore. The Page didnt Last...

Oh the irony. I'll update my link to an archive post or reproduce the important parts. Thanks for pointing that out!

Edit: doesn't appear to be down, they're just using a self-signed cert.

hmm, your right. i tried "continue with insecure certificate" yesterday but maxbe i was to impatient

Is there a good end-to-end encrypted alternative to Evernote?

first I heard of web clipping. Looks like OneNote has web-clipper extensions, too. This is so great.

>There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.


And I still remember the modem days where I would download entire websites because the ISP charged by the hour, and I'd read them offline to save money.

I can't put my finger on it but this has a sort of Dickensian quality to me.

I think this says something kind of profound about information and capitalism and whatnot.

No, it hasn't. Technology just wasn't there back then which caused significant cost per time unit, which makes it only fair to charge per time unit.

Yeah. In the dialup days, layer 1 and 2 of a home internet connection was a long-running phone call between your own modem and a modem of an ISP. You payed via your phone bill, for the duration of the call.

It says scarce resources are pricier. Welcome to the real world!

Personal wayback machins should be standard computing kit. I have had one since around 2013. Very bare bones demo: https://bpaste.net/show/3FBH6 it does much more than that. file:// is supported for example, so you can recursively import a folder tree, and re-export it later if you wanted to.

Or in some random script: "from iridb import iri_index" "data = iri_index.get('https://some/url')" I'm skipping lots, you can ref by hash, url, url+timestamp. It hands back a fh, you dont know if the data you are reffing even fits in memory. Extensive caching, all the iri/url quirks, punycode, PSL etc.

Some random pdf in ~/Downloads, "import doc.pdf" and dmenu pops up, you type a tag, hit enter and the pdf disappears into the hash tree, tagged, and you never need to remember where you put it. Later on you only need to remember part of the tag, and a tag is just a sequence of unicode words.

Chunks are on my github (jakeogh/uhashfs, it's heka-outdated dont use it yet), I'll be posting the full thing sometime soonish.

I actually this week asked the author of SingleFile if he could implement a save-on-bookmark feature for SingleFile, and he was amenable:



Nice. FYI there's also SingleFileZ

> SingleFileZ is a fork of SingleFile that allows you to save a webpage as a self-extracting HTML file. This HTML file is also a valid ZIP file which contains the resources (images, fonts, stylesheets and frames) of the saved page.


I'll implement the "save bookmark page" feature in both extensions :)

Whoa. I just installed SingleFileZ for FF and it is working great. Before I was using wget and that was clunky. This is working great since I can just toss a single file up on my server and we are good to go. Thanks for this!

oh, hello! It's funny how this subject has popped up again.

Hi! I think it confirms that there's a real interest in this feature.

Off topic, but could I ask how you knew your software was being talked about? Did you just happen by or have you some monitoring agent looking for mentions? Just curious

Sorry, I didn't see your question. I check the posts on HN very regularly. The title of the post made me think that people might have been talking about SingleFile. Sometimes, friends of mine tell me someone on the Internets is talking about SingleFile :). I also sometimes use the integrated search engine.

WorldBrain's Memex (https://addons.mozilla.org/en-US/firefox/addon/worldbrain/) has an option to perform a full-text index (not archive) of bookmarks, or pages you visited for 5 seconds (default) down to 1 second (no option to index all pages). It stores this stuff into a giant Local Storage (etc) database, which Firefox implements as a sqlite file.

https://www.gwern.net/Archiving-URLs describes extracting brower history to create an archive via a batch job.

Firefox actually purges history automatically. For instance, the oldest history I have on this browser right now is from January 2018. I found about this the hard way.

I noticed this behavior in Firefox too. So I started writing personal Python scripts to scrape FF's SQLite database where it stores all the browsing history information.

Safari does the same, even though I tell it to never clear browsing history.

I think Chrome(ium) does as well. Very annoying tbh.

Chrome was the first browser I encountered that deletes history without being instructed to.

It looks like Firefox has been doing it since 2010[1]. I wonder how long Chrome has been doing it, since launch, 2008? Here's a Chrome bug discussing it[2].

[1] https://web.archive.org/web/20151229082536/http://blog.bonar...

[2] https://bugs.chromium.org/p/chromium/issues/detail?id=500239

Mosaic had full text history search.

You can increase the retention period to centuries via about:config.

I have this problem. Some bits of history are gone except from old backups of profile directories and profiles where I've already set places.history.expiration.max_pages to some absurdly high number.

I need to do a handful of experiments to see exactly how this interacts with Sync, even though I've (foolishly) already synced the important profiles. I'd hope that the cloud copies of the places database just keeps growing, but in any case, I'd rather combine them all offline anyway.

Even if you set the setting, how can you be sure that it won't be reset on an upgrade or that you'll remember to set it if you need a new profile (perhaps your old one becomes buggy, crufty, corrupt, or all three)? I thought I had all my history retained until one day I couldn't find a website I knew I had visited years ago, and took a closer look at my history and was very unpleasantly surprised... What happened? I'll never know, but my suspicion is that Firefox reset the history retention setting at some point along the way. If you do any web dev, you know Firefox occasionally backstabs you and changes on updates. The only way to be sure over a decade-plus is to regularly export to a safe text file where the Mozilla devs can't mess with it. I can't undo my history loss, but I do know I have lost little history since.

I can't be sure. When I say 'combine them all offline', I mean using something like [1] which refuses to do anything for me because the Waterfox database version is a rather old Firefox version, and that seems to expect all the db's versions to be up-to-date and equal, which seems pointless. #include <sqlite3.h> was my next step-- only I don't walk very well, so that didn't happen "yet". Or I'm lazy, or distracted, or depressed, or something. When I recently got tired of realizing a thing was on the other machine, I bit the bullet and synced them, if only to see how well that worked.

Anyway, thanks for the guide.

[1] https://github.com/crazy-max/firefox-history-merger

I like the idea, but wanted to know how realistic it would be so I made a quick and dirty Python script to download all my bookmarks. If you want to make the same experiment, you can get it from here: https://gist.github.com/ksamuel/fb3af1345626cb66c474f38d8f03...

It requires Python 3.8 (just the stdlib) and wget.

I have 3633 bookmarks, for a total of 1.5 Go unziped, 1.0 Go zipped (and we know we can get more from better algo and using links to files with the same checksum like for JS and css deps).

This seems acceptable IMO, espacially since I used to consider myself a heavy bookmarker and I was stunned by how few I actually had and how little disk they occupied. Here are the types of the files:

   31396 text/plain
   3034 application/octet-stream
   1316 text/x-c++
   1123 text/x-po
    865 text/x-python
    384 text/html
    227 application/gzip
    218 inode/x-empty
    178 text/x-pascal
    113 image/png
     44 application/zlib
     29 text/x-c
     28 text/x-shellscript
     14 application/xml
     13 application/x-dosexec
     12 text/troff
      5 text/x-makefile
      4 text/x-asm
      3 application/zip
      2 image/jpeg
      2 image/gif
      2 application/x-elc
      1 text/x-ruby
      1 text/x-diff
      1 text/rtf
      1 image/x-xcf
      1 image/x-icon
      1 image/svg+xml
      1 application/x-shockwave-flash
      1 application/x-mach-binary
      1 application/x-executable
      1 application/x-dbf
      1 application/pdf
It should probably be opt-in though, like a double click on the "save as bookmark icon" to download the entire file, and the star becomes a different color. Mobile phones, chrome books and raspy may not want to use the spaces, not to mention there are some bookmark content that you don't want your OS to index, and show you preview of in every search.

But it would be fantastic: by doing this experiment I noticed that many bookmarks were 404 now, and I will never get their content back. Beside, searching bookmark, and referencing them is a huge pain.

So definitely something I wish mozilla would consider.

> definitely something I wish mozilla would consider

There used to be this neat little extension called Read It Later that let you do just that. Bookmark and save it so you could read it when you were offline or the page disappeared. Later they changed their name and much later Mozilla bought it and added it to Firefox without a way to opt out. It was renamed to Pocket.

Pocket is not integrated with your bookmarks. For offline consultation, you need a separate app. Of course this app is not available on Linux, where you have to get some community provided tools.

Bookmark integration would mean one software, with the same UI, on every platform, and only one listing for your whole archive system.

I’ve been building an application to do this, except for everything on your computer! It’s called APSE[0], short for A Personal Search Engine.

[0] https://apse.io

Having to pay $15/month ($180/yr!) to be able to search stuff on my own computer for years seems awfully expensive. I'd rather depend on some simple open-source piece of software that I can understand and maintain if necessary.

Yeah, the sheer idea of paying a subscription for software that is running on my computer to index local resources is crazy. This kind of software should be should sold as one-time buy license.

Decades ago there was an amazing piece of software from lotus when I worked there called magellan. I remember the first time I saw someone search, and find results in text documents, spreadsheets and many other of the common formats of the day.

That was in 1989 and today I mostly search my computer using find and grep commands, since that's what just keeps working.

I should try adopting find and grep, but on Windows I'm currently using this and I'm very happy with it: https://www.voidtools.com/downloads/

I use Void Tools Everything to find files by name, and AstroGrep for finding information in them.

Yup, I could see paying $180 one time for something like this. but at $180 a year for a self hosted product... that's just very steep.

Google used to have a native Mac extension like a launch bar. Command space ... Enter search all local files. It was really fast

macOS Spotlight became good enough


Well I used LaunchBar and then Quicksilver for many years. Spotlight has never been as nice and hackable as those.

You can ask Safari to do that by enabling 'Reading List: Save articles for offline reading automatically'. It's not WARC but it is an offline archive. The shortcut is cmd-shift-D which is almost the same as the bookmark one. It's also the only way I know of to get Safari to show you bookmarks in reverse chronological order. And it syncs to iOS devices.

This could be done in better and more specialized ways, one problem is browser extension APIs don't provide very good access to the browser's webpage-saving features.

This problem has been solved a long time ago if you use Pinboard.


Just pay the yearly subscription so pinboard can cache your bookmarks.

I do not see how using a web thing is a solution to web things going away.

Pinboard happens to be a web service run along the same principles as the article we're discussing.

The bus factor is high, but I suspect that Maciej has a plan that'll let us download our archive even if he does get grabbed by the mainland Chinese government let alone a forecasted going out of business action.

Then what do you propose the answer is? The blog post just proposes using “web things” differently

For archiving web pages? ArchiveBox is ok.

I've contemplated upgrading my Pinboard account many times. Finally bit the bullet.

Until pinboard goes OOB

Pinboard is a profitable online equivalent to a mom and pop shop. It’s sustainable and its founder isn’t chasing growth at all costs. It also has a cult following, so OOB is highly unlikely.

Does pinboard caching work with sites that are behind a login or paywall?

It does not. See FAQ here: https://pinboard.in/upgrade/

This is the advantage to Evernote. Since it’s a browser extension, it has access to anything you have access to.

The downside is, since it’s a browser extension, it has access to anything you have access to.

Agreed that there’s a tradeoff. I don’t think there’s really an alternative solution though.

I use the Zotero extension for this feature.


came here to say this. zotero also saves metadata as an extra. as long as it is used from within a browser.

In practice how is this different from MHTML? I think most browsers have built-in support for MHTML so it should be possible to build that part easily.

The state of mhtml support is fairly pathetic at the moment. Firefox broke mhtml compatibility with the quantum overhaul. Chrome's mht support had been a hit and miss over the years, sometimes removing the GUI option entirely and requiring one to manually launch the browser with a special tag to enable it. The only browser with a history of consistent mhtml support happens to be....Internet Explorer, followed by a bunch of even more obscure vendors that nobody really uses.

I am currently dealing with the problem of parsing large mht files (several megabytes and up). A regular web browser would hang and crash upon opening these files and most ready made tools I could find struggle with the number of embedded images. It's very much a neglected format with very little support in 2019.

According to the MHTML entry on Wikipedia, Chrome requires an extension, Firefox doesn't support it, and only Internet Explorer supports MHTML.

I mean, it's not any worse than WARC support…

Maybe MAFF's are best as they use compression instead of base64 encoding: https://en.wikipedia.org/wiki/Mozilla_Archive_Format

Or SingleFileZ files which can be viewed without installing any extension https://github.com/gildas-lormeau/SingleFileZ.

Edit: it can also auto-save pages, like SingleFile.

I used the excellent unMHT plugin for Firefox, but it got dropped some time ago, failed to meet "enhanced security requirements" :(

I still keep an old ESR with this plugin for archiving, and accessing MHTs.

This is what started me clipping everything to OneNote instead of bookmarking. Unfortunately, it becomes difficult to maintain, the formatting is off, things subtly break, pages clipped on mobile use different fonts for god knows what reason, some content is discarded silently because the clipper deems it's not part of the main article, I could go on.

It's better than nothing but it's also increasingly frustrating to deal with.

I've actually been saving every page I visit for a good two years now and it has barely caused in dent in my NAS storage space. As usual though, I wrote a crappy extension and Python script to do that because I never bothered to look online. Thanks for introducing me to WarcProxy - I'll probably be making the switch very soon.

That would be even more useful if a search warrant is ever executed on my house.

Not useful to me personally, but useful to someone!

I always save pages instead of only bookmarking them.

Most websites I used to visit during demoscene high days, are now gone.

Years ago I used the old Firefox addon Shelve to automatically archive the vast majority of web pages I visited.


The main disadvantage was disk space. This is particularly true when some pages are 10 MB or larger. I would periodically prune the archive directory for this reason.

I stopped using Shelve when I started running out of disk space, and now I can't use Shelve because the addon is no longer supported. The author of Shelve has some suggestions for software with similar functionality:


It installs and loads in Waterfox, but of course it still hasn't been touched in 3.5 years.

I used Scrapbook in the past (which also still works) but I usually just save random things in ~/webpages/ since (apparently) 2011. The earliest is a copy of the landing page at bigthink.com. Of course now almost every link is broken, excluding social media buttons, About Us, Contact Us, RSS, Privacy, Terms of Use, Login, and the header logo pointing at the same page.

It would be nice if we had browsers that were actually user-agents that allow full pluggable customizability for all cookie, header, UI, request, and history behavior. Then, this would just be a plugin that anyone could install.

And then people install garbage extensions that break the browser and people think "Wow, this firefox browser is so buggy and slow" and switch to chrome. And then your extensions break with every single browser update because they are tampering with internal code.

Everyone is free to fork a browser and apply any changes they want. Allowing extensions to change anything at all essentially is the same as forking and merging your changes with upstream every update.

Somehow that doesn’t seem to happen with editors; we have a ton of them and they are very customizable.

I guess “with a reasonably stable hook api” was supposed to be implicit in my statement.

How is viewing the WARC after? Is it the same quality as archive.is/ or archive.org/ ?

It's a pain, like most of the WARC ecosystem. It's been several months since I dug in, so maybe it's had some spitshine the last little bit, but I usually end up using combinations of wpull, grab-site, and a smattering of other utilities to reliably capture a page/set of pages, and have had to make some quick hacks as well as manually merging in some PRs to get things to work with Python3. Once I have the WARC, I typically end up using warcat to extract the contents into a local directory and explore that way.

WARC as a format seems promising, but at least last I checked, open-source tooling to make it a pleasant and/or transparent experience is not really there, and worse, at least as of several months ago, doesn't really seem actively worked on. Definitely an area you'd expect to be further along.

Pretty good, depending on the tooling. I’m having good luck with https://webrecorder.io/ and their related open-source tools.

archive.org use WARC if I remember correctly, and offer guides on creating and reading the format.

Safari used to (and still) do this automatically but in a limited way. In the browsing history view (Command Y), you can search visited pages by its content, and this is extremely useful. But there's no way(†) to tell Safari to display that saved content. If you revisit a URL in the history, Safari fetches it again, losing the original saved content.

(†): short of direct plist manipulation

Does it actually? I thought history was stored in a SQLite database and only kept the URL and page title.

This comment misses the point.

The point is to make a webpage that lasts. So people can link to it and get the page. That means making a maintainable webpage and a url that does not change.

It is great that you can archive every page you visit for yourself, but that is not the same as making a lasting web.

Lets make something that others can use too.

Better than a bookmark action would be a commandline option, similar to Firefox's -screenshot, which will work without starting X11. Something like -archive:warc

Does this also strip the megabytes of superfluous tracking JS? It's probably what'll be the bulk of the size on the modern web, and I don't feel any particular need to store it.

(I believe that for historical purposes, enough complaining about ads and tracking will survive that future historians can easily deduce the existence of this practice)

I use DEVONThink on my mac, which has web archiving, full-text search, and auto-categorization.

"Imagine having the last 30 years of web browsing history saved on your local machine."

I believe the name for that experience was/is "Microsoft Windows".

>Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic.

I tend to do that, I also save a lot of scientific papers, ebooks and personal notes. I've found that doing so does not help me at all. The main problem I have is that when I need to look something up (an article, a book, a bit of info) I reach for google first, usually end up finding the answer and go to save it, only to find that I had already found the answer beforehand (and perhaps already made clarifying notes to go along with it) and then forgot about it.

This, and not dead links, is the fundamental problem with bookmarks for me. Not only bookmarks, it extends to my physical notes and pretty much everything I do. If I haven't actively worked on something for a couple of months, I forget all about it and when I come back to it I usually have to start from scratch until I (hopefully) refresh my memory. Some of it is also usually outdated information.

I think this is a big, unsolved problem and I'm not even sure how to go about starting to solve it. I can envision some form of AI-powered research assistant, but only in abstract terms. I can't envision how it would actually work to make my life better or easier. It would need to be something that would help blur the line between things I know and things that are on my computer somehow. If I think of my brain like it has RAM and cache, things I'm working on right now are in the cache and things I've worked on recently or work on a lot are in RAM, but what's for me lacking is a way to easily move knowledge from my brain-RAM to long term storage and then move that knowledge back into working memory faster than I can do so now. I'm not even talking about brain uploading or mind-machine interfaces, but just something that can remind me of things I already know but forgot about faster than I can do so by myself.

I am convinced that figuring out how to do this will lead to the next leap in technological development speed and efficiency. Not quite the singularity that transhumanists like to talk about, but a substantial advancement.

I have exactly the same problem.

What I've found is that I need to spend more time deciding what is important, and less time consuming frivolous information. That's hardly a technology problem.

For things I really don't want to forget, I'm using Anki [0], a Spaced Repetition System (SRS). Anki is supremely good at knowing when you're about to forget an item and prompting you to review it.

Spaced practice and retrieval practice, both of which are used in SRS, are two learning techniques for which there is ample evidence that they actually work [1].

You still need to decide what is worth remembering, but that's something technology can't help with, I think.

[0] https://apps.ankiweb.net/

[1] https://www.learningscientists.org/

Yes so much this.

There are a few issues to consider:

- Any comprehensive archive of your activity is itself going to be a tremendously "interesting" resource for others -- advertisers, law enforcement, business adversaries, and the like. Baking in strong crypto and privacy protections from the start would be exceedingly strongly advised.

- That's also an excellent reason to have this outside the browser, by default, or otherwise sandboxed.

- Back when I was foolish enough to think that making suggestions to Browser Monopoly #1 was remotely useful, I pointed out that the ability to search within the set of pages I currently have open or have visited would be immensely useful. It's (generally) a smaller set than the entire Web, and comprises a set of at least putatively known, familiar, and/or vetted references. I may as well have been writing in Linear A.

- Context of references matters a lot to me. A reason I have a huge number of tabs open, in Firefox, using Tree-Style Tabs, is that the arrangement and relationships between tabs (and windows) is itself significant information. This is of course entirely lost in traditional bookmarks.

- A classification language for categorising documents would be useful. I've been looking at various of these, including the Library of Congress Subject Headings. A way of automatically mapping 1-6 high-probability subjects to a given reference would be good, as well as, of course, tools for mapping between these.

- I've an increasing difference of opinion with the Internet Archive over both the utility and ultimately advisability of saving Web content in precisely the format originally published. Often this is fragile and idiosyncratic. Upconverting to a standardised representation -- say, a strictly semantic, minimal-complexity HTML5, Markdown, or LaTeX, is often superior. Both have their place.

On that last, I've been continuing to play with the suggestion a few days ago for a simplified Washington Post article scrubber, and now have a suite of simple scripts which read both WashPo articles and the homepage, fetching links from the homepage for local viewing. These tend to reduce the total page size to about 3-5% of the original, are easier to read than the source, and are much more robust.

I'm reading HN at the moment from w3m (which means I've got vim as my comment editor, yay!), and have found that passing the source to pandoc and regenerating HTML from that (scrubbing some elements) is actually much preferable, for the homepage. Discussion pages are ... more difficult to process, and the default view in w3m is unpleasant, though vaguely usable.

Upshot: saving a WARC strictly for archival purposes is probably useful, but generating useful formats as noted above would be generally preferable in addition.

With the increasing untenability of mainstream Web design and practices, a Rococco catastrophe of mainstream browsers, the emergence of lightweight and alternative browsers and user-agents (though many based on mainstream rendering engines), the tyranny of the minimum viable user attacking any level of online informational access beyond simple push-stream based consumption, and more, it seems that at the very least there's a strongly favourable environment to rethinking what the Web is and what access methods it should support. Peaks in technological complexity tend to lead to a recapitulation phase in which former, simpler, ideas are resurrected, returned to, and become the basis of further development.

I fundamentally agree with the principle -- that pages should be designed to survive a long time -- however the steps the author lays out I completely disagree with.

"The more libraries incorporated into the website, the more fragile it becomes" is just fundamentally untrue in a world where you're self-hosting all of your scripts.

"Prefer one page over several" is diametrically opposed to the hypertext model. Please don't do this.

"Stick with the 13 web safe fonts" assumes that operating systems won't change. There used to be 3 web safe fonts. Use whatever typography you want, so long as you self host the woff files.

"Eliminate the broken URL risk" by... signing up for two monitoring services? Why?

I think this list of suggestions does a great disservice to people who just want to be able to post their thoughts somewhere. There's an assumption here that you'll need to be technically capable in order to create a page "designed to last" and frankly that is not what the internet is about. Yes, Geocities went away. Yes, Twitter and Facebook and even HN will go away. But the answer sure as hell isn't "I teach my students to push websites to Heroku, and publish portfolios on Wix" because that is setting up technical gatekeeping that is completely unnecessary.

> "The more libraries incorporated into the website, the more fragile it becomes" is just fundamentally untrue in a world where you're self-hosting all of your scripts.

There are more problems though. older library versions might be vulnerable to XSS attacks, or use features removed by browsers in the future for security reasons (eval?). Or you might want to change something involving how you use the API but the docs are long gone. Generally, libraries imply complexity and when it comes to reliability, complexity will always be your enemy.

Also unless you're very diligent about semantic markup and separation of content, presentation, interaction logic, the more complicated s site is the more difficult it is to port.

I have run into this problem trying to migrate very old web pages or blog posts off of SaaS sites that are shutting down or just decaying. It's not just that complicated sites make it difficult to extract the content in the first place; it's difficult to publish that content on another site in a high-fidelity, and sometimes even readable, way.

The hard part isn't keeping the old site (page) running (although that's not always easy either). The hard part is when you want to do something _else_ with that content -- more complicated means less (easily) flexible.

I didn't perceive the author to be doing any technical gatekeeping, quite the opposite. I feel like their article was targeted at people like me or others who use stuff like Hugo/Jekyll, or those who use free website builders or use large frameworks for simple websites.

I agree a couple of the points seem out of place (the monitoring service one made me laugh. visiting my website is the first thing I do after uploading a new page), but the intent of this article I wholeheartedly agree with:

Reduce dependencies, use 'dumb' solutions, and do a little ritualistic upkeep of your website to keep it around for a decade or more. The things you propose are the norm and the reason nothing sticks around, IMO.

> (the monitoring service one made me laugh. visiting my website is the first thing I do after uploading a new page)

I think what you want is not just monitoring your internal links, but also external ones - if a page you linked to in your article starts 404-ing or otherwise changes significantly, it's something you'd likely want to know about. That said, just like preferring GoAccess over Google Analytics, it's something I'd like to have running locally somewhere (on my server, or even on my desktop), instead of having to sign up to some third-party service.

> "Stick with the 13 web safe fonts" assumes that operating systems won't change. There used to be 3 web safe fonts. Use whatever typography you want, so long as you self host the woff files.

Indeed. 10 years ago, “font-family: Georgia, Serif” was guaranteed to work and look the same on pretty much all computers out there. Windows had all of the “web core” fonts (Georgia, Verdana, Trebuchet, Arial, even Comic Sans). Macintosh computers had all of the “web core” fonts. Even most Linux computers had them because it was legal to mirror, download, and install the files Microsoft distributed to make the fonts widely available.

In the last decade, Android has become a big player, and the above font stack with Georgia will look more like Bitstream Vera than it looks like Georgia on Android.

The only way to have a website have the same typography across computers and phones here in the soon-to-be 2020s is to supply the .woff files. Locally (because Google Webfonts might be offline some day). Either via base 64 in CSS or via multiple files; I prefer base 64 in CSS because sites are more responsive loading a single big text file than 4 or 5 webfont files. Not .woff2: Internet Explorer never got .woff2 support, and we can’t do try-woff2-then-woff CSS if using inline base64.

Even with very aggressive subsetting, and using the Zopfli TTF-to-WOFF converter to make the woff files as small as possible, this requires a single 116 kilobyte file to be loaded with my pages. But, it allows my entire website to look the same everywhere, and it allows my content to be viewed using 100% open source fonts.

Then again, for CJK (Asian scripts), webfonts become a good deal bigger; it takes about 10 megabytes for a good Chinese font. In that case, I don’t think it’s practical to include a .woff file; better to accept some variance in how the font will look from system to system.

Edit In terms of having a 10-year website, my website has been online for over 22 years. The trick is to treat webpages as <header with pointers to CSS><main content with reasonably simple HTML><footer closing all of the tags opened in the header> and to use scripts which convert text in to the fairly simple HTML my website uses for body content (the scripts can change, as long as the resulting HTML is reasonably constant). CSS makes it easy for me to tweak the look and fonts without having to change the HTML of every single page on my site, but as the site gets older, I am slowing decreasing how much I change how it looks.

I disagree with keeping fonts inline in the page. It means an additional 100kb per page at the very least. Which adds up very quickly. Remember that most of the world still doesn't have broadband (including yourself if you're using roaming services abroad). It also means extremely redundant information is transmitted when people watch more than one page on your site.

Using base64 fonts in your stylesheet isn't a big deal when you aggressively cache and compress your CSS.

Exactly. It’s a CSS (not HTML) file with all the inline fonts in that file, with a long cache time, so all of the website’s fonts are loaded once for site visitors.

  > "Prefer one page over several" is diametrically opposed to
  > the hypertext model.
No, it is not. No need to split the article into five pages when it can be on one. Unless you want to inflate your clicks, that is.

""Prefer one page over several" is diametrically opposed to the hypertext model. Please don't do this."

I think I agree with you here, in that much of the power of hypertext lies in the hierarchical "tree" model.

And yet, I think it has not been used properly up to this point ...

I hesitate to post this as this is not quite finished[1], but here goes - this is something called an "Iceberg Article":


... wherein the main article content is, as the article suggests, a single, self-contained page.

And yet ... that is just the "tip" - underneath is:

" ... at least one, but possibly many, supporting documents, resources and services. The minimum requirement is simply an expanded form of the tip (the "bummock"), complete with references, notes and links. Other resources and services that might lie under the surface are a wiki, a changelog, a software repository, additional supporting articles and reference pages and even a discussion forum."

[1] Neither wiki nor forum exist yet, but the bummock does...

I don't see how that's fundamentally different from a Wikipedia article, which is basically

(1) a single, self-contained page, (2) that is just the "tip", and (3) linked within it is all the stuff mentioned

That site has various opinions about the "tip" being uncluttered of links, etc, but that's just an opinion (and one I disagree with).

The thinking here is that the "tip" is <= to a single page.

Wikipedia articles can be quite long (and justifiably so) - perhaps scrolling many pages.

The "tip" of an "Iceberg Article" is ".. a single page of writing ..."

Perhaps confusing because I don't mean a "single (web)page" I mean, an actual single page.

It's not entirely wrong, if the page is held static but the browser continues to be upgraded.

If you're worried about fonts changing out from under a site you should surely also be worried about bitrot in, say, jQuery.

Or not bitrot, but ever-changing browser APIs.

When was the last browser change that broke things like simple news websites????

The phrases "simple" and "news websites" don't combine well these days. Even the NPR website downloads 13.2 MB of content over 91 individual requests, and takes just over 3.6 seconds to load (6.5 to finish).

- CSS Stylesheets: 3

- Animated gifs: 1

- Individual JS files: 11 (around 2MB of JS decompressed (but not un-minimized))

- Asynchronous Requests: 14 (and counting)

And that's with uBlock Origin blocking 12 different ad requests.

That's not simple in any form. So, the possibility of something on this page breaking? High. There's a lot of surface area for things to break over time. And that's not counting what happens when the NPR's internal APIs change for those asynchronous requests.

I know NPR was just an example, but they do actually have a text-only version that I've found really useful: https://text.npr.org

If the site is being served over HTTP/2 then the 11 separate JS files is a good thing compared to a single 2MB JS file.


In my case it also has the added benefit of being able to cache JS for a long(er) period of time, with users only having to download maybe 0-30kb of JS when only 1 component is updated instead of invalidating the entire JS served (Way under 1MB however)

>Use whatever typography you want, so long as you self host the woff files.

or use Google Web fonts, and set let last option in your font-family to be "serif" or "sans-serif" to let an appropriate typeface be used if your third-party font is unreachable. That's the beauty of text, the content should still be readable even if your desired font is unavailable.

Google Web Fonts are not an "or", here. Fonts have disappeared from it, and there is no reason to not expect Google to, at some point in the future, go: "you know what, this costs too much without any substantial return." And now it's just another killedbygoogle.com product. Just like images, self-hosting woff/woff2 should be step 1.

Fonts disappearing is not a big issue that will ultimately render your page useless. If the font is gone, the look of the page is slightly affected, but the content of the page remains. It's honestly not a big deal at all.

In that case just use sans-serif or a web-safe font and avoid the third-party dependency.

Here as we enter the 2020s, there are no longer any web safe fonts. Those 1990s Core Fonts for the Web (Verdana, Georgia, Trebuchet, etc.) are no longer universal across all widely used platforms.

Yeah okay, but the initial suggestion (just specify "sans serif") still holds. Or really, if we're talking about a webpage to last, why do we even care about what font is being used? If you care enough about a font that the glyphs used are important for layout, then obviously you're going to need to include the font. If the specific look of the page is essential to the content conveyed, it seems likely to me you won't be using a standard font anyway.

For typical "the words matter more than how the words look" content...can someone explain to me why we care about including the font?

There is another thread in this discussion where we discuss this, pointing out that default fonts in browsers tend to be quite ugly.

See here: https://news.ycombinator.com/item?id=21841011

There’s also layout issues caused when replacing a font with another font, unless the metrics are precisely duplicated. There’s a reason RedHat paid a lot of money to have Liberation Sans with the exact same metrics as Arial, Liberation Serif have the same metrics as Times New Roman, and Liberation Mono have the same metrics as Courier New.

His "or" was to suggest that instead of only self-hosting the font file, you simply use a google one with a "fall-back" that happens to be a super-standard font that won't reasonably disappear from most OSes in the near future. That way, you get a reasonable "best of both".

Google web fonts were a great way to make my site slower. I don’t know if it’s the latency here in Australia or what, but (especially for developing locally) google web fonts were a big headache for having snappy webpages. I took the time one day to produce my own webfont files and self-host those, and the difference in site load speed is like night and day.

And that's where they are not banned. Many pages simply won't load at all in the PRC because someone thought a Google analytics tracker or a hosted library should load before the content (which then never does).

Actually Google Analytics works fine in PRC.

Google Fonts also isn’t blocked but I recall it being hit-and-miss in terms of responsiveness when I was working on a website that targeted Chinese audience a few years ago. However, I just tried resolving fonts.googleapis.com and fonts.gstatic.com on a Chinese server of mine, and they both resolve to a couple of Beijing IP addresses belonging to AS24424 Beijing Gu Xiang Information Technology Co. Ltd., so it’s probably very much usable now.

Not sure "working fine in PRC" is really something you can say about anything web related.

I do occasional web dev from within China and had to eliminate external references to get manageable page load times. At least from where I work pulling in practically anything from outside the Great Firewall will have a high probability of killing page load time. Anything hosted by Google in particular will often have you staring at your screen for 30 seconds.

Yes, any additional domain you request to has a non-negligible chance of killing the entire connection. To GP, noticing that one request works once from a server (not a home or mobile connection) really means nothing. Every ISP has different and constantly changing failure modes.

Or don’t specify any font at all and leave it up to the user’s preference. Why presume you know better than the user?

It would be great if web browsers had a way to actually indicate the user's preference of typeface, but what we've actually got is the browser's preference, and the browsers almost all have chosen really terrible default typefaces. It's fine to say "just use the default" for Mac users who get a decent default, but then the poor windows users have to suffer through some terrible serif.

The users who actually know how to change the default font also know how to use stylish.

When you go to a restaurant you let the chef prepare food for you.

Telling him to back off and let you cook because he can't know better than you (his user) would be absurd.

Same thing with design and typography. It requires skill and taste, and hopefully people will be delighted or simply consume the content for what it is, because the design/cooking just reveals that content in a convenient/useful shape.

Most people have the means to cook for themselves without going anywhere and do so at least the vast majority of the time. Even if you do go to a restaurant, they almost always have menus rather than just making one dish for everyone since some people have styles of cooking that they prefer or don't like. People rarely design their own font but rather pick from professionally designed fonts. Additionally, at restaurants people pay for food so incentives are aligned while on the web people generally don't pay for content and any design professional involved is likely an advertiser. I rarely read stuff on the web for a design experience but for the content. I suspect most people would be unhappy with a newspaper that changed fonts for every story or a book that changed fonts every chapter.

Personally, I've been setting my browser to use only DejaVu fonts with a 16pt minimum for years (maybe a decade now) and every time I briefly use a default browser profile I notice the fonts and think not just "this is bad" but "how can people live like this?". Even with the usually minor issues that often appear, setting my own fonts is a way better experience than not doing so. My default experience is much closer to Firefox reader mode than it is to what the page specifies in most cases.

IMO, font speicification should be limited to serif, sans-serif, or monospace and let the user or browser set the actual font. Desingers should not rely on exact sizes of fonts or use custom icon fonts.

Most fonts picked by designers suck. Plain and simple. I override fonts for most websites I frequent.

Can you elaborate on why/how they suck? Do you have example links, to set a common ground for the conversation?

I think most fonts that get your attention suck, the best ones are invisible and get you directly to the meaning of text, without getting in the way. So maybe there's a kind of bias (selection or sampling bias?) operating here?

I can’t speak for the parent poster, but, yes, back in the Myspace days, end users would do really tasteless CSS like Comic Sans or an italic font everywhere. Back then, I told my browser “I don’t care what font they tell you to use, just render it with Verdana”.

These days, people either use their social network’s unchangeable CSS, or they use a Wordpress theme with an attractive and perfectly readable font. Even Merriweather, which I personally don’t care for, is easy enough to read.

The only time I have seen a page use obnoxious fonts in the 2010s is when the LibreSSL webpage used Comic Sans as a joke to highlight that the project could use more money:


Edit It may be a case that the parent poster likes using a delta hinted font, either Verdana or Georgia, on a low resolution monitor, and doesn’t like the blurry look of an anti-aliased font on a 75dpi screen.

> back in the Myspace days, end users would do really tasteless CSS like Comic Sans or an italic font everywhere.

Indeed, typography is a skill. Most designers should have it though, which is why I asked more information to OP.

> The only time I have seen a page use obnoxious fonts in the 2010s is when the LibreSSL webpage used Comic Sans as a joke to highlight that the project could use more money

Ah, the infamous Comic Sans. It's a shame because as a typeface on its own, in its category, it is pretty good. Sadly, it's misused all the time in contexts where it's not appropriate at all.

> It may be a case that the parent poster likes using a delta hinted font, either Verdana or Georgia, on a low resolution monitor, and doesn’t like the blurry look of an anti-aliased font on a 75dpi screen.

Without more details we cannot guess. You're right: a lot of things can go wrong and ruin a typeface, regardless of how the characters are designed. Anti-protip: a reliable way to make any font look like shit is to keep the character drawings as they are and mess up the tracking (letter-spacing) and kerning.

I think one of the reasons Comic Sans got such a bad rep is because it was one of the relatively few available fonts back in the pre-woff “web safe fonts” era of a decade ago. Microsoft should had given us a more general purpose font, such as a nice looking slab serif to fill the gap between the somewhat old-fashioned looking Georgia and the very stylized Trebuchet MS font.

Because they they are not the single system default sans-serif and single system default sans-serif-monospace fonts that all websites MUST use, period, no discussion. As you put it:

> fonts that get your attention suck

If I can tell the difference between your font and the system default font, your font sucks; if I can't tell the difference, what's the damned point?

> the single system default sans-serif and single system default sans-serif-monospace fonts that all websites MUST use, period, no discussion.

The web standards allow a website to use any WOFF (or WOFF2) font they wish to use. Please see https://www.w3.org/TR/css-fonts-3/

The web standards are wrong. This shouldn't be surprising, since they also allow a website to use javascript and cookies.

Well, if it makes you feel any better, my website renders just fine on Lynx (no Javascript nor webfonts needed to render the page), complete with me putting section headings in '==Section heading name==', which is only visible in browsers without CSS. Browsers with modern CSS support see the section headings as a larger semibold sans-serif, to contrast with the serif font for body text. [1]

[1] There are some rendering issues with Dillo, with made the mistake of trying to support CSS without going all the way, making sure that http://acid2.acidtests.org renders a smiley face, but even here I made sure the site still can be read.

[2] Also, no cookies used on my website. No ads, no third party fonts, no third party javascript, no tracking cookies, nothing. The economic model is that my website helps me get consulting gigs.

[3] I do agree with the general gist of what you’re trying to say: HTML, Javascript, and CSS have become too complicated for anything but the most highly funded of web browsers to render correctly. Both Opera and Microsoft have given up with trying to make a modern standards compliant browser, because the standards are constantly updating.

> Well, if it makes you feel any better, my website renders just fine on Lynx

It doesn't; I only use lynx when someone tricks apt-get into updating part of my graphics stack (xorg, video dirvers, window manager, etc) and researh is needed to figure out how to forcibly downgrade it, and then only because I can't use a proper browser without a working graphics stack.

> the general gist of what you're trying to say: HTML, Javascript, and CSS have become too complicated for anything but the most highly funded of web browsers to render correctly.

This is subtly but critically wrong; I am saying that it is necessary than web browsers do not render websites 'correctly'. The correct behaviour is to actively refuse to let websites specify hideous fonts, snoop on user viewing activity, or execute arbitrary malware on the local machine.

> Browsers with modern CSS support see [...] the serif font for body text.

My point exactly.

"not getting your attention" and "can't tell the difference" are not the same thing.

Fair nitpick - "haven't noticed the difference yet" would be more accurate - but I don't see how that changes the argument; if I haven't noticed a difference, what's the point?

The trouble is that the defaults tend not to be the best fonts that are available, and very few users change them. I have changed them myself, but I don’t know of anyone else that has.

For myself, I wish that people would leave Arial, Verdana, Helvetica Neue, Helvetica, &c. out of their sans-serif stack, having only their one preferred font and sans-serif, or better still sans-serif alone; but as a developer I understand exactly why they do it all.

Unfortunately, I'm one of those developers :( My font stack is:

  font-family: system-ui, Helvetica, sans-serif;
for prose and

  font-family: ui-monospaced, Menlo, monospace;
for monospaced text. The first being the user's preferred font, the second as a good (IMO?) default that I impose on them, and the third as a full fallback. I'm conflicted on whether this is the right balance between user choice and handling browsers that support nothing.

: "in a world where you're self-hosting all of your scripts."

Anything self-hosted is already fragile: it will go away when you don't continue to actively maintain it (paying for a domain, keeping a computer connected to the Internet etc.) or when you die.

I've outlived and outhosted all of the third party hosts I've used.

I guess you can last 10 years, which is apparently what "This Page is Designed to Last" aspires to, but what if we have greater ambitions? Like 100 years?

Then I think you need something like archive.org.

You can design and mark-up content that will still be useful and readable in 100 years. You might be able to preserve the presentation logic (CSS-style) for 100 years.

You probably won't be able to preserve the interaction design for 100 years (without a dedicated effort -- that's why they bury computers along with the software in time capsules).

But I think it is optimistic to think that _most_ SasS hosts are going to archive content for 100 years. Preserving digital content is an _active_ process. It takes resources and requires deliberate effort.

Postscript: I'm trying to think of modern companies that would preserve content for 100 years, assuming they make it that for.

Facebook is the only significant current platform that I can even imagine preserving content for 100 years, but even that seems like s stretch. Historians might step in to archive it, but is there real value to Facebook to maintain and publish 50 year old comments on 2.5 billion unremarkable walls?

Twitter won't. Certainly Insta, SnapChat, WhatsApp etc. won't. Flickr probably could do it relatively easily but won't. YouTube maybe, but there's more to store. Something like GitHub maybe?

Repo hosts are quite vulnerable to storage abuse as well as simply accruing genuine old content.

I can see a deletion heuristic that considers both account activity and repo activity being deployed within the next 10 years.

However I expect another evolution in SCM in the same timeframe.

Right, look at SourceForge. There's a lot of broken links and/or references to no-longer accessible content in some of the older Apache.org projects too.

Also maybe cvs/svn/git repo generally don't contain content worth preserving for 100 years. There are some historically significant or interesting repos, but for the most part you'll have a bunch of unremarkable (and duplicated) code that may not have run then and certainly won't run now.

> a bunch of unremarkable (and duplicated) code that may not have run then and certainly won't run now

100 years is a long time, but I do run 20+ years old Common Lisp libraries and expect them to work without modification; I'd be really pissed if they disappeared from the Internet because someone thought that 5 years of inactivity means something doesn't work anymore.

If you have built something worth lasting 100 years, other people will help you ensure that it does. That reduces the concerns in this article considerably.

When I studied media science one of the most lasting experiences I had was a talk with one lady of the viennese film museum (on of the few film museums that store actual films instead of film props).

As a digital native I never gave it a thought, but she told me that there is a collective memory gap in films that have been shot or stored digitally. With stuff that has been stored on film, there was always soem copy in some cellar and they could make a new working copy from whatever they found. With digital technology this became much much harder and costly for them, because it often means cobbling together the last working tape players and maintaining both the machines and the knowledge of how to maintain them. With stuff on harddrives a hundred different codecs that won’t run on just any machine etc this combined to something she called the digital gap.

I had never thought about technology in that way. Nowadays this kind of robustness, archiveability and futureproofing has become a factor that drives many of my decisions when it comes to file formats, software etc. This is one of the main reasons why I dislike relying soly on cloud based solutions for many applications. What if that fancy startup goes south? What happens to my data? Even if they allow me to get it in a readable format, couldn’t I just have avoided that by using something reliable from the start?

I grew to both understand and like the unix mantra of small independent units of organizations — trying as hard as possible not to make software and other things into a interlinked ball of mud that falls apart once of the parts stops working for one or the other reasons. Thinking about how your notes, texts, videos, pictures, software, tools etc. will look in a quasi post apocalyptic scenario can be a healthy exercise.

On this subject you can dive into the story of the missing "Doctor Who" TV serials.

Some tape of master's were infamously reused to store other contents. Beside the whole archive problem come from the reusable nature and scarcity of the chosen storage. I think I've read something about reusing paper as well in medieval time.


> I think I've read something about reusing paper as well in medieval time.

This mostly happened with parchment, not paper, but otherwise you are right. It is called a palimpsest.[1] Sometimes the writing under the writing can be reconstructed as happened with the oldest copy of Cicero's Republic.[2]

[1]: https://en.wikipedia.org/wiki/Palimpsest

[2]: https://www.historyofinformation.com/detail.php?entryid=3059

There is an old observation that I found striking at the time:

Newer methods of storing information tend to be progressively easier to write, and progressively less durable.

(The following is not really in chronological order)

You'll never look at stone tablets the same way again. As primitive as they are, their longevity can be amazing. Ancient emperors and tyrants knew what they were doing. Trajan's column from 113 AD is our main source on roman legionary's iconic equipment.

Cuneiform tablets were heavy and awkward, but they were 3D so there was no paint to worry about.

Parchment tends to be more durable than papyrus, and paper. Perhaps the best known among the Dead Sea Scrolls was made out of copper.

Iron Age culture artifacts are harder to find than Bronze Age one, because bronze is more resistant to corrosion.

CD's, especially(?) from home burners, are reported to oxidize after several years. That may still be better than tapes, hard drives and other magnetic media (SSD?) which can be wiped by an EMP pulse. The internet era information storage appears to come with an upkeep cost! Slack practically doesn't archive messages by default. Until Gmail, it was typical for email servers to delete old messages.

People get used to novelty and things being ephemeral. Capitalism supposedly requires low durability goods so people keep buying them, including tools and clothes. Houses are poorly built break down pretty quickly.

I find it amazing people used to decorate their homes, tools, clothes with ornaments, engravings etc. You'd be a fool to do that today, you don't even know how long that thing is going to last.

Interesting analogy. I am having opposite problem. I have a shoe box half filled with miniDV tapes. Camera is long gone. I would like to transfer these to hard drive. Services that offer service digitize your tapes just too expensive for me. Most of the tapes probably just goofing around and there is issue of privacy. With current camcorder I just plug the SD card into computer and copy across.

Get an old minidv player which can transit the video over FireWire: https://www.quora.com/What-is-the-best-way-to-transfer-mini-...

Ask Whovians (https://en.wikipedia.org/wiki/Doctor_Who_missing_episodes).

Post internet, most content is globally replicated. Someone somewhere will find time and energy to make an Amiga simulator with exactly correct bugs, to run the program you want. Amount of content lost proportion to amount of content created must have gone down dramatically.

Sorry for cross interfering with this post timeline ;)

> With digital technology this became much much harder and costly for them, because it often means cobbling together the last working tape players

I think something might be getting lost in translation. Could she have meant “electronic” rather than “digital” (which to me suggests digital media such as DVD etc)

This whole anecdote makes more sense to me with this substitution.

She was refering to both, she said they have similar problems with CDs and even stuff on hard disks, because often the used Video Codecs are hard to get running without the right knowledge and resources, especially because some of the non-consumer-codecs were often also proprietary and sort of made fore specific plattforms, but I don't know too much on that, so take this as speculation.

Yes when you equate it with “both”it all makes more sense! Just that digital as I’m used to the term excludes the analogue stuff like VHS as well

Obligatory xkcd entitled "Digital Resource Lifespan" https://www.xkcd.com/1909/

The author says:

"that formerly beloved browser feature that seems to have lost the battle to 'address bar autocomplete'."

But at least in firefox, if you type "*" then your searched terms in the URL bar, it actually queries your bookmarks !

There are many such operators, you can search in your history ("^"), your tags ("+"), you tabs ("%"): https://support.mozilla.org/en-US/kb/address-bar-keyboard-sh...

My favorite is "?", which is not documented in this link. It forces search instead of resolving a domain name.

E.G: if I type "path.py", looking for the python lib with this name, Firefox will try to go to http://path.py, and will show me an error. I can just add " ?" at the end (with the space) and it will happily search.

It's a fantastic feature I wish more people knew about.

It very well done as well, as you can use it without moving your hands from the keyboard: Ctrl + l gets you to the URL bar, but Ctrl + k gets you to the URL bar, clears it, insert "? ", then let you type :)

It's my latest FF illumination, the previous one was discovering that Ctrl + Shirt + t was reopening the last closed tab.

Not sure you're aware of this one too... But, you might like the "Ctrl+Tab" shortcut as well. With it you can alternate between the last few active tabs, with thumbnails. Really handy.

I don’t think I’ve come across a single Firefox user that ever uses keyboard shortcuts that has left it that way—all have found the “Ctrl+Tab cycles through tabs in recently used order” preference and turned it off, so that it goes through tabs in order, like literally every other program I’ve ever encountered does with tabs. (Yes, Alt+Tab does MRU window switching, but that has never been the convention for Ctrl+Tab tab switching.)

Mind you, MRU switching is still useful behaviour; Vim has Ctrl+^ to switch to the alternate file which is much the same concept, and Vimperator et al. used to do the same (on platforms where Alt+number switched to the numbered tab, rather than Windows’ Ctrl+number), no idea whether equivalent extensions can do that any more. I have a Tree Style Tab extension that makes Shift+F2 do that, and it suits me.

If you keep Control+Tab set to cycle through tabs in recently used order, you can use Command-Shift-Left/Right or Control-PageUp/PageDown to cycle through tabs in tab-bar order instead.

Additionally, you don't need an extension to jump to a tab anymore. Command-[1-8] goes to that number tab in the current window, where 1 is the leftmost tab. Command-9 goes to the rightmost tab.

At least on linux, that shortcut does not use an MRU ordering; I see other replies mentioning the command key; is this behavior Mac specific?

Thank you so much for this! Another handy and often overlooked feature are the shortcuts for bookmarks. And with %s in the URL you can search/navigate pretty fast. Example: https://en.wikipedia.org/wiki/%s with the shortcut "w" could bring you the according article if you type "w foobar"

I would actually shift this quite a bit to say if you’re designing your page to last 10 years, put it on the internet archive on day 1.

Invite them to crawl it, verify the crawl was successful, and even talk about that link on your page.

It removes the risk of domain hijacking, hosting platforms shuttering, and the author losing interest. P.s. The internet archive is doing excellent work. Support them.

And as you give a content donation, please also consider a monetary donation to keep the lights on at the Internet Archive: https://archive.org/donate/

I started a recurring donation through your link. Thanks for posting this.

Great idea!

Make sure that archive.org - the Internet Archive - catches your website in "The Wayback Machine". Catering to that is a pretty good strategy for archiving for at least the next couple of decades, considering that institute's staying power.

And on that note - consider donating to them.

The Internet Archive is a fantastic resource. And right now, they happen to match every donation two to one (so $5 becomes $5 + 2 * $5 = $15 dollars!)

How do they match a donation to themselves?

They currently have a deal with a donor who will donate $2 for every $1 the Archive gets in that time period.

Just sign such a deal with two donors and boom, feedback loop, exponential growth, infinite money!!

A shift to independent publishing is needed. I used to have sites that died because the upkeep became tiresome, and if - a professional developer with almost 25 years experience of writing web applications - find it tiresome, can we blame people for wanting to use the big platforms?

I think using a static site generator might be OK. Common headers and footers help, and RSS might definitely be a good thing, but that seems to be dying.

One idea from this article I liked was "one page, over many". I don't think he meant have one single page on your website, but rather one per directory, and like he has with this article have one directory for a thought or essay or piece of something you want documenting, and just have an index.html in it.

I like this because I think the one thing that has killed off most personal websites is not the tech tool chain, but that "blogging" created an expectation of everybody becoming a constant content creator. The pressure to create content and for it to potentially "go viral" is one of several reasons I just tore down several sites over the years.

Around this time of year I take a break from work and think about my various side projects, and sometimes think about "starting a blog again". I often spend a few hours fiddling with Jekyll or Hugo, both good tools. Then I sit and think about the relentless pressure to add content to this "thing".

I like this idea instead though. No blogs. No constant "hot takes" or pressure to produce content all the time. Just build a slowly growing, curated, hand-rolled website.

I still think there might be a utility in having a static site build flow with a template function, but a simple enough design could be updated with nothing more than CSS.

Bit to think about, here... interesting.

I use a combination of asciidoc and hugo to generate my static website. It means that I can easily regenerate the website using whatever tool I want in the future or even just easily update the template for the existing site. If something happens to asciidoc, there are lots of converters that would allow me to move to another format or presumably some format in the future. Markdown and restructuretext are also really good options.

I don't think there's any good solution to the dead link problem. For example there are 11 links in this article:

How many of these will still be alive in 10 years? How many times do you have to fix your page to make your page "last"?

I think the Stack Overflow guidelines have "solved" this problem in about the cleanest way currently possible: expect links to die, and include the relevant information in your answer.

If the link still works when it gets clicked on that's a bonus, but it shouldn't need to be available for the content you're reading to be understandable.

And directly quoting the information has been endangered by modern copyright directives.

Will you still be allowed to do that in ten years? Or will aggressive takedown policies have forced a shift?

> And directly quoting the information has been endangered by modern copyright directives.

I guess paraphrasing or summarising hasn’t been prohibited.

> will aggressive takedown policies have forced a shift?

Shifting to prarphrasing or to summarising doesn’t sound too bad.

That's not always possible without changing the subtleties and basic meaning, or even intention as well as discussing and questionioning what the author wanted to say with their sentence or paragraph. This is particularly tricky if you write political commentary on your site. Blog post of a politician or social media posts can't be quoted where there's already vaguely stated and often said politician removes the post after backlash so you wouldn't be able to link it.

In the case of StackOverflow, if you can’t explain an answer in your own words, then you don’t really know the answer and no one should rely on your links to a supposedly authoritative source.

And there's also the HTTP 300 codes if content has been moved.

These work only if you move stuff around on the same website. If you switch domains you can't just ask the new domain owner to redirect requests to your new website.

When a website disappears completely, there is no web server to respond with HTTP 300.

Put a wayback machine link in parentheses/superscript after every link in the page?

Of course, you can't know how long the Wayback Machine itself will continue to exist, either.

I think we will simply have to assume it will continue. That is, if anything will continue, archive.org and similar projects whose primary goal is to preserve and prevail are easily the prime candidates.

and we don't know when it will be the end of the world so we should all stop breathing now

There's no great solution, but there are things you can do to help, for example:

- take an 'archive copy' of anything you link to so you can host it yourself if it goes away (copyright issues to consider, of course)

- automate a link-checking process so you at least know as soon as a target disappears

- only link to 'good' content (you can feed back results from previous step to approximate this on a domain-basis over time)

(although these things require a build process, which the article's author is against)

As many times as the article literally tells you will be necessary? One of the key points about links it makes is to use a link checker if you do link out.

Let's be honest with ourselves. The best way to make your content last for a long time is to host it on a platform that is free and very successful. For example, whatever photos I posted on Facebook 12 years ago? Still alive and kicking. The articles I've published on wordpress.com 7 years ago? Still in mint condition, with 0 maintenance required.

In comparison, the websites that I've built and hosted or deployed myself, have constantly required periodic work just to "keep the lights on". I went out of my way to make this as minimal and cheap as possible, but even then, it hasn't been nearly as simple as the content I've published on wordpress.

At some point, people's priorities change. Perhaps due to new additions to the family, medical circumstances, or even prolonged unemployment. And when that happens, even the smallest amount of upkeep, whether it is financial, technical or simply logistical, becomes something they have no interest in engaging with.

If we really want our content to last, not just for 10 years but for a generation, our best bet is to publish it on a platform like wordpress.com. One which requires literally zero maintenance, and where all tech infrastructure is completely abstracted away from you. I know this isn't going to be a popular idea with the HN crowd, and I do not blame anyone at all for wanting to keep control over their content. People are free to optimize along whatever dimensions they wish. But if I had to bet on longevity, I would bet every time on the wordpress article over the self-hosted one.

>Let's be honest with ourselves. The best way to make your content last for a long time is to host it on a platform that is free and very successful. For example, whatever photos I posted on Facebook 12 years ago? Still alive and kicking. The articles I've published on wordpress.com 7 years ago? Still in mint condition, with 0 maintenance required.

You view on timeline is too short. We're not talking about keeping something online for 7 years, but for 70. If I had followed your advice a few years ago, I would have deployed on Geocities. Do you know what happened to those websites?

The question is, is wordpress going to be around in 70 years? No one knows. But that static HTML page will still render fine, even if it is running in a backward compatibility mode on your neurolink interface.

> The question is, is wordpress going to be around in 70 years? No one knows. But that static HTML page will still render fine

The question isn't whether wordpress will be around in 70 years, but whether it will outlast your self-hosted website. Anything that is self-hosted requires significantly more financial/logistical maintenance, and what is the likelihood of someone continuing to do that for 70 years?

For me it's very easy because those domains are also tied to my email and all of my other hosted services (gitea, tt-rss, etc.) all use the same domain. So it's very easy to remember to keep them all alive and active. I've had domain names active far longer than Wordpress has existed.

Photos you posted on Facebook 12 years ago are generally inaccessible to the public; you need to be a Facebook user to see most stuff on Facebook after all.

Also, in most cases even you would have a very hard time accessing them, unless you somehow "pinned" them not to be far far down the scroller.

You can export your WordPress website to static HTML easily with the help of a free plugin.

The article addresses this point. This is the kind of hurdle that makes pages not last. The point is that yes we can spend time each few years migrating and maintaining but we shouldnt have to.

Clearly the only reasonable way is to store it in a Mainframe...in EBCDIC

It'll live forever...


Never underestimate the bandwidth of a station wagon full of punch cards.

If Jekyll died tomorrow, I still have the HTML to keep the website running, more or less. It's a build step in my pipeline but not one that abstracts it in such a way that I cannot use the final product as my archive. I'm not sure that a CMS could let me do the same.

Not to mention that you’d still be able to use Jekyll yourself for as long as you’d like, it being locally installable and open source.

>to host it on a platform that is free and very successful

Yes, from a technical pow, but what about deplatforming? I think it is a bigger risk to lose data than any framework/technology deprecation. I would definitely not rely on any platform keeping my data.

The issues outlined here are one of the reasons that I am moving as many of my workflows to org-mode as possible. Everything is text. Any fancy bits that you need can also be text, and then you tangle and publish to whatever fancy viewing tool comes along in the future.

I don't have a workflow for scraping and archiving snapshots of external links, but if someone hasn't already developed one for org I would be very surprised.

In another context I suggested to the hypothes.is team that they should automatically submit any annotated web page to the internet archive, so that there would always be a snapshot of the content that was annotated, not sure whether that came to fruition.

In yet another context I help maintain a persistent identifier system, and let me tell you, my hatred for the URI spec for its fundamental failure to function as a time invariant identifier system is hard to describe in a brief amount of time. The problem is particularly acute for scholarly work, where under absolutely no circumstances should people be using URIs or URLs to reference anything on the web at all. There must be some institution that maintains something like the URN layer. We aren't there yet, but maybe we are moving quickly enough that only one generation worth of work will vanish into the mists.

> The issues outlined here are one of the reasons that I am moving as many of my workflows to org-mode as possible. Everything is text.

That works for some, even most people. Unfortunately, the content I create will inevitably cite material in languages other than the main document language. That means that I have to heavily use HTML span lang="XX" tags to set the right language for those passages, so that (among other things) users with screenreaders will get the right output. As far as I know, org-mode lacks the ability to semantically mark up text in this way.

If it is for blocks of text then you could use #+BEGIN_VERSE in combination with #+ATTR_HTML, or possibly create a custom #+BEGIN_FRENCH block type, but I suspect that you are thinking about inline markup, in which case you have two options, one is to write a macro {{{lang(french,ju ne parle frances}}} and the other would be to hack the export-snippet functionality so you could write @@french:ju ne parle frances@@ and have it do the right thing when exporting to html. The macro is certainly easier, and if you know in advance what languages you need it shortens to {{{fr:ju ne parle frances}}} which is reasonably economical in terms of typing.

Maybe I'm dense, but I'm having trouble understanding what is so difficult about keeping content around. It seems like the issue of webpack and node and all the other things he mentions on the article aren't really problems with content per se. You can just publish your thoughts as a plain text file or markdown or whatever and you're good to go. I'm having a hard time thinking of types of content that are really tied to a specific presentation format which would require a complex scaffolding. A single static page with your thoughts is sufficient and should require no maintenance to keep around. I do agree though that even static site generators create workflows that get in the way. I'd love to see an extreeeemely minimal tool which lets you drop some files in a folder and then create an index page that links to those. You could argue that's what static site generators pretty much do, but they do seem to be more complex than that in practice. Remember deploying a web site with FTP? I have to say that was simpler for the average person than what we have today. I think that, in some ways, the complexity is what ends up pushing people towards FB, Medium, etc as publishing platforms.

"I'd love to see an extreeeemely minimal tool which lets you drop some files in a folder and then create an index page that links to those."

I use the tree command on BSD to do just that. It has the option of creating html output with a number of additional options.

An example: tree -P *.txt -FC -H http://baseHREF -T 'Your Title' -o index.html

Oh wow thanks for sharing. I use tree all the time, but had no idea you could do this. baseHREF should be the full root domain, e.g. example.com?

Yeah. And play around with it. It's quite flexible. Color, CSS and other goodies. Takes me about 2 seconds to update my entire site...

> I'd love to see an extreeeemely minimal tool which lets you drop some files in a folder and then create an index page that links to those.

Don’t most web servers do this already?

Good point haha

It seems like in practice the biggest problem is "it got deleted", and everything else is about either preventing others from deleting your stuff or preventing yourself from deleting it out of laziness or frustration.

Deploying a web site with (S)FTP works as well as it ever did... and is just as obscure to non-technical people as it ever was. Ease of use means loss of control.

> Ease of use means loss of control.

It'd be a cool challenge to build something so simple that even a non-tech person could use which allows them to maintain control and ownership. Any good examples of tech in general that is highly approachable like this? Even things like WordPress are too complicated for most - maybe if not self-hosted it's not so difficult, but still falls short in terms of being complex and not just simple text or html (at the most)..

Onionshare has this feature, but the website is only accessible over Tor.

Oh cool, didn't know about that. Be cool to see something like this which is more approachable for non-tech people. I think the tor part of this, at least in its current state, is too much for the average person.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact