There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.
Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic. Imagine having the last 30 years of web browsing history saved on your local machine. This would especially be useful when in research mode and deep diving a topic.
EDIT: I forgot to mention https://github.com/webrecorder/webrecorder (the best general purpose web recorder application I have used during my previous research into archiving personal web usage)
This was what made me convert from bookmarking to clipping pages into Evernote around 6-7 years ago. I realized I had this huge archive of reference bookmarks that were almost useless because 1) I could rarely find what I was looking for, if I even remembered I'd bookmarked something in the first place, and 2) if I did, it was likely gone anyway. With Evernote I can full text search anything I've clipped in the past (and also add additional notes or keywords to ease in finding or add reference info).
Since starting with replacing bookmarks, I've moved other forms of reference info in there, and now have a whole GTD setup there as well, which is extremely handy since I can search in one place for reference info and personal tasks (past and future). Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.
I was an Evernote user when I was on macOS. When I switched to Linux, a proper web clipper was something I really missed. I'm now on Joplin and it does everything I used to use Evernote for and then some.
It even has vim bindings now!
As far as longevity goes, I think they got their archive / backup format right - it's just a tarball with markdown in it.
No need of proprietary code and apps why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve.
Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.
Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.
> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text.
I have used rst intensively on a project. A few years later, I would be hard pressed to write anything in ti and would need to start with a Quick Start tutorial. With all its faults, Markdown is simple enough that it can be (and is) used anywhere, so there is no danger of me forgetting its syntax (even if it wasn't much simpler to star with).
It is stil not a specification like restructuredText[1]. Also wikiMarkup (which really started this markdown) is different from GitHub markdown, which is different from other markdown editors. Also many sites use their own markdown versions.
If you are in restructuredText world there is one specification and all implementation adhere to it, be it pandoc, sphinx, pelican, nikola. The beauty of it is that it has extension mechanisms which provides enough room for each tool to develop it. But markup can be parsed by any tool.
Because its inherently appealing, close to what you wanted intuitively, and if you're only dealing with a single implementation of it, works fairly well.
You don't really get bit by its lack of a standard and extensibility until after you've bought in.
It's essentially designed by the opposite of a committee -- rather than including everything but the kitchen-sink, it contains support for almost no usecases except the one. Which is very appealing, when you only have the one usecase.
Well rst is better than markdown from day one. The only reason it became famous is thanks to wikimarkup.
So markdown needs to thank the popularity of Wikipedia for its success, as rst did not have any application like Wikipedia. But still rst is used widely enough with its killer Sphinx, readthedocs and now its kind of de-facto documentation writing markup in Python and many open source software world.
Because you can teach someone markdown in five minutes. And even if they don't know all the ins and outs, the basics are pretty foolproof (paragraphs, headings, bold and italic).
This... I just found a plugin for the static site generator Pelican that is 7 years old that still works.
After running Pelican you get plain HTML that can be hosted anywhere.
I like Netlify, but other options like GitHub pages are also great.
The author recommends not putting on GitHub Pages because they haven't found a working business model and might not be here in the future.
But... GitHub has been taken over by Microsoft which is most likely not going bankrupt soon and Microsoft loves their backward compatibility so I am confident they won't screw GitHub up too much.
You can say the same about geocities when it was acquired by yahoo. But it didn't last and then now it's happening with yahoo-groups. So I am not hopeful if GitHub becomes a liability microsoft will keep it.
Joplin is open source, which is a big part of the sell to me. It definitely isn't the best of all possible note taking systems that could ever exist, but it's the best open source one I've found so far, and I don't have time to write a better one at the moment.
> why not build it into browsers. I have seen Firefox and Chrome can download web pages. So it will be nicer if they can download the bookmarked pages and store in a local html, css, image folder. I think it's pretty easy to achieve
This is solving a different problem though. WARC/MHT and other solutions can do this. Joplin is more of a note taking system that allows ingesting content from the web into one's own local notebook, which is relevant to what the GP post was talking about - Evernote.
However, it would seem that "the modern web" is the now popular standard. 10 years ago it might have been Flash or Java web applets or whatever. Now it's JS. I'm not convinced that JS is any better than what it has replaced. However, people keep paying developers to write them, so presumably someone likes them.
> Also people need to move away from those esoteric reactjs, angular, vuejs and plethora of CMS as API or static site generators relying on some js framework which won't last even 2-3 years. Use a static site generator which can generate a plain html, like static site generators built on pandoc, python docutils or similar.
Agreed, but that's also not a problem that Joplin, Evernote, or any other such tool is going to be able to solve. Unless you are complaining that Joplin is an Electron app? That's my biggest issue with it personally. It runs well enough, but is definitely the heaviest application I use regularly, which is a little sad for a note taking program. On the other hand, I haven't found a better open source replacement for _Evernote_. There are lots of other open source note-taking programs though.
> Personally I like restructuredText as the preferred format for content as its a complete specification and plain text. So the only thing in this article I will change is that content can also be in rst format and then generate html from it. Markdown is not a specification as each site implements their own markdown directives unlike restructuredtext specification and most of the parsers and tooling are little different from each other.
reST is indeed very nice. At one point, I kept my personal notes as a Sphinx wiki with everything stored in reST. I found this to be less ergonomic than Evernote/Joplin, although in principle it could do all the same things that Joplin can do, and then some.
Thanks a lot for the recommendation! I have been a little annoyed with Evernote not having an app for Ubuntu, which I recently started using quite heavily. So this looks very interesting!
> Only downside is I'm dependent on Evernote, but hopefully it manages to stick around in some form for a good while, and if it ever doesn't, I expect I'll be able to migrate to something similar.
I have used Evernote and OneNote, but have finally, after a long interim period, resorted to using only markdown.
I have a "Notes" root folder and organize section groups and sections in subfolders. VSCode (or Emacs), with some tweaks, shortcuts, and extensions, provides a good-enough markdown editing experience. Like an extension that allows you to paste in images, storing it in a resources folder in the note's current location (yes, I see small problems with this down the road when re-organizing, but nothing that can't be handled).
For Firefox, I use the markdown-clipper extension the few times I would like to copy a whole article, it works well enough. Or I copy/paste what I need; mostly, I take my own summarized notes.
For syncing, I use Nextcloud, which also makes the notes available both for reading and editing on Android and iOS (I use both).
Up until very recently, I used Joplin, which also uses markdown, but there were two things I could not live with: it does not store the markdown files with a readable filename, e.g., its title, and being tied to a specific editor.
If you are mostly clipping and not writing your own notes, I can imagine my setup won't work well, or be very efficient.
I want to use a format that has longevity, and storing in a format that I cannot grep is out of the question.
>if I even remembered I'd bookmarked something in the first place
I had recently participated in a discussion on the problem of forgetting bookmarks[1].
Copying my workflow from there,
1. If the entire content should be easily viewed, then store via pocket extension.
2. If a partial content should be easily viewed i.e. some snippet with link to entire source, then store in notes (Apple).
3. If the content seem useful in the future, but it is okay to forget it; then I store it in the browser bookmarks.
But, my workflow doesn't address the problem raised by Mr. Jeff Huang; if Pocket app or notes disappear so goes my archives. I think self hosted archive as mentioned by the parent is the way to go, but I don't think it's a seamless solution to a common web browser user.
My solution for a small subset of the forgetting problem:
I frequently see something and want to try it out the next time I want to do something else. So I emulate User Agent strings and append lots of "like [common thing I search for a lot]" to the bookmark. When I start typing into the search bar for those other things I'll be reminded of the bookmark.
For example, since file.io is semi-deprecated I decided to try out 0x0.st . But I kept forgetting when I actually needed to transfer a file, so I made a bookmark titled "0x0.st Like file io".
As a side note, I have a similar bash function called mean2use that I use to define aliases that wrap a command and ask me if I'd like to do it another way instead or if I'm sure I want to use the command. I've found this is a nice way to retrain my habits.
Exported sure, it's all there. But importing that into your new favorite notes application is not going to be trivial, especially not for regular users.
That's why I've decided to stick mostly to regular files in a filesystem.
Presumably "regular users" will not be individually writing XML parsing code to convert the notes. The developers of their "new favorite notes application" will do it (and if they can't be bothered, maybe it shouldn't be your "new favorite notes application").
Joplin, for example, can import notes exported from Evernote. It's just a menu option that even regular users should have no trouble employing.
>There's no reason why a web browser bookmark action doesn't automatically create a WARC (web archive) format.
Indeed.
And I still remember the modem days where I would download entire websites because the ISP charged by the hour, and I'd read them offline to save money.
Yeah. In the dialup days, layer 1 and 2 of a home internet connection was a long-running phone call between your own modem and a modem of an ISP. You payed via your phone bill, for the duration of the call.
Personal wayback machins should be standard computing kit. I have had one since around 2013. Very bare bones demo: https://bpaste.net/show/3FBH6 it does much more than that. file:// is supported for example, so you can recursively import a folder tree, and re-export it later if you wanted to.
Or in some random script: "from iridb import iri_index" "data = iri_index.get('https://some/url')" I'm skipping lots, you can ref by hash, url, url+timestamp. It hands back a fh, you dont know if the data you are reffing even fits in memory. Extensive caching, all the iri/url quirks, punycode, PSL etc.
Some random pdf in ~/Downloads, "import doc.pdf" and dmenu pops up, you type a tag, hit enter and the pdf disappears into the hash tree, tagged, and you never need to remember where you put it. Later on you only need to remember part of the tag, and a tag is just a sequence of unicode words.
Chunks are on my github (jakeogh/uhashfs, it's heka-outdated dont use it yet), I'll be posting the full thing sometime soonish.
> SingleFileZ is a fork of SingleFile that allows you to save a webpage as a self-extracting HTML file. This HTML file is also a valid ZIP file which contains the resources (images, fonts, stylesheets and frames) of the saved page.
Whoa. I just installed SingleFileZ for FF and it is working great. Before I was using wget and that was clunky. This is working great since I can just toss a single file up on my server and we are good to go. Thanks for this!
Off topic, but could I ask how you knew your software was being talked about? Did you just happen by or have you some monitoring agent looking for mentions? Just curious
Sorry, I didn't see your question. I check the posts on HN very regularly. The title of the post made me think that people might have been talking about SingleFile.
Sometimes, friends of mine tell me someone on the Internets is talking about SingleFile :). I also sometimes use the integrated search engine.
WorldBrain's Memex (https://addons.mozilla.org/en-US/firefox/addon/worldbrain/) has an option to perform a full-text index (not archive) of bookmarks, or pages you visited for 5 seconds (default) down to 1 second (no option to index all pages). It stores this stuff into a giant Local Storage (etc) database, which Firefox implements as a sqlite file.
Firefox actually purges history automatically. For instance, the oldest history I have on this browser right now is from January 2018. I found about this the hard way.
I noticed this behavior in Firefox too. So I started writing personal Python scripts to scrape FF's SQLite database where it stores all the browsing history information.
It looks like Firefox has been doing it since 2010[1]. I wonder how long Chrome has been doing it, since launch, 2008? Here's a Chrome bug discussing it[2].
I have this problem. Some bits of history are gone except from old backups of profile directories and profiles where I've already set places.history.expiration.max_pages to some absurdly high number.
I need to do a handful of experiments to see exactly how this interacts with Sync, even though I've (foolishly) already synced the important profiles. I'd hope that the cloud copies of the places database just keeps growing, but in any case, I'd rather combine them all offline anyway.
Even if you set the setting, how can you be sure that it won't be reset on an upgrade or that you'll remember to set it if you need a new profile (perhaps your old one becomes buggy, crufty, corrupt, or all three)? I thought I had all my history retained until one day I couldn't find a website I knew I had visited years ago, and took a closer look at my history and was very unpleasantly surprised... What happened? I'll never know, but my suspicion is that Firefox reset the history retention setting at some point along the way. If you do any web dev, you know Firefox occasionally backstabs you and changes on updates. The only way to be sure over a decade-plus is to regularly export to a safe text file where the Mozilla devs can't mess with it. I can't undo my history loss, but I do know I have lost little history since.
I can't be sure. When I say 'combine them all offline', I mean using something like [1] which refuses to do anything for me because the Waterfox database version is a rather old Firefox version, and that seems to expect all the db's versions to be up-to-date and equal, which seems pointless. #include <sqlite3.h> was my next step-- only I don't walk very well, so that didn't happen "yet". Or I'm lazy, or distracted, or depressed, or something. When I recently got tired of realizing a thing was on the other machine, I bit the bullet and synced them, if only to see how well that worked.
I like the idea, but wanted to know how realistic it would be so I made a quick and dirty Python script to download all my bookmarks. If you want to make the same experiment, you can get it from here: https://gist.github.com/ksamuel/fb3af1345626cb66c474f38d8f03...
It requires Python 3.8 (just the stdlib) and wget.
I have 3633 bookmarks, for a total of 1.5 Go unziped, 1.0 Go zipped (and we know we can get more from better algo and using links to files with the same checksum like for JS and css deps).
This seems acceptable IMO, espacially since I used to consider myself a heavy bookmarker and I was stunned by how few I actually had and how little disk they occupied. Here are the types of the files:
It should probably be opt-in though, like a double click on the "save as bookmark icon" to download the entire file, and the star becomes a different color. Mobile phones, chrome books and raspy may not want to use the spaces, not to mention there are some bookmark content that you don't want your OS to index, and show you preview of in every search.
But it would be fantastic: by doing this experiment I noticed that many bookmarks were 404 now, and I will never get their content back. Beside, searching bookmark, and referencing them is a huge pain.
So definitely something I wish mozilla would consider.
> definitely something I wish mozilla would consider
There used to be this neat little extension called Read It Later that let you do just that. Bookmark and save it so you could read it when you were offline or the page disappeared. Later they changed their name and much later Mozilla bought it and added it to Firefox without a way to opt out. It was renamed to Pocket.
Pocket is not integrated with your bookmarks. For offline consultation, you need a separate app. Of course this app is not available on Linux, where you have to get some community provided tools.
Bookmark integration would mean one software, with the same UI, on every platform, and only one listing for your whole archive system.
Having to pay $15/month ($180/yr!) to be able to search stuff on my own computer for years seems awfully expensive. I'd rather depend on some simple open-source piece of software that I can understand and maintain if necessary.
Yeah, the sheer idea of paying a subscription for software that is running on my computer to index local resources is crazy. This kind of software should be should sold as one-time buy license.
Decades ago there was an amazing piece of software from lotus when I worked there called magellan. I remember the first time I saw someone search, and find results in text documents, spreadsheets and many other of the common formats of the day.
That was in 1989 and today I mostly search my computer using find and grep commands, since that's what just keeps working.
You can ask Safari to do that by enabling 'Reading List: Save articles for offline reading automatically'. It's not WARC but it is an offline archive. The shortcut is cmd-shift-D which is almost the same as the bookmark one. It's also the only way I know of to get Safari to show you bookmarks in reverse chronological order. And it syncs to iOS devices.
This could be done in better and more specialized ways, one problem is browser extension APIs don't provide very good access to the browser's webpage-saving features.
Pinboard happens to be a web service run along the same principles as the article we're discussing.
The bus factor is high, but I suspect that Maciej has a plan that'll let us download our archive even if he does get grabbed by the mainland Chinese government let alone a forecasted going out of business action.
Pinboard is a profitable online equivalent to a mom and pop shop. It’s sustainable and its founder isn’t chasing growth at all costs. It also has a cult following, so OOB is highly unlikely.
In practice how is this different from MHTML? I think most browsers have built-in support for MHTML so it should be possible to build that part easily.
The state of mhtml support is fairly pathetic at the moment. Firefox broke mhtml compatibility with the quantum overhaul. Chrome's mht support had been a hit and miss over the years, sometimes removing the GUI option entirely and requiring one to manually launch the browser with a special tag to enable it. The only browser with a history of consistent mhtml support happens to be....Internet Explorer, followed by a bunch of even more obscure vendors that nobody really uses.
I am currently dealing with the problem of parsing large mht files (several megabytes and up). A regular web browser would hang and crash upon opening these files and most ready made tools I could find struggle with the number of embedded images. It's very much a neglected format with very little support in 2019.
This is what started me clipping everything to OneNote instead of bookmarking. Unfortunately, it becomes difficult to maintain, the formatting is off, things subtly break, pages clipped on mobile use different fonts for god knows what reason, some content is discarded silently because the clipper deems it's not part of the main article, I could go on.
It's better than nothing but it's also increasingly frustrating to deal with.
I've actually been saving every page I visit for a good two years now and it has barely caused in dent in my NAS storage space. As usual though, I wrote a crappy extension and Python script to do that because I never bothered to look online. Thanks for introducing me to WarcProxy - I'll probably be making the switch very soon.
The main disadvantage was disk space. This is particularly true when some pages are 10 MB or larger. I would periodically prune the archive directory for this reason.
I stopped using Shelve when I started running out of disk space, and now I can't use Shelve because the addon is no longer supported. The author of Shelve has some suggestions for software with similar functionality:
It installs and loads in Waterfox, but of course it still hasn't been touched in 3.5 years.
I used Scrapbook in the past (which also still works) but I usually just save random things in ~/webpages/ since (apparently) 2011. The earliest is a copy of the landing page at bigthink.com. Of course now almost every link is broken, excluding social media buttons, About Us, Contact Us, RSS, Privacy, Terms of Use, Login, and the header logo pointing at the same page.
It would be nice if we had browsers that were actually user-agents that allow full pluggable customizability for all cookie, header, UI, request, and history behavior. Then, this would just be a plugin that anyone could install.
And then people install garbage extensions that break the browser and people think "Wow, this firefox browser is so buggy and slow" and switch to chrome. And then your extensions break with every single browser update because they are tampering with internal code.
Everyone is free to fork a browser and apply any changes they want. Allowing extensions to change anything at all essentially is the same as forking and merging your changes with upstream every update.
It's a pain, like most of the WARC ecosystem. It's been several months since I dug in, so maybe it's had some spitshine the last little bit, but I usually end up using combinations of wpull, grab-site, and a smattering of other utilities to reliably capture a page/set of pages, and have had to make some quick hacks as well as manually merging in some PRs to get things to work with Python3. Once I have the WARC, I typically end up using warcat to extract the contents into a local directory and explore that way.
WARC as a format seems promising, but at least last I checked, open-source tooling to make it a pleasant and/or transparent experience is not really there, and worse, at least as of several months ago, doesn't really seem actively worked on. Definitely an area you'd expect to be further along.
Safari used to (and still) do this automatically but in a limited way. In the browsing history view (Command Y), you can search visited pages by its content, and this is extremely useful. But there's no way(†) to tell Safari to display that saved content. If you revisit a URL in the history, Safari fetches it again, losing the original saved content.
The point is to make a webpage that lasts. So people can link to it and get the page. That means making a maintainable webpage and a url that does not change.
It is great that you can archive every page you visit for yourself, but that is not the same as making a lasting web.
Better than a bookmark action would be a commandline option, similar to Firefox's -screenshot, which will work without starting X11. Something like -archive:warc
Does this also strip the megabytes of superfluous tracking JS? It's probably what'll be the bulk of the size on the modern web, and I don't feel any particular need to store it.
(I believe that for historical purposes, enough complaining about ads and tracking will survive that future historians can easily deduce the existence of this practice)
>Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic.
I tend to do that, I also save a lot of scientific papers, ebooks and personal notes. I've found that doing so does not help me at all. The main problem I have is that when I need to look something up (an article, a book, a bit of info) I reach for google first, usually end up finding the answer and go to save it, only to find that I had already found the answer beforehand (and perhaps already made clarifying notes to go along with it) and then forgot about it.
This, and not dead links, is the fundamental problem with bookmarks for me. Not only bookmarks, it extends to my physical notes and pretty much everything I do. If I haven't actively worked on something for a couple of months, I forget all about it and when I come back to it I usually have to start from scratch until I (hopefully) refresh my memory. Some of it is also usually outdated information.
I think this is a big, unsolved problem and I'm not even sure how to go about starting to solve it. I can envision some form of AI-powered research assistant, but only in abstract terms. I can't envision how it would actually work to make my life better or easier. It would need to be something that would help blur the line between things I know and things that are on my computer somehow. If I think of my brain like it has RAM and cache, things I'm working on right now are in the cache and things I've worked on recently or work on a lot are in RAM, but what's for me lacking is a way to easily move knowledge from my brain-RAM to long term storage and then move that knowledge back into working memory faster than I can do so now. I'm not even talking about brain uploading or mind-machine interfaces, but just something that can remind me of things I already know but forgot about faster than I can do so by myself.
I am convinced that figuring out how to do this will lead to the next leap in technological development speed and efficiency. Not quite the singularity that transhumanists like to talk about, but a substantial advancement.
What I've found is that I need to spend more time deciding what is important, and less time consuming frivolous information. That's hardly a technology problem.
For things I really don't want to forget, I'm using Anki [0], a Spaced Repetition System (SRS). Anki is supremely good at knowing when you're about to forget an item and prompting you to review it.
Spaced practice and retrieval practice, both of which are used in SRS, are two learning techniques for which there is ample evidence that they actually work [1].
You still need to decide what is worth remembering, but that's something technology can't help with, I think.
- Any comprehensive archive of your activity is itself going to be a tremendously "interesting" resource for others -- advertisers, law enforcement, business adversaries, and the like. Baking in strong crypto and privacy protections from the start would be exceedingly strongly advised.
- That's also an excellent reason to have this outside the browser, by default, or otherwise sandboxed.
- Back when I was foolish enough to think that making suggestions to Browser Monopoly #1 was remotely useful, I pointed out that the ability to search within the set of pages I currently have open or have visited would be immensely useful. It's (generally) a smaller set than the entire Web, and comprises a set of at least putatively known, familiar, and/or vetted references. I may as well have been writing in Linear A.
- Context of references matters a lot to me. A reason I have a huge number of tabs open, in Firefox, using Tree-Style Tabs, is that the arrangement and relationships between tabs (and windows) is itself significant information. This is of course entirely lost in traditional bookmarks.
- A classification language for categorising documents would be useful. I've been looking at various of these, including the Library of Congress Subject Headings. A way of automatically mapping 1-6 high-probability subjects to a given reference would be good, as well as, of course, tools for mapping between these.
- I've an increasing difference of opinion with the Internet Archive over both the utility and ultimately advisability of saving Web content in precisely the format originally published. Often this is fragile and idiosyncratic. Upconverting to a standardised representation -- say, a strictly semantic, minimal-complexity HTML5, Markdown, or LaTeX, is often superior. Both have their place.
On that last, I've been continuing to play with the suggestion a few days ago for a simplified Washington Post article scrubber, and now have a suite of simple scripts which read both WashPo articles and the homepage, fetching links from the homepage for local viewing. These tend to reduce the total page size to about 3-5% of the original, are easier to read than the source, and are much more robust.
I'm reading HN at the moment from w3m (which means I've got vim as my comment editor, yay!), and have found that passing the source to pandoc and regenerating HTML from that (scrubbing some elements) is actually much preferable, for the homepage. Discussion pages are ... more difficult to process, and the default view in w3m is unpleasant, though vaguely usable.
Upshot: saving a WARC strictly for archival purposes is probably useful, but generating useful formats as noted above would be generally preferable in addition.
With the increasing untenability of mainstream Web design and practices, a Rococco catastrophe of mainstream browsers, the emergence of lightweight and alternative browsers and user-agents (though many based on mainstream rendering engines), the tyranny of the minimum viable user attacking any level of online informational access beyond simple push-stream based consumption, and more, it seems that at the very least there's a strongly favourable environment to rethinking what the Web is and what access methods it should support. Peaks in technological complexity tend to lead to a recapitulation phase in which former, simpler, ideas are resurrected, returned to, and become the basis of further development.
I fundamentally agree with the principle -- that pages should be designed to survive a long time -- however the steps the author lays out I completely disagree with.
"The more libraries incorporated into the website, the more fragile it becomes" is just fundamentally untrue in a world where you're self-hosting all of your scripts.
"Prefer one page over several" is diametrically opposed to the hypertext model. Please don't do this.
"Stick with the 13 web safe fonts" assumes that operating systems won't change. There used to be 3 web safe fonts. Use whatever typography you want, so long as you self host the woff files.
"Eliminate the broken URL risk" by... signing up for two monitoring services? Why?
I think this list of suggestions does a great disservice to people who just want to be able to post their thoughts somewhere. There's an assumption here that you'll need to be technically capable in order to create a page "designed to last" and frankly that is not what the internet is about. Yes, Geocities went away. Yes, Twitter and Facebook and even HN will go away. But the answer sure as hell isn't "I teach my students to push websites to Heroku, and publish portfolios on Wix" because that is setting up technical gatekeeping that is completely unnecessary.
> "The more libraries incorporated into the website, the more fragile it becomes" is just fundamentally untrue in a world where you're self-hosting all of your scripts.
There are more problems though. older library versions might be vulnerable to XSS attacks, or use features removed by browsers in the future for security reasons (eval?). Or you might want to change something involving how you use the API but the docs are long gone. Generally, libraries imply complexity and when it comes to reliability, complexity will always be your enemy.
Also unless you're very diligent about semantic markup and separation of content, presentation, interaction logic, the more complicated s site is the more difficult it is to port.
I have run into this problem trying to migrate very old web pages or blog posts off of SaaS sites that are shutting down or just decaying. It's not just that complicated sites make it difficult to extract the content in the first place; it's difficult to publish that content on another site in a high-fidelity, and sometimes even readable, way.
The hard part isn't keeping the old site (page) running (although that's not always easy either). The hard part is when you want to do something _else_ with that content -- more complicated means less (easily) flexible.
I didn't perceive the author to be doing any technical gatekeeping, quite the opposite. I feel like their article was targeted at people like me or others who use stuff like Hugo/Jekyll, or those who use free website builders or use large frameworks for simple websites.
I agree a couple of the points seem out of place (the monitoring service one made me laugh. visiting my website is the first thing I do after uploading a new page), but the intent of this article I wholeheartedly agree with:
Reduce dependencies, use 'dumb' solutions, and do a little ritualistic upkeep of your website to keep it around for a decade or more. The things you propose are the norm and the reason nothing sticks around, IMO.
> (the monitoring service one made me laugh. visiting my website is the first thing I do after uploading a new page)
I think what you want is not just monitoring your internal links, but also external ones - if a page you linked to in your article starts 404-ing or otherwise changes significantly, it's something you'd likely want to know about. That said, just like preferring GoAccess over Google Analytics, it's something I'd like to have running locally somewhere (on my server, or even on my desktop), instead of having to sign up to some third-party service.
> "Stick with the 13 web safe fonts" assumes that operating systems won't change. There used to be 3 web safe fonts. Use whatever typography you want, so long as you self host the woff files.
Indeed. 10 years ago, “font-family: Georgia, Serif” was guaranteed to work and look the same on pretty much all computers out there. Windows had all of the “web core” fonts (Georgia, Verdana, Trebuchet, Arial, even Comic Sans). Macintosh computers had all of the “web core” fonts. Even most Linux computers had them because it was legal to mirror, download, and install the files Microsoft distributed to make the fonts widely available.
In the last decade, Android has become a big player, and the above font stack with Georgia will look more like Bitstream Vera than it looks like Georgia on Android.
The only way to have a website have the same typography across computers and phones here in the soon-to-be 2020s is to supply the .woff files. Locally (because Google Webfonts might be offline some day). Either via base 64 in CSS or via multiple files; I prefer base 64 in CSS because sites are more responsive loading a single big text file than 4 or 5 webfont files. Not .woff2: Internet Explorer never got .woff2 support, and we can’t do try-woff2-then-woff CSS if using inline base64.
Even with very aggressive subsetting, and using the Zopfli TTF-to-WOFF converter to make the woff files as small as possible, this requires a single 116 kilobyte file to be loaded with my pages. But, it allows my entire website to look the same everywhere, and it allows my content to be viewed using 100% open source fonts.
Then again, for CJK (Asian scripts), webfonts become a good deal bigger; it takes about 10 megabytes for a good Chinese font. In that case, I don’t think it’s practical to include a .woff file; better to accept some variance in how the font will look from system to system.
Edit In terms of having a 10-year website, my website has been online for over 22 years. The trick is to treat webpages as <header with pointers to CSS><main content with reasonably simple HTML><footer closing all of the tags opened in the header> and to use scripts which convert text in to the fairly simple HTML my website uses for body content (the scripts can change, as long as the resulting HTML is reasonably constant). CSS makes it easy for me to tweak the look and fonts without having to change the HTML of every single page on my site, but as the site gets older, I am slowing decreasing how much I change how it looks.
I disagree with keeping fonts inline in the page. It means an additional 100kb per page at the very least. Which adds up very quickly. Remember that most of the world still doesn't have broadband (including yourself if you're using roaming services abroad). It also means extremely redundant information is transmitted when people watch more than one page on your site.
Exactly. It’s a CSS (not HTML) file with all the inline fonts in that file, with a long cache time, so all of the website’s fonts are loaded once for site visitors.
... wherein the main article content is, as the article suggests, a single, self-contained page.
And yet ... that is just the "tip" - underneath is:
" ... at least one, but possibly many, supporting documents, resources and services. The minimum requirement is simply an expanded form of the tip (the "bummock"), complete with references, notes and links. Other resources and services that might lie under the surface are a wiki, a changelog, a software repository, additional supporting articles and reference pages and even a discussion forum."
[1] Neither wiki nor forum exist yet, but the bummock does...
The phrases "simple" and "news websites" don't combine well these days. Even the NPR website downloads 13.2 MB of content over 91 individual requests, and takes just over 3.6 seconds to load (6.5 to finish).
- CSS Stylesheets: 3
- Animated gifs: 1
- Individual JS files: 11 (around 2MB of JS decompressed (but not un-minimized))
- Asynchronous Requests: 14 (and counting)
And that's with uBlock Origin blocking 12 different ad requests.
That's not simple in any form. So, the possibility of something on this page breaking? High. There's a lot of surface area for things to break over time. And that's not counting what happens when the NPR's internal APIs change for those asynchronous requests.
If the site is being served over HTTP/2 then the 11 separate JS files is a good thing compared to a single 2MB JS file.
-
In my case it also has the added benefit of being able to cache JS for a long(er) period of time, with users only having to download maybe 0-30kb of JS when only 1 component is updated instead of invalidating the entire JS served (Way under 1MB however)
>Use whatever typography you want, so long as you self host the woff files.
or use Google Web fonts, and set let last option in your font-family to be "serif" or "sans-serif" to let an appropriate typeface be used if your third-party font is unreachable. That's the beauty of text, the content should still be readable even if your desired font is unavailable.
Google Web Fonts are not an "or", here. Fonts have disappeared from it, and there is no reason to not expect Google to, at some point in the future, go: "you know what, this costs too much without any substantial return." And now it's just another killedbygoogle.com product. Just like images, self-hosting woff/woff2 should be step 1.
Fonts disappearing is not a big issue that will ultimately render your page useless. If the font is gone, the look of the page is slightly affected, but the content of the page remains. It's honestly not a big deal at all.
Here as we enter the 2020s, there are no longer any web safe fonts. Those 1990s Core Fonts for the Web (Verdana, Georgia, Trebuchet, etc.) are no longer universal across all widely used platforms.
Yeah okay, but the initial suggestion (just specify "sans serif") still holds. Or really, if we're talking about a webpage to last, why do we even care about what font is being used? If you care enough about a font that the glyphs used are important for layout, then obviously you're going to need to include the font. If the specific look of the page is essential to the content conveyed, it seems likely to me you won't be using a standard font anyway.
For typical "the words matter more than how the words look" content...can someone explain to me why we care about including the font?
There’s also layout issues caused when replacing a font with another font, unless the metrics are precisely duplicated. There’s a reason RedHat paid a lot of money to have Liberation Sans with the exact same metrics as Arial, Liberation Serif have the same metrics as Times New Roman, and Liberation Mono have the same metrics as Courier New.
His "or" was to suggest that instead of only self-hosting the font file, you simply use a google one with a "fall-back" that happens to be a super-standard font that won't reasonably disappear from most OSes in the near future. That way, you get a reasonable "best of both".
Google web fonts were a great way to make my site slower. I don’t know if it’s the latency here in Australia or what, but (especially for developing locally) google web fonts were a big headache for having snappy webpages. I took the time one day to produce my own webfont files and self-host those, and the difference in site load speed is like night and day.
And that's where they are not banned. Many pages simply won't load at all in the PRC because someone thought a Google analytics tracker or a hosted library should load before the content (which then never does).
Google Fonts also isn’t blocked but I recall it being hit-and-miss in terms of responsiveness when I was working on a website that targeted Chinese audience a few years ago. However, I just tried resolving fonts.googleapis.com and fonts.gstatic.com on a Chinese server of mine, and they both resolve to a couple of Beijing IP addresses belonging to AS24424 Beijing Gu Xiang Information Technology Co. Ltd., so it’s probably very much usable now.
Not sure "working fine in PRC" is really something you can say about anything web related.
I do occasional web dev from within China and had to eliminate external references to get manageable page load times. At least from where I work pulling in practically anything from outside the Great Firewall will have a high probability of killing page load time. Anything hosted by Google in particular will often have you staring at your screen for 30 seconds.
Yes, any additional domain you request to has a non-negligible chance of killing the entire connection. To GP, noticing that one request works once from a server (not a home or mobile connection) really means nothing. Every ISP has different and constantly changing failure modes.
It would be great if web browsers had a way to actually indicate the user's preference of typeface, but what we've actually got is the browser's preference, and the browsers almost all have chosen really terrible default typefaces. It's fine to say "just use the default" for Mac users who get a decent default, but then the poor windows users have to suffer through some terrible serif.
The users who actually know how to change the default font also know how to use stylish.
When you go to a restaurant you let the chef prepare food for you.
Telling him to back off and let you cook because he can't know better than you (his user) would be absurd.
Same thing with design and typography. It requires skill and taste, and hopefully people will be delighted or simply consume the content for what it is, because the design/cooking just reveals that content in a convenient/useful shape.
Most people have the means to cook for themselves without going anywhere and do so at least the vast majority of the time. Even if you do go to a restaurant, they almost always have menus rather than just making one dish for everyone since some people have styles of cooking that they prefer or don't like. People rarely design their own font but rather pick from professionally designed fonts. Additionally, at restaurants people pay for food so incentives are aligned while on the web people generally don't pay for content and any design professional involved is likely an advertiser. I rarely read stuff on the web for a design experience but for the content. I suspect most people would be unhappy with a newspaper that changed fonts for every story or a book that changed fonts every chapter.
Personally, I've been setting my browser to use only DejaVu fonts with a 16pt minimum for years (maybe a decade now) and every time I briefly use a default browser profile I notice the fonts and think not just "this is bad" but "how can people live like this?". Even with the usually minor issues that often appear, setting my own fonts is a way better experience than not doing so. My default experience is much closer to Firefox reader mode than it is to what the page specifies in most cases.
IMO, font speicification should be limited to serif, sans-serif, or monospace and let the user or browser set the actual font. Desingers should not rely on exact sizes of fonts or use custom icon fonts.
Can you elaborate on why/how they suck? Do you have example links, to set a common ground for the conversation?
I think most fonts that get your attention suck, the best ones are invisible and get you directly to the meaning of text, without getting in the way. So maybe there's a kind of bias (selection or sampling bias?) operating here?
I can’t speak for the parent poster, but, yes, back in the Myspace days, end users would do really tasteless CSS like Comic Sans or an italic font everywhere. Back then, I told my browser “I don’t care what font they tell you to use, just render it with Verdana”.
These days, people either use their social network’s unchangeable CSS, or they use a Wordpress theme with an attractive and perfectly readable font. Even Merriweather, which I personally don’t care for, is easy enough to read.
The only time I have seen a page use obnoxious fonts in the 2010s is when the LibreSSL webpage used Comic Sans as a joke to highlight that the project could use more money:
Edit It may be a case that the parent poster likes using a delta hinted font, either Verdana or Georgia, on a low resolution monitor, and doesn’t like the blurry look of an anti-aliased font on a 75dpi screen.
> back in the Myspace days, end users would do really tasteless CSS like Comic Sans or an italic font everywhere.
Indeed, typography is a skill. Most designers should have it though, which is why I asked more information to OP.
> The only time I have seen a page use obnoxious fonts in the 2010s is when the LibreSSL webpage used Comic Sans as a joke to highlight that the project could use more money
Ah, the infamous Comic Sans. It's a shame because as a typeface on its own, in its category, it is pretty good. Sadly, it's misused all the time in contexts where it's not appropriate at all.
> It may be a case that the parent poster likes using a delta hinted font, either Verdana or Georgia, on a low resolution monitor, and doesn’t like the blurry look of an anti-aliased font on a 75dpi screen.
Without more details we cannot guess. You're right: a lot of things can go wrong and ruin a typeface, regardless of how the characters are designed. Anti-protip: a reliable way to make any font look like shit is to keep the character drawings as they are and mess up the tracking (letter-spacing) and kerning.
I think one of the reasons Comic Sans got such a bad rep is because it was one of the relatively few available fonts back in the pre-woff “web safe fonts” era of a decade ago. Microsoft should had given us a more general purpose font, such as a nice looking slab serif to fill the gap between the somewhat old-fashioned looking Georgia and the very stylized Trebuchet MS font.
Because they they are not the single system default sans-serif and single system default sans-serif-monospace fonts that all websites MUST use, period, no discussion. As you put it:
> fonts that get your attention suck
If I can tell the difference between your font and the system default font, your font sucks; if I can't tell the difference, what's the damned point?
Well, if it makes you feel any better, my website renders just fine on Lynx (no Javascript nor webfonts needed to render the page), complete with me putting section headings in '==Section heading name==', which is only visible in browsers without CSS. Browsers with modern CSS support see the section headings as a larger semibold sans-serif, to contrast with the serif font for body text. [1]
[1] There are some rendering issues with Dillo, with made the mistake of trying to support CSS without going all the way, making sure that http://acid2.acidtests.org renders a smiley face, but even here I made sure the site still can be read.
[2] Also, no cookies used on my website. No ads, no third party fonts, no third party javascript, no tracking cookies, nothing. The economic model is that my website helps me get consulting gigs.
[3] I do agree with the general gist of what you’re trying to say: HTML, Javascript, and CSS have become too complicated for anything but the most highly funded of web browsers to render correctly. Both Opera and Microsoft have given up with trying to make a modern standards compliant browser, because the standards are constantly updating.
> Well, if it makes you feel any better, my website renders just fine on Lynx
It doesn't; I only use lynx when someone tricks apt-get into updating part of my graphics stack (xorg, video dirvers, window manager, etc) and researh is needed to figure out how to forcibly downgrade it, and then only because I can't use a proper browser without a working graphics stack.
> the general gist of what you're trying to say: HTML, Javascript, and CSS have become too complicated for anything but the most highly funded of web browsers to render correctly.
This is subtly but critically wrong; I am saying that it is necessary than web browsers do not render websites 'correctly'. The correct behaviour is to actively refuse to let websites specify hideous fonts, snoop on user viewing activity, or execute arbitrary malware on the local machine.
> Browsers with modern CSS support see [...] the serif font for body text.
Fair nitpick - "haven't noticed the difference yet" would be more accurate - but I don't see how that changes the argument; if I haven't noticed a difference, what's the point?
The trouble is that the defaults tend not to be the best fonts that are available, and very few users change them. I have changed them myself, but I don’t know of anyone else that has.
For myself, I wish that people would leave Arial, Verdana, Helvetica Neue, Helvetica, &c. out of their sans-serif stack, having only their one preferred font and sans-serif, or better still sans-serif alone; but as a developer I understand exactly why they do it all.
Unfortunately, I'm one of those developers :( My font stack is:
font-family: system-ui, Helvetica, sans-serif;
for prose and
font-family: ui-monospaced, Menlo, monospace;
for monospaced text. The first being the user's preferred font, the second as a good (IMO?) default that I impose on them, and the third as a full fallback. I'm conflicted on whether this is the right balance between user choice and handling browsers that support nothing.
: "in a world where you're self-hosting all of your scripts."
Anything self-hosted is already fragile: it will go away when you don't continue to actively maintain it (paying for a domain, keeping a computer connected to the Internet etc.) or when you die.
I guess you can last 10 years, which is apparently what "This Page is Designed to Last" aspires to, but what if we have greater ambitions? Like 100 years?
You can design and mark-up content that will still be useful and readable in 100 years. You might be able to preserve the presentation logic (CSS-style) for 100 years.
You probably won't be able to preserve the interaction design for 100 years (without a dedicated effort -- that's why they bury computers along with the software in time capsules).
But I think it is optimistic to think that _most_ SasS hosts are going to archive content for 100 years. Preserving digital content is an _active_ process. It takes resources and requires deliberate effort.
Postscript: I'm trying to think of modern companies that would preserve content for 100 years, assuming they make it that for.
Facebook is the only significant current platform that I can even imagine preserving content for 100 years, but even that seems like s stretch. Historians might step in to archive it, but is there real value to Facebook to maintain and publish 50 year old comments on 2.5 billion unremarkable walls?
Twitter won't. Certainly Insta, SnapChat, WhatsApp etc. won't. Flickr probably could do it relatively easily but won't. YouTube maybe, but there's more to store. Something like GitHub maybe?
Right, look at SourceForge. There's a lot of broken links and/or references to no-longer accessible content in some of the older Apache.org projects too.
Also maybe cvs/svn/git repo generally don't contain content worth preserving for 100 years. There are some historically significant or interesting repos, but for the most part you'll have a bunch of unremarkable (and duplicated) code that may not have run then and certainly won't run now.
> a bunch of unremarkable (and duplicated) code that may not have run then and certainly won't run now
100 years is a long time, but I do run 20+ years old Common Lisp libraries and expect them to work without modification; I'd be really pissed if they disappeared from the Internet because someone thought that 5 years of inactivity means something doesn't work anymore.
If you have built something worth lasting 100 years, other people will help you ensure that it does. That reduces the concerns in this article considerably.
When I studied media science one of the most lasting experiences I had was a talk with one lady of the viennese film museum (on of the few film museums that store actual films instead of film props).
As a digital native I never gave it a thought, but she told me that there is a collective memory gap in films that have been shot or stored digitally. With stuff that has been stored on film, there was always soem copy in some cellar and they could make a new working copy from whatever they found. With digital technology this became much much harder and costly for them, because it often means cobbling together the last working tape players and maintaining both the machines and the knowledge of how to maintain them. With stuff on harddrives a hundred different codecs that won’t run on just any machine etc this combined to something she called the digital gap.
I had never thought about technology in that way. Nowadays this kind of robustness, archiveability and futureproofing has become a factor that drives many of my decisions when it comes to file formats, software etc. This is one of the main reasons why I dislike relying soly on cloud based solutions for many applications. What if that fancy startup goes south? What happens to my data? Even if they allow me to get it in a readable format, couldn’t I just have avoided that by using something reliable from the start?
I grew to both understand and like the unix mantra of small independent units of organizations — trying as hard as possible not to make software and other things into a interlinked ball of mud that falls apart once of the parts stops working for one or the other reasons. Thinking about how your notes, texts, videos, pictures, software, tools etc. will look in a quasi post apocalyptic scenario can be a healthy exercise.
On this subject you can dive into the story of the missing "Doctor Who" TV serials.
Some tape of master's were infamously reused to store other contents. Beside the whole archive problem come from the reusable nature and scarcity of the chosen storage. I think I've read something about reusing paper as well in medieval time.
> I think I've read something about reusing paper as well in medieval time.
This mostly happened with parchment, not paper, but otherwise you are right. It is called a palimpsest.[1] Sometimes the writing under the writing can be reconstructed as happened with the oldest copy of Cicero's Republic.[2]
There is an old observation that I found striking at the time:
Newer methods of storing information tend to be progressively easier to write, and progressively less durable.
(The following is not really in chronological order)
You'll never look at stone tablets the same way again. As primitive as they are, their longevity can be amazing. Ancient emperors and tyrants knew what they were doing. Trajan's column from 113 AD is our main source on roman legionary's iconic equipment.
Cuneiform tablets were heavy and awkward, but they were 3D so there was no paint to worry about.
Parchment tends to be more durable than papyrus, and paper. Perhaps the best known among the Dead Sea Scrolls was made out of copper.
Iron Age culture artifacts are harder to find than Bronze Age one, because bronze is more resistant to corrosion.
CD's, especially(?) from home burners, are reported to oxidize after several years. That may still be better than tapes, hard drives and other magnetic media (SSD?) which can be wiped by an EMP pulse. The internet era information storage appears to come with an upkeep cost! Slack practically doesn't archive messages by default. Until Gmail, it was typical for email servers to delete old messages.
People get used to novelty and things being ephemeral. Capitalism supposedly requires low durability goods so people keep buying them, including tools and clothes. Houses are poorly built break down pretty quickly.
I find it amazing people used to decorate their homes, tools, clothes with ornaments, engravings etc. You'd be a fool to do that today, you don't even know how long that thing is going to last.
Interesting analogy. I am having opposite problem. I have a shoe box half filled with miniDV tapes. Camera is long gone. I would like to transfer these to hard drive. Services that offer service digitize your tapes just too expensive for me. Most of the tapes probably just goofing around and there is issue of privacy. With current camcorder I just plug the SD card into computer and copy across.
Post internet, most content is globally replicated. Someone somewhere will find time and energy to make an Amiga simulator with exactly correct bugs, to run the program you want. Amount of content lost proportion to amount of content created must have gone down dramatically.
> With digital technology this became much much harder and costly for them, because it often means cobbling together the last working tape players
I think something might be getting lost in translation. Could she have meant “electronic” rather than “digital” (which to me suggests digital media such as DVD etc)
This whole anecdote makes more sense to me with this substitution.
She was refering to both, she said they have similar problems with CDs and even stuff on hard disks, because often the used Video Codecs are hard to get running without the right knowledge and resources, especially because some of the non-consumer-codecs were often also proprietary and sort of made fore specific plattforms, but I don't know too much on that, so take this as speculation.
My favorite is "?", which is not documented in this link. It forces search instead of resolving a domain name.
E.G: if I type "path.py", looking for the python lib with this name, Firefox will try to go to http://path.py, and will show me an error. I can just add " ?" at the end (with the space) and it will happily search.
It's a fantastic feature I wish more people knew about.
It very well done as well, as you can use it without moving your hands from the keyboard: Ctrl + l gets you to the URL bar, but Ctrl + k gets you to the URL bar, clears it, insert "? ", then let you type :)
It's my latest FF illumination, the previous one was discovering that Ctrl + Shirt + t was reopening the last closed tab.
Not sure you're aware of this one too... But, you might like the "Ctrl+Tab" shortcut as well. With it you can alternate between the last few active tabs, with thumbnails. Really handy.
I don’t think I’ve come across a single Firefox user that ever uses keyboard shortcuts that has left it that way—all have found the “Ctrl+Tab cycles through tabs in recently used order” preference and turned it off, so that it goes through tabs in order, like literally every other program I’ve ever encountered does with tabs. (Yes, Alt+Tab does MRU window switching, but that has never been the convention for Ctrl+Tab tab switching.)
Mind you, MRU switching is still useful behaviour; Vim has Ctrl+^ to switch to the alternate file which is much the same concept, and Vimperator et al. used to do the same (on platforms where Alt+number switched to the numbered tab, rather than Windows’ Ctrl+number), no idea whether equivalent extensions can do that any more. I have a Tree Style Tab extension that makes Shift+F2 do that, and it suits me.
If you keep Control+Tab set to cycle through tabs in recently used order, you can use Command-Shift-Left/Right or Control-PageUp/PageDown to cycle through tabs in tab-bar order instead.
Additionally, you don't need an extension to jump to a tab anymore. Command-[1-8] goes to that number tab in the current window, where 1 is the leftmost tab. Command-9 goes to the rightmost tab.
Thank you so much for this! Another handy and often overlooked feature are the shortcuts for bookmarks. And with %s in the URL you can search/navigate pretty fast. Example: https://en.wikipedia.org/wiki/%s with the shortcut "w" could bring you the according article if you type "w foobar"
I would actually shift this quite a bit to say if you’re designing your page to last 10 years, put it on the internet archive on day 1.
Invite them to crawl it, verify the crawl was successful, and even talk about that link on your page.
It removes the risk of domain hijacking, hosting platforms shuttering, and the author losing interest. P.s. The internet archive is doing excellent work. Support them.
And as you give a content donation, please also consider a monetary donation to keep the lights on at the Internet Archive: https://archive.org/donate/
Make sure that archive.org - the Internet Archive - catches your website in "The Wayback Machine". Catering to that is a pretty good strategy for archiving for at least the next couple of decades, considering that institute's staying power.
A shift to independent publishing is needed. I used to have sites that died because the upkeep became tiresome, and if - a professional developer with almost 25 years experience of writing web applications - find it tiresome, can we blame people for wanting to use the big platforms?
I think using a static site generator might be OK. Common headers and footers help, and RSS might definitely be a good thing, but that seems to be dying.
One idea from this article I liked was "one page, over many". I don't think he meant have one single page on your website, but rather one per directory, and like he has with this article have one directory for a thought or essay or piece of something you want documenting, and just have an index.html in it.
I like this because I think the one thing that has killed off most personal websites is not the tech tool chain, but that "blogging" created an expectation of everybody becoming a constant content creator. The pressure to create content and for it to potentially "go viral" is one of several reasons I just tore down several sites over the years.
Around this time of year I take a break from work and think about my various side projects, and sometimes think about "starting a blog again". I often spend a few hours fiddling with Jekyll or Hugo, both good tools. Then I sit and think about the relentless pressure to add content to this "thing".
I like this idea instead though. No blogs. No constant "hot takes" or pressure to produce content all the time. Just build a slowly growing, curated, hand-rolled website.
I still think there might be a utility in having a static site build flow with a template function, but a simple enough design could be updated with nothing more than CSS.
I use a combination of asciidoc and hugo to generate my static website. It means that I can easily regenerate the website using whatever tool I want in the future or even just easily update the template for the existing site. If something happens to asciidoc, there are lots of converters that would allow me to move to another format or presumably some format in the future. Markdown and restructuretext are also really good options.
I think the Stack Overflow guidelines have "solved" this problem in about the cleanest way currently possible: expect links to die, and include the relevant information in your answer.
If the link still works when it gets clicked on that's a bonus, but it shouldn't need to be available for the content you're reading to be understandable.
That's not always possible without changing the subtleties and basic meaning, or even intention as well as discussing and questionioning what the author wanted to say with their sentence or paragraph. This is particularly tricky if you write political commentary on your site. Blog post of a politician or social media posts can't be quoted where there's already vaguely stated and often said politician removes the post after backlash so you wouldn't be able to link it.
In the case of StackOverflow, if you can’t explain an answer in your own words, then you don’t really know the answer and no one should rely on your links to a supposedly authoritative source.
These work only if you move stuff around on the same website. If you switch domains you can't just ask the new domain owner to redirect requests to your new website.
I think we will simply have to assume it will continue. That is, if anything will continue, archive.org and similar projects whose primary goal is to preserve and prevail are easily the prime candidates.
As many times as the article literally tells you will be necessary? One of the key points about links it makes is to use a link checker if you do link out.
Let's be honest with ourselves. The best way to make your content last for a long time is to host it on a platform that is free and very successful. For example, whatever photos I posted on Facebook 12 years ago? Still alive and kicking. The articles I've published on wordpress.com 7 years ago? Still in mint condition, with 0 maintenance required.
In comparison, the websites that I've built and hosted or deployed myself, have constantly required periodic work just to "keep the lights on". I went out of my way to make this as minimal and cheap as possible, but even then, it hasn't been nearly as simple as the content I've published on wordpress.
At some point, people's priorities change. Perhaps due to new additions to the family, medical circumstances, or even prolonged unemployment. And when that happens, even the smallest amount of upkeep, whether it is financial, technical or simply logistical, becomes something they have no interest in engaging with.
If we really want our content to last, not just for 10 years but for a generation, our best bet is to publish it on a platform like wordpress.com. One which requires literally zero maintenance, and where all tech infrastructure is completely abstracted away from you. I know this isn't going to be a popular idea with the HN crowd, and I do not blame anyone at all for wanting to keep control over their content. People are free to optimize along whatever dimensions they wish. But if I had to bet on longevity, I would bet every time on the wordpress article over the self-hosted one.
>Let's be honest with ourselves. The best way to make your content last for a long time is to host it on a platform that is free and very successful. For example, whatever photos I posted on Facebook 12 years ago? Still alive and kicking. The articles I've published on wordpress.com 7 years ago? Still in mint condition, with 0 maintenance required.
You view on timeline is too short. We're not talking about keeping something online for 7 years, but for 70. If I had followed your advice a few years ago, I would have deployed on Geocities. Do you know what happened to those websites?
The question is, is wordpress going to be around in 70 years? No one knows. But that static HTML page will still render fine, even if it is running in a backward compatibility mode on your neurolink interface.
> The question is, is wordpress going to be around in 70 years? No one knows. But that static HTML page will still render fine
The question isn't whether wordpress will be around in 70 years, but whether it will outlast your self-hosted website. Anything that is self-hosted requires significantly more financial/logistical maintenance, and what is the likelihood of someone continuing to do that for 70 years?
For me it's very easy because those domains are also tied to my email and all of my other hosted services (gitea, tt-rss, etc.) all use the same domain. So it's very easy to remember to keep them all alive and active. I've had domain names active far longer than Wordpress has existed.
Photos you posted on Facebook 12 years ago are generally inaccessible to the public; you need to be a Facebook user to see most stuff on Facebook after all.
Also, in most cases even you would have a very hard time accessing them, unless you somehow "pinned" them not to be far far down the scroller.
The article addresses this point. This is the kind of hurdle that makes pages not last. The point is that yes we can spend time each few years migrating and maintaining but we shouldnt have to.
If Jekyll died tomorrow, I still have the HTML to keep the website running, more or less. It's a build step in my pipeline but not one that abstracts it in such a way that I cannot use the final product as my archive. I'm not sure that a CMS could let me do the same.
>to host it on a platform that is free and very successful
Yes, from a technical pow, but what about deplatforming? I think it is a bigger risk to lose data than any framework/technology deprecation. I would definitely not rely on any platform keeping my data.
The issues outlined here are one of the reasons that I am moving as many of my workflows to org-mode as possible. Everything is text. Any fancy bits that you need can also be text, and then you tangle and publish to whatever fancy viewing tool comes along in the future.
I don't have a workflow for scraping and archiving snapshots of external links, but if someone hasn't already developed one for org I would be very surprised.
In another context I suggested to the hypothes.is team that they should automatically submit any annotated web page to the internet archive, so that there would always be a snapshot of the content that was annotated, not sure whether that came to fruition.
In yet another context I help maintain a persistent identifier system, and let me tell you, my hatred for the URI spec for its fundamental failure to function as a time invariant identifier system is hard to describe in a brief amount of time. The problem is particularly acute for scholarly work, where under absolutely no circumstances should people be using URIs or URLs to reference anything on the web at all. There must be some institution that maintains something like the URN layer. We aren't there yet, but maybe we are moving quickly enough that only one generation worth of work will vanish into the mists.
> The issues outlined here are one of the reasons that I am moving as many of my workflows to org-mode as possible. Everything is text.
That works for some, even most people. Unfortunately, the content I create will inevitably cite material in languages other than the main document language. That means that I have to heavily use HTML span lang="XX" tags to set the right language for those passages, so that (among other things) users with screenreaders will get the right output. As far as I know, org-mode lacks the ability to semantically mark up text in this way.
If it is for blocks of text then you could use #+BEGIN_VERSE in combination with #+ATTR_HTML, or possibly create a custom #+BEGIN_FRENCH block type, but I suspect that you are thinking about inline markup, in which case you have two options, one is to write a macro {{{lang(french,ju ne parle frances}}} and the other would be to hack the export-snippet functionality so you could write @@french:ju ne parle frances@@ and have it do the right thing when exporting to html. The macro is certainly easier, and if you know in advance what languages you need it shortens to {{{fr:ju ne parle frances}}} which is reasonably economical in terms of typing.
Maybe I'm dense, but I'm having trouble understanding what is so difficult about keeping content around. It seems like the issue of webpack and node and all the other things he mentions on the article aren't really problems with content per se. You can just publish your thoughts as a plain text file or markdown or whatever and you're good to go. I'm having a hard time thinking of types of content that are really tied to a specific presentation format which would require a complex scaffolding. A single static page with your thoughts is sufficient and should require no maintenance to keep around. I do agree though that even static site generators create workflows that get in the way. I'd love to see an extreeeemely minimal tool which lets you drop some files in a folder and then create an index page that links to those. You could argue that's what static site generators pretty much do, but they do seem to be more complex than that in practice. Remember deploying a web site with FTP? I have to say that was simpler for the average person than what we have today. I think that, in some ways, the complexity is what ends up pushing people towards FB, Medium, etc as publishing platforms.
It seems like in practice the biggest problem is "it got deleted", and everything else is about either preventing others from deleting your stuff or preventing yourself from deleting it out of laziness or frustration.
Deploying a web site with (S)FTP works as well as it ever did... and is just as obscure to non-technical people as it ever was. Ease of use means loss of control.
It'd be a cool challenge to build something so simple that even a non-tech person could use which allows them to maintain control and ownership. Any good examples of tech in general that is highly approachable like this? Even things like WordPress are too complicated for most - maybe if not self-hosted it's not so difficult, but still falls short in terms of being complex and not just simple text or html (at the most)..
Oh cool, didn't know about that. Be cool to see something like this which is more approachable for non-tech people. I think the tor part of this, at least in its current state, is too much for the average person.
All the points about using simple html don't do anything for website rot.
Sites I built with tables and custom js display framework over decade ago, before people started abusing floats for layout and before js frameworks happened, still display today perfectly.
Pages die beause domains and hosting gets abandoned or because websited get upgraded without paying attention to old link format.
If you want your pages to last buy hosting that automatically charges your credit card and use a company that encourages your cc info to be up to date (like Amazon).
Also never revamp your sites just make new ones in subfolders or on new (sub)domains. And if you absolutely need to upgrade existing site pay very close attention so that it accepts old url format and directs user to correct content.
This is why I've been saving PDFs/HTMLs or even just taking screenshots of webpages I find especially meaningful or important to me... Archiving things as files this way can get kind of tedious and definitely feels primitive at times (we made the LHC but I'm here not expecting the same pages of cool interviews, designs, etc. to be up next month), but what can you do?
But it's nice to know you're not alone in wanting nice things to last :)
Yes, I wish I had started doing that years ago. I went back through all my old bookmarks that I had saved from the past 2 decades and a significant portion are dead now.
I've put a bit of thought of what'll happen to my website if I were to die. It's hosted on GitHub Pages right now, so at some point GitHub is going to disappear or stop offering the service. Even before that, I honestly don't know what happens to renewing payments–I guess my Google Domains payment will stop and somebody will squat on my domain soon after? archive.org might be the only thing keeping the information around…I hope I've done a good job of making it archivable; as a matter of policy, there's no JavaScript. There's a snapshot from earlier this year and it looks fine, so maybe it'll outlast me?
It's specifically for open source software, but I wonder if it can be spun out into personal archives or websites.
---
Some years ago, I had an idea for a thought experiment called the Eternal Bit - various angles on the practicality of preserving a bit state "forever".
This is nice, but that webfont bullet is a weird take. If you're hosting the font yourself and you're using a modern format, fonts don't add a lot of overhead, and if they happen to fail then the browser will take the next item down the stack. It's textbook progressive enhancement. Nothing about adding a webfont properly will prevent Web content from lasting and being maintained for 10 years.
"your focus should be about delivering the content to the user effectively and making the choice of font to be invisible, rather than stroking your design ego" - these aren't the only two options?
This definitely reads like something one of my professors would have written. Someone with a lot of good knowledge, but lacking some context for things you learn in the field.
I think this article is excellent, but one small nit: isn't it contradictory to say don't minimize HTML but do minimize SVG?
The justification in the HTML case is that "view source is good" and "it's compressed over the wire anyway". Don't those arguments apply equally (or nearly equally) well to SVG?
Programs such as Inkscape put a lot of extra information into SVG files that is not actually needed for them to render. File size reductions with SVG minifiers can be significant. For instance, the applications-internet.svg icon goes from 31K to 9K when compressing it with svgo.
I think the argument is that HTML can and should be human readable and editable, while you really need a tool to meaningfully create and edit SVGs anyway, so minifying isn’t a loss.
I've never looked into svg specifically but I would assume that even if it's xml, it's generally far from reasonably human readable/editable, in which case you're not losing the properties by minimizing. Unreadable -> more unreadable is much less significant than readable -> unreadable
That's why I hedged a bit. SVG is certainly less readable, but it can be human readable if not aggressively minimized.
I've both created and modified SVG documents by hand with a text editor, and also view-sourced SVGs to see how they work.
The only thing that is especially unreadable are the long strings of coordinates that make up paths, and even those can be manageable with comments. No one does complex paths by hand, but the rest is colors, shapes, transforms -- all things you can read and may want to modify.
HTML is a terrible authoring format. CSS is a terrible everything. If you want something to last, the thing that will last is the human-created source - probably markdown.
I'm not worried about my blog posts sitting in their git repository being lost. The Jekyll pipeline that adds a Javascript header/footer might go away, as might the Javascript that pretifies my raw posts, but the markdown is durable, and a future archivist could always regenerate a pretty version from the markdown - or even read the raw markdown.
Give browsers a good way to view markdown, give site creators a good way to link to canonical source for their pages, and then we'll have durable links.
No, I trust HTML and CSS to stand the test of time far more than Markdown. Do you know how many different variants of Markdown there are, how they affect the interpretation and appearance of the content, how they can break surprisingly much? And when it comes down to it, as soon as you want to do anything even remotely interesting in Markdown, you have to drop straight HTML in there, and pray that the Markdown engine will do the right thing, since the rules of how it should all work are insanely complex, and vary widely by engine, for all that there’s a general trend towards CommonMark which at least specifies the madness and folly. (I’m sad that Markdown won over reStructuredText, which was actually designed.)
HTML is a quite acceptable authoring format, one that readily lets you do interesting things if you desire—though making it possible can certainly be a footgun. CSS is a reasonable styling language.
> Do you know how many different variants of Markdown there are, how they affect the interpretation and appearance of the content, how they can break surprisingly much?
One plus a bunch of non-standard extensions. Do you know how many different variants of HTML there are, how rare it is for real-world HTML to actually conform to any of them, and how many different interpretations of that there are. (To say nothing of CSS, the standard so complex that it's never actually been implemented).
> And when it comes down to it, as soon as you want to do anything even remotely interesting in Markdown, you have to drop straight HTML in there
Just say no. If your goal is writing something intended to last, you should be able to convey it in mostly-plain text.
Nope, definitely not one Markdown. Reddit uses three different engines with major mutual incompatibilities and mostly forbids all HTML (which frankly I deem enough to call it not real Markdown); Stack Overflow uses two with what used to be critical deficiencies and incompatibilities in the one used for comments, but I think they’re now mostly minor only; some things still use Markdown.pl which does many bizarre things; some, other weird engines with idiosyncrasies of their own; most more recent Markdown engines are only mostly CommonMark-compatible, regularly deviating in important places, and very often adding incompatible extensions. It’s a disaster at present, and I don’t expect it to get much better for at least a decade, if ever. (HTML got better with HTML5, but I don’t think Markdown is likely to unite so firmly, because people want more than non-HTML Markdown offers, and so will continue to extend it.)
HTML, though? Since the HTML5 spec about a decade ago, there has only been one HTML, with all browsers parsing and handling documents identically. CSS is similarly parsed and interpreted according to well-defined algorithms now. There are some visual rendering differences between browsers in how CSS is handled, but it is exceedingly rare for them to affect the content.
And for your complaints about CSS, it’s not intended as a single thing to implement in one piece—it’s deliberately designed as something that is extended over time. But if you write CSS that works in browsers now, then presuming you haven’t used vendor-prefixed stuff, it’s reasonable to expect it to work the same way indefinitely.
Imagine sticking a proxy between our browser and the internet that automatically archives all webpages we attempt to browse to, and then only lets you view the archive. How much of the internet would you be able to see?
This website needs to scale text better on mobile (on an iPhone) so it's not hard to read. Especially for a post advocating on using vanilla HTML and how nice and powerful it is.
I don't think I particularly disagree with any of the post, but I found it a little long-winded.
Catering to energy limited, heat dissipation limited, UI size and precision limited, and network limited (random rtt) smart phones is why the web has become as bad as it has.
I have to compliment Jeff for first telling me unpopular things I already believe then adding sensible things to it (except the link back to his page)
The idea ForEach person to write just one single web page is really wonderful. I'm going to have to deeply ponder that and make one.
While I really like the permanent web and losing centralized control over data is a price worth paying (.... no wait, I would pay money not to have this.) the winning method of making cars last 100 years is maintenance not build quality.
That said, here are some similar ideas of mine:
I often add a torrent magnet uri when I link to youtube. I seed those torrents myself and sometimes someone helps. If the content was good enough and yt deletes it for whatever [stooopid] reason there would be more seeds.
Here http://dr-lexus.go-here.nl I, in stead, try to "sell" the idea that [besides from stuff breaking] people use and will use a combination of Adblock Plus, NoScript, RequestPolicy, Ghostery and JSOff to break your stuff.
You can put a really tiny bit of css inline and at least render the page properly if the css fails for whatever reason.
I really dislike how our websites merely provide just one location for everything so I wrote this:
I haven't tried it yet but it uses the BitTorrent protocol for video retrieval, so there is your magnet link? Of course it is self-hosted if you can't find an instance that'll take you, but it might be a good option.
Good rules, I follow all of them, making a pure HTML/CSS single page homepage of uncompressed HTML, except... it's not at all designed to last. I nuked my last site to make it, and I'll nuke this one one day when I wanna make something even cooler. Website as art installation plus a few links.
My content is spread about, on medium, on github, twitter, instagram, wherever. But this too I feel is mostly ephemeral. It's still not clear to me why that's a bad thing. I dislike hoarding physical objects. I'm not sure about digital, either, I suppose. So one alternative is to free yourself from the idea that it needs to be hoarded, that all your works need to be maximally legible and easy to find, etc. I do suppose if you're a professor your students should probably be able to find the syllabus from your website (which he has, though it's hosted on brown's site).
> It's still not clear to me why that's a bad thing.
Human advancement comes from the process of building knowledge on knowledge, using what our predecessors learned to move our starting point forward. (You're the predecessor in that sentence.)
I still don't get why not minify HTML...
When you open your browser's devtool, all the HTML elements is properly formatted and interactive.
If you mean "view-source:", there are tons of HTML beautifier there.
This is somewhat of a concern to me: I can imagine a distant future where one of the great mysteries would be the rise and fall of literacy, since almost no written communication is left and computers and the internet are long forgotten, obsolete technology.
"In the beginning of the 21st century people suddenly stopped reading and writing. In the short timespan of only a few decades almost all form of written communication has vanished..."
> But if not, maybe you are an embedded systems programmer or startup CTO or enterprise Java developer or chemistry PhD student, sure you could probably figure out how to set up some web server and toolchain, but will you keep this up year after year, decade after decade?
This describes me. Started on Drupal now on Hugo but still have to retrain myself everytime I need to update my site. I don't even understand my own Hugo template. I once set up batch file to build and upload web site. I upgraded computer and I couldn't get my batch file to work. It was something on my PC or AWS. Most likely my incompetence and lack of time to find issue. I am going back to HTML and FTP. Funny thing is I learnt HTML in college and 20 years later I can jump into it without much fuss.
I'd like to point out SGML as a format for very long document storage and sustainable authoring. I've written up a tutorial for using SGML for preserving content [1] that I held at this year's ACM DocEng conference.
> Well, people may prefer to link to them since they have a promise of working in the future.
In a world where 95% of the google search results for <common problem> are forum threads, with everyone 'answering' the question by saying 'just google it, dumbass'[1], I don't think people - as in - the common homo sapiens - cares about the longevity of the content they link to as much as the author thinks.
Quality of the content (Does it have the information you want) is king. Longevity is a 'future me' problem, and 'future me' is incredibly shortsighted.
[1] Thanks, guys, how did you think I found this forum thread?
I wanted to make a pre MVP leads landing page this week. Researched new static site builder and templates. Choose one, ran into a build problem, then saw that docs were not up to date... all in all half a day with no hard, ready to launch outcome.
Took a step back, what is the most minimal thing I needed.
Took html5 boilerplate, copy&pasted the index.html, put normalize.css inline.
Got rid of everything else. Minimal Html, an image, form mailer. Done and launched https://www.securrr.app on Netlify in less than an hour.
Overengineering is real. Choose a goal, use the most simplest solution to get there.
I was expecting the solution to be mirror your generated pages on IPFS (https://ipfs.io), so they just don't go away at all (as long as someone has them pinned).
The proposed solution set seems extremely convoluted and don't actually solve the issue.
That’s quite the caveat, and speaking as somebody who has attempted it, you’ve introduced quite a bit of complexity. The whole point is that complexity militates against keeping it online. Keeping it simple, the author’s theory seems to go, is the single most effective way to make something likely to be able to be available long term.
None of what he wrote mitigates the case where he no longer maintains his site and stops paying for the server, which is the very reason all those links he mentioned were dead (defunct websites/hosts).
The article was about not having to maintain a site for non techies like us. WE can happily maintain a range of things, but a chemistry doctoral student will hit a hurdle and may give up.
Paying for a server - mentioned in the article about the provider changing access.
You might keep the content format mighty simple, but keep in mind that the hosting and delivery mechanism consisting of http/html are extremely demanding. In particular, your website's availability is fundamentally limited by the uptime of your host machine (which you are only leasing) and its HTTP server and network. If you forget to renew the lease on your host machine, or there is some mishap by the service provider, your website will vanish without a trace -- even if it happened to have a high rate of visitors.
It's changed a bit, fairly fundamentally (hiding points, web view, collapsible conversations, ...) . The chan sites haven't, they're probably the closest interactive ones. Metafilter also hasn't but it's pretty small
> On my large screen, the lines are so long that it's difficult to read.
The text on the page adapts to the width of the browser window in which it is viewed. The result is that everyone can obtain the line width they prefer by adjusting the width of their local browser that is rendering the page.
If the lines are too long, just pop the page out into its own window, narrow the width of just that window until the lines are at a comfortable length for you, and you'll have lines that are just right for you. This page is an example of where you are in control of how the site renders and can easily adjust it to your own tastes.
FYI. The content looks real nice in "reader" mode (at least in Firefox).
I kind of think this is the best approach. Write your content with as minimal markup as possible. Let the user-agent render the page in my own styling.
I wish the web was a bit more like the gopher protocol, where the page markup was very minimal. Just give me the plain text (or close to it) and let my client render it the way I want to read it.
Markdown (or similar) would be an ideal representational format. This would push all the rendering decisions to the client.
Even better (IMO, because that’s what I do) is to provide your own markup that is reasonable, but let the underlying semantic HTML be there too so people can discard it and use their browser’s reader mode to view it too.
This approach of monitoring a single URL is useful, especially if you can use the free version of monitoring services. It doesn't, however, go far enough.
Does your site only consist of a single page? Are there no links between pages? Or links to external resources?
There are few things as off-putting to your visitor by presenting them a link that ends in a 404 page.
Forgive the shameless plug, but this is the exact reason we built the Oh Dear [1] tool. We crawl the site, much like Google, to find those pages and present them to you so you can fix them.
It's not that expensive and it covers your _entire_ site, not just a single page that is designed to last. I hope your _entire_ site is designed to last, we like to help make that happen.
"I don't even know any web applications that have remained similarly functioning over 10 years"
I have a WordPress instance that has functioned flawlessly since 2007. I would call WordPress a web application (though I realize this is not the bespoke sense of the term that the OP has in mind).
I like the general principle here of trying to keep website design relatively simple and self-reliant, but I'm not sure I agree with the idea that "link rot" is some horrible problem we desperately need a solution for. Good articles will be lost in time, but I think that's OK. New good ones will be written too. It's OK for content to be ephemeral, we don't have to be so obsessed with FOMO and making sure every blogpost ever made is etched into diamond to survive for 100 generations. If you read some article you thought was great, it's OK to just be happy you read it rather than spamming it to the four corners of the earth to give it as many eyes as possible.
Indeed, it should be use "plain svg" image instead. Some drawing program like Inkscape use SVG as their save format and add a lot of internal tags/properties that are just junk for other programs. Generally they offer the option to save as "plain svg" without all of this tags.
I would imagine it’s because one is markup of text, the other serialized image data. One potentially has meaning when viewed raw, one probably doesn’t.
HTML may contain style and script tags. Sometimes minification may break CSS or JS - however rare that is, if you want your archive to be reliable you don't want functionality that may break something. If you intend to reduce storage space usage, HTML may be gzipped - this is "losless" so it does not carry a risk of breaking anything
bit disappointed, mainly because I was expecting something a bit different. I was hoping to see something that has some ideas that would make things last beyond the original author, say for 100 years. Of course that's hard to predict given tech will change, but what the best guess effort to make your page last as long as it possibly can into the future
You need to address the problem of domain names...the author has a vanity domain name that needs to be paid for that will need to continue to be paid for, so may need to pigggyback onto another domain
Hosting might be tricky
My best guess is MAYBE something like Github would solve both issues....
> the author has a vanity domain name that needs to be paid for that will need to continue to be paid for, so may need to pigggyback onto another domain
This is a very good point. I host my blog with a vanity domain name[1]. The domain name was accidentally taken away from me[2] for a while due to which my blog became unreachable.
As a result of this experience, I now host a mirror of my blog using GitHub Pages[3]. Further, I have made sure that the entire blog can be downloaded[4] to local filesystem and viewed locally, i.e., all inter-linking is done using relative paths only.
I agree with all the points except for "Obsessively compress your images". The same argument as point 2 ("Don't minimize that HTML") applies: It's an extra step in the queue, and you lose some quality in the result.
Besides, bandwidth and storage will only get cheaper. Unless you have a site that contains thousands or more images, chances are you won't really notice the difference the compression makes.
Maybe for preservation we should even go in the other direction instead: Make every image a link to a full-size non-resized version of the image with optimum quality.
I disagree, image compression is still highly relevant and will stay relevant for a long time. It’s not about hosting costs and hosting bandwidth, it’s about user bandwidth.
I frequently travel between countries by train and bus and when outside cities it’s almost impossible to get any non-optimized page to load. And this doesn’t even only extend to the time spent inside the train. Depending on the country (all developed) and which phone operator my phone decides to switch to, I have slow mobile internet about half of the time. This is also one of the reasons why I prefer to read HN over other sites: HN loads, others don’t.
Most images on a modern website don’t really contribute anything substantial but are merely decorative. Image quality loss should therefore not be a problem for ~90% of images.
Compare the available mobile bandwidth and cost today with what we had just 5 years ago. We're talking about regular web pages here. Sure, you don't want to put full resolution images straight into your page. That makes no sense. But if you have a regular page with a couple of images, its size up to a certain size won't matter in most places in a few more years, just as it doesn't matter today if you live in a more densely populated area. I have a mobile phone contract where I get 10GB of traffic (which seems a lot to me but which is laughable if you compare it to other markets such as Kuwait where contracts with 1TB/month are normal) for less than 20€ per month and I only manage to use 3GB per month or so most of the time.
an experimental browser is to be considered something which will last a long time? Or would it need lots of maintenance? I can predict that in the next 5 years there will be some api change and users will have to migrate. And some users will not. If users will be dropped then it is not a perfect thing that makes pages last.
Yes it probably is great to get started for newbies, but the article was about things that last
Doesn’t seem like you know what you’re talking about. API changes? Sure, if you’re referring to the HTML language changing. The content does not depend on the browser being there. It’s just “dat” addresses. Dat is a public p2p protocol with good success stories, especially in the scientific community.
Does Google Web Fonts not do some tracking along the lines of analytics? I'd never integrate a Google service in my site and sell user info out for little/no benefit.
The point on not minimizing HTML doesn't make any sense to me. Who looks at HTML via View Source to understand a site, when the Elements panel is right there?
One of the issues highlighted is that the ecosystem moves fast and dependencies tend to break your setup. One thing that at least partially has mitigated this problem for me is using a stack that is mature and very careful about breaking changes. Clojure is a great example of that. My gut feeling tells me that if you had a repo in Clojure that created your static pages 5 years ago (when Java 8 was released) it will still work today.
I'm confused about the widely held opinion that it is paramount that all web content persist for infinity. That seems parallel to someone expecting my personal handwritten journal to be archived and available to all of future humanity. I'm not convinced that is a positive thing, nor is it a necessary thing. How egotistical do you need to be to think that your personal blog is worth persisting forever?
It's partially that for most other media once you have found and cataloged a piece of media/data it's safe. Books don't disappear off my shelves. Even things you haven't gotten specific copies of generally don't disappear entirely, they may get harder but there's usually a copy /somewhere/. Things on the internet tend to disappear entirely unless they've been included in the few archiving things like archive.org.
I wish I had every single personal journal/notebook whatever of every human who ever existed (copyright protections granted). Dont you? Socrate's, the diary of simple seaman in Columbus 4th voyage, the diary of a lesbian peasant in 15th China, can you imagine that? How better would be our understanding of the past and how many jewels would be preserved. Sure, lots of junk, but as the saying goes: One Mans Junk is Another Mans Treasure
I am all for archiving important web content, but imo most of the web content is not important and impossible to archive anyway. Too many pages use dynamic content which make web archives useless. Also having too much data just dilutes the useful stuff.
If you have useful content (the author should) store it in some other way. Put articles in any simple markup format like markdown in git, distribute videos using torrent and so on. For some types of content I couldn't think of a good solution yet. For examples JS games or applications should be in some self-contained single-file "web-executable" format. Images need to be embedded often, so torrent is not a good medium, but they don't belong in git also. So not really sure where to put them...
Also make sure to use standardized formats with proper structure. Browsers should remove quirk modes and flat out refuse to render anything which cannot be interpreted 100% unambiguously.
Webpages are always just a temporary distribution medium. They should never be the original storage place.
Sorry if all of that sounded a little bit rant-y, but I think the current web is a very bad state.
There are still orders of magnitude more pages of HTML than Markdown, so if you're interested in archival, just store simple HTML. I'd say HTML has a good chance of being interpretable a thousand years from now.
I was thinking about this exact subject matter the other day. We invest so much time in our digital content but what happens with death? How long can we really expect our content to last?
Ultimately we are all dependent on the functionality of the internet and making a physical copy is the best way to make something last. However if we intend to not only have it last but be accessible, some of the suggestions are helpful but the most significant is: where can I host something indefinitely?
Will Netlify, Github Pages, AWS, etc all be around in 50 years? 100 years? 500 years? Heck, the internet hasn't even been around for all that long.
As I write this, my only thought is that you need a system of fallbacks. Frankly, this seems like a business opportunity in which the cost of the infrastructure is purchased based on some formula. DNS is configured programmatically. File locations are distributed and redundant. I'm not sure what the best approach is but one thing is certain: the page accessibility is the least of this author's issues. He just needs the site to be hosted...
Any advice on what to do if you want to embed videos?
Bandwidth can cost a lot (maybe I'm wrong?) and AFAIK Cloudflare doesn't cache videos.
I would stick with Vimeo, but they can delete your videos if over threshold after you stop paying. YouTube at this point is sadly a better bet if you want to maintain a low cost website for decades.
To last a long time, you need a lot of people hosting copies. Think about PDFs -- it's easy to host a local copy of a PDF, so popular PDFs tend to be widely available. But a site hosting a local copy of every webpage it links to would feel weird.
Wonderful write-up. In particular, I appreciated how actionable this was. As a backend developer that loves the web but isn't part of the whole frontend mania, I completely relate, yet still learned a couple things that I can apply easily.
The problem is clear but the solutions presented don't seem to help all that much.
This is why I hope for a future with content-based addressing, as used in bittorrent and IPFS.
Bittorrent has its focus set on files, and IPFS is becoming a bit convoluted but the core idea is very powerful.
Immutable content persists by "pinning". And even if the author no longer wants to maintain the resource, there are likely some "users" that will (eg. linked pages). And even if there is little interest from users there are still archiving services that might want to keep it. Persistent unless it's so vile or irrelevant that even archiving services ditch it.
If you need any particular ones not available on http://atdt.freeshell.org/k5/ let me know here and I can pull them out of an old archive for 2000-2007.
Good ol' localroger. I never knew I wanted to read singularity fiction that included zombie rape and forced incest until Metamorphosis of Prime Intellect.
This is also why if you ever want to make a platform or service that will rely on permanent links make sure you consider making configurable/trackable routing as part of a resource’s stored data record, not just whatever your application decides at the time. That way when you inevitably want to upgrade the system and change the routing you can always maintain the old ones because they aren’t purely defined by application implementation. This is the difference between crafting software to do something and crafting software to build something that manages the thing you want to do.
I love this! I was going to write something similar. The increasing prevalence of sites that shard their content over multiple pages to increase ad impressions seems beneficial to nobody; the consumer, producer, and advertiser all have degraded experiences. The problem is that the companies which are paying out for ads (Google) don't penalize this type of behavior, so it is financially beneficial to have super bloated websites which require 15 page refreshes to view all the content.
wikihow is a very good antipattern to follow. This site should be deindexed from google entirely.
pURL are hard. I have one which is an X.509 enforced pointer to a PDF which is 'terms and conditions' and the pain to ensure the link never dies (its embedded in binary objects which are long lived, and out in the world beyond our control) is non-zero. Change publishing model, change CMS, the pURL is at risk.
Publication models where you pay the cost for somebody to be the archival reference make some sense. URI pointing into a store. Buckets in Google? But if you decided to move on, could you get a retained 302 redirect to point to the new home?
> I think we've reached the point where html/css is more powerful, and nicer to use than ever before. Instead of starting with a giant template filled with .js includes, it's now okay to just write plain HTML from scratch again.
This is so true, I do it for all my personal webpages. Well I currently have Github Pages in between, but markdown converts to pretty neat vanilla HTML and I wrote the templates myself.
On another note, to his third point, I couldn't access index.20191213.html :)
The only pages I ever go back to are the ones where I hope, or need, to find NEW content. Forums, stock art, reddit, hacker news, facebook, twitter, github (for updates).
There is really only a small subset of Internet content that will ever need to stay the same - or even should stay the same. So, while slapping an HTML file on the web is great for, say, the synopsis of a movie, I think the majority of the web can be discarded when new and relevant content needs to take its place.
Maff, Mht,
it's the best way I found to save snapshots of websites.
if you want to save an entire site, there's PDF.
I keep them yearly in sub directories marked by year,
it's pretty convienient and you can rename each Maff
to be most relevant to a directory view, special pages
go below the year directory.
currently only waterfox still has this option, since
the snooty mozilla guys can't bring themselves to carry
on a 15 year WORKING method that wasn't Not Invented Here...
XML tools should be able to validate the HTML locally, as these pages have already been "rendered" ahead of time.
And it would be interesting to see some stats on the types of pages that are returned for the broken links. Are they returning HTTP 404s? 30x? 200's, or 500's? Does their site even allow for 300 redirection?
But it makes sense to reduce infrastructure to serve static content, which are all points of failure anyway, or at least a maintenance burden.
Archive.org also support adding custom pages to be saved.
Is miss a feature on the WeyBack Machine chrome extension to automatically check every bookmark, and auto save if it is not already saved:
https://chrome.google.com/webstore/detail/wayback-machine/fp...
All my projects have a README.md, some a README.org. These are simple uncluttered formats which everyone with the most basic of readers can read. A few years back I fount my decades old masters thesis, written in plaintext latex. It compiled first time to pretty pdf on a raspberry pi.
If you care about content, or anything really, keep it simple.
Is there a browser extension that can automatically add a site that you bookmark to the Wayback Machine and fetch the bookmarked site from the Wayback Machine if it 404's? I really hate it when I bookmark something and come back to it months or years later to find that the page doesn't exist any more so having something to automatically pull it from the web archive would be amazing
Right after we launched, we were pleasantly surprised that a good portion of our users started using our automated visual site-mapping platform [1] to archive their sites ( while redesigning newer versions of them ) and other public sites of interest.
These points look like the old (timeless) guidelines I remember for creating a web page ... solid, good, and as plain as a saltine cracker. I love it and it stores well but convincing the world while we romp through the hype of technological progress is like following a mob through the streets telling them to pick up their garbage.
Jeff's article has similar sentiments and technical recommendations as Tim Berners-Lee's classic piece "Cool URIs don't change" (1998): https://www.w3.org/Provider/Style/URI
I do believe that, usually, you should make it possible to be archived. (In one of a few cases where it isn't, consider if HTTP(S) is even the correct protocol for what you are doing; sometimes it isn't.)
I agree that you can (almost always) write HTML without JavaScript codes (CSS is not always needed either, but nevertheless it can be helpful). This improves portability too. If you do want to use JavaScript to generate static documents, consider using Node.js and have it output a plain HTML document, which will then be the hosted document, rather than hosting the JavaScript version. (This way, the code only has to run once.)
About fonts, I think usually the fonts are not an essential part of the document. (Still, you don't usually need to use so many, but often you do not need to specify the font at all, except you may still need to specify if it is monospace, bold, etc.) Also, you do not usually need so many pictures on your web page (except picture galleries).
I do think if you need a copy of a web page then you can copy it.
You can also use plain text documents (without HTML); it is what I often do.
"monospace" is a property of a given font but is not (usually) a variant of a single font family as is "bold" (font weight).
While bold text is often used for emphasis and setting it to "regular" doesn't change the meaning much, using a monospace font signals specific things (often, it's used to represent code snippets, but in some contexts it can mean "work in progress"). Changing that font to a proportional one strips a lot of meaning and readability from the text.
In that regard, fonts families are an essential part of the document, and not just their stylistic/emphasis variants (e.g. size and weight, which can still convey important meaning on their own. Think titles and headings.)
Even Glk, which has styles meant for meaning and use rather than formatting, nevertheless has a monospace style, since some things need to be displayed using a monospace font to be displayed correctly. However, you could just as well display all text using a monospace font, if you are using only one font, so you might not be able to "signal specific things" by the use of a monospace font.
For the case of headings, this can be done without specifying the font yourself; if you specify it is a heading, then the rendering software can decide how to display it. (HTML has <H1> and <H2> and so on for headings, and Glk has style_Header and style_Subheader.)
Honestly, in almost all places that monospaced fonts are used, they’re purely a stylistic choice. (The remainder is mostly ASCII art, or plain-text representations of tables—in which case alignment is important, not the font—or something where tabular figures are desirable for number alignment.) I don’t say that that makes it useless in any way, but honestly monospaced text in code editing is overrated. You can live without it easily, and will probably get used to it very quickly.
I agree about monospaced text in code editing. I was thinking about uses such as code excerpts in literature (class names, inline bash commands, etc). Monospaced fonts[1] plays a major role in helping reading and understanding the content. This is more than a stylistic choice.
[1] actually the font doesn't need to be monospaced, but it must be very different than the font used for main copy.
While in some cases the font doesn't necessary need to be monospaced, some documents might require that a monospaced font is used, so it should be monospaced anyways. Although, as you say, different from the main copy; so if all of the text is monospaced, there should be some way to distinguish it, such as by selecting a different monospaced font, or displaying it with a different colour.
(One case where a monospace font is required is if you are automatically including text from another document which uses plain text format; there is no way to determine what format is needed, but monospace will always work.)
Regarding fonts, I think it's good practice to archive fonts together with your site's source, if the site relies on third party resources, like Google Fonts. (Switching the link target should be easy. Embedding a commented link to a stylesheet linking to these local resources along with the third party resource link may be of help.)
We need a common distributed storage protocol - kind of Bit Torrent for web pages. Dead simple in use and built into (every) browser transparently. Anytime a public page is visited/bookmarked then it is preserved in a swarm of caches until the last seeder dies.
Well, a new HTTP status code or header could be created for sites that are closing, then if a browser navigates to the website, it could prompt the user for action? e.g. update bookmark, archive, or email site admin (if on the hosting side).
Most sites close quite abruptly (domain didn’t renew, web server crashed, site was “modernized”, etc). Most link rot is due to the fact that people aren’t actively trying to keep things alive. An HTTP code won’t help with that.
Using wallabag (o.s. Pocket like alternative) for bookmarks and it saves a copy of the website in a screenreader mode fashion so text and images - so far I have around 2k links in there and it takes no significant size so far
Amen to that at least. Nothing bums me out more than someone using jquery just for a simple DOM selection, or having a simple layout and using bootstrap just because you know their columns.
HTML is worth minifying just to get rid of comments. I don't want my "TODO(mullens): " comments to be visible to end users if possible. And gzip doesn't get rid of comments at all.
The solution to this designing web pages to last will have huge overlap with the missing layer of services on top of free and open source software. If that ever gets kissed into existence...
With the exceptions of my root index.html generated by the tree command and the subdirectory listings handled by apache, my entire site is plain .txt files. Should be good for a while...
Kinda sounds like you don't link to any other content, though (I guess you can include links in plain text, but that's a bit of a pain for your readers) which means your site doesn't really fit the use case.
Yes, it's a dead end, a cul de sac that links to nowhere. And that's exactly what I intended. But if you see something that is of further interest, highlight it and search. That will almost always produce topical results...
The number of potential re-hosters is increased by releasing everything with a free license. Some won't violate copyright laws, e.g., Wikimedia Commons for individual media files.
For this exact reason I've been pushing to use static HTML on my personal blogs and I spend a lot of time optimizing images for both retina and 1x resolutions. Basic HTML & CSS are good now, and there is no reason fairly basic content can't be stored as static pages.
You can convert PHP or other dynamic pages to static fairly easily, it takes about half a day of scripting and a little time with configuring your web server. The pages load quicker and the load on your server is lower.
I've long considered the idea of a permanent hosting solution, where people pay to get their content up and I guarantee it's longevity, but the business model on that is tough.
I find it an interesting social phenomenon that the finite nature of content on the Web is lamented, when content has been temporal for all of human history.
Carved stone is a bit more durable that that nice Myspace page was, however. Yes it was not widespread and was difficult to setup but millennia of duration is nice.
I think that the medium, be it a hosting site or you own domain, will fail. What we need is maybe : a naming scheme that can transfer to different people (replicated files on different servers indexed by content hash? Aka anonymous ftp mirrors ... ) and common media ie readable by anyone, gracefully degrading.
For preservation, using self-hosted fonts is better than trusting the “Web-safe” fonts being available on the client (already not true e.g. on Android).
HMTL was a bad idea that just kept sprouting cancerous growths on top of itself. Just like the author recommends not using non-web fonts, we honestly don't need much markup beyond "monospace", "em", "strong", "title 1-5", and "p". Everything else HTML does is a huge distraction.
In the past thirty years, there have only been a few actual kinds of content:
- Article
- Comment
- Reference manual
Nothing really deviates from this. If it does, then it's probably in the "application" domain and deserves all that javascript stuff.
I imagine a new semantic web inspired content type that bakes in permanent content URLs, author signatures, and supports sharing P2P and rich archival:
<article>
<id>
(Computed immutable content hash and signature. Subsequent revisions
will invalidate this and require new content hashes. Links can be made
to the old version.)
</id>
<authors>
(metadata, homepage, public key)
</authors>
<contents>
<title> Semantic Web was actually brilliant </title>
<body>
<h1>The Web was Embraced, Extended, and Extinguished</h1>
<p>Lorem ipsum dolor sit amet...</p>
</body>
</contents>
</article>
It can be a really simple model. No divs or presentational CSS or anything. (Well, maybe some support for LaTeX and I18N.) The client decides how it looks, which is how it always should have been.
If every entity in the universe has an ID, they can link to each other in a distributed fashion. It won't matter if the original source URL goes down as long as someone somewhere has a cache.
It's also probably better to binary encode these (protobuf or something). Images and media could be inlined instead of href'd.
Instead of building on the layer cake, we should take a moment to reflect on what we're trying to do. We have a lot of history and bad evolution that we can garbage collect and streamline.
If our objective is to publish and share content, a lightweight version of the semantic web is the way to go.
Oh, and check out my follow-up to the author:
<article>
<id> (my own post's content hash) </id>
<parent> (the author I'm responding to) </parent>
<author> me </author>
<contents>
Hey you! I disagree!
</contents>
</article>
Again, these can be distributed on a web. It can be p2p, federated, or live atop the classical WWW. It doesn't matter. The clients can richly use the semantic data model.
The anger for disappearing content, walled gardens, and AMP will eventually reach a boil. As technologies go, the pendulum is always swinging. This is the direction we'll eventually head back to.
It's one thing to dream up a minimalist markup language that ought to be enough for everybody, but it's a whole different problem to preserve the large body of existing content out there. Actually, SGML (from 1986, and on which HTML is based) has the capability to render/transform your mini markup language to HTML, and also render markdown and other custom Wiki syntax.
I agree, it is correct, we do not need much markup other than the five you specified, plus hyperlinks; those are good enough. Yes, the client should decides how it looks.
I'd include lists and tables for data presentation. Also, we may want some (additional) tags to manage footnotes and references (these were in SGMLguid, but didn't make it into HTML.)
You could just link to them I think, possibly with a hint to specify inlining; the client decides whether to obey that hint or to ignore that hint. (This would be done the same whether the picture is part of the document or is a separate file, I should think. It makes many considerations easier to work with.)
I guess, the Netscape Navigator 1.0 specs (tables and forms, but before frames and JS) are quite what we may want (with the possible exception of the font tag and, certainly, without the blink tag), augmented by some footnote/reference tags and a reduced set of CSS.
I had no intention of bring up "blockchain", only that cryptographic hashes must play an important part of any archival scheme intended to last generations.
Otherwise you must trust that an archiver hasn't changed history. Why trust when you can verify?
FileCoin is a blockchain and is the incentivization layer on top of IPFS that allows users to rent hard drive space. It is intrinsically connected with IPFS.
The parent comment is still accurate. IPFS isn't a blockchain and shouldn't be dismissed for this use case, which is more or less what it was designed for.
Filecoin is just for incentives, but IPFS has been running for years without it.
Disappearing content is a blessing, not a curse. Let it all be replaced by new people doing the same things slightly differently, instead of constantly having to confront prior art.
This - so, so, much. I don't know where the assumption that everything is worth preserving comes from. To me, this is even part of the beauty of the internet: Things appear, then vanish again.
Often, the antique content is irreplaceable. If it disappears, then valuable and important reference material is lost forever. Consider, for example, if you are working with hardware or software created decades ago -- if information about such systems is lost, you're pretty much hosed.
As another example, I keep every line of code I write (that wasn't written for an employer) forever. I often pull up code that I've written decades ago to use in new projects.
The beauty of the internet is that the old and the new aren't mutually exclusive -- there's plenty of room for it all.
> *from my high school years, where I first tasted the god-like feeling of dominance over software*
This piece of writing killed the rest of this article for me. While his point is worth making, the way he went about making it just sounds very childish. Especially for a Professor.
This is by and large, impossible. The hoops you have to jump through and downsides you have to endure are just a death by 10,000 cuts. Try writing a tabulated container in raw HTML and CSS that flex sizes and behaves nicely with the browser back button. Partial page reloads, containers with native (no reload!) sorting, there's so many features in modern web design that are just downright impossible without JS.
I am hesitant to put words in the author's mouth, but I suspect he would agree with my stance here...
The solution is to not require things like partial page reloads, etc. This shouldn't be a huge hardship -- a properly designed modern site can degrade gracefully anyway, so users that don't have or allow things like JS can still use it, even if in a "degraded" form.
Heck, with the cost of storage so low, recording every webpage you ever visit in searchable format is also very realistic. Imagine having the last 30 years of web browsing history saved on your local machine. This would especially be useful when in research mode and deep diving a topic.
[1] https://github.com/machawk1/warcreate
[2] https://github.com/machawk1/wail
[3] https://github.com/internetarchive/warcprox
EDIT: I forgot to mention https://github.com/webrecorder/webrecorder (the best general purpose web recorder application I have used during my previous research into archiving personal web usage)