Bare HTML never goes out of style. My archive doesn’t use any JavaScript at all, it’s just HTML with the smallest amount of classless CSS I could get away with
I love you, will you marry me?
Can you make the rest of the web like this? That would be awesome.
The bad news is market forces are such that Web 1.0 will never again be profitable. The good news is market forces are such that Web 1.0 is effectively unkillable.
.. if not already profitable due to historical network effects, like https://craigslist.org which was ~$10M annual revenue per employee, north of $500M/year total.
I agree, or at least simplicity is often given lip service while complexity is implemented because it’s easier.
> 3 Git repos, each submoduling and iterating upon the last
Not throwing stones, just curious — is that really the simplest approach? I would have thought one repo, with folders for source, raw articles, and website. You’ve obviously given it a lot of thought. Why this model?
I was going to call this out as something I super loved.
> The first repo archives the page as-is with a cronjob. The second repo turns the first repo into cleaned-up Markdown with an English translation with another cronjob. The third repo turns the cleaned-up Markdown into a Hugo website with a third cronjob.
This is a super awesome pattern, imo.
You all never have two systems try to make commits onto the same head, no clashing. If one process goes bad, you don't have to meddle with interweaved history, your source history is pure. It also makes the event sourcing easy; when a repo changes the next process starts, where-as you have to start filtering down your edge triggering if you conflate concerns together. In this case the author is using cron jobs, but this would be a viable medium-throughput modest-latency event-source system with little change!
This to me is much simpler. There's little appealing to me about conflating streams. Having the history show up as a continuous record within a repo could be seen as an advantage, but even still I'd rather make a fourth repo that merges the streams after the fact.
Simplicity is in the eye of beholder. Not conflating different concerns feels to me much simpler.
This mirrors how many big data / data warehousing pipelines work. Originally a lot of transformations tended to get pulled into the same flow as moving data around, but storage has gotten progressively cheaper and this is so much simpler and durable.
That's essentially why I gravitated to it, yeah. To me, Git submodules are a pretty generalized and elegant solution to the question of "how do I model a directory of files as an immutable data structure". If you know the commit hash of the submodule, you know exactly what is going to be in there. That kind of certainty is very helpful when trying to code up tricky data cleaning and the like.
My usual approach to submodules is to keep running commands until something sticks—probably not the most scientific approach.
If you're going to have multiple repos I find it cleaner and more convenient to use your language's packaging system; each project becomes just another dependency.
I wish it wasn’t such busywork rolling your own packages. It feels like a homework assignment even with generators. Lazy code is more fun to write for sure, faster to results.
Why is no-JS so appealing to some people? Honest question, I wanna understand.
Edit: by the replies what I understood is that people that dislike JS think the web is a medium exclusively to read? Apps should be desktop executables. Fair enough. It's a good thing this is a negligible % of internet users.
Page bloat is one reason, tracking/privacy-invading JS is a pest, but for me at least, my main problem is that most JS is used to break how people want to use the web.
The web is a document-centric medium. I load a URL, I want to look at the document presented at that page.
Things I likely don't want to do is deal with banners, pop-overs or other code grabbing my attention (if you don't show me a cookie banner you've done the right thing: not use cookies that aren't strictly necessary, thanks!). I probably don't want infinite scroll, "funky" animations or other stuff that tries to break the document medium.
I think the only exception is if I'm looking at a document that wants to use an interactive chart, that's fine. Form validation might be helpful, but do you really need anything other than built-in HTML client-side validation?
It's not because of a religious opposition to JS, but because usually Javascript-laden sites take a long time to load, require the latest hardware and browsers, and come with pop-up banners, third-party ads, and bloatware.
What we need is a distinction of “low JS”, which is really what we had before we somehow grew an industry of JS framework mania. JS with fallbacks is a very useful and pleasant thing, JS powering the entire experience is - in most cases - not.
A consequence of Rice's Theorem is that there can never be a useful, enforced distinction between the garbagepile that we have right now and anything more than "no JS".
Before WHATWG hijacked the web we had apps on the internet -- Flash, Java applets, etc. They had slightly more friction than ordinary webpages, which forced publishers to think hard about hey do I really need an app or can I do this with static HTML? Now maybe Flash and Java-applets aren't the best platforms for web apps, but the apps need to be kept separate from the documents, and there needs to be some small frictional cost imposed on publishers who choose the app route -- because otherwise all of them will choose it out of laziness.
In the olden times (e.g. 1999) you would write HTML by hand (notepad.exe) the same as people write markdown today. It then works as a single thing to move around, I don't need to bring a second thing with me so reorganizing simple HTML files is easy. So if I am just going to write a one off document HTML with images and tables was the easiest (see very many professor websites, geocities, angelfire, etc). Once org-mode became the better version of this because it could export plain HTML I shifted to that personally but if org-mode didn't exist I probably would still be writing HTML directly when it comes to just typesetting tabular data vs dealing with Word/Google docs tables.
Automating writing HTML is also easy as you can just write to stdout directly, and if you are comfortable writing HTML by hand directly this is a simple uplift to automate things you did by hand.
This is why jQuery was so popular because you could just add it in to this vanilla HTML in a nice way. You didn't have to stand up some JS scaffolding and toolchain to get there. Even jQuery can get you to a situation away from WYSIWYG because you don't just need to render the DOM you now have to have a JS runtime evaluation. In some contexts you have something that can write a DOM from HTML but you have no JS, it was only when chromium embedding became popular and normalized you got both the HTML and the JS together, for a long time you only got the HTML.
Finally, direct HTML will work as-is forever but JavaScript is a moving target. Some terrible websites I wrote in raw HTML work today after 20 yrs (admittedly 1024x768 optimizations didn't age well) but many JS things I have written have not had that longevity or require standing up a long dead development framework to get going again.
Why send 100mb when the content is a paragraph of text? Its lazy and wasteful. Static sites are amazing. Pages actually load as fast as you’d expect them to given modern hardware and connections.
Re/ not hotlinking the images: Not gonna happen :) I'm not going to balloon the size of the repo, or add the overhead of a database, just so I can locally serve those images. I'd rather take them out entirely at that point.
Re/ linking back to the source: Great idea! I look forward to your PR. Start with https://github.com/hiAndrewQuinn/selkouutiset-scrape/ and see if you can figure out the logic behind how YLE.fi chooses its archive tags, maybe there's a pattern in there I was unable to see when I tried this myself.
- in first step I "tag" urls I like (in rss reader, HN, twitter, pdfs, podcast...). It is very quick, I usually spend like 2 hours per week filtering news.
- latter I collect tagged urls (from email, notes and bookmarks) and feed them into batch job
- batch job downloads text form, and converts them into markdown. HTML are converted with Reader mode, videos have transcripts with time marks...
- this gets converted into new Obsidian vault and synced with Syncthing across my devices.
- there is AI job that does some annotation and linking with existing notes.
- over next weeks (usually two) I slowly "digest" content, read it, and annotate. At end it becomes integrated into my Obsidian notes. I can copy&paste and write directly into articles.
It is very natural process for learning. I spend minimal time dealing with junk content, clickbait, ads... And it works for all types of content including video and PDFs. And offline absolutely rocks!
Do you mind elaborating on your process for ‘tagging’?
I’ve been looking for some type of solution to send a url to something which can then scrape the URL.
Currently my process is to just send it to myself on a self hosted Zulip server (Slack like alternative) but I find it a bit cumbersome.
I’ve been considering an email based system after reading a blogpost from Stephen Wolfram someone posted recently here but I‘ve been looking for what others have done as well.
It is not really automated, more like a few places where I collect links from:
* starred items on FreshRSS, I have a script that can extract them and reset.
* email folder in Thunderbird. On Android I use "Share as" and send emails to myself
* Bookmark folder in Brave. This gets synced from all my devices
* Paper notes with timestamp on youtube videos and podcasts. I like to listen while running, this is simple stupid method. I would like some sort of "clicker", where I just press button to tag timestamp and video, but never got time to do that.
* Paper notes with quotes from books and other PDFs I read. I have to manually find relevant URLs. This will feed new books into my system.
* Google Keep snippets. I never degoogled myself, this is pretty nice app.
* Photos from my phone. I go through that periodically.
Every a few days I go through all of that, and compile it into list of URLs. Some of that is automated (FreshRSS, Google Keep), but it is manual job.
Downloading and processing with LLM takes about two days (I use CPU, about 1000 articles weekly). I run that on my laptop over weekend. LLM generates keywords and links it with my existing notes. Larger documents have summaries with links.
Software consultant. I like to group topics (like new Kafka release) together and read it all at once, rather than slow trickle over two weeks. I also like to have my sources directly linked with notes. Offline archive is just easiest way.
I started working on my own system after Google Reader shutdown in 2013. And I really hate ads.
I love you, will you marry me?
Can you make the rest of the web like this? That would be awesome.