Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Is the past disappearing on the web?
569 points by dusted 3 months ago | hide | past | favorite | 316 comments
I have the habit of looking at the date of things I consume online, it gives me a sense of relevance and context, both when I'm looking for things that are "from now" but more importantly when I'm looking for things given a temporal context, for instance, programming for an old compiler, finding out how to do something with an old piece of hardware or electronics.

I feel like I'm encountering more and more sites and articles where I can't seem to find the date. Google will return irrelevant results from today rather than relevant results from 10 years ago.

I feel it's getting worse, is it just me?




> I feel like I'm encountering more and more sites and articles where I can't seem to find the date.

It seems to me that its become standard practice on marketing type blogs for corporate websites to remove the date from their posts. I think its because (from personal experience) the company will go though a burst of "blog productivity" create a load of content but then not touch it for years, they don't want that content to look out of date or their website to look stagnant.

Removing the date from their posts, or any other content, hides how old it is and therefore obscures how active they are at crating new content.

Most companies try to use their blogs to attract new customers, a new customer may visit their website once or twice and will never see the blog again, it's not important that they do. They don't want it to look stale.

As a counter example, an interesting thread from yesterday [0] was about how CloudFlare use their blog not as a marketing tool but for technical content and attracting employees. They very regulally use their blog, and so keep the date on it showing how fresh it is.

0: https://news.ycombinator.com/item?id=30070422


There’s a popular concept in the content industry of “evergreen content”, meaning posts that are always relevant or useful regardless of when they were originally posted. The idea is that articles will have a longer “shelf life” if they are not tied to a particular date or recent event.

Of course, producing true evergreen content takes more effort than just removing the publish date but that’s one easy way to fake it.


Originally posted: <date>

Most recent update: <date>

Edit: it’s only a matter of time till an article stops getting updated. Of course I doubt people in the “content industry” (tell me that’s not a real thing) care what happens to anyone else after they stop updating their ‘evergreen’ page.


A past employer almost went with this approach, but with a technical audience it's hard to cover all the edge cases with RSS. People get cranky when they get pinged when an article updates superficially, but you also want people updated when the article is basically completely rewritten

Safer to just publish a new article when there's a sufficient amount of content to update. Link the articles together somehow. Maybe add a disclaimer to older (and not updated) articles that they might no longer be valid

Also solves the problem others have mentioned where they don't trust the date on articles. If you have a solid previous article to link to, you're more likely to build trust that the new one really is new


I don't see how this is a problem with RSS. RSS (and the other feed formats) have a concept of an entry ID. If user's don't want to see updates they only show items with a new entry ID.

To get even more nuanced generally a superficial update shouldn't change the "updated" time, this gives another level of control.


I see some articles which always have a "last updated" of today. You can refresh a day later and watch it change.


people in the “content industry” (tell me that’s not a real thing)

I think if there is a thing of any sort, there's an industry for it.

And if it sound dumb to an engineer type, it's more likely there's a lot of it.


It’s too general though isn’t it. It would be like the “doing stuff” industry being a thing. There might be people who use the term, but that doesn’t mean the term points to any actual phenomenon in the real world.


It's a huge industry. There are services to buy writers, buy content, trade content, link content, guest post so on and so forth.


That was my reaction, it’s scary right. Like need an article on safely installing a light switch? Call the content industry. Managing stored PII and passwords? Content industry. Medical information? Content industry.


I see that a lot these days with "most recent update" that I'm pretty sure is a lie. Or maybe they changed some tiny thing to make it not a technical lie.


Marketing is all about lying. Making up fake reviews, customer testimonials.

Exaggerating (which is lying) about the product.

Explaining how good it will make you feel, or better yet, showing photos of people very happy, using the products.

Acting = lying, photshopped photos = lying, fictitious scenarios of use = lying.

Lying is required to write a story (when you read a fiction book, you want the lie about reality, for it is fun to imagine.)

This is a lie all parties engage in willingly.

But marketing types have a sole purpose. To lie, but make it look like truth.

So they don't care one iota about really updating an article or not.


Movie with CGI such as Star Wars = lying?


Fiction is lies only when you want to pass them off as the truth:

"That's right -- and when you get to the human world, the Nothing will cling to you. You'll be like a contagious disease that makes humans blind, so they can no longer distinguish between reality and illusion. Do you know what you and your kind are called there?"

"No," Atreyu whispered.

"Lies!" Gmork barked

― Michael Ende, The Neverending Story


There's in the same dialog this warning about advertising and politics:

“When it comes to controlling human beings there is no better instrument than lies. Because, you see, humans live by beliefs. And beliefs can be manipulated. The power to manipulate beliefs is the only thing that counts

... Who knows what use they’ll make of you? Maybe you’ll help them to persuade people to buy things they don’t need, or hate things they know nothing about, or hold beliefs that make them easy to handle, or doubt the truths that might save them.”


need a revisit, that seems appropriate and deep


It is! Ende was raised under the esoteric Anthroposophy philosophy, and I suspect much of his works are dedicated to debunk its nonsensical beliefs. The second half of The Neverending Story is a philosophical treaty on itself (it hurts the book as an action story, but gives you a lot to ponder about).

His child books certainly have a deep second reading as adults.


All fiction is a lie, by its very definition.

After all, it claims the unreal as real. Fiction, lies.

The difference with things like fiction in books and movies, for entertainment, is that the listener of the lie, knows it is a lie, and listens for entertainment.

It is not a lie, pretending to be true.


I would argue that a lie involves deliberate deception, not just untruth. Fiction, then, is not a lie, as by definition it is something imagined or invented, and therefore not created to deceive.


I think deceive was created to explain the difference in intent.


Yep, the whole thing is made up!


and then someone figures out how to put a `<?php echo date('l, F jS, Y'); ?>` into the "Most recent update:" line ... voila! Perennially up-to-date!


I first heard about "evergreen content" from the local news industry. To add examples of what it looks like (to demonstrate the extra effort required as you mentioned), here are a couple recent examples from The Guardian (because there's no paywall):

1. "Top 10 novels inspired by Greek myths" (not tied to a recent event): https://www.theguardian.com/books/2022/jan/26/top-10-novels-...

2. "How to make pea and ham soup - recipe": https://www.theguardian.com/food/2022/jan/26/how-to-make-pea...

These are unlikely to pull lots of traffic versus a more timely event (e.g. interview with an author for a recent book release for the first section; except maybe the second as it's a regular column). However, the usefulness about evergreen content for a local print publication was to fill content for the print edition when there was a slow news week (not enough timely news to fill the pages).


The ultimate piece of evergreen content I’ve heard of is Peter Stark’s “Frozen Alive” for Outside Magazine.[0] It’s a story about what hypothermia does to you, told as a second person narrative. It was first written 25 years ago. It was a hit article in 1997, and since at least 2016, it is one of Outside Magazine’s most read articles every year.[1][2]

[0] https://www.outsideonline.com/2152131/freezing-death

[1] https://niemanstoryboard.org/stories/peter-stark-and-as-free...

[2] https://www.npr.org/2019/01/31/690468853/why-peter-starks-fr...


> They don't want it to look stale.

But that can also be extremely counterproductive when the content you publish has a natural expiry date - and in many areas of expertise (pretty much anything other than pure marketing talk) things change over time. Potential client seeing the obsolete information might rightfully presume that the company is out of the loop or just plain unprofessional. If you have a date on the page it's far less likely to happen.

COVID recommendations and measures are one such (rather extreme) example where many big players endangered their credibility because they've failed to properly mark the outdated content.


That's why The Guardian is a pretty great source for news, they add a banner on all old articles saying it's X years old, so beware, information might be stale.


It also serves as its own form of memory hole, if readers can’t get the context of the date after a few searches, they are likely to just give up.


I must say that I put more trust in blog posts that put the notice "Updated: 10/2021") at top of their posts. This communicates to me that this topic was important at some time in the past and that someone is still taking car of the content and is updating from time to time.

Stability and old content can be good. Not everything is being updated and not everything needs to be updated. I'm all for putting dates on pages and blog posts :)


When they are just dishonest, saying that it's updated today or recently using a script is just as easy.

They just lie. I was trying to find out why VLC stopped working for me a few months ago [0] and landed in a terrible site where they suggested to do a lot of cargo cult driver updating and whatnot. A few "comments" thanked the author for the comprehensive (and totally fake) information.

This kind of crap seems to be winning. My other recent quest, finding a uni site about consciusness, psychedelia, emerson and 60s music, that I used to visit 20 years ago, was also unsuccessful.

[0] https://code.videolan.org/videolan/vlc/-/issues/25976

(nobody assigned)


On that note, Reddit seems to be faking their last modified date metadata for SEO purposes. I’ve seen many old threads (where all the comments are 5 years old with no edits) show up in my Google search results with today’s date below the link. This coincided with and is likely related to the removal of automatic thread archiving 4 months ago.


This also signals an end game, where the last vestiges of true improved, customer facing improvements are over.

With nothing left, one delves into short term growth numbers.

Because of this, some indexes may de-stress reddit by date search, and, by freshness. And some end users will click, become annoyed more often, and stop clicking on reddit links.

They clearly have nothing left, and no other idea how to improve reddit. They have signaled they are well past peak growth.


I understand you. I've noticed this kind of behavior on some review sites. They put the latest month in the title but who knows when did they actually update it?


Just today I was reading a blog post about some framework (KubeFlow) not being “ready for prime time yet”, and even though the article had great technical details, the fact that it lacked a date made it so much less valuable: it’s highly relevant whether this conclusion was drawn last month or two years ago.

I understand why this happens, but part of me really wishes we could stop this. Maybe there is some archive.org extension that can show me “this page first appeared on $date”.


If you found it through google, or search it on google, you can see when it was cached.


One can even envision an extension which fetches the caching dates automatically and shows them right beside the search results.


Open developer tools and type `document.lastModified` in the console. Though this is painful for every web page, is useful in these instances. Maybe someone with experience in developing Chrome extensions, can develop one for getting this plus other useful page info by just clicking on it in the toolbar


"last modified" only works for static content, and not pages that are rendered dynamically or where the header is set based on cache settings.

Even for static content, the date may be wrong as the output may have been generated multiple times since it was first created.


Thanks for the info


Wordpress-based blogs (maybe others, too) often have a timestamp hidden in an HTML Meta-tag even if the date is otherwise hidden from regular view.


why would companies who are trying to deceive not go through the trouble of changing the last-modified header? At any rate lots of sites don't set last modified anyway so that makes it January 1, 1970, GMT.


I have the same feeling. Actually, I've seen it in real life. A company I consulted a few years ago created lots of "evergreen content." Now, they had a reason to do so, but...

It would be cool if archive.org (or any others) had an API that made it easy and quick to look up "first seen" timestamp for any given URL.


Here is the API you are looking for: -- https://archive.org/wayback/available?url=example.com&timest... -- this will show the oldest timestamp of a archive that the wayback has. The trick is to set the timestamp to be /really/ old and it will show the first snapshot it has.


I don't think you can tell if "the text" of the page has changed since then without manually looking though. I'm not sure how you'd solve this technically, it's probably more than just telling you if the html bytes are identical, since small changes probably happen all the time to "the same article".

Or it might be good enough, because this kind of content is rarely updated at the same url?


I wish Wayback had an indicator for how much the page has changed from one capture to the next. A simple count of the diffs should give you a good idea.


That is irrelevant, at least for me. :)


Why are you interested in when the URL was first there, but not if it has entirely different content than when it was first there?


Because I'm only interested in when Wayback first saw the page. Nothing magical. :)


Very nice, and thanks for that! I'm unable to find usage rules, though.


BBC television programmes used to give the year in Roman numerals, in the copyright notice, I think. It has been suggested that this was to make it harder for people to notice how old the programme was. The same technique wouldn't work so well on a web site because you couldn't do what the BBC did and leave the Roman number on the screen for about 0.5 s so that most people don't have time to decypher it. Also the dates at the end of the last century were particularly hard to read in Roman. Here's an example. Time yourself:

© MCMXCVIII


> It has been suggested that this was to make it harder for people to notice how old the programme was

This is tempting but I don't know how much credence to give it.

It's more to do with a style that was taken on and kept. My father still writes the month in a date using a Roman numeral, e.g. 25/XII/21.

The date format has been around for many years. There weren't that many BBC programmes made, even over the course of many years. To try and convince viewers that a programme was not "old" was difficult. It might be black and white, and the quality certainly would have looked dated.


I think the convention started with movies.


I got it in seconds because I used your context of being a date "at the end of the last century", see the VIII and immediately guess "1998"


Common in movies, before home media or even broadcast was A Thing, and in some other places. I don't think that's the explanation.


I'm not sure how true this urban legend is but it is not too hard once you get the hang of it

(And dates aside, how the program ages is more important)


Like a true millennial I just entered it into Google and was surprised at the result.


I was disappointed by [MCMXCVIII]; [MCMXCVIII in decimal] does better, but I really hoped for more like the unit conversion panel.


1998


Just to add an interesting example of a middle ground, the fly.io blog post [0] currently top of HN [1] has the published date "hidden" at the bottom of the article. Their content is technical content that can go out of date but is also useful marketing content. The post is from August 2021 but has been posted to HN today.

0: https://fly.io/blog/run-ordinary-rails-apps-globally/

1: https://news.ycombinator.com/item?id=30083764


FWIW, I started hating putting dates on most of my stuff because I simply got sick of people incessantly asking "is this still up to date?" because the date happened to be from even just a few months ago much less a year ago.


Why not add something together with the date telling people for how long this information will be valid? Does not have to be a date, can be "indefinitely" or, in the case of software "until new major version update". Works best when you also revisit old content occasionally and update that information. I think you drew the wrong conclusions from those inquiries.


> I simply got sick of people incessantly asking "is this still up to date?"

I don't know what or where 'your stuff' is - but isn't that a fair question for a lot of topics? Writing about data structures or political protests might be valid for decades but a lot of technical writing about languages, platforms, even products and companies can age very quickly and unlike news or culture might be of less interest to people in the future.

Many people find technical writings whilst searching google for a solution to a problem. I love the articles that have a date and version of whatever platform something is relevant for (even better when someone adds a "this was written for version X.Y things have changed in version X.Z see the doc at..." etc)


And if you remove the date, people will still ask the same question. Inevitably.


It's a valid question, you don't have to respond to them but at least if the date is there the user themselves can have a fighting chance at looking up what's changed since then.

It's entirely possible a tutorial made for software 3 months ago or something might have outdated information if there was a new, backwards incompatible release, or that a news article might be missing information that was revealed only a few days ago, etc.


Yeah, that can be a problem, I used to put a Created and an Updated date, but now I just have a Last updated date.


I notice this sometimes also in academic papers. I would like to see a date but it is not usually presented. Is this for similar reason?


That seems surprising - many citation formats involve the date?


The problem is the downloaded pages of the papers themselves, which often completely lack proper bibliographic data (or even header/footer info other than a page number). Compare this with a page from a corporate tech report, where each page might have the title and document number somewhere.


They used to come in magazines, with the date on the front, and authorship metadata on the index.

Nowadays they come as individual papers, but the publishers kept the format... because, why change anything when you have guaranteed profits?


In academic content, the lack of date is the most pernicious!


Depending on the source, it might not be published yet, or the PDF you grabbed might be a pre-print while the 'officially' published article is behind a journal paywall.

Generally, I take the publication dates of the cited works as representative of the age of the paper.


I've heard of some marketing things updating the date periodically without changing the contents in order to trick people and/or search engines into thinking it's new.


These days they'll change the title.

An article with the title "The ten best pencils you can buy [January 2022]" written in 2017.


  The ten best pencils you can buy {{ datetime.date.today().strftime('%B %Y') }}


Normal SEO advice is to update your dates and copy so Google thinks it's "fresh".


one strange example for me --

AWS has tons of training content they make freely available to their partners and wider community.

In many cases though there are no dates! (youtube shows date of upload but AWS' own training sites lack such markers. We have to guess based on copyright year in the slide footers)

It is clear that they invest heavily in creating new training content. In fact they essentially repeat the same content multiple times in many live tech talks and partnercasts etc. So there is no dearth of new content. It is also well known that they release new features very frequently -- so knowing how recent the content is helps a lot. But they still do this -- seemingly deliberately.

They recently overhauled their whole digital learning portal -- renamed it AWs Skill Builder, built it using the docebo LMS/CMS portal -- changed a lot of things but didnt make any effort to add a published date to any of the courses.

Frustrating.


Their API docs are incorrect half the time because they rarely update them. Finding documentation on their dynamic parameter system that all APIs use dumps you out on one page with every possible parameter domain.


At Snyk (https://snyk.io) we're actually working on a new blog process to refine this. Essentially, a problem almost every technical blog has is: when we publish articles, are they ephemeral -- or are they evergreen?

If you treat blog posts as ephemeral, it means you'll write them once, ensure they're accurate, then leave them there forever. Unfortunately, with technology stuff, that rarely works. Technologies change, libraries break, facts now might be different in two years, etc.

One of the things we're currently working on is tagging all of our technical content so that once a year it pops up in a review board somewhere and someone reviews it for accuracy, updates it if necessary, etc.

This way, technical stuff will still be useful to readers (hopefully) a couple of years from now.


That sounds like more something that you would do on a wiki than a blog ..


This is absolutely infuriating when it's meant to be informative, especially about something in a fast-changing landscape, like how-tos on Kubernetes, for instance.


> like how-tos on Kubernetes

Honestly although I've occasionally derived benefit from these, I think I'm reaching a tipping point where I feel the plethora of how-to articles with their ads and newsletter pop-ups and everything are less productive than just some plain old documentation and taking the time to fundamentally understand the tech so that I no longer need the how-to. Mainly because I trust Kubernetes to keep their documentation up-to-date but who knows how current a random how-to article is, never mind the marketing bloat they're usually polluted with.


I agree. The date itself is irrelevant if the content stands on its own (if it mentions the version they are using, for example, or it features arguments explicitely). If bad seo articles stop being read (independently of date) better content, with proper versions listed and argumentation (like official documentation), could take their place.


Very easy solution which I'm using when searching for something with expiry date is to look for date first and if it's not there then immediately leave site and not even bother to read anything there, if this is regular practice and domain frequently is at the top then after few times I'm just not visiting it anymore.


It's a sad state of affairs that marketing needs drive Google. I can hear the "duh" response to that in people's minds (yes we know how Google gets revenue) but - this is the index to the world's data. The World's data portal. It has such repercussions that it's adjusted to suit the needs of online sales and marketing sensibilities.


> the company will go though a burst of "blog productivity"

One of the most insightful comments I ever read on HN pointed out that marketing folks are good at selling things, period, and that includes selling things internally. So when nascent companies are wondering why the product doesn't sell itself, the first thing they do is hire a Director of Marketing. That person promises deliverables from day 1, and what's more deliverable/visible than a blog? Then they leave -- in my experience, the typical marketing exec's tenure at a startup is about a year -- and no one else feels like putting in the work. Also, by that time, most people have seen that the blog never really drove engagement in the first place.


Increasingly I feel like this is one of those pieces of metadata that we should be moving out of the page.

I would suggest moving it into the browser (i.e. read a meta tag or header) but the obvious problem is that they’ll just be forged and it would almost immediately become pointless.

Search engines could help here. If Google were to provide a last cached date (or a date of the last significant change) in the search result that would be far more useful. They certainly have this information from crawling, and it would be difficult to forge as constant substantial changes to game the system would be both expensive for the author, and harmful to the page’s ranking.


> I feel like I'm encountering more and more sites and articles where I can't seem to find the date.

Moreover, since static pages are no longer a thing, the system cannot even retrieve the date the file was created/modified, always returns the timestamp from when it was presented on the browser.

Getting EXIF data from images might provide a clue, but the image creation/edits often do not correlate with the text...

If anyone knows how to extract such date info, it'd be helpful


I don't understand why this isn't a seo penalty. I feel like early on with google they actively looked for things like this to rank sites.


SEO wise most blogs will expose metadata that shows publish date and last update date.


I've experienced some major frustration due to Azure's documentation doing exactly this. Though it may not be limited to Azure. There's just a date on every document and usually it's a month or two old. Many of the examples I've found via Google don't even compile or aren't relevant any more.

I hope this is a trend that can be reversed.


> It seems to me that its become standard practice on marketing type blogs for corporate websites to remove the date from their posts.

Worse still, I’ve encountered sites that automatically update their edit dates to be current as a way to optimize SEO. I’ve found articles with decades old information claiming to have been written mere hours or days prior.


Many news sites seem to set the content date to today so they always appear in searches, even though the content is years old.


I wonder how much of this is misguided SEO, thinking that if they leave the date off Google will assume it's fresh content and always surface it in search results, and not realize that Google knows exactly when they first crawled a page.


That is exactly it. People are using recent as a proxy for quality.


Even more true for open source software. If it isn't recently updated, people assume the project is dead.


To be fair, bit-rot is a real thing for software, because the world keeps changing. In most cases it takes effort and upkeep to keep everything working.

Annoying, for sure, but this is the reality of software and technology as the landscape continues to evolve over time.


Anecdote: On of the best friends of my oldest daughter moved to a new rented flat recently. When I saw the address for a playdate, I recognised it and said to my daughter "I think you went to a birthday party at this address when you were 3-5 years old when another of your friends must have lived here". To corroborate this, I looked up the calendar on my phone, where I'm pretty organised about putting dates and locations of events. However, at that point I discovered that Google Calendar only keeps 2 years of history!

So modern technology is literally erasing our pasts. Not just calendar entries, but messaging systems (people used to keep handwritten letters for decades), and possibly even photos (if we're not careful about preserving them).

Edit: See my clarification of the 2 years in comment https://news.ycombinator.com/item?id=30084620 below. I still think the point remains - we do not own or value our digital data in the same way as physical objects, and there is a much heightened risk of that data disappearing as a result, either by the owners of the platforms the data is stored on archiving the data or by us not valuing it enough to preserve exports and backups through long periods of time.


> modern technology is literally erasing our pasts

It's not modern technology that's erasing our pasts, but cloud-based services owned by someone else. So I'v always keen on local software - for instance, I'm currently upgrading my was staled desktop mindmap software (http://innovationgear.com/mind-mapping-software/) ;)


Modern technology erases it too. I am already seeing the drives that I used in college to store my raw camera files and LR catalogs degrade to the point of not being able to read the data as I try this month to move them to fresh storage. For some of my oldest photos, I only have whatever processed JPGs I happened to upload to a cloud provider that stuck around this long. The silver-based negatives from my film work at that time, though, are still fine.


Might be a long shot, but GRCs Spinrite has saved a couple drives of mine. It won't work if there's a problem with the physical connection to the drive but almost everything else it can fix (at least momentarily) and actually grants a nice speed boost in the process for drives that have gotten too messy.


Aside from all the flashy handwavy marketing "technospeak" what SpinRite actually does under the hood is re-read each bit from the disk thousands of times and then see if it gets more 0s or 1s.

It works well when a drive is past the point where its internal error correction code no longer works, but you just want to beat the right answer out of it with a baseball bat.

These days professional disk recovery by a lab is sub-$1000 and they recommend not using brute force tools like Spinrite because it can also exacerbate the problem before they get hands on it.


I just checked, this isn't true. I've Google Calendar entries from over 10 years ago. Where did you get the idea that they delete anything older than 2 years?


There is a setting in one's Google account to keep one's data forever or automatically delete it after a certain time period, perhaps they changed it?

(side note: the title of this post is a little sensational imo. the past has always "disappeared". Don't forget that you will die some day and eventually there will be no trace of your existence)


where is this setting?


In the "Activity Controls"[0] section of one's google account, there's an option for "auto-delete".

Now that I'm looking for it, I don't see Calendar listed as one of the products controlled here, but it might be grouped under GMail or something.

[0] https://myactivity.google.com/activitycontrols


Uh, my gmail stops at ~2009 or so...

I lost a bunch of emails from my now passed mother which I went to go look for, and found out that Gmail chopped history.


Yikes! Just checked and I have emails going back to 2000 (imported, account is from 2007) in my Gmail. It is a workspace account though.


> Not just calendar entries, but messaging systems

Logged in to my yahoo account that I haven't used in more than 10 years to look at some convos I've had with a good friend. There was nothing there. Yahoo forums are full of people that want to recover messages from their deceased loved ones, and won't be able to.

I find it fascinating that the only traces remaining of our lives will probably be archived on some NSA server somewhere... Flashbacks of that X-Files episode with the underground bunkers full of DNA.


A client of mine had his email account accessed by somebody in Nigeria (could see date and location of last login). That person then deleted everything after accessing the account.


That's why manual backing up of your chatlogs is such an amazing (and often cringey) thing. Just don't back them up anywhere online.

I've got conversations from 15+ years ago with girls I liked, friends I loved and everything in between. It's hard to watch, but if I ever feel nostalgic I can bring them up in a flick.

I'm not even 30 yet.


Not to start a war but iCalender has saved everything since I first started using an iPhone in 2013.

edit: Make that 2011


There used to be an old restriction for the (official) mobile apps to sync 1 year in the past. I don't experience this anymore, so appears to be lifted. Maybe using some legacy client?


Interesting anecdote! It facilitates to rewrite the past, influence the present, control the future. We need to be mindful where to invest our records.


I've moved a bunch of times in my life. Each time, I move a box full of junk that's been sitting there since the last move. I never open it or go through it, and I always forget about it until the next move. Certainly there must be some attachment to the junk, but I never think about it until I move. I'm not sure that my digital junk is any different.



Google's search quality has taken a serious nosedive in the last couple of years - the last 6 months in particular.

I think they implemented some new form of search term widening which is far too strong, so the results you want are often buried among pages and pages of results for the general category of things that you searched for rather than close matches for your keywords. Combined with the recency bias that other people have talked about and you end up with a lot less useful search for precise searching.

This coincides with a large increase in the number of surveys that my partner has been getting through the Google Rewards program that ask whether or not a recently used search term gave relevant results. Obviously that's just anecdotal, but it does feel like there are substantial changes in the algorithm, and not necessarily for the better.


Has been going on for a decade in my region but it seems recently other people on HN has been hit too.

With that said, welcome to the club. I feel sorry for you all who have to go through this now; I've had a decade to adapt.


I've been seeing this too. So after reading your comment and the one above, I did a search to see what operators are still supported by Google and stumbled across this page: https://ahrefs.com/blog/google-advanced-search-operators/

It's nice to see that AND and OR still work, but from a few tests, it seems that most of us should be prefacing our searches with "allintext:" (without the quotation marks). It works well to dramatically improve the results of my searches. I think I'll write myself a google search page that automatically prepends it to my searches.


From that article:

> “search term” > > Force an exact-match search.

This is no longer working in my experience, at least for full phrases (for example error messages), same in DuckDuckGo


Agreed. Quotation mark functionality in Google seems to be sadly dysfunctional.

I guess I buried the lede. I probably should have just focused on the allintext: operator in my comment. That's the operator that makes a huge difference in my searches and that I wanted to share.


I got the message loud and clear :-)

Now I just wonder if that is Kagis magic trick

- and how long it will take before Google "repairs away" this feature too :-(


Wow!

I have been a really advanced Google user and somehow I have missed that particular one!

Edit:

1. Reading through it I am fairly certain some of these has been defect too. define: in particular is one that I used to use, but that I was sure was broken. Today it worked again though.

2. At least double quotes still doesn't work (or maybe they include text from links pointing to site or something)

3. How did you find this? I've been looking for something like this recently :-)


I use duckduckstart.com and I don't recall adding a bang to the search, so it was probably in the startpage.com results when I searched for "Google search operators."


Most of these aren't working anymore... extremely frustrating


They are optimizing for a certain type of search. It seems to me that type of search favors recency and the broadest results possible.

While it seems to be getting worse to me, for most people this is probably what they want. Searching for "Javascript .splice(,) ES5" and "Vermont State Flag" are two pretty different types of searches.

It's actually not that hard to get the results you want for a more specific search if you learn the boolean operators and use them effectively.


just the other day I was trying to find general information about how a cuckoo clock works and I had to wade through multiple pages of companies selling cuckoo clocks, companies selling clocks, clock oil, watch sellers... I have ublock origin but the "non-ad but still loaded with commercial results" experience was terrible.

Guess I need to add blacklist to my extensions so I can start paring down the BS responses.


There are recipes I've been cooking for 15 years that are on forum posts. I've searched for them all this time with google, and now they have disappeared.

I can find them by manually navigating around the archives of the forums, but to most modern search engines they do not exist.

We are losing a lot with the direction that these search engines are going in.


i been messing with other search engines a lot. kagi and marginalia. nothing too good or bad so far. I really want to be done with google.

ddg seems like a clone of google so its not much better


> Google will return irrelevant results from today rather than relevant results from 10 years ago.

Tip: leave Google behind for now.

That site has the last few years been very useful but only in the same way as my very cheap electrical saw: because I didn't have access to anything better.

For someone who has tried good tools like Festo, Milwaukee, Hitachi or old Google it is just a painful reminder of the past and how good life used to be.

It works but hasn't sparked joy for close to a decade.

After kagi and marginalia came into my life my life has improved significantly.

Note: I'm not saying Googlers are evil or dumb now but I will point out that engineers there have incentives stacked against them.


It’s very clear Google has censorship and biases. If you search for some subjects you will get vague unrelated results, but legitimate results from say Bing. An example of this is anything related to the sex industry.

I find this with coronavirus as well.


All search engines have some form of implicit bias. An unbiased search engine would be beyond useless at actually finding anything but extremely well specified queries. The trick is to tune the bias to favor results that are interesting and relevant.

This is also why having just one big search engine is a bad idea.


At the very least though, I would expect my search engine to rank results higher, if they contained the words I was looking for, without having to wrap in +"..." every single word. This seems to be something, that few search engines are capable of. Is this very basic thing labeled as "advanced" now?


That's how most search algos work out of the box. Google used to work that way. I agree; it's getting worse. It's not obvious if this is being done purposely but the result is that Google is basically unusable for a lot of searches now.


Google's moved to vector search, which doesn't lend itself to keyword prioritization in the same way classic keyword search does. I think that's 90% of the problem.


So finally I have a name for the utter madness that hit Google a decade ago?

Vector search is that the thing that makes it ignore my search query and search for something else?

Also, is this the problem Bing has?

And is it much much cheaper since they choose to use only use something so utterly ridiculously broken?

Edit: I'm obviously exaggerating heavily here but this has cost me so much time and frustration.

I agree with others: if pages exist that contains the exact matches why don't return them first?


Not bias, deliberate disappearance of specific categories of ideas. Independent blogs and forums not owned by large silicon valley companies have disappeared, for example. So have all discussions of current events, except for "mainstream" news publications.

Whether this is because of a short term greed motivation to maximize adsense yield or a larger conspiracy of global information control isn't clear yet. But it amounts to the same in practice.


Can you give examples of search queries, and blogs and forums that you would expect to show up from those queries?


Here’s an example on Google search:

‘site:4chan.org wuhan’ returns no results ‘site:4chan.org wuhan institute of virology’ returns 1,710 results


Examples?


Don't know what you want to tell us by comparing software tools with hardware tools. Quality hardware tools from the last century are still doing what they are designed for and in professional context even work bettern then most current tools as the old hardware were designed to last.


Point is last fall I had both

- inferior power tools (because I couldn't afford good ones last I bought)

- inferior search tools (because Google has broken itself)

The similarities are that I was stuck with bad tooling and when you know how big the difference is it hurts.

For people who never took advantage of how good Google used to be or who have never used good tools it probably wouldn't hurt as much.


What’s better than Google?



I just tried "python convert datetime to unix timestamp" on marginalia and it came out with nothing. Kagi I signed up for the beta and I'm on a waitlist.

Still nothing better that Google for me.


Read the how to on marginalia. Edit, it is right beneath the search box and says: [...] A concrete example: How do I cook steak? will probably not be helpful, Steak Recipe will give better results (just Steak is pretty good too).

After that try: python datetime timestamp

or: python convert time

or something similar. Edit: see also marginalias answer next to me.

As I wrote earlier: marginalia is most for fun (although lately I have gotten better results for at least one simple technical query.)

Edit: if a search engine respects your query you can always broaden your query. If a search engine ignores your query you cannot do anything.


Is my memory filling me or is that much closer to how search used to work in the early 00s? I have vague memories of being flabbergasted by someone even trying something like "how to XYZ?" in a search engine.


Yeah, you used to have to think of specific, unique keywords that would appear in the page you were searching for. A lot of people couldn't break the habit of asking questions, so search engines just started accepting it and trying to optimize for it.


> A lot of people couldn't break the habit of asking questions, so search engines just started accepting it and trying to optimize for it.

Which had been totally fine if they hadn't nerfed it for us who knew how to use it.


No, you are right.

This is the way anyone with a clue would approach search until Google broke it: search for words or phrases within the page.


I always search as such:

"perfect steak recipe"

"perfect grilled chicken"

and such... typically resulting in the best results.

"R53 mini cooper fuel filter"

reduce the words to the core of the question, and get good results.


I distinctly remember terming my searches as questions. I was young and this was back pre Google so I was using Ask Jeeves. Naturally if you're asking someone something you would phrase it as a question, so in my mind it made sense to ask it that way.


Wasn't that the whole premise of Ask Jeeves? And then much later, Wolfram Alpha awed with that same 'plain language' functionality?


I got access to the Kagi beta a week or so back, and have been daily-driving it as my default search engine ever since.

I've only ever "fallen back" to Google for a small handful of searches since then, and only out of curiosity for what it might look like in comparison to the Kagi results.

Although it's only been a small amount of time, I'm sticking with Kagi for now because the search results are at least on par with Google's, and significantly less encumbered by Google's SERP dark patterns.


Yeah it's not quite the sort of query it's good at. It's more geared toward discovery than answering questions. You can answer that particular question if you tune the query a bit though.

https://search.marginalia.nu/search?query=python+datetime+un...

Has this as the first result, though:

http://www.logophile.org/blog/2010/08/15/python-datetime-con...


You're right. It works better when I think "hard keyword search" than what I'm used to with DDG and Google. I tried "python super classmethod" and got here: https://legacy.python.org/dev/peps/pep-3135/ which perfectly answered my question.


Isn't it beautiful that something so amazing can run on a standard tower PC in a living room in Sweden?

Holding up against a HN hug of death last month, providing an absolutely refreshing search experience!


I just tried on Kagi, this was the result: https://imgur.com/a/Ds1YTf5 I've been using Kagi for a little while now, honestly, it's been pretty great.


Kagi for serious stuff (work, health). I've no idea how since they build mostly on Google and Bing but something they do makes an enormous difference.

Marginalia for fun stuff.


> Our searching includes anonymized requests to traditional search indexes like Google and Bing as well as vertical sources like Wikipedia and DeepL or other APIs. We also have our own non-commercial index (Teclis), news index (TinyGem), and an AI for instant answers.

Kagi seems to use both Google and bing results and will cost a minimum of $10 a month. Seems like the worst of all worlds to me. It’s a good option I suppose for those who want to pay.


That's the magic part, it uses Bing and Google but manages to get good results!

I've recently seen queries answered on Kagi that returns utter rubbish when I paste the exact same queries into Google or DDG.


I don’t see what’s magical about taking results from other sites lol. What happens if bing and Google block them?

I’m skeptical of their privacy policy as well, given that you need an account and have to pay, it’s unlikely they don’t know your searches.


They are paying for API access, not scraping results. Why would they be blocked?


Where did you see that they’re paying for API access? Their FAQ says that they’re making anonymized requests but don’t go into more details than that.

That being said, even if they did pay if they have any traction at all they’d still be blocked. Why would their competitors help them?

In any case I doubt it will take off, people who are poor aren’t going to pay 10 bucks a month just to search when there are dozens of free competitors.

Happy to be proven wrong.


If they were just illegitly scraping results, they could not provide any kind of consistency in quality as they would be getting blocked on and off. And the small search engines like DDG, Startpage, Ecosia, etc have been making these kinds of deals for more than a decade and with more users than Kagi is likely to ever have without having their contracts terminated due to "too much traction".


Late but I have texted a number of times with the Kagi team and they do pay for API access.


>That's the magic part, it uses Bing and Google

That's the part I want to avoid. Build your own crawlers and stop relying on Google. Google already has enough control over the internet, I do not want a search engine that's just downstream of them.


Ok. I see. But I cannot do that.

I can however use a search engine that doesn't mock me and second guess me all the time and that tries to anonymize me.

But I agree very much with you.

If someone launches a search engine that works as well as Kagi and doesn't use either Google or Bing I'll happily pay a premium for that on top of of what I have already said I am willing to pay for Kagi.

Edit: In fact I already support a search engine with an independent index.


They are still a startup. If they can get the users/money, eventually depending on Google will be a risk/too expensive, and they'll start crawling themselves.


Define better first.

If by better you mean supporting a competitive market that doesn't strengthen the monopoly Google has on search and their results which, as years go by, are augmented for their benefit but not yours, then anything is better than Google.

If by better you mean increased chances that the first few results will contain the answer you need, then Google is still king. That being said, when it comes to tech, with most answers found in open git issues or stackoverflow, i find brave search sufficient. So much so that i haven't used any of Google's services in over a year.

The real question is, what are you willing to compromise? There's an in-between for the two extremes above and the choice is subjective.

An exciting alternative is https://neeva.com/ I'll happily pay money for a privacy oriented search engine as i do for email, cloud storage and NextCloud. We shouldn't expect things to be given to us for free. I work, they work, we can do a fair trade with some iou derivatives.


Neeva signups are bizarrely region locked. I get that there may be some local results for some kinds of searches, but that's easily handled with a banner in results for geo searches that states "local results aren't available in your area yet", rather than locking much of the world out of the service.


> If by better you mean increased chances that the first few results will contain the answer you need, then Google is still king.

Hasn't been true since at least December. Kagi is better now by almost an order of magnitude.

Of course my Google is not your Google. Still I have experimentdd so much with Google over so many years, logged in, logged out, from multiple addresses that I am confident in saying that it hasn't been itself for a decade and finally now we have a better alternative.


> Of course my Google is not your Google.

And that's the main problem with the new Google.


> neeva

A website that requires me to sign up to see what their premium account may cost ... this sounds pretty scammy to me.


Also not available in my region.

Straight from the playbook of big film ;-)


I was excited for Neeva at first. But then I saw the premium includes some kind of NFT trash attached to it. I'll take my money elsewhere. I signed up for the Kagi beta and am hoping it will be better.


A warning for those who remove dates from their content and care about invalidating bad patents and rejecting bad patent applications: If there's no date, likely a patent examiner can't use it as prior art.

I used to work as a patent examiner and I was disappointed when I found web content describing an element of a patent application I was working on, but there was no date that could be used to be certain the document was available before the priority date of the application.

You can use the Wayback Machine and similar archivers to get a date, but frequently the archivers didn't capture the page or didn't capture it in time in my experience (even if it likely was published in time, I can't establish that legally).

Before I quit, I spent some time saving a ton of webpages in one of the areas I was working on (water heaters) just because I wasn't sure how long I'd be at the USPTO and I could be certain of the date given that I myself archived the documents. It was a long-term investment, but could have been quite useful if (for example) a company tries to patent something they previously sold a long time ago and forgot about. The Wayback Machine often had spotty coverage of corporate webpages so I couldn't see all their products at a particular time.


Check the source html, sometimes cmses put dates there


I've certainly noticed Google ignoring results more than 3-4 years old when something barely related, but more recent, matches. I call it "recency trap". Because of that I've found myself more and more systematically setting the date range of the desired results (which isn't 100% of the time useful as many sites reply with misleading metadata).

To point a recent example (and given the current events) a number of Russian officials blamed the sinking of the Kursk on NATO (either on purpose or by accident), and I recall such statements from back then, but via Google it's been almost impossible to find a primary source. Most results were from the 2021 statements insisting on that from a retired admiral that was involved back then, but from 2000/2001 the relevant content was certainly tough to find.

Part of it is because this is 2000/2001 and many links rotted away, another part because the existing links usually don't respect basic SEO, and finally because Google, in my experience, very strongly prioritizes now/recent content.


At least Google still allows you to set a specific time window for your search results. Given their strong recency bias, this is often the only way to find older resources.


I'm glad the date range tool still exists, I use it now and then and it's generally good. What's bad is that it's not available on the mobile site. But what's ugly is how unreliable and cluttered the results are. For example:

"Dec 21, 2001 — House Democrats plan to vote Wednesday to impeach President Trump for his role in inciting the deadly Capitol attack as President-elect Joe Biden prepares ..."

https://www.google.com/search?q=president+Trump&biw=980&bih=... (be sure to switch to desktop mode if on mobile)


Google has a pretty big recency bias. From a content producer-perspective it makes sense, if you put up new content you want to see traffic to that content. From a consumer perspective it's questionable at best. Given the Lindy-effect, odds are the quality of old content is higher than average.

I also do kinda think we should be thinking more about what legacy we leave than we presently do. HTML has some serious problems with that regard, especially in terms of link rot, and especially now that we treat it as a way to build platforms. Archive.org is great and all, but is it enough? How will SPAs fare when the backend server is down in 30 years? How much value will be lost?


> HTML has some serious problems with that regard

I think the problem is not with HTML but with HTTP as location-addressed protocol. For future-proofing, and DDOS/censorship mitigation, content-addressed storage (DAT/IPFS/Torrent) makes sense. I would love to seed my favorite blogs, if i was given the possibility to do so: in this sense, a web browser based on IPNS would be rad: too bad the only one i know of is bundled with JS (instead of operating a paradigm shift) and produced by an adware company.

> How will SPAs fare when the backend server is down in 30 years? How much value will be lost?

You don't even need to wait 30 years for SPAs to be broken. By the next API update they may stop working, and subtle changes in browser sandboxing could kill them just as quick.


> For future-proofing, and DDOS/censorship mitigation, content-addressed storage (DAT/IPFS/Torrent) makes sense.

Well great, now you can't even change the content one tiny bit (not even definitively benign changes like fixing a typo or a bug in your CSS or whatever) without invalidating all links pointing to it.

Okay, so as long as anybody still has that version cached and is seeding it at least the link still works (though over time it's not excluded that that number might drop to zero and the link still break after all), though now of course everybody who comes in gets the old version and might not even new that some update exists.

While I acknowledge that the silent updates possible with HTTP for an URL's contents can be a curse as much as they are a blessing, I'm not sure if "absolutely no updates ever" are the right answer, either.


Very good point, though you're missing a layer of indirection in your thinking. I should have made my point more explicit!

Usually you would subscribe to a cryptographic pointer (the publisher's public key) whose value is stored in a DHT. Then the pointer can be updated to point to new revisions of the content/website. IPNS is a famous implementation of that for the IPFS protocol. So as a client you can seed a specific revision of the content, or all of them, or just seed the latest version.


If I think about Google search I think at a search engine. Not a content producer. Probably they try to become the latter while starting to suck at the former.


Even more frustrating than this are the sites which automatically append the current year to their article titles.

"Best CMS frameworks (2022)", for example, and yet the content is out of date.


A particularly egregious example of that are the product review sites - "Best wireless headphones (2022)" and most of the products listed are no longer available.


It drives me batty. Another instance of solving the problem by adding "Reddit" to the end of the search string. Of course this means Reddit will shortly be swamped with AI spam bots driving crap to the top of forums. Presumably that's a current issue.


It's not swarmed with bots yet, but I've found reddit to be less reliable in the last year with quite a few suspicious comments popping up for various products.

Reddit is still better but I worry in another year that will no longer be the case.


I find a lot of those 'review sites' are really just lightweight b.s. marketing sites that are really designed to generate product referral fees by providing a convenient link to Amazon or wherever else rather than truly a real attempt at providing good review.


The past is always disappearing. I know of no data recording device that we could use to store anything for more than a million years except DNA (for fun check out the Arch Mission Foundation that attempts to use DNA for backups of human knowledge) [0] or 5D nanostructured glass [1].

This post reminded me of a great Kurzgesagt video [2] that went briefly into how much of the past life on earth we have no information on and will never be able to know. Incidentally it took me a few seconds to find that video. Before the internet if I was trying to lookup a clip I had seen a month ago on TV I don't know even where I would have begun searching...

However I think we are getting increasingly better at preserving information and making it easy to access with tools like the internet archive, and cloud backups for your photos. This is despite the sheer quantity of data (such as the number of photos you take) growing at an exponential rate. Would you have been able to easily find instructions for a machine that was decades old before the internet?

So the past is disappearing but possibly at a decreasing rate.

0: https://en.wikipedia.org/wiki/Arch_Mission_Foundation

1: https://en.wikipedia.org/wiki/5D_optical_data_storage

2: https://www.youtube.com/watch?v=xaQJbozY_Is


I take a lot of notes. The links I place in them often 404 just a few years later. I'm grateful for wayback machine. But it doesn't capture everything. Sometimes I'm just left with a link. And some links don't reveal any information about the source (e.g. a youtube link gives no info about what a removed video was). I've started adding a little more plain text to my notes to help protect against this (e.g. source: "3m20s into Bob's Rails tutorial on Webpacker") just in case it disappears I'll have something to latch on to.


I pay for pinboard.in for this reason.


Hadn't come across this before. At first glance, I thought the quotes were satirical, but no, they're very real:

> "You don't really know that the cool project you signed up for is in a skyscraper in Silicon Valley, or like me: one dude in his underpants somewhere who has five windows open to terminal servers"

- Economist Magazine [1]

[1] https://www.economist.com/babbage/2011/04/04/stick-a-pin-in-...


It is developed and run(?) by our own dear idlewords.


Its not you, the search engines have become weaponised.

The most recent example, Sue Gray, a top British civil servant until a few months ago would have her career controversies visible in search results when searching for her. Since it was announced she would carry out an investigation into party's at 10 Downing Streets, its become impossible to see her career controversies in search results now.

Eli Pariser also hilighted changes going on with Google back in 2011 as you can see from this talk, but I think was just the start. https://www.ted.com/talks/eli_pariser_beware_online_filter_b...

IMO, the search engines have now got a lot worse with what they show in search results like the Sue Grey example above.

Society is becoming like Fahrenheit 451 https://en.wikipedia.org/wiki/Fahrenheit_451


I think there's a recency bias for Google search results, that cuts both ways. I searched "Eric Clapton" and "Neil Young" and both pages have a mix of expected results (Wikipedia, social media, official website, Spotify) and news/tabloid/opinion pieces about their recent covid takes. I've certainly had searches in the past where I couldn't find the Wikipedia link because there was too much noise about a recent controversy.


I would like to highlight that there are currently other search engine options. We can chose to use duck duck go or even bing. Both of which I have never blatantly experienced things being hidden.


Is the instance of Sue Gray related at all to the strength of English libel law? When UK controversies get buried, that often seems to be the reason.


Sue Gray has done other reports, when I think of her I think of 50 shades of grey.


Yes, and even worse; I have a strong suspicion some websites (automatically) update the date in titles and other places.

Especially "best [product] in [year]" articles and lists, they somehow always are about the current year, even in early January, and even if they are only about outdated things..

The crap dirty tricks SEO content does seem to work quite well atm. It's probably pretty hard to determine whether something is relevant or not.


I am slowly starting to consider that on the internet NOTHING is relevant, and EVERYTHING is manipulated.


You are not wrong... Getting attention matters, it's lucrative


Some essays stand the test of time, though. Not everything is irrelevant, but I'd say 99% is. The bar for publishing is very very low.


I've definitely seen sites simply showing the current date at a place where you'd normally expect the article's original publishing date.


Also, youtube videos sometimes don't have posting date in their description, not sure why. And I hate when platforms use contextual time like '1 year ago' instead of posting date.


Could you share an example of a YouTube video that does not have an upload date?

I watch quite a lot of YouTube and very frequently look up the date that videos were uploaded. I've never found this information to be missing.


I stumbled across this recently and was puzzled.

If I find that link / another example I will share here. <placeholder>

But this definitely is a feature of YouTube where content creators have an option to hide the date. I googled and found the documentation for this feature.


https://support.google.com/youtube/thread/4006094/how-can-i-...:

> Q: How can I hide the date the video was published? (Posted 4/12/19)

> A: That is not an option that is available.*


Not sure if this is OP's case, but YouTube has the rare video with no metadata visible whatsoever (tested web):

https://www.youtube.com/watch?v=Qg0_4EmHLCo


Woow, thanks very much for this piece of interestingness. What even is this?

Even ad rolls (if you dig out the video id) have some sort of interesting "noone will see this" title :)

And random apps on the Play Store also have video titles.

Adding the video to Watch Later by replaying the request using cURL seems to work, at which point I can add the video to a playlist to find it later (yay), and also click through to see the uploader: https://www.youtube.com/user/GameloftVideos

Except for the bit where it says "This channel is not available" using the YouTube design from a few years ago and displays absolutely nothing, the account is perfectly normal, nothing to see here.

Wow.

What even is this?? [Edit: just realized I forgot I said that above. I'm leaving it in :P]


This is very annoying. I write for an AI site, and have requested a change to the CMS, which changes the publication date to the 'updated' date, leaving no visible evidence of the original date behind. If you open up the HTML, the original date is in there as a token, but the casual reader won't know, and Google is annoyingly unpredictable as to whether it will display the date in SERPS (it almost always knows the date, whether it shows it or not).

I have taken now to including 'Original publication date' as a codicil in my pieces.


Have you tried resizing the window? I do have this sometimes too and figured out that it usually happens when the other field-values are too long. Seems the date has simply the least priority between all the buttons and the view-count and becomes hidden.


One way to date a post is to copy the url into https://archive.org/.

But I've shared similar frustrations, yes.


A plugin that either automatically or on requests checks sites for the oldest known archive date would be interesting, though I worry it'd both encourage blocking IA bots and tax their resources.


I remember checking caching headers to learn about content date at some point.

It's not just you. Also sites being gone and content getting lost. I think by now pretty much everything text should just not disappear anymore, and we seem to only have web archive which is doing something right (google cache seemed to have lost its persistence at some point but I may be imagining it because that's like 2 data points).

I wish web archive would skip videos and instead fetch more obscure websites but I'm guessing that being able to tell what's spam and what is not is not easy.

Shared distributed history cache for visited websites could be nice but within a short time I spent thinking about it I couldn't figure out a way in which this could work that would make me install it myself.


I noticed a bias for recency that a lot of people share, or take for granted (sometimes myself). I was surprised to see that this idea, that history is better and better with time, goes back to Aristotle.

https://plato.stanford.edu/entries/progress/


I think Google prioritizes newer content because AFAIK modifying the date of a page to make it look like it's been published more recently is a common SEO trick.


TikTok made this more popular. Every other social media had a date of the post or the video. TikTok had none. Still doesn't. That kept videos fresh over time.


Yes, I discovered this back when reddit was just a few years old. Before then I was able to search google for a specific reddit thread and find it. Now there is no chance. I've been screen shit and downloading things that I like for years now.


Traditionally 'date' has been a negative SEO thing in that Google prefers to surface "new" things rather than "older" things. As a result, web sites will pick one of three strategies; Remove all date indicators from the content, Put the date on when "new" but remove the date after the content has been up for more than a set period of time, or use a js function to always render a date from yesterday as the date of the material.

Google isn't about "reference" data (who clicks on ads when they are looking for a history of stories about topic X?) so the archival and reference function falls to meta services like Wikipedia where a human curates the history and provides links back to the that history.

Of course such links get very few visitors and often the place hosting the content will simply retire it rather than spend a couple of nanocents on leaving it up, and the result is link rot.

Yes, I am cynical about how Google is now an agent for "destroying the world's information" when at one point they were simply trying to organize it.


I often use before:YYYY-MM-DD and after:YYYY-MM-DD keywords in google search, and some websites mess with that data so that their content looks like it was created today when it obviously wasn't.


I’ve started noticing this when using Google to search for Reddit threads. Threads posted years ago now appear to have been posted in the last few days due to Reddit hacking Google relevancy metrics.

Probably Reddit’s doing, but it’s made finding older topics impossible and I partially blame Google for letting companies abuse their service in this way.


I think, this started already at the height of the Web 2.0, when Google started to favour more recent (often copy-and-paste) blog content over older content, regardless of prestige or authority. If I'm not mistaken, age and/or recent update has been now officially part of the algorithm for quite a while.


Yes the old, decentralized web is disappearing and being replaced by centralized big tech. I remember when google search came out and it was amazing. Now it restricts anything "not approved". personal websites, geocities, angelfire, myspace. then blogs and facebook appeared for people to put their content there. Then reddit came and ate the forums. reddit never used to show up in search results but many helpful forums would. Now it is the opposite. Then google ate up the bloggers. The entire point is to direct content to these big tech approved sites. Try wiby.me to find some of the old sites. There is also a movement to decentralize. Take your stuff off github, off reddit, off facebook, and onto your own site. Form webrings like the good old days.


It's not just you - it's always Year Zero [1] on Google SRPs now. When you're looking for information on a topic or historical event, you're not going to easily find classical sources, contemporary primary sources, bloggers - it's always content from the last two years from what Google considers authoritative / trusted sources. Searching has become a skill again; you have to use your existing knowledge to drill down to associated terms and also search books, archives, social media, and so on, up to asking literal people.

[1] https://en.wikipedia.org/wiki/Year_Zero_(political_notion)


It's to a huge degree google giving crap results and not many actual pages disappearing. The other day I searched for "f.position.vsub is not a function"[1] and after the first 3 results it starts giving just completely irrelevant results like 'the definition of function according to mathematics' (https://en.wikipedia.org/wiki/Function_(mathematics) ). Just wild garbage stuff.

[1] https://www.google.com/search?q=f.position.vsub+is+not+a+fun...


It also reminds me what Thomas Piketty said prices of things used to be mentioned prominently in novels and all in past. Nowadays it does not happen unless pricing itself is salient part of plot. The reason was being inflation. Once inflation took over, there was no accuracy or authenticity about prices of objects and slowly disappeared from novels/movies etc.

The same thing I can see with dates, with billions of events generated and captured every second , the actual date/time of event can be demoted to level of thousands other attributes captured. So it will be mentioned when date itself is point of article etc.


That past does seem to disappear in Google: https://news.ycombinator.com/item?id=23977375 https://news.ycombinator.com/item?id=19604135 Maybe people are evolving to erase dates as a countermeasure?

I sometimes wonder how much of this is just a ratchet of things like banning spam: on a long enough time horizon, the survival rate of everything goes to zero?


Related anecdote: the 2008 flash crash of United Airlines stock [0]. A six-year-old article about United Airlines' 2002 bankruptcy was republished from the Chicago Tribune in The South Florida Sun-Sentinel (apparently without the date), then got picked up Google News, then by an investment firm, then ultimately by Bloomberg, leading to a flash crash from $12 to $3.

0: https://www.wired.com/2008/09/six-year-old-st/


I think this is partly a second-order consequence of the reasonable expectation that, due to the rate of innovation and general churn in technology, computer-related information goes stale quickly and hence more recent sources should be prioritised. Having a date stamp on the page then becomes a liability (I for one am guilty of adding a current year number to my search terms for some queries to try to get more accurate results).

Another reason might be that SEO got really good several years ago so older content just can't compete.


I've noticed that too. For some explicit or implicit reason, designers are making UI's without the date. Of course this makes anyone unable to orient itself in the time context. Wonder if they are mindlessly influenced by the transient nature of social media design. Things being "conversation" oriented can gain from peers interaction but for highly technical content, an informal UX will prevent high engineering from _flowing_ (emerging smoothly) or at least less difficult than needed.


I think this is the effect of various companies trying to capture as much of our attention as possible because it makes sense for their bottom line. We are all victims of recency bias, so it makes sense for companies to prioritize more recent content. If it wasn't the case, the various social media apps wouldn't be as addictive as they are.

I am not aware of a social media app feed that puts quality above recency (not counting the plugins that enable that, like Twemex [1]). Instead they keep us in what David Perell calls "Never-Ending Now" [2]. We endlessly consume temporary, short-lived content and we are mostly blind to the past.

Google search is not social media, but I wouldn't be surprised if Google ranked more recent content higher, given how they have changed the Youtube algorithm.

1: https://chrome.google.com/webstore/detail/twemex-sidebar-for...

2: https://perell.com/essay/never-ending-now/


Google has some neat tools, click the "tools" menu and select any time, you can enter a range there: https://imgur.com/a/XQSMT9n

Not that I ever needed to use it, but it's there.

Google does know an indirect date of the page even if it's not written explicitly. The first crawling date should be saved and if no better indicator exists I'd assume they are using that timestamp.


I have a feeling that OP has used this feature and that it's part of their complaint.

I've been using it for many years by now, and have essentially given up on it recently.

I'd often set for past year/month and get articles, reviews, etc that were clearly dated many years previous. I don't know if the site are responsible for the gaming of search, or Google themselves, but the result is the same. I have some doubt that PC hardware review sites are gaming dates on their old articles, although I accept it could be in their interest to do so too.


I'm surprised that you never used it ?

This is pretty much the only reason why I still sometimes used Google - because Duck Duck Go only added this feature recently.


I'm in the minority of the HN crowd who has very little issues with Google search quality. (although I do see some spam in my native language) My account is getting adolescent old, so it might have learned a thing or two about me how to serve me best.


But this isn't about search quality - it's about finding some web page (otherwise obscured by similar keyword webpages) that you know that could only have been created at a specific time.


Count your blessings.

My account is around 15 years old and has gotten worse with time.


Like anything, the date that google consideres something to be published at can be, and is gamed.


I also had the same observation. It's pretty annoying and an example of how the web in some cases is become less useful. For Google in particular, more often I myself skipping the first wave of results because they aren't very helpful, they're just crafted to show up in search results. I guess now that so many of us are already in the Google ecosystem, there's less incentive to cater to users.


There is worse than content without a date. There is content with a fake date, changed to appear like it's relevant while it isn't.

There is nothing more frustrating that typing "XXXX 2022" click a link and see "XXXX-2020" in the URL.

People legit not changing their article but updating the title to stay on top of the SEO game. Usually found on generic searches that drive big traffic. I freaking hate that so much.


That’s especially common in blogspam where they programmatically update the year to the current year when compiling a stolen best-of list.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: