Hacker News new | past | comments | ask | show | jobs | submit login
CNET is deleting old articles to try to improve its Google Search ranking (theverge.com)
801 points by mikece on Aug 9, 2023 | hide | past | favorite | 574 comments



So google's shitty search now economically incentivizes sites to destroy information.

Can there be any doubt that Google destroyed the old internet by becoming a bad search engine? Could their exclusion of most of the web be considered punishment for being sites being so old and stable that they don't rely on Google for ad revenue?


I'll just assume you neglected to read TFA, because if you had, you would have discovered that it links to an official Google source that states CNET shouldn't be doing this.[1]

[1] https://twitter.com/searchliaison/status/1689018769782476800


I could imagine CNET‘s SEO team got an average rank goal instead of absolute SEO traffic. So by removing low ranked old pages the avg position of their search results goes closer to the top even though total traffic sinks. I‘ve seen stuff like this happen at my own company as well, where a team‘s KPIs are designed such that they‘ll ruin absolute numbers in order to achieve their relative KPI goals, like getting an increase in conversion rates by just cutting all low-conversion traffic.


In general, people often forget that if your target is a ratio, you can attack the numerator or the denominator. Often the latter is the easier to manipulate.


Even if it's not a ratio. When any metric becomes a target it will be gamed.

My organization tracks how many tickets we have had open for 30 days or more. So my team started to close tickets after 30 days and let them reopen automatically.


Lower death rate in hospitals by sending sick people to hospice! https://www.nbcnews.com/health/health-care/doctors-say-hca-h...


Meanwhile that's not necessarily a bad outcome. In theory it makes the data better by focusing on deaths that might or might not have been preventable, rather than making every hospital look responsible for inevitable deaths.

Of course the actual behavior in the article is highly disturbing.


This is why KPIs or targets should NEVER be calculated values like averages or ratios. The team is then incentivized to do something hostile such as not promote the content as much so that the ratio is higher, as soon as they barely scrape past the impressions mark.


When deciding KPIs, Goodhart's law should always be kept in mind: when a measure becomes a target, it ceases to be a good measure.

It's really hard to not create perverse incentives with KPIs. Targets like "% of tickets closed within 72 hours" can wreck service quality if the team is under enough pressure or unscrupulous.


Sure they can, e.g. on-time delivery (or even better shipments missing the promised delivery date) is a ratio. Or inventory turn rates, there you actually want people to attack the denominator.

Generaly speaking, an easy solution is to attach another target to either the nominator or denominator, a target that requires people to move that in value in acertqin direction. That might even be a different team thanthe one having goals on the ratio.


> Sure they can, e.g. on-time delivery (or even better shipments missing the promised delivery date) is a ratio. Or inventory turn rates, there you actually want people to attack the denominator.

These are good in that they’re directly aligned with business outcomes but you still need sensible judgement in the loop. For example, say there’s an ice storm or heat wave which affects delivery times for a large region – you need someone smart enough to recognize that and not robotically punish people for failing to hit a now-unrealistic goal, or you’re going to see things like people marking orders as canceled or faking deliveries to avoid penalties or losing bonuses.

One example I saw at a large old school vendor was having performance measured directly by units delivered, which might seem reasonable since it’s totally aligned with the company’s interests, except that they were hit by a delay on new CPUs and so most of their customers were waiting for the latest product. Some sales people were penalized and left, and the cagier ones played games having their best clients order the old stuff, never unpack it, and return it on the first day of the next quarter - they got the max internal discount for their troubles so that circus cost way more money than doing nothing would have, but that number was law and none of the senior managers were willing to provide nuance.


Selling something in one quarter, with the understanding that the customer returns it the next, is also clean cut accounting fraud.


Yeah, every part of this was a “don’t incentivize doing this”. I doubt anyone would ever be caught for that since there was nothing in writing but it was a complete farce of management. I heard those details over a beer with one of the people involved and he was basically wryly chuckling about how that vendor had good engineers and terrible management. They’re gone now so that caught up with them.


That can be gamed as well: you could either change the scope or cut corners and ship something of lower quality.


I mean, ideally you just have both an absolute and a calculated value to ensure both trend in the right direction.


This is exactly how Red Ventures runs their companies. Make that chart on the wall tv go up, get promotion.


That only says that Google discourages such actions, not that such actions are not beneficial to SEO ranking (which is equal to the aforementioned economic incentive in this case).


So whose word do we have to go on that this is beneficial, besides anonymous "SEO experts" and CNET leadership (those paragons of journalistic savvy)?

Perhaps what CNET really means is that they're deleting old low quality content with high bounce rates. After all, the best SEO is actually having the thing users want.


In my experience SEO experts are the most superstitious tech people I ever met. One guy wanted me to reorder HTTP header fields to match another site. He wanted our minified HTML to include a linebreak just after a certain meta element just because some other site had it. I got requests to match variable names in minified JS just because googles own minified JS had that name.


> In my experience SEO experts are the most superstitious tech people I ever met.

And some are the most data-driven people you'll ever meet. As with most people who claim to be experts, the trick is to determine whether the person you're evaluating is a legitimate professional or a cargo-culting wanna-be.


I’ve always felt there is a similarity to day traders or people who overanalyze stock fundamentals. There comes a time when data analysis becomes astrology…


> There comes a time when data analysis becomes astrology.

Excellent quote. It's counterintuitive but looking at what is most likely to happen according to the datasets presented can often miss the bigger picture.


This. It is often the scope and context that determines logic. It is easy to build bubbles and stay comfy inside. Without revealing much, I asked a data scientist whose job it is to figure out bids on keywords and essentially control how much $ is spent on advertising something at a specific region about negative criteria. As in, are you sure you wouldn’t get this benefit even if you stopped spending the $ and his response was “look at all this evidence that our spend caused this x% increase in traffic and y% more conversions” and that was 2 years ago. My follow up question was - okay, now that the thing you advertised is popular, wouldn’t it be the more organic choice in the market, and we can stop spending the $ there? His answer was - look at what happened when we stopped the advertising in this small region in Germany 1.5 years ago! My common sense validation question still stands. I still believe he built a shiny good bubble 2 years ago, and refuses to reason with wider context, and second degree effects.


The people who spend on marketing are not incentivised to spend less :)


> There comes a time when data analysis becomes astrology...

or just plain numerology


Leos are generally given the “heroic/action-y” tropes, so if you are, for example, trying to pick Major League Baseball players, astrology could help a bit.

Right for the wrong reasons is still right.


Right for the wrong reasons doesn't give confidence it's a sustainable skill. Getting right via randomness also fits into the same category.


My data driven climate model indicates that we could combat climate change by hiring more pirates.


Some of the most superstitious people I've ever met were also some of the most data-driven people I've ever met. Being data-driven doesn't exclude unconscious manipulation of the data selection or interpretation, so it doesn't automatically equate to "objective".


The data analysis I've seen most SEO experts do is similar to sitting at a highway, carefully timing the speed of each car, taking detailed notes of the cars appearance, returning to the car factory and saying that all cars need to be red because the data says red cars are faster.


One SEO expert who consulted for a bank I worked at wanted us to change our URLs from e.g. /products/savings-accounts/apply by reversing them to /apply/savings-accounts/products on the grounds that the most specific thing about the page must be as close to the domain name as possible, according to them. I actually went ahead and changed our CMS to implement this (because I was told to). I'm sure the SEO expert got paid a lot more than I did as a dev. A sad day in my career. I left the company not long after...


Unfortunately though, this was likely good advice.

The yandex source code leak revealed that keyword proximity to root domain is a ranking factor. Of course, there’s nearly a thousand factors and “randomize result” is also a factor, but still.

SEO is unfortunately a zero sum game so it makes otherwise silly activities become positive ROI.


But that's wrong... Do breadcrumbs get larger as you move away from the loaf? No!


It's just a URL rewrite rule for nginx proxy.


If you want all your canonical urls to be wrong and every navigation to include a redirect, sure.


Even if that measurably improves the ranking of your website, it still would be a bullshit job. Also cries for side effects, especially on the web.


I think you're largely correct but Google isn't one person so there may be somewhat emergent patterns that work from an SEO standpoint that don't have a solid answer to Why. If I were an SEO customer I would ask for some proof but that isn't the market they're targeting. There was an old saying in the tennis instruction business that there was a bunch of 'bend your knees, fifty please'. So lots of snakeoil salesman but some salesman sell stuff that works.


That's a bit out there, but Google has mentioned in several different ways that pages and sites have thousands of derived features and attributes they feed into their various ML pipelines.

I assume Google is turning all the site's pages, js, inbound/outbound links, traffic patterns, etc...into large numbers of sometimes obscure datapoints like "does it have a favicon", "is it a unique favicon?", "do people scroll past the initial viewport?", "does it have this known uncommon attribute?".

Maybe those aren't the right guesses, but if a page has thousands of derived features and attributes, maybe they are on the list.

So, some SEO's take the idea that they can identify sites that Google clearly showers with traffic, and try to recreate as close a list of those features/attributes as they can for the site they are being paid to boost.

I agree it's an odd approach, but I also can't prove it's wrong.


Considering their job can be done by literally anyone, they have to differentiate somehow


>our minified HTML

Unreadable source code is a crime against humanity.


Is minified "code" still "source code"? I think I'd say the source is the original implementation pre-minification. I hate it too when working out how something is done on a site, but I'm wondering where we fall on that technicality. Is the output of a pre-processor still considered source code even if it's not machine code? These are not important questions but now I'm wondering.


Source code is what you write and read, but sometimes you write one thing and people can only read it after your pre processing. Why not enable pretty output?

Plus I suspect minifying HTML or JS is often cargo cult (for small sites who are frying the wrong fish) or compensating for page bloat


It doesn't compensate bloat, but it reduces bytes sent over the wire, bytes cached in between and bytes parsed in your browser for _very_ little cost.

You can always open dev tools in your browser and have an interactive, nicely formatted HTML tree there with a ton of inspection and manipulation features.


In my experience usually the bigger difference is made by not making it bloated in the first place... As well as progressive enhancement, nonblocking load, serving from a nearby geolocation etc. I see projects minify all the things by default while it should be literally the last measure with least impact on TTI


It does stuff like tree shaking as well; it's quite good. If your page is bloated, it makes it better. If your page is not bloated, it makes it better.


Tree-shaking is orthogonal to minification tho.


That's true.


and does an LLM care … it feels like minification doesn’t stop one form explaining the code at all.


The minified HTML (and, god forbid, JavaShit) is the source from which the browser ultimately renders a page, so yes that is source code.


"The ISA bytecode is the source from which the processor ultimately executes a program, so yes that is source code."


I suppose the difference is that someone debugging at that level will be offered some sort of "dump" command or similar, whereas someone debugging in a browser is offered a "View Source" command. It's just a matter of convention and expectation.

If we wanted browsers to be fed code that for performance reasons isn't human-readable, web servers ought to serve something that's processed way more than just gzipped minification. It could be more like bytecode.


I find myself using View Source sometimes, too, but more often I just use devtools, which shows DOM as a readable tree even if source is minified.

I'm actually all for binary HTML – not just it's smaller, it can also be easier to parse, and makes more sense overall nowadays.


Let's be honest, a lot of non-minified JS code is barely legible either :)

For me I guess what I was getting at is that I consider source the stuff I'm working on - the minified output I won't touch, it's output. But it is input for someone else, and available as a View Source so that does muddy the waters, just like decompilers produce "source" that no sane human would want to work on.

I think semantically I would consider the original source code the "real" source if that makes sense. The source is wherever it all comes from. The rest is various types of output from further down the toolchain tree. I don't know if the official definition agrees with that though.


>If we wanted browsers to be fed code that for performance reasons isn't human-readable,

Worth keeping in mind that "performance" here refers to saving bandwidth costs as the host. Every single unnecessary whitespace or character is a byte that didn't need to be uploaded, hence minify and save on that bandwidth and thus $$$$.

The performance difference on the browser end between original and minified source code is negligible.


Last time I ran the numbers (which admittedly was quite a number of years ago now), the difference between minified and unminified code was negligible once you factored in compression because unminified code compresses better.

What really adds to the source code footprint is all of those trackers, adverts and, in a lot of cases, framework overhead.


I was thinking transfer speed, although even then, the difference is probably negligible if compressing regardless.


The way I see it, if someone needs to minify their JavaShit (and HTML?! CSS?!) to improve user download times, that download time was horseshit to start with and they need to rebuild everything properly from the ground up.


> It could be more like bytecode.

Isn’t this essential what WebAssembly is doing? I’ll admit I haven’t looked into it much, as I’m crap with C/++, though I’d like to try Rust. Having “near native” performance in a browser sounds nice, curious to see how far it’s come.


If you need to use prettiefy to even have a chance to understand the code, is it still source code?

About the byte code: You mean wasm? (Guess that's what you're alluding to.)


If you need syntax highlighting and an easy way to navigate between files to understand a large code base, is it still source code?


Turrles all the way down


Nobody tell this guy about compilers.


Minifying HTML is basically just removing non-significant whitespace. Run it through a formatter and it will be readable.

If you dislike unreadable source code I would assume you would object to minifying JS, in which case you should ask people to include sourcemaps instead of objecting to minification.


So I guess you think compiled code is even worse, right?


I mean, isn't that precisely why open source advocates advocate for open source?

Not to mention, there is no need to "minify" HTML, CSS, or JavaShit for a browser to render a page unlike compiled code which is more or less a necessity for such things.


Minifying code for browsers greatly reduces the amount of bandwidth needed to serve web traffic. There's a good reason it's done.

By your logic, there's actually no reason to use compiled code at all, for almost anything above the kernel. We can just use Python to do everything, including run browsers, play video games, etc. Sure, it'll be dog-slow, but you seem to care more about reading the code than performance or any other consideration.


I already alluded[1] to the incentives for the host to minify their JavaShit, et al., and you would have a point if it wasn't for the fact that performance otherwise isn't significantly different between minified and full source code as far as the user would be concerned.

[1]: https://news.ycombinator.com/item?id=37072473


I'm not talking about the browser's performance, I'm talking about the network bandwidth. All that extra JS code in every HTTP GET adds up. For a large site serving countless users, it adds up to a lot of bandwidth.


Somebody mentioned negligible/deleterious impacts on bandwidth for minified code in that thread, but they seemed to have low certainty. If you happen to have evidence otherwise, it might be informative for them.


>JavaShit

Glad to see the diversity of HN readers apparently includes twelve year olds.

Anyway, you do realise plenty of languages have compilers with JS as a compilation target, right? How readable do you think that is?


>Glad to see the diversity of HN readers apparently includes twelve year olds.

The abuse of JavaScript does not deserve the respect of being called by a proper name.

>Anyway, you do realise plenty of languages have compilers with JS as a compilation target, right? How readable do you think that is?

If you're going to run "compiled" code through an interpreter anyway, is that really compiled code?


>In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language).


Well, no, but it can speed up loading by reducing transfer.


Is that still true with modern compression?


If someone releases only a minified version of their code, and licenses it as free as can be, is it open source?


According to the Open Source Definition of the OSI it's not:

> The program must include source code [...] The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor [...] are not allowed.


The popular licenses for which this is a design concern are careful to define source code to mean "preferred form of the work for making modifications" or similar.


It’s also a crime against zorgons from Planet Zomblaki.


Google actually describes an entirely plausible mechanism of action here at [1]. old content slows down site crawling, which can cause new content to not be refreshed as often.

Sure, one page doesn’t matter, but thousands will.

[1] https://twitter.com/searchliaison/status/1689068723657904129...


It says it doesn't affect ranking and their quote tweet is even more explicit

https://twitter.com/searchliaison/status/1689297947740295168


This is the actual quote from Google PR:

>Removing it might mean if you have a massive site that we’re better able to crawl other content on the site. But it doesn’t mean we go “oh, now the whole site is so much better” because of what happens with an individual page.

Parsing this carefully, to me it sounds worded to give the impression removing old pages won’t help the ranking of other pages without explicitly saying so. In other words, if it turns out that deleting old pages helps your ranking (indirectly, by making Google crawl your new pages faster), this tweet is truthful on a technicality.

In the context of negative attention where some of the blame for old content being removed is directed toward Google, there is a clear motive for a PR strategy that deflects in this way.


The tweet is also clearly saying that deleting old content will increase the average page rank of your articles in the first N hours after it is published. (Because the time to first crawl will decrease, and the page rank is effectively zero before the first crawl).

CNet is big enough that I’d expect Google to ensure the crawler has fresh news articles from it, but that isn’t explicitly said anywhere.


And considering all the AI hype, one could have hoped that the leading search engine crawler would be able to "smartly" detect new contents based on a url containing a timestamp.

Apparently not if this SEO trick is really a thing...

EDIT : sorry my bad it's actually the opposite. One could expect that a site like CNET would include a timestamp and a unique ID in their URL in 2023. This seems to be the "unpermalink" of a recent cnet article.

Maybe the SEO expert could have started there...

https://www.cnet.com/tech/mobile/samsung-galaxy-z-flip-5-rev...


I did the tweet. It is clearly not saying anything about the "average page rank" of your articles because those words don't appear in the tweet at all. And PageRank isn't the only factor we use in ranking pages. And it's not related to "gosh, we could crawl your page in X hours therefore you get more PageRank."


It's not from Google PR. It's from me. I'm the public liaison for Google Search. I work for our search quality team, not for our PR team.

It's not worded in any way intended to be parsed. I mean, I guess people can do that if they want. But there's no hidden meaning I put in there.

Indexing and ranking are two different things.

Indexing is about gathering content. The internet is big, so we don't index all the pages on it. We try, but there's a lot. If you have a huge site, similarly, we might not get all your pages. Potentially, if you remove some, we might get more to index. Or maybe not, because we also try to index pages as they seem to need to be indexed. If you have an old page that doesn't seem to change much, we probably aren't running back ever hour to it in order to index it again.

Ranking is separate from indexing. It's how well a page performs after being indexed, based on a variety of different signals we look at.

People who believe removing "old" content aren't generally thinking that's going to make the "new" pages get indexed faster. They might think that maybe it means more of their pages overall from a site could get indexed, but that can include "old" pages they're successful with, too.

The key thing is if you go to the CNET memo mentioned in Gizmodo article, it says this:

"it sends a signal to Google that says CNET is fresh, relevant and worthy of being placed higher than our competitors in search results."

Maybe CNET thinks getting rid of older content does this, but it's not. It's not a thing. We're not looking at a site, counting up all the older pages and then somehow declaring the site overall as "old" and therefore all content within it can't rank as well as if we thought it was somehow a "fresh" site.

That's also the context of my response. You can see from the memo that it's not about "and maybe we can get more pages indexed." It's about ranking.


Suppose CNET published an article about LK99 a week ago, then they published another article an hour ago. If Google hasn’t indexed the new article yet, won’t CNET rank lower on a search for “LK99” because the only matching page is a week old?

If by pruning old content, CNET can get its new articles in the results faster, it seems this would get CNET higher rankings and more traffic. Google doesn’t need to have a ranking system directly measuring the average age of content on the site for the net effect of Google’s systems to produce that effect. “Indexing and ranking are two different things” is an important implementation detail, but CNET cares about the outcome, which is whether they can show up at the top of the results page.

>If you have a huge site, similarly, we might not get all your pages. Potentially, if you remove some, we might get more to index. Or maybe not, because we also try to index pages as they seem to need to be indexed.

The answer is phrased like a denial, but it’s all caveated by the uncertainty communicated here. Which, like in the quote from CNET, could determine whether Google effectively considers the articles they are publishing “fresh, relevant and worthy of being placed higher than our competitors in search results”.


You're asking about freshness, not oldness. IE: we have systems that are designed to show fresh content, relatively speaking -- matter of days. It's not the same as "this article is from 2005 so it's old don't show it." And it's also not what is being generally being discussed in getting rid of "old" content. And also, especially for sites publishing a lot of fresh content, we get that really fast already. It's essential part of how we gather news links, for example. And and and -- even with freshness, it's not "newest article ranks first" because we have systems that try to show the original "fresh" content or sometimes a slightly older piece is still more relevant. Here's a page that explains more ranking systems we have that deal with both original content and fresh content: https://developers.google.com/search/docs/appearance/ranking...


Dude, like who is google? The judicial system of the web?

No. Google has their own motivations here, they are a player not a rule maker.

Don’t trust SEOs as no one actually knows what works, but certainly dont think google is telling you the absolute truth.


Ha, I actually totally agree with you, apparently my comment gave the wrong impression. I was just arguing with the GP's comment which was trying to (fruitlessly, as you point out) read tea leaves that aren't even there.


While CNET might not be the most reliable side, Google telling content owners to not play SEO games is also too biased to be taken at face value.

It reminds me of Apple's "don't run to the press" advice when hitting bugs or app review issues. While we'd assume Apple knows best, going against their advice totally works and is by far the most efficient action for anyone with enough reach.


Considering how much paid-for unimportant and unrelated drivel I now have to wade through every time I google to get what I am asking for, I doubt very much that whatever is optimal for search-engine ranking has anything to do with what users want.


Wrong, the best SEO is having what users want and withholding it long enough to get a high average session time.


And I suppose a corollary is: "claim to have what the users want, and have them spend long enough to figure out that you don't have it"?


See: every recipe site in existence.


> That only says that Google discourages such actions

Nope. It says that Google does not ding you for old content.

"Are you deleting content from your site because you somehow believe Google doesn't like "old" content? That's not a thing!"


Do the engineers at Google even know how the Google algorithm actually works? Better than SEO experts who spend there time meticulously tracking the way that the algorithm behaves under different circumstances?

My bet is that they don't. My bet is that there is so much old code, weird data edge cases and opaque machine-learning models driving the search results, Google's engineers have lost the ability to predict what the search results would be or should be in the majority of cases.

SEO experts might not have insider knowledge, but they observe in detail how the algorithm behaves, in a wide variety of circumstances, over extended periods of time. And if they say that deleting old content improves search ranking, I'm inclined to believe them over Google.

Maybe the people at Google can tell us what they want their system to do. But does it do what they want it to do anymore? My sense is that they've lost control.

I invite someone from Google to put me in my place and tell me how wrong I am about this.


Yeah,

Once upon a time, Matt Cutts would come on HN give a fairly knowledgeable and authoritative explanation of how Google worked. But those days are gone and I'd say so are days of standing behind any articulated principle.


I work for Google and do come into HN occasionally. See my profile and my comments here. I'd come more often if it were easier to know when there's something Google Search-related happening. There's no good "monitor HN for X terms" thing I've found. But I do try to check, and sometimes people ping me.

In addition, if you want an explanation of how Google works, we have an entire web site for that: https://www.google.com/search/howsearchworks/


Google Alerts come to mind.


The engineers at Google do know how our algorithmic systems work because they write them. And the engineers I work with at Google looking at the article about this found it strange anyone believes this. It's not our advice. We don't somehow add up all the "old" pages on a site to decide a site is too "old" to rank. There's plenty of "old" content that ranks; plenty of sites that have "old" content that rank. If you or anyone wants our advice on what we do look for, this is a good starting page: https://developers.google.com/search/docs/fundamentals/creat...


>The engineers at Google do know how our algorithmic systems work because they write them.

So there's zero machine learning or statistical modeling based functionality in your search algorithms?


There is. Which is why I specifically talked only about writing for algorithmic systems. Machine learning systems are different, and not everyone fully understands how they work, only that they do and can be influenced.


It's really hard to get a deep or solid understanding of something if you lack insider knowledge. The search algorithm is not something most Googlers have access too but I assume they observe what their algorithm does constantly in a lot of detail to measure what their changes are doing.


"Are you deleting content from your site because you somehow believe Google doesn't like "old" content? That's not a thing!"

I guess that Googler never uses Google.

It's very hard to find anything on Google older than or more relevant than Taylor Swift's latest breakup.


I think in this context, saying that it's not a thing that google doesn't like old content just means that google doesn't penalize sites as a whole for including older pages, so deleting older pages won't help boost the site's ranking.

This is not the same as saying that it doesn't prioritize newer pages over older pages in the search results.

The way it's worded does sound like it could imply the latter thing, but that may have just been poor writing.


"poor writing" is the new "merely joking guys!"


That Googler here. I do use Google! And yeah, I get sometimes people want older content and we show fresher content. We have systems designed to show fresher content when it seems warranted. You can imagine a lot of people searching about Maui today (sadly) aren't wanting old pages but fresh content about the destruction there.

Our ranking system with freshness is explained more here: https://developers.google.com/search/docs/appearance/ranking...

But we do show older content, as well. I find often when people are frustrated they get newer content, it's because of that crossover where there's something fresh happening related to the query.

If you haven't tried, consider our before: and after: commands. I hope we'll finally get these out of beta status soon, but they work now. You can do something like before:2023 and we wouldn't show pages from before 2023 (to the best we can determine dates). They're explained more here: https://twitter.com/searchliaison/status/1115706765088182272


"Taylor Swift before:2010"

With archive search, the News section floats links like https://www.nytimes.com/2008/11/09/arts/music/09cara.html


Maybe not related to the age of the content but more content can definitely penalize you. I recently added a sitemap to my site, which increased the amount of indexed pages, but it caused a massive drop in search traffic (from 500 clicks/day to 10 clicks/day). I tried deleting the sitemap, but it didn't help unfortunately.


Ye. I am flabbergasted by people that are gaslighting people into not being "superstitious" about Google's ranking.


How many pages are we talking about here?


100K+. Mostly AI and user generated content. I guess the sudden increase in number of indexed pages prompted a human review or triggered an algorithm which flagged my site as AI generated? Not sure.


Just because someone says water isn't wet doesn't mean water isn't wet.


The contrived problem of trusting authority can be easily resolved by trusting authority


Claims made without evidence can be dismissed without evidence.


"The Party told you to reject the evidence of your eyes and ears. It was their final, most essential command."

-George Orwell, 1984


it seems incredibly short-sighted to assume that just because these actions might possibly give you a small bump in SEO right now, they won't have long-term consequences.

if CNET deletes all their old articles, they're making a situation where most links to CNET from other sites lead to error pages (or at least, pages with no relevant content on them) and even if that isn't currently a signal used by google, it could become one.


No doubt those links are redirected to the CNET homepage.


Isn’t mass redirecting 404s to the homepage problematic SEO-wise?


Technically, you're supposed to do a 410 or a 404, but when some pages being deleted have those extremely valuable old high-reputation backlinks, it's just wasteful, so i'd say it's better to redirect, to the "next best page" like maybe a category or something, or the homepage, as the last resort. Why would it be problematic? Especially if you do a sweep and only redirect pages that have valuable backlinks.


I was only talking about mass redirecting 404s to the homepage, which I've heard is not great, I think what you're saying is fine -- but that sounds like more of a well thought out strategy.


Hi. So I'm the person at Google quoted in the article and also who shared about this myth here: https://twitter.com/searchliaison/status/1689018769782476800

It's not that we discourage it. It's not something we recommend at all. Not our guidance. Not something we've had a help page about saying "do this" or "don't do this" because it's just not something we've felt (until now) that people would somehow think they should do -- any more than "I'm going to delete all URLs with the letter Y in them because I think Google doesn't like the letter Y."

People are free to believe what they want, of course. But we really don't care if you have "old" pages on your site, and deleting content because you think it's "old" isn't likely to do anything for you.

Likely, this myth is fueled by people who update content on their site to make it more useful. For example, maybe you have a page about how to solve some common computer problem and a better solution comes along. Updating a page might make it more helpful and, in turn, it might perform better.

That's not the same as "delete because old" and "if you have a lot of old content on the site, the entire site is somehow seen as old and won't rank better."


Your recommendations are not magically a description of how your algorithm actually behaves. And when they contradict, people are going to follow the algorithm, not the recommendation.


Exactly this, not any different than how they behave with youtube, it seems deceptive at best


Yeah, Google’s statement seems obviously wrong. They say they don’t tell people to delete old content, but then they say that old content does actually affect a site in terms of it’s average ranking and also what content gets indexed.


"They say that old content does actually affect a site in terms of it’s average ranking" -- We didn't say this. We said the exact opposite.


Sorry if I’m misconstruing what was said, but then it seems that what was said isn’t consistent with what actually happens.


What the Google algorithm encourage/discourage and what google blog or documentation encourage/discourage are COMPLETELY different things. Most people here are complaining about the former, and you keep responding about the latter.


No one has demonstrated that simply removing content that's "old" means we think a site is "fresh" and therefore should do better. There are people who perhaps updated older content reasonably to keep it up-to-date and find that making it more helpful that way can, in turn, do better in search. That's reasonable. And perhaps that's gotten confused with "remove old, rank better" which is a different thing. Hopefully, people may better understand the difference from some of this discussion.


I think you have misread the tweet. It says it does not work _and_ discourages the action.


Exactly. Google also discourages link buildning. But getting relevant links from authority sites 100% work.


This is another problem of the entire SEO industry. Websites trust these SEO consultants and growth hackers more than they trust information from Google itself. Somehow, it becomes widely accepted that the best information on Google ranking is from those third parties but not Google.


I'm not sure it is so cut and dried. Who is more likely to give you accurate information on how to game Google's ranking: Google themselves, or an SEO firm. I suspect that Google has far less incentive to provide good information on this than an SEO firm would.


Google will give you advice on how to not be penalized by Google. They won’t give you advice on how to game the system in your favor.

The more Google helps you get ahead, the more you end up dominating the search results. The more you dominate the results, the more people will start thinking to come straight to you. The more people come straight to you, the more people never use Google. The less people use Google, the less revenue Google generates.


> The more you dominate the results, the more people will start thinking to come straight to you.

This is a possible outcome but there are people that type in google.com and then the name of their preferred news site, their bank, etc, every day.

The site with the name they search dominates that search but they keep searching it.


I would like to know what dollar amount Google makes on people typing things like “Amazon” into google search and then clicking the first paid result to Amazon.


Search heremetically provides a very small amount of profits. Google makes most of its money on the websites it surfaces via ads.


It’s the same on YouTube - the majority of the people who work there seem to have no idea how “the algorithm” actually works - yet they still produce all sorts of “advice” on how to make better videos.


I got this feeling the YT algorithm is sticky and like "chooses" who to promote in some self reinforcing loop.


There’s an easy proof that those SEO consultants have a point: find a site that according to Google’s criteria will never rank, which has rocketed to the top of the search rankings in its niche within a couple months. That’s a regular thing and proves that there are ways to rank on Google that Google won’t advise.


Of course SEO consultants are trusted more than Google. They often ignore what Google says and bring good results for their clients.

Google has a vested interest in creating a good web experience. Consultants have an interest in making their clients money.

Link building is a classic example where good consultants deliver value. (There are bad consultants than good ones though)


It could be premature to place fault with the SEO industry. Think about the incentives: Google puts articles out, but an SEO specialist might have empirical knowledge from working for a various number of web properties. It's not that I wouldn't trust Google's articles, but specialists might have discovered undocumented methods for giving a boost.


They certainly want you to believe that.


The good ones will share the data/trends/case studies that would support the effectiveness of their methods.

But the vast majority are morons, grifters, and cargo culters.

The Google guidance is generally good and mildly informative but there’s a lot of depth that typically isn’t covered that the SEO industry basically has to black box test to find out.


> Websites trust these SEO consultants and growth hackers more than they trust information from Google itself.

That's because websites' goals and Google's goals are not aligned.

Websites want people to engage with their website, view ads, buy products, or do something else (e.g. buy a product, vote for a party). If old content does not or detracts from those goals, they and SEO experts say, it should go because it's dragging the rest down.

Google wants all the information and for people to watch their ads. Google likes the long tail; Google doesn't care if articles from the 90's are outdated because people looking at it (assuming the page runs Google ads) or searching for it (assuming they use Google) means impressions and therefore money for them.

Google favors quantity over quality, websites the other way around. To oversimplify and probably be incorrect.


Google actively lies on an infinite number of subjects. And SEO is a completely adversarial subject where Google has an interest in lying to prevent some behaviors. While consultants and "growth hackers" are very often selling snake oil, that doesn't make Google an entity you can trust either.


Hey, don't do that. That's bad. But if you keep doing it, you'll get better SEO. No, we won't do anything to prevent this from being a way to game SEO.

Words without action are useless.


"Google says you shouldn't do it" and "Google's search algorithm says that you should do it" can both be true at the same time. The official guidance telling you what to do doesn't track with what the search algorithm uses to decide search placement. Nobody's going to follow Google's written instructions if following the instructions results in a penalty and disobeying them results in a benefit.


If Google says one thing and rewards a different thing, guess which one will happen.


They say "Google doesn't like "old" content? That's not a thing!"

But who knows, really? They run things to extract features nobody outside of Google knows that are proxies for "content quality". Then run them through pipelines of lots of different not-really-coordinated ML algorithms.

Maybe some of those features aren't great for older pages? (broken links, out-of-spec html/js, missing images, references to things that don't exist, practices once allowed now discouraged...like <meta keywords>, etc). And I wouldn't be surprised if some part of overall site "reputation" in their eyes is some ratio of bad:good pages, or something along those lines.

I have my doubts that Google knows exactly what their search engines likes and doesn't like. They surely know which ads to put next to those maybe flawed results, though.


I don’t know man, I read it but I’ve learned to judge big tech talk purely by their actions and I don’t think there’s a lot of incentive built into their system that supports this statement.


I few tweets down they qualify this, saying that it might improve some things like indexing of the rest of the site:

https://twitter.com/searchliaison/status/1689068723657904129

My understanding is that if you have a very large site, removing pages can sometimes help because:

- There is an indexing "budget" for your site. Removing pages might make reindexing of the rest of the pages faster.

- Removing pages that are cannibalising on each other might help the main page for the keywords to rank higher.

- Google is not very fond of "thin wide" content. Removing low quality pages can be helpful, especially if you don't have a lot of links to your site.

- Trimming the content of a website could make it easier for people and Google to understand what the site is about and help them find what they are looking for.


Google search ranking involves lots of neural networks nowadays.

There is no way the PR team making that tweet can say for sure that deleting old content doesn't improve rank. Nobody can say that for sure. The neural net is a black box, and it's behaviour is hard to predict without just trying it and seeing.


Speaking from experience as someone who is paid for SEO optimization there's a list a mile long of things Google says "doesn't work" or you "shouldn't do" but in fact work very well and everyone is doing it.


I remember these kind of sources right from inside in the Matt Cutts era 15+ years ago encouraging and advising so many things which later proven to not be the case. I wouldn't take this only because it was written by the official guide.


Google says so many things about SEO which are not true. There are some rules which are 100% true and some which they just hope their AI thinks they are true.


RTFA does not include reading through every linked source.


Never has your username been so accurate as to who you are.


There's an awful lot of SEO people on Twitter that claim to be connected to Google, and the article he links on the Google domain as a reference doesn't say anything on the topic that I can find. I'm reluctant to call that an official source.


Journalist here. Danny Sullivan works for Google, but spent nearly 20 years working outside of Google as a fellow journalist in the SEO space before he was hired by the company.

He was the guy who replaced Matt Cutts.


1st paragraph is correct , 2nd not quite - Matt Cutts was a distinguished engineer (looking after web spam at Google) who took on the role of the search spokesperson - it’s that role Danny took over as “search liaison”


No. But it's also complicated, as Matt did thinks beyond web spam. Matt worked within the search quality team, and he communicated a lot from search quality to the outside world about how Search works. After Matt left, someone else took over web spam. Meanwhile, I'd retired from journalism writing about search. Google approached me about starting what became a new role of "public liaison of search," which I've done for about six years now. I work within the search quality team, just as Matt did, and that type of two-way communication role he had, I do. In addition, we have an amazing Search Relations team that also works within search quality, and they focus specifically on providing guidance to site owners and creators (my remit is a bit broader than that, so I deal with more than just creator issues).


thanks, Ernie!


I'm the source. I officially work for Google. The account is verified by X. It's followed by the official Google account. It links to my personal account; my personal account links back to it. I'm quoted in the Gizmodo story that links to the tweet. I'm real! Though now perhaps I doubt my own existence....


He claims to work for Google on X, LinkedIn, and his own website. I am inclined to believe him because I think he would have received a cease and desist by now otherwise.


He claims to work for google as search "liason". He's a PR guy. His job is to make people think that google's search system is designed to improve the internet, instead of it being designed to improve google's accounting.


I actually work for our search quality team, and my job is to foster two-way communication between the search quality team and those outside Google. When issues come up outside Google, I try to explain what's happened to the best I can. I bring feedback into the search quality team and Google Search generally to help foster potential improvements we can make.


Yes. All this is saying that you do not write any code for the search algorithms. Do you know how to code? Do you have access to those repos internally? Do you read them regularly? Or are you only aware of what people tell you in meetings about it.

Your job is not to disseminate accurate information about how the algorithm works but rather to disseminate information that google has decided it wants people to know. Those are two extremely different things in this context.

I work on these kind of vague "algorithm" style products in my job, and I know that unless you are knee deep in it day to day, you have zero understanding of what it ACTUALLY does, what it ACTUALLY rewards, what it ACTUALLY punishes, which can be very different from what you were hoping it would reward and punish when you build and train it. Machine learning still does not have the kind of explanatory power to do any better than that.


No. I don't code. I'm not an engineer. That doesn't mean I can't communicate how Google Search works. And our systems do not calculate how much "old" content is on a site to determine if it is "fresh" enough to rank better. The engineers I work with reading about all this today find it strange anyone thinks this.


Probably not; anyone can claim to work for these companies with no repercussions, because is it a crime? Maybe if they're pricks that lower these companies' public opinion (libel), but even that requires a civil suit.

But lying on the internet isn't a crime. I work for Google on quantum AI solutions in adtech btw.


He’s been lying a long time, considering that he’s kept the lie up that he’s an expert on SEO for nearly 30 years at this point, and I’ve been following his work most of that time.


with a gold badge, half a million followers, as well as a wikipedia page that mentions he works at google?


Did you notice that nowadays a lot of websites have a lot of uninteresting drivel giving a "background" to whatever the thing was you were searching for before you get to read (hopefully) the thing you were searching for?

People discovered that Google measures not only how much time you stay on a webpage but also how much you scroll to define how interesting a website is. So now every crappy "tech tips" website that has an answer that fits in a short paragraph now makes you scroll two pages before you get the thing you actually wanted to read.


I've noticed something similar on youtube.

I search for "how to do X", and instead of just showing me how to do it, which might take 30 seconds, they put a ton of fluff and filler in to the video to make it last 5 minutes.

Typical video goes something like:

0 - Ads, if you're not using an ad blocker

1 - Intro graphics/animation

2 - "Hi, I'm ___, and in this video I'm going to show you how to do X"

3 - "Before I get in to that, I want to tell you about my channel and all the great things I do."

4 - "Like and subscribe."

5 - "Now let's get in to it..."

6 - "What is X?"

7 - "What's the history of X?"

8 - "Why X is so great."

9 - finally... "How to do X"

Fortunately you can skip around, but it's still a bunch of useless fluff and garbage content to get to the maybe 30 seconds of useful information.


What makes this worse is that there's an increasing trend for how-tos to be only available on video.

As someone who learns best by reading, I'm already at a disadvantage with video to begin with. To make it worse, instructional videos tend to omit a great deal of detail in the interest of time. Then when you add nonsense like you're pointing out, it makes the whole thing a frustrating and pointless activity.


The nonvideo content is still there but shitty search is prioritizing video. Obviously video ads pay better.


Notice how Google frequently offers YouTube recommendations at the top of things like mobile results, or those little expandable text drop downs? My guess is it is because clicking that let's them serve a high intent video ad at a higher CPM than a search text ad.


As someone who is Deaf, many of these videos are not accessible. They rely on shitty Google auto captions which aren't accurate at least 25% of the time.


It gets even better when you subscribe to YouTube Premium.

You get no random ad content which just cut into the feed at will, which makes for a somewhat better experience. But there's the inevitable "NordVPN will guarantee your privacy", "<some service here which has no ads and was made by content creators so you don't have to look at ads if you subscribe but hey all our content is on YT but with ads and here is an ad>" ad.

There is no escape. I actually pay for YT premium and it's SO much better than being interrupted by ads for probiotic yoghurt or whatever. I know there are a couple of plugins out there which I have not tried (I think nosponsors is one of them) but I really don't think there is any escape from this stuff.


uBlock Origin and Sponsor Block provide a better experience than YouTube Premium.


... on desktop. My kingdom for Youtube ad blocking on my TV.


SmartTubeNext, if it's an Android smart TV.


That is explicitly encouraged by YouTube, if your video is at least 8 minutes (I think) you're allowed to add more ads.


I've noticed that any video that is 10:XX minutes long almost always is useless

They're probably reaching some YT threshold to have more ads show on it


I can only tolerate modern YouTube at 1.5 or 2x playback speed. Principally because speaking slower to stretch video length has become endemic.


Same. I think it's a mix of that and this "presenter voice" everyone thinks they have to use. My ADHD brain doesn't focus on it well because it's too slow so it's useless to me but all my life I've been told when presenting I should speak slowly and articulately while the reality is that watching anyone speak that way drives me nuts


What's great about writing is that readers can go at their own pace. When speaking, you have to optimize for your audience and you probably lose more people by being too fast vs. the people you lose by talking too slow. I have to say I appreciate YouTubers that go a million miles an hour (hi EEVBlog). As a native speaker of English, I can keep up. But you have to realize, most people in the world are not native speakers of English.

(The converse is; whenever I turn on a Hololive stream I'd say that I pick up 20% of what they're saying. If they talked slower, I would probably watch more than every 3 months. But, they rightfully don't feel the need to optimize for non-native speakers of Japanese.)


> What's great about writing is that readers can go at their own pace.

100%, and you can skim it so that your pace subconsciously varies depending on how relevant or complex that section is.


This is why I hate the trend of EVERYTHING being made into a video. Simple things that mean I have to watch 4-5min of video and have my eardrums blasted by some dubstep intro so some small quiet voice can say "Hi guys, have you ever wanted to do _x_ or _y_ more easily?" before finally just giving me the nugget of information I came for.

I wish more stuff were available in just text + screenshots..


Some of those people seem to be speaking so slow that it is excruciating to listen to them. When I find someone who speaks at a normal speed and I have to slow the video down, they usually have more interesting things to say.

That said, tinkering before and after youtube has been two different worlds. I really like having video to learn hands-on activities. I just wrapped up some mods to a Rancilio Silvia, and I noticed my workflow was videos, how-to guides and blog posts, broader electrical information documentation, part specific manuals / schematics, and my own past knowledge. I felt very efficient having been through the process before, and knowing when to lean on which resource. But the videos are by far the best resource to orient myself when first jumping in to the project, and thus save me a lot of time.


I mean, people are bad at editing. "I didn't have time to write a short letter, so I've written a long letter instead." I don't think it's a conspiracy.

I definitely write super long things when I consciously make the decision to not spend much time on something. Meanwhile, I've been working on a blog post for the better part of 2 years because it's too long, but doesn't cover everything I want to discuss. If you want people to retain the content, you have to pare it down to the essentials! This is hard work.


> I mean, people are bad at editing. "I didn't have time to write a short letter, so I've written a long letter instead." I don't think it's a conspiracy.

Making a long video isn't like writing a rambling letter. It takes work to make 10 minutes of talk out of a 1-minute subject. And mega-popular influencers do this, not just newbs who haven't learned how to edit properly yet.


"Tell me everything you know about Javascript in 1 minute." Figuring out what not to say is the hard part of that question. Rambling into the camera for an hour is easy.


But we're not talking about people taking 10 minutes to summarize a complex topic. We're talking about people taking 10 minutes to deliver 30 seconds of simple, well-delineated info.

This is something that happens a lot. I'll Google a narrow technical question that can be answered in three lines of text--there's literally nothing more of value to say about it--and all the top hits are 5+ minute videos. That doesn't happen by accident.


There's certainly a wide gamut of creators out there, and the handymen I've seen have videos like you mentioned. I imagine the complaints above are about the far more commercialized channels that do in fact model their videos after YT's algorithm.


It doesn't have to be a literal conspiracy. Why do you reject the possibility that people and organizations are reacting to very real and concrete financial incentives which clearly exist?


Certainly there are a lot of people that stretch their videos out to put in more ads, but not everyone with a long video is playing some metrics optimization game. They're just bad at editing.

I think the situation that people run into is something like "how do I install a faucet" and they are getting someone who does it for a living explaining it for the first time. Explaining it for the first time is what makes it tough to make a good video. Then there are other things like "top 10 AskReddit threads that I feel like stealing from this week" and those are too long because they are just trying to get as much ad revenue as possible. The original comment was about howtos specifically, and I think you are likely to run into a lot of one-off channels in those cases.


Sponsorblock is great for cleaning this crap up. Besides skipping sponsor segments, I have it set to autoskip unpaid/self promotion, interaction reminders, intermissions/intros, endcards/credits, and filler tangent/jokes set to manual skip.


SponsorBlock is basically mandatory to watch YouTube now. I can’t even imagine what it would be like without Premium.


One thing I really wish sponsorblock would add is the ability to mask off some part of the screen with an option to mute. More and more channels are embedding on-screen ads, animations, and interaction "reminders" that are, at best, distracting.


> More and more channels are embedding on-screen ads, animations, and interaction "reminders" that are, at best, distracting.

Do you have an example of a video that does this? Seems like an interesting problem to solve.


uBlock Origin is my extension of choice for this. It makes it really easy to block those distractions, and there are quite a few pre-existing filters to choose from.


I think you missed what the poster is asking for. They want to block a portion of the video itself. For example when you watch the news on TV, there is constant scrolling text on the bottom of the screen with the latest headlines. They want to block stuff like that.


I believe you're right! I can't think of any extension that would be able to modify the picture of a stream itself in real-time. What came to my mind was the kind of 'picture-in-picture' video that some questionable news sites display as you scroll down an article, usually a distracting broadcast which is barely related to the news itself.


There’s a channel that explains how to pronounce words that is a particularly bad offender. They talk about the history of the word up front, but without ever actually saying the word. They only pronounce it in the last few seconds, right as the thumbnail overlay appears.


You forgot the sponsor stuff that's 20% of the video's length.


Yeah, I usually skip up to a half of a typical video until they get to the point, sometimes more. People feel like just getting down to business is somehow wrong, they need first to tell the story of their life and how they came to the decision of making this video and why I may want to watch it. Dude, I am already watching it, stop selling it and start doing it!


> Did you notice that nowadays a lot of websites have a lot of uninteresting drivel giving a "background" to whatever the thing was you were searching for before you get to read (hopefully) the thing you were searching for?

I know you came here looking for a recipe for boiled water, but first here's my thesis on the history and cultural significance of warm liquids.


It’s interesting that water is often boiled in metal pots. There are several kinds of metal pots. Aluminum, stainless steel, and copper are often used for pots to boil water in.

Water boils in pots with different metals because only temperature matters for boiling water. If the water is 100c at sea level, it will boil.


Finally, that SA writing skill we learned in school can be put to practice!


Also you might want look into beer as a replacement for water, or mineral oil


Tea made with mineral oil is awful.


Also, "jump to recipe" could be a simple anchor tag that truly skips the drivel. But for some reason it executes ridiculous JavaScript that animates the scroll just slowly enough to trigger every ad's intersection observer along the way.


I hate when you click a search result and it doesn't even have the keyword(s) you searched...


There should be a search engine penalty for loquacious copy.

I want the most succinct result possible when I search the web.


This has been going on for about a decade now. This alone has caused me to remove myself from Google's products and services. They have unilaterally made the internet worse.


How do they know how much you scroll? Does this mean you get penalized in search results if you don't use Google Analytics?


I mean, penalizing sites in their search platform for not using their advertising platform would be blatantly anticompetitive behavior, right?

Surely Google is too afraid of our vigorous pro-competition regulatory agencies and would never do such a thing.


and the cherry on top is that they also own the browser. helps to thwart attempts to "scam" google analytics and track those poor a-holes that don't use it.


Hello from Google! I work for our search ranking team. Sadly, we can't control publishers who do things that we do not advise and do not recommend.

We have no guidance telling publishers to get rid of "old" content. That's not something we've said. I shared this week that it is not something we recommend: https://twitter.com/searchliaison/status/1689018769782476800

This also documents the many times over the years we've also pushed back on this myth: https://www.seroundtable.com/google-dont-delete-older-helpfu...


Are the employees at Google working on Search aware of how bad search results have become in the past year or two? Literally almost everyone I know, inside and outside of tech, has noticed a significant downgrade in quality from Google search results. And a lot of it is due to artificially inflated SEO techniques.


We've been diligently working to improve the results through things like our helpful content system, and that work is continuing. You can read about some of it in a recent post here (and it also describes the Perspectives feature that's live on mobile): https://blog.google/products/search/google-search-perspectiv...


It's great that you responded to the question. Is there a reason you didn't answer it, though?


"We've been diligently working to improve the results" was the response to the question of "Are the employees at Google working on Search aware of how bad search results have become in the past year or two?" I thought that was a clear response.

To be more explicit, yes, we're aware that there are complaints about the quality of search results. That's why we've been working in a variety of ways, as I indicated, to improve those.

We have continued to build our spam fighting systems, our core ranking systems, our systems to reward helpful content. We expanded our product reviews system to cover all types of reviews, as this explains: https://status.search.google.com/incidents/5XRfC46rorevFt8yN...

We regularly improve these systems, which we share about on this page: https://status.search.google.com/products/rGHU1u87FJnkP6W2Gw...

The work isn't stopping. You'll continue to see us revise these systems to address some of the concerns people have raised.


It was clearly a response, yes, but an answer is always better than a response. Thank you for answering!


I am of the opinion that its just internet becoming more spammy and unhelpful rather than google searching becoming bad. Every Tom and his mom seems to have a blog/website which they don't even write themselves. Most of the content on the internet is now for entertainment rather than purpose or knowledge. So, I do wonder if its just the state of the internet these days. As a layman, these days I just go directly to Wikipedia/reddit/youtube rather than searching on google.


The Internet is becoming spammy and bad because of Google's rules for ranking. The fact that Google favors newer content and longer pages with filler text is why people are making the content lower quality.


> Are the employees at Google working on Search aware of how bad search results have become in the past year or two?

I would assume they didn't answer this because the answer is either "No" because echo chamber or "Yes" but they don't want to say that publicly.


Because politics, not solutions, drive big tech.


My strong impression is that in the last two years there a couple change were rolled out to search that sent it straight into the sewer - search seemed to be tweaked to crassly, crudely put any product name above anything else in the searches. But since then, it seems like quality has crept back up again. Simple product terms still get top billing but more complicated searches aren't nerfed.

So it seems the search quality team exists but gets locked on in the closet by advertising periodically.

I know you can't verify anything directly but maybe we could set a system of code for you to communicate what's really happening...


You are also talking to someone who is on the PR team. This term gets thrown out a lot but in this case it is factually true, you are literally talking to a shill. I mean no disrespect to Danny but you are not going to get an honest and straightforward answer out of him.

If you think I am exaggerating, try to prompt him to see if you can get him to acknowledge that Googles current systems incentivize SEO spam. See if he passes the Turing test.


Don't kick the messenger. It's already good that someone (allegedly) from a department related to the situation could give some input. No need to dump all your frustrations on them


You realize they have to combat an entire fleet of marketers and writers who are trying to leverage their algorithms?


Facebook doesn't have guidance telling content creators to publish conspiracy theories, but their policies are willfully optimized to promote it. Take responsibility for the results of your actions like an adult.


We don't have a policy or any guidance saying to remove old content. That said, we absolutely recognize a responsibility to help creators understand how to succeed and what not to do in terms of Google Search. That's why we publish lots of information about this (none of which says "old content is bad." A good place to review the information we provide is from our Search Essentials page: https://developers.google.com/search/docs/essentials


> Take responsibility for the results of your actions like an adult.

"your actions", give me a break. The parent commenter doesn't own Google, and you aren't forced to use the platform.

Are people on this site really convinced that an L3 Google engineer can flick the "Fix Google" switch on the search engine?


> Are people on this site really convinced that an L3 Google engineer can flick the "Fix Google" switch on the search engine?

No, it's just when someone speaks on behalf of the company with the terms "we," they are typically addressed with "you." That doesn't mean we think they're the CEO. Are you unfamiliar with this concept? I can send you an SEO guide on it.


You're missing the point; This guy has zero power over what Google does so publicly berating him is not going to accomplish anything.

And anyways, the sentence "Take responsibility for the results of your actions like an adult" actually does imply he has some personal responsibility here. It's not helpful to the discussion and it's rude.


If you choose to throw yourself on a public forum doing PR for a company doing dumb things and you also insult everyone's intelligence by lying to them, people are gonna be a little rude


"choose to throw yourself"

"a company doing dumb things"

"insult everyone's intelligence"

"lying to them"

That's a little hyperbolic, don't you think? Do you even hear yourself? I fully understand Google hate but directing it at one person who is literally just doing their job (and hasn't lied to anyone despite your allegation) is childish and counterproductive. Save that for Twitter.

Danny has been here since 2008. Your account was created in 2022.

And also, "people" aren't being rude, you are. Own your actions.


No, I'm not being hyperbolic. There is one reason for the SEO algorithm to reward longer articles, and that's ad revenue. To paint it as anything else is lying. And you opened up this conversation extremely rudely with "OMG are you so dumb you think he owns Google."

How long I've been here is irrelevant.


>you opened up this conversation extremely rudely

That wasn't me. Maybe pay closer attention?

>reward longer articles

The age of articles was being discussed, not article length. Maybe pay closer attention?

Actually... you know what, never mind.


The age of the articles was discussed in the original article, but when I was speaking to this engineer, I was talking about the length of articles which is the main criticism levied against Google SEO. I'm aware you didn't read any of it


I followed the thread just fine. You accused me of being rude (it was someone else) and also accused the other commenter of lying. Neither of which are true.

You did that, not me. It's you who seem to be having a problem with understanding the thread.


You said L3 so I was curious. I looked up the guy's LinkedIn [0] and honestly an L3 engineer would have a lot more context about Google's search. Danny, what do you even do?

[0] https://www.linkedin.com/in/dannysullivan/


Before Google, Danny Sullivan was a well respected search engine blogger/journalist. As far as I know, he isn't an engineer. There's no need to be rude.


I work for our search quality team, directly reporting to the head of that team, to help explain how search works to people outside Google and bring concerns and feedback back into team so we can look at ways to improve. I came to the position about six years ago after retiring from writing about search engines as a journalist, explaining how they work to people from 1996 onward.


So you're PR


Yes, I believe you are correct.


That’s impressive! Congrats.


Making statements that you wish publishers wouldn't do various things, doesn't change the actual incentives that the real-world ranking algorithms create for them.

I mean, saying that you should design pages for people rather than the search engine clearly hasn't shut down the SEO industry.


This is the usual if a hazard isn't labeled, it isn't a hazard fallacy.


It doesn't matter if your guidance discourages it, your SEO algorithm is encouraging it. What you call "helpful" in your post is what is financially helpful to Google, not what's helpful to me.

There's no denying Google encourages long rambling nonsense over direct information


No one has demonstrated getting rid of "old" content somehow makes the rest of the site "fresh" and therefore ranks better. What's likely the case is that some people have updated content to make it more useful -- more up-to-date -- and the content being more helpful might, in turn, perform better. That's a much different thing that "if you have a lot of old content, the entire site is somehow old." And if you read the CNET memo, you'll see there's a confusion with these points.


But there's the rub, you're not making content more helpful. You're making it longer and more useless so we have to scroll down more so Google can rake in more ads. The fact that you're calling it more "helpful" is insidious. That's why garbage SEO sites are king on the internet right now. It's the same thing you guys do with Youtube, where you decreased monetization for videos under a certain length. Now every content creator is encouraged to artificially inflate the length of their video for more ads.

You're financially rewarding people for hiding information.


This is our guidance about how people should see themselves to create helpful content to succeed in Google Search: https://developers.google.com/search/docs/fundamentals/creat...

That includes self-assessment questions, including this:

"Are you writing to a particular word count because you've heard or read that Google has a preferred word count? (No, we don't.)"

That's not telling people to write longer. Our systems are not designed to reward that. And we'll keep working to improve them.


Google is destroying the internet is a good way to put it. AD dollars are their only priority. I hope Google dies because of it.


What a fantasy. It does not show any sign of profit decrease. How would a company die with $279.8B revenue, steadily increasing yearly?


I think the theory of Google's death is that they are "killing the golden goose." The idea is that they are killing off all the independent websites on the internet. That is, all the sites besides Facebook/Instagram/Twitter/NetFlix/Reddit/etc. that people access directly (either through an app or a bookmark) and which (barring Reddit) block GoogleBot anyway.

These are all the sites (like CNET) that Google indexes which are the entire reason to use search. They are having their rankings steadily eroded by an ever-rising tide of SEO spam. If they start dying off en masse and if LLMs emerge as a viable alternative for looking up information, we may see Google Search die along with them.

As for why their revenues are still increasing? It's because all the SEO spam sites out there run Google Ads. This is how we close the loop on the "killing the golden goose" theory. Google uses legitimate sites to make their search engine a viable product and at the same time directs traffic away from those legitimate sites towards SEO spam to generate revenue. It's a transformation from symbiosis/mutualism to parasitism.

Edit: I forgot to mention the last, and darkest, part of the theory. Many of these SEO spam sites engage in large-scale piracy by scraping all their content off legitimate sites. By allowing their ads to run on these sites, Google is essentially acting as an accessory to large-scale, criminal, commercial copyright infringement.


Directs traffic away not to generate revenue but to generate revenue faster this quarter in time for the report. They could make billions without liquifying the internet but they would make billions slowly


> Google uses legitimate sites to make their search engine a viable product and at the same time directs traffic away from those legitimate sites towards SEO spam to generate revenue.

[Disclosure: Google Search SWE; opinions and thoughts are my own and do not represent those of my employer]

Why do you assume malicious intent?

The balance between search ranking (Google) and search optimization (third-party sites) is an adversarial, dynamic game played between two sides with inverse incentives, taking place on an economic field (i.e. limited resources). There is no perfect solution; there’s only an evolutionary act-react cycle.

Do you think content spammers spend more or less resources (people, time, money) than Google’s revenue? So then the problem becomes how do you win a battle with orders of magnitude less people, time, and money? Leverage, i.e., engineering. You try your best and watch the scoreboard.

Some people think Google is doing a great job; some think we couldn’t be any worse. The truth probably lies across a spectrum in the middle. So it goes with a globally consumed product.

Also, note, Ads and Search operate completely independent. There’s no signals going from Ads to Search, or vice versa, to inform rankings; Search can’t even touch a lot of the Ads data, and Ads can’t touch Search data. Which makes your theory misinformed.


> Why do you assume malicious intent?

Not GP, but to me, admittably a complete non-expert on search, there are so many low-hanging fruits if search result quality was anywhere on Google's radar that it is really difficult not to assume malicious intent.

Some examples:

- why pinterest is flooding the image results with absolute nonesense? How difficult it would be to derank a single domain that manages to screw google's algorithm totally?

- why there is no option for me to blacklist domains from the search result? Are there really some challenges that can't be practically solved in a couple of minutes of thinking?

- Does google seriously claim they can't differentiate between stackoverflow and the content copying rip-off SEO spam sites?


> why there is no option for me to blacklist domains from the search result?

You might already be aware of this, but you can use uBlock Origin to filter google search results.

Click on the extension --> Settings --> My Filters. Paste in the bottom

    google.*##.g:has(a[href*="pinterest.com"])

Every time I get mislead on clicking onto an AI aggregator site, my filter list grows...


The issues you pointed out might be due to a company policy of not manually manipulating search results and leaving it all to the algorithm. It can be argued that this leads them to improve their algorithm, although at this point I don't think any algorithm other than a good and big LLM/classifier-transformer can solve the ranking problem, and that is probably not economical or something. But OTOH they manually ban domains they deem to be not conformant to the views of the Party. (not CCP, 1984)


Also, note, Ads and Search operate completely independent

That’s the mistake. They should be talking. Sites that engage in unethical SEO to game search rankings should be banned from Google’s ad platform. Why aren’t they? Because Google is profiting from the arrangement.


> Why do you assume malicious intent?

There doesn't have to be any malicious intent, just an endless chase for increased profit next quarter. SEO spam has more ads, thus generates more income for Google. Even if Ads and Search operate "completely independently", there must be a person in the corporate hierarchy which has control over both and could push the products to better synergize and make that KPI tick up.

Actually deranking sites which feature more than three Google Ads banners would improve search quality (mainly by making sites get to the point rather than padding a simple answer into an essay like an 8th grader at an exam) - but it would reduce Ads income so you cannot do it, no matter how independent you claim to be.


I think dismissing the relationship and impact adtech and search continue to have on web culture is an incredibly pointy-headed misstep. It's the sort of willful oversight that someone makes when their career relies on something being true.

Unless you have a clear view by leadership of what they desire the web should be and are willing to disclose it in detail, then there's not much to add by saying you work in Search.


When I enter a search query, that goes into Ads so that half the page can be relevant Ads instead of search results. That's a signal.


I've also kinda been wondering if Google has been ruining it's search to bolster youtube content.


What a fantasy. It does not show any sign of profit decrease. How would a company die with $279.8B revenue, steadily increasing yearly?

At one time both buggy whips and Philco radios had hockey stick growth charts, too.

You must not bet old enough to remember when people thought MySpace would always drive the internet.


> You must not bet old enough to remember when people thought MySpace would always drive the internet.

You know there's a difference between "people thought" and dollars.

Some people think the earth is flat. Opinions can change very quickly - like 5 years ago, people thought elon musk was the hero of the internet.

Here's the stats on MySpace revenue: It generated $800 million in revenue during the 2008 fiscal year.


And MySpace generates about 10% of that now.

Google could be generating $27 billion in revenue in 2038 and be considered a massive failure compared to what it is now.

I fail to see the point you are trying to get at?


Their search results are declining rapidly in quality. "<SEARCH QUERY> reddit" is one of their most common searches. Their results are filled with SEO spam and bots now.

At some point a competitor will emerge. The tech crowd will notice it and begin to use it. Then it will go widespread.


I suspect that the revenue increases have more to do with the addition of new users in developing markets rather than actual value added. Once all potential users have been reached, Google will have to actually improve their product.


No such suspicion is necessary. Google's revenue from the United States has only increased as a share of its total revenue over the past decade. https://abc.xyz/assets/4c/c7/d619b323f5bba689be986d716a61/34...


I'm reminded of the 00's era joke that Microsoft could burn billions of dollars, pivot to becoming a vacuum cleaner manufacturer, and finally make something that doesn't suck.

I don't think Google dying would be good (lots of things would have to migrate infra suddenly), but the adtech being split off into something else would certainly be a welcome turn of events, IMO. I'm tired of seeing promising ideas killed because they only made 7-figure numbers in a spreadsheet where it'd have been viable on its own somewhere it wasn't a rounding error.


Before someone suggests a new search engine where the ranking algorithm is replaced with AI, I would like to propose a return to human-curated directories. Yahoo had one, and for a while, so did Google. It was pre-social-media and pre-wiki, so none of these directories were optimized to take advantage of crowdsourcing. Perhaps it's time to try again?

https://en.wikipedia.org/wiki/Google_Directory


> Before someone suggests a new search engine where the ranking algorithm is replaced with AI, I would like to propose a return to human-curated directories. Yahoo had one, and for a while, so did Google. It was pre-social-media and pre-wiki, so none of these directories were optimized to take advantage of crowdsourcing.

False. Google Directory (and many other major name, mostly now defunct, web directories) were powered by data from DMOZ which was crowdsourced (and kind of still is, through Curlie [0] and while some parts of the website show updated as recently as today, enough fairly core links are dead or without content that its pretty obviously not a thriving operation.) Also, it was not pre-Wiki: WikiWikiWeb was created in 1995, DMOZ in 1998. It was pre-Wikipedia, but Wikipedia wasn’t the first Wiki.

[0] https://curlie.org/


Interestingly, a static snapshot of DMOZ is still out there: https://dmoztools.net/ | https://web.archive.org/web/20180126194656/http://dmoztools....


Actually, several static snapshops exist (a benefit of open licensing) despite the fact attempts to fork and continue have been not so successful. In addition to the one upthread there are also:

https://dmoz-odp.org/

http://www.odp.org/homepage.php


One issue there was that human-curated directories are everywhere. Hacker News is one. Reddit is another. And during the Yahoo times, directories were made everywhere and all over the place. Which one is authoritative? There's too much of them out there.

That said, in NL a lot of people's home pages for a long time was set to startpagina.nl, which was just that, a cool directory of websites that you could submit to the website. It seems to exist still, too.


I don't think we need any "AI" in the modern sense of that word. It would be an improvement to bring google back to its ~2010 status.

Not sure if the kagi folks are willing to share, but I get the impression that pagerank, tf-idf and a few heuristics on top would still get you pretty far. Add some moderation (known-bad sites that e.g. repost stackoverflow content) and you're already ahead of what google gives me.


I feel like they presume I'm a gullible person they need to protect who is just on the Internet for shopping and watching entertainment.

Wasn't the point of them tracking us so much to customize and cater our results? Why have they normalized everything to some focus group persona of Joe six-pack?

***

Let's try an experiment

Type in "Chicago ticket" which has at least 4 interpretations, a ticket to travel to Chicago, a citation received in Chicago, A ticket to see the musical Chicago and a ticket to see the rock band Chicago.

For me I get the Rock band, citation, baseball and mass transit ticket in that order.

I'm in Los Angeles, have never been to Chicago, don't watch sports, and don't listen to the rock band. Google should know this with my location, search and YouTube history but it apparently doesn't care. What do you get?


Or it know way too much. An alternative explanation for it could also be:

It knows you're in LA and did not look up "Chicago flight", so you probably aren't looking for flights there.

Chicago musical isn't playing in LA so probably not the right kind.

Probably why most people get parking ticket listed higher. It would be interesting to see the results in a city where the band, team or musical has an event soon.


Google tracks you to "customize search results" and I even have a 100% google'd phone (Pixel) but when I'm searching for restaurants it still shows me stuff from portland oregon instead of portland maine. This despite literally having my "Home" marked in my google account as portland maine.


Are you ignoring the ads?

My ads are:

- Flights - Chicago the Musical - Flights - More Flights

My search results are:

- Citation - Citation payment plan - News report on lawsuit regarding citations in Chicago - Baseball

I also live very far from Chicago, and the only time I was there was for a connecting flight some time in the '90s.


I get four top results for paying a parking ticket in Chicago, a city I’ve never been to.


Same here. Never been to Chicago, live in Germany.

https://imgur.com/a/BXRnqIB


Going deep into personalization on Google.com (answering queries using your signed in data across other google properties) feels like high risk low reward. In a post gdpr & right to be forgotten environment they know they have targets on their back. Is super deep personalization really worth getting slapped with megafines?

Where you'll see integrations like this used to be assistant, but is now bard. Both of which have lawyer boggling Eulas and a brand they can sacrifice if need be.


Aren't they both doing the economically incentivized thing tho? Are you saying maybe some things should be beyond economic incentives?


Yes, some things should be beyond economic incentives. Destroying the historical record, for instance. We have plenty of precedent around that, now that we realised it's bad.


This is interesting, as I am actually doing the same thing with a site I have as I noticed my crawl budget has gotten less especially this year and fewer new articles are being indexed.

I suspect this is a long-term play for Google to phase out this search and replace with Bard. Think about it all these articles are doing now is writing a verbose version of what Bard gives you directly unless it’s new human content.

Google has in essence stolen all their information by scraping and storing in a database for its LLM and is offering its knowledge of this directly to users, so in a way, this is akin to Amazon selling its own private label products.


An article about reduced quality was pretty popular on HN a few years ago, that Google results looks like ads. But I believe we have hit a new low recently. Perhaps that is true for the overall quality of publications on the net. The amount or either approved news sites without significant content or outright click farms is immense. Even for topics that should net results. A news site filter would already help a lot, but even then the search seems to only react on buzzwords. Sometimes even terms you didn't search at all that were often associated with said buzzwords.


They could just noindex them.


Google still needs to crawl them to see the noindex tag. And when Google is crawling a lot of pages on your site, it'll be slow.


This is only an issue if you have millions of pages

Also, it will slow down the crawl frequency if you noindex it

So it's a non problem


>This is only an issue if you have millions of pages

You know, like a news website that's been on the internet since the 90s


> Also, it will slow down the crawl frequency if you noindex it

Eventually Google stops crawling noindexed pages.


Even if we pretend for a moment that your statement, that google's search is "shitty", is universally accepted as truth, you can't blame this one on Google.

People have been committing horrifying atrocities in the name of SEO for years. I've seen it firsthand. And it spectacularly backfired each time.

This can very probably be yet another one of such cases.


Can Google tell the difference between old relevant information and old irrelevant (or outdated) information? I'm not seeing any evidence of that. A search engine is not a subject matter expert of everything on the Internet, and it shouldn't be.


In all fairness, there is some old information I would love to disappear eventually. Nothing quite as frustrating than having a question and all the tutorials are for a version of a software that is 15 years old and behaves completely different than the new one.


To be fair, the old internet wasn't killed. It just passed away. The curious voices that were prevalent back then are buried now.


I don’t see why google gets the blame when spam is what forces google to search how it searches.


> Can there be any doubt

This is such terrible, low-quality, manipulative content. It does not belong on HN.


I stopped using search, I have switched 90% over to chatgpt.


This is some bullshit. It’s bad enough that a lot of sites with content going back 10-20 years have linkrot or have simply gone offline. But I am at a loss for words that they’re disappearing content on purpose just for SEO rankings.

If this is what online publishing has come to we have seriously screwed up.


In fairness to them. If one of their top ways of getting traffic is being marked as less relevant because of older articles, what do you want them to do? Just continue to lose money because Google can't rank them appropriately?


Or mayhap Google et al could realize and understand that having old articles in archive shouldn't penalize your ranking.


Who says this is actually improving their ranking? SEO is smoke and mirrors. I wouldn't be surprised if this whole exercise was actually counterproductive for them, and they just don't realize it.


Not always smoke. If you have enough resources, you can make research and test hypothesis. Also, sometimes there are leaks, like a leak of Yandex source code, which disclosed what factors were used for ranking.


Yeah, didn't old google's PageRank penalize broken links?


Yep. Also, the days of the "Advanced Search" page have passed, but Google still has time range options under the "Tools" button near the top of the results page. If they're giving the user the option to filter results by time, then it's pretty goofy for the algorithm to de-rank a site in results where the the default of "Any time" was selected in the query just because the site has old articles in the index.


I'm afraid that, as so often with any topic, the vast majority of users never or rarely uses the time filter and so everything gets optimized for the users to dumb to search properly.


In this case I'm not too keen on blaming users for this when the option is sorta buried in the current design. The Tools button isn't prominent, the current value of the time range option is completely hidden when it's on the default of "Any time", and the search query doesn't change if you alter the time range option. (And, to my knowledge, there's not a search query incantation for specifying the time range.)

If the current UI is a reflection of what PMs at Google want users to do, then they don't really care if the user uses, or even knows about, the option to filter by time. So I don't think it flies to point at the user and say, "you're holding it wrong".


Or perhaps more likely: Giving users a more prominent way to filter by time didn’t improve ad clicks and was thus A/B tested out of the product.

Just like it was A/B tested that the best background color to contrast the ad area against the organic results’ #ffffff is #fffffe.


You'd think that having a long history of content with traffic to said content over that time would be a key differentiator in ranking pages....

The issue here is very clearly with how Google et al are operating, effectively intentionally favoring blog spam over real content producers.


google shows ads not good content. let's hope those are more aligned in the future


Nope nope too busy on Bard and making cookies more favorable to Adsense https://www.cookiebot.com/en/google-third-party-cookies/

And if the content isn't shared somewhere (typically on a non-Google property), then is it even relevant anymore? All the Googlers who defined search relevance outside of freshness have left to other opportunities.


Or mayhap Google et al could realize and understand that having old articles in archive shouldn't penalize your ranking.

I will not be surprised if in my lifetime it is not possible to Google anything by William Shakespeare.


I'd rather see them address it with a more complicated approach that preserves the old content. For example, they could move it to an unindexed search archive while they rework it to have more useful structures, associations and other taxonomy.

The approach to just move it to another domain and be done suggests it's not well enough researched for what an organization their size could manage.


> In fairness to them [...] what do you want them to do? Just continue to lose money

How come CNET is getting a free pass here to do something obnoxious because they're losing money, but Google is getting the stick for (maybe) doing something obnoxious to avoid losing money?

I think Google do a lot of bad stuff but let's at least assign blame proportionally.


Because Google is a wildly profitable ad tech behemoth with a history of screwing over news companies and CNET is probably just trying to survive.

Do you really not see the power dynamics at play here?


Move the articles to archived.cnet.com or cnetarchive.com instead of just deleting them


They can just set a noindex tag in the HTML head or send it in the http response header for older pages, then links still work.


It's my understanding that doing what you propose will still use up the crawl budget, because the bot will have to download the page and parse it to understand that it is no-indexed.


Then you could also block those pages in robots.txt, no? (You do need to do both though, as otherwise pages can be indexed based on links, without being crawled.)


Exactly. This should be solvable without actually deleting the pages. I assume they're only removing articles with near-zero backlinks, so a noindex,nofollow should generally be fine, but if crawl budget is an issue robots.txt and sitemap can help.


The real answer is that there's a non-zero cost to maintain these pages, and even more so if robots.txt entries and such have to be maintained for them as well. And if they have no monetary benefit, or even potentially a detriment, it makes more sense for them from a business perspective to just get rid of them. Unfortunately.


simple. Split google's Ad part, with search part and chrome part.

Because it's clearly bullshit at this point.


How could search survive? The business value of search is in providing data for ads, is it not?


I assume GP means to split Search but keep search ads, called "AdWords", together with it. The third-party advertisement part of Google was/is called AdSense. The two were always run separately AFAIK.


By running ads. Probably Google Ads. What it's already doing.


I thought that the ads on search were a nice to have, but that the real value of search was the profile it allows google to build about a person, which is then used by all ad systems used by google.

Split that off, and search is all of a sudden producing a whole let less value, since the profiling value cannot be meaningfully extracted by other companies.


Make a second archive website


move the old articles to another domain or non indexed pages.


That's what the article says they're doing.


The article isn't super-clear about what's happening and, for most purposes, just dumping stuff on the Wayback Machine is probably not that different from deleting it even if the bits are still "somewhere." A few years back I copied any of my CNET stuff I cared about to my own site and, in general, that's a strategy I've followed with a number of sites as I don't expect anything to continue to be hosted or at least be findable indefinitely.


I kinda wish there were some way to store every page I ever visit automatically and index it locally for easy search. Then when I want to look up e.g. the guy who had the popular liquid oxygen fire website back in the 90s it would be easy. But I also fear it would be used against me somehow too.


The things people can do with your browser history of just URLs makes me happy that the actual content on said pages are NOT stored locally.

Otherwise, there is the save page option in your browser if you do want to keep a local copy.


Someone posted this to HN a few days ago

https://linkwarden.app/

It looks very appealing, but I haven’t had a chance to try it myself just yet.


You can pay for pinboard.in and it does something along the lines of what you're looking for.


A really awesome on-the-ground website for information about specifics in collectibles was allexperts.com (now defunct). They got bought by something about.com who then got bought by a Thought Company? and they straight up deleted all of the 10+ years of people asking common questions about collectibles and getting solid answers. Poof.


The real reason: hiding it from ai scrapers?


It’s on archive.org, is that really going to work?


something about bathwater and babies comes to mind


Underrated comment. This has to be part of it.


If it were truly just about Google rankings they would noindex the pages


The crtgaming community hate that a few years ago, they deleted all the specs for old CRT monitors :(


Librians regularly cull books. Magazines are cleared off the newstands weekly or monthly.

Storing and indexing and maintaining old content isn't free, in either dollars or environmental footprint.


>Storing and indexing and maintaining old content isn't free, in either dollars or environmental footprint.

It's pretty close to free.


Why don't you go ahead and personally do it then? It's pretty close to free.


It's pretty close to free because they already pay for the infrastructure for the current content. The marginal cost is very low.


You can store the whole wikipedia on a single SD card these days. Storage space is not a problem.


Is this why Google search is really terrible now? The search results are nearly unusable today compared to 1-2 years ago


People have said that for years.


Is this related to privacy? I know I have my web activity set to delete after 18 months. Is it too small?


The only History that exists today is that which doesn’t end with a 404.

We might as well burn the libraries, they serve no modern purpose.


Is there any evidence this would even work?

Surely Google determines "fresh, relevant" content according to whatever has recently been published, which this doesn't change. If anything, doesn't Google consider sites with a long history of content with tons of inbound links as more authoritative and therefore higher-ranked?

This baffles me. It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.


The theory I've heard is related to 'crawl budget'. Google is only going to devote a finite amount of time to indexing your site. If the number of articles on your site exceeds that time, some portion of your site won't be indexed. So by 'pruning' undesirable pages, you might boost attention on the articles you want indexed. No clue how this ends up working in practice.

Google's suggestion isn't to delete pages, but maybe mark some pages with a no index header.

https://developers.google.com/search/docs/crawling-indexing/...


But as that linked guide explains, that's only relevant for sites with e.g. over a million pages changing once a week.

That's for stuff like large e-commerce sites with constantly changing product info.

Google is clear that if your content doesn't change often (in the way that news articles don't), then crawl budget is irrelevant.


Google crawls the entire page, not just the subset of text that you, a human, recognize as the unchanged article.

It’s easy to change millions of pages once a week with on-load CMS features like content recommendations. Visit an old article and look at the related articles, most read, read this next, etc widgets around the page. They’ll be showing current content, which changes frequently even if the old article text itself does not.


I'm pretty sure Google is smart enough to recognize the main content of a page, and ignore things like widgets and navigation. That's Search Engine 101.


Yes, of course, but that analysis happens after the content has been visited by the bot. It’s still a visit, and still hits the “crawl budget.”


So they should stop doing this on pages that they are deleting now.


It’s possible they examined the server logs for requests from GoogleBot and found it wasting time on old content (this was not mentioned in the article but would be a very telling data point beyond just “engagement metrics”).

There’s some methodology to trying to direct Google crawls to certain sections of the site first - but typically Google already has a lot of your URLs indexed and it’s just refreshing from that list.


To determine whether content changes Google has to spend budget as well, hasn't it? So it has to fetch that 20-years old article.


> So it has to fetch that 20-years old article.

It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.


There’s an obvious way this can be exploited. Bait and switch.


If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.


If the content is dynamic (e.g. a list of popular articles in a sidebar has changed), then the page will be considered "updated".


This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.


> The theory I've heard is related to 'crawl budget'. Google is only going to devote a finite amount of time to indexing your site.

Once a site has been indexed once, should it really be crawled again? Perhaps Google should search for RSS/Atom feeds on sites and poll those regularly for updates: that way they don't waste time doing to a site scrape multiple times.

Old(er) articles, once crawled, don't really have to be babysat. If Google wants to double-check that an already-crawled site hasn't changed too much, they can do a statistical sampling of random links on it using ETag / If-Modified-Since / whatever.


The SiteMap, which was invented by Google and designed to give information to crawlers, already includes last-updated info.

No need to invent a new system based on RSS/Atom, there is already an actually existing and in-use system based on SiteMap.

So, what you suggest is already happening -- or at least, the system is already there for it to happen. It's possible Google does not trust the last modified info given by site owners enough, or for other reasons does not use your suggested approach, I can't say.

https://developers.google.com/search/docs/crawling-indexing/...


I can imagine a malicious actor changing an SEO-friendly page to something spammy and not SEO-friendly. Since E-Tag and If-Modified-Since are returned by the server, they can be manipulated.

Just a guess though.


This should be what sitemap.xml provides already.


Even if that rule were true, why wouldn’t everything in the say, top NNN internet sites get an exemption? It is the Internet’s most hit content, why would it not be exhaustively indexed?

Alternatively, other than ads, what is changing on a CNN article from 10 years ago? Why would that still be getting daily scans?


Probably bad technology detecting a change. Things like current news showing up beneath the article, which changes whenever a new article is added. I've seen this happen on quite a few large websites. It might be technologically easier to drop old articles than the amount of time to fix whatever they use to determine if a page has changed. You would think a site like CNET wouldn't have to deal with something like that, but sometimes these sites that have been around for a long time have some serious outdated tech.


That's a good point about the static nature of some pages. Is there any way to tell a crawler to crawl this page, but after this date don't crawl again, but keep anything you previously crawled.


the ads are different.

i am tracking rss feeds of many sites, and on some i get notifications for old articles because something irrelevant in the page changed.


CNET* not CNN. But everything you say is still true.


How does Wikipedia manage to remain indexed?


Google is paying Wikipedia through "Wikimedia Enterprise." If Wikipedia weren't able to sucker people into thinking that they're poverty-stricken, Google would probably prop it up like they do Firefox.


Google search still prefers to give me at least 2-3 blogspam pages before the Wikipedia page with exactly the same keywords in the title as my query.


If I were establishing a "crawl budget", it would be adjusted by value. If you're consistently serving up hits as I crawl, I'll keep crawling. If it's a hundred pages that will basically never be a first page result, maybe not.

Wikipedia had a long tail of low-value content, but even the low-value content tends to be among the highest value for its given focus. e.g., I don't know how many people search "Danish trade monopoly in Iceland", and the Wikipedia article on it isn't fantastic, but it's a pretty good start[0]. Good enough to serve up as the main snippet on Google.

[0] https://en.wikipedia.org/wiki/Danish_trade_monopoly_in_Icela...


Wikipedia’s strongest SEO weapon is how often wiki links get clicked on result pages, with no return.

They’re just truly useful pages, and that is reflected in how people interact with them.


Purely speculating, Wikipedia has a huge number of inbound links (likely many more than CNet or even than more popular sites) which crawler allocation might be proportionate to. Even if it only crawled pages that had a specific link from an external site, that would be enough for Google to get pretty good coverage of Wikipedia.


Very likely Google special-cases Wikipedia


Your site isn’t worthy of the same crawl budget as Wikipedia.


They could specify in the sitemap how often do old articles change. Or set a indefinite caching header.


Google might not trust the sitemap because it sometimes is wrong.


It could be better to opt those articles out of the crawler. Unless that's more effort. If articles included the year and month in the URL prefix, I would disallow /201* instead.


In a major site redesign a couple years ago, we dropped 3/4 of our old URLs, and saw a big improvement in SEO metrics.

I know it doesn’t make sense and that Google says it is not necessary. But it clearly worked for us.

I think a fundamental truth about Google Search is that no one understands how it actually works anymore, including Google. They announce search algorithm updates with specific goals… and then silently roll out tweaks, more updates, etc. when the predicted effect doesn’t show up.

I think the idea that Google is in control and all the SEOs are just guessing, is wrong. I think it’s become a complex enough ML system that now all anyone can do is observe and adjust, including Google.


Which SEO metrics?


We ranked higher for more keywords and traffic from organic search went up.


I have noticed some articles (and not just "Best XXX of 202Y" articles) that seem to always update their "Updated on" date which Google unhelpfully picks up and shows in search results leading me to think the page is much more recent than it is.


I've been curious about how they are doing this. It seems to be an increasing trend and is making the query mostly useless.


  Updated on <?= date(m/d/Y) ?>


> It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.

If the content deleted is garbage, why wouldn't it help? No clue on CNET's overall quality, but I don't have a favorable image of it. Just had a look at their main page and that did not do it any favors.


Because the remaining content is also garbage (probably even worse garbage gauging by the trends I observed before I started ignoring CNET altogether)


Several reasons why it works. First is the page rank algorithm will give the other pages on the site a higher score. Per the spec.

Second is there could be spam links pointing to old CNET articles that need to be wiped from CNETs site spam score.


Perhaps sites with a small ratio of new:total content would be downranked --- but I really don't think that makes sense because that's going to be the case for any long-established site.


Google also might be at fault for making images on web lower quality. Several years ago, Google had announced that page load speed will affect the ranking. Google's tool, PageSpeed Insights gave recommendations on improving load speed. But it also recommended to lower quality of JPEG images to the level where artifacts would be visible. So instead of proper manual testing (using eyes, not a mathematical formula) on a large set of images, some Google employee simply wrote a recommended compression level out of their head and this forced web masters to worsen the quality of the images below any acceptable level.

So it doesn't matter if the photographer or illustrator worked hard to make a beautiful image, Google's robotic decision based on some lifeless mathematical formula crossed out their efforts.


That also explains why 2000s-era postage-stamp size photos are now back in vogue.

When it comes to Google, sufficiently-advanced malice is indistinguishable from incompetence.


Yes, this is made much worse by Google's "smartphone" crawler on pagespeed insights being an emulated very low end Moto G phone (10 years old?), downclocked to 25% of original (very slow) CPU speed, with a network that maxes out at 1mbit/sec or so, with 150ms latency added.

Makes it incredibly difficult to have nice imagery above the fold at least.


I've seen a couple news sources that are altering their publish dates to show near the top of news feeds. Google will announce "3 hours old", despite being weeks old.


Google should be massively down ranking sites that do this. Also if a site has a huge historical archive, that should be a positive indicator of site quality, not a negative one.


Reddit does this and it's very frustrating


Wow, I thought that was just me. The elation of googling for a niche Reddit topic and finding one very recent only to reach that sad realization that the content is 11y old and likely not relevant anymore.


Even worse when this happens, and you find that it's an old post you made. Happened to me more than once. I never realized this was intentional on reddit's part; I just assumed Google was broken.


I've been unable to find a reddit post in google I read yesterday and remembered all of the words in the title. No idea what is going on there.


Yup, "Last Updated on xxx" but no obvious updates called out in the article


And definitely no "diff" to previous versions.

Even if updates are mostly minor corrections or batch updates of boilerplate, the capability exists to rewrite any part or all of a story if there's no way to see past versions from third-parties from a single cache from archive.is or possibly archive.org.

Archival navigation and visualization should be deeply integrated into a user-centric, privacy browser.


Yea I’ve noticed that as well. It’s really annoying when you search for something new but it links to an article that says a few months but it’s in fact years old.


I'm not certain they are altering publish dates, it is probably an error of Google. I've my own sites in the search results with wrong publishing dates, that don't make any sense.


This is why I work at archive.org. Is it perfect? No. Does it have value to society? Absolutely.


I adore archive.org. I'm worried though it is becoming somewhat of a load bearing element of civilization, given the importance of shared and accurate history. We need redundancy.

~~I'm also worried about the deletion of old pages on archive because new owners of a domain update the robots.txt file to disallow it, which I've heard wipes the entire archive.org history of that domain. I hope that gets addressed.~~

Edit: this is no longer the case


Yeah wait, what? I hope that's incorrect. Updating robots.txt to disallow should only omit content from that point onward... It shouldn't be retroactive. What if there's a new owner of a respective domain, for example?


It was correct. And archive.org used this self inflicted problem to justify disregarding robots.txt.

https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


Oh thank you for the update, that's great news! Agree with their position 100%.


> Agree with their position 100%.

Including the part where they blamed anyone but themselves for hiding old snapshots when robots.txt changed?


no I didn't see that in there, I just agree that robots.txt makes sense for web crawlers but not archival purposes


It isn't, I really wish that instead of wiping DECADES of history; it only applies to content that was archived from the day of the domain's registration. I think this is slightly more reasonable, but I imagine they simply don't have access to such data.


Simply requiring domain owners contact archive.org to remove old snapshots would be better than applying robots.txt retroactively.


The Internet Archive is not really known for deleting anything. In many different postings across the years, their founder and employees have mentioned items being taken "out" of the wayback machine, not "deleted" out of it. I don't think you have anything to worry about.


It's absolutely essential and irreplaceable for the web archive, and that's why I was pretty angry that you guys decided to pick a fight with the big publishers over "loaning" ebooks that could have gotten the whole site killed.

It was possibly a worthwhile fight for someone to have, but not for the site that hosts the Wayback Machine. Separation of concerns, my friends...


what is it like to work there? can you describe what archive does on a day to day basis? what do you need most apart from I guess donations?


I'd read a blog post about that!


You do great work.


Bad. Antithetical to both Google's original ideals and the early 'netzien goals.

Google's deteriorating performance shouldn't result in deleting valuable historical viewpoints, journalistic trends and research material just to raise your newly AI-generated sh1t to the top of the trash fire.


> Bad. Antithetical to both Google's original ideals and the early 'netzien goals.

At this point, we should all realize that every ideal and slogan spouted by tech companies is just marketing. We were just too young and naive to know any better. 'Don't be evil'. In hindsight, that should have set off alarm bells.


Yeah, truly. "Don't be evil"? Like, you want to be evil, or you tend to be evil, but just don't do it?


Imagine, if you will, an utopic world where a critical service such as finding anything is not dominated by one (1) entity but an actual number - such as ten (10). Sci-fi novels describe this hypothetical market structure as a "competitive market".

In this utopic arrangement, users of search services are more-or-less evenly distributed among different search providers, enjoying a variety of different takes on how to find stuff.

Search providers, continues the sci-fi imagination, keep innovating and differentiating themselves to keep an edge over competition and please their users.

Producers of content, one the other hand, cannot assume much about what the inventive and aggressively competitive group of search providers will accentuate to please their users. So they focus on... improving the quality of their content, which is what they do best anyway.

Its a win-win for users and content producers. Alas, search service providers have to actually work for their money. This is slightly discomforting to a few, but not the end of the world.

One cannot but admire the imagination of such authors. What bizarre universes they keep inventing.


Search is an inherently centralizing and monopolizing industry. The one with the biggest index can show the best results and ads, thus has the biggest budget, and the one with the biggest budget has the biggest index.


There are more assumptions than words in this sentence.


Yet reality seems eerily similar


Depends on what is in that index. If it's full of noise and spam then its results won't be very useful. Garbage in - garbage out.


There are search engines you can use right now, that offers superior quality to Google. It's not a sci-fi fantasy.


Since everyone's reacting to the headline and hasn't read The Fine Article... Please let me call attention to the linked tweet from Google explicitly saying don't do this.

> Are you deleting content from your site because you somehow believe Google doesn't like "old" content? That's not a thing! Our guidance doesn't encourage this. Older content can still be helpful, too.

https://twitter.com/searchliaison/status/1689018769782476800


And? I really do not get how a tweet from Google contradicts the fact that their search engine is hot garbage that incentivises spam, centralised silos and "SEO tricks" like what CNET is doing.


I don't have any agenda, it just seems like important context. c|net is doing something stupid and apparently it isn't even going to accomplish their goal. Maybe this statement from Google will help convince other companies not to do the same thing.

FWIW, I'm as mad about Google quality going downhill as anyone's. The problem of sites doing stupid things for SEO purposes goes back at least 20 years though.


SEO is not reliant, and is often contradictory to statements like the one you quoted. No one who takes SEO seriously looks at what Google says. They look at how Google behaves.

Source: used to professionally offer SEO services.


Making helpful content requires time and money. Slashing staff and using AI generated content [1] is not the way to do this.

[1] https://www.businessinsider.com/cnet-slashes-10-staff-says-c...


SEO is a scam run by con artists. Google's worth as a search engine is it's ability to rank pages by quality. SEO tries to fake quality or trick Google to ranking objectively bad sites higher.

Red Ventures is trying to get CNET to be worse. with this and the AI written stories. Google should react by delisting all of CNET.


SEO won a long time ago. Google fights a hopeless war against legions of soulless locusts.

For all its resources it’s incapable of improving the situation. Since it only understands proxies of quality and truth, and these things can be manufactured and industrialized, they are incapable of winning.

Blogspam and fraud consistently outrank the original, and changes to the rules frustrate legitimate sites more than SEO spam. If anything these frequent changes increase the demand for SEO, not make it harder. You just wind up with more.

Until they figure out some kinda digital pesticide for SEO spam, the situation will continue to get worse.


To a large extent yes, but I'd wager a lot of sites have been improved by people reading about ranking and SEO as a lot of it is just better incentives for good/bad behavior (e.g. punishing copy pasting, rewarding relevant keyword usage).


The quality of Google’s search results have been in steady decline for years. Google’s worth as a search engine is mostly the value it accrues to advertisers with pinpoint user targeting at this point.


Something is going very wrong here.

Information on the internet should be in whatever format best suits the topic. The format that best serves the users looking for that information.

And search engines should learn to interpret that information and the various formats, in order to be able to best connect those searching information with those providing information. Yes, the search engine should adapt to the information and it's formats - not the other way round.

Instead we see "information" (or the AI-generated trite replacing it) adapt it's contents and format for search engines, in a bid to ultimately best serve advertisers. And search engines too adapt and change their algorithms to best serve advertisers.

As a result it becomes ever harder and harder for users to actually find the information they want in a format that works.

It's become so bad, that it's now more practicable to use advanced AI to filter out the actual information and re-format it, rather than go look for for it yourself.

As a human, you no longer want to use the web, and search... you want to have a bot that does that for you... because ultimately the space has become pretty hostile to humans.


Funny timing, because Google said yesterday not to do this.

https://www.seroundtable.com/google-dont-delete-older-helpfu...


That reads like an attempt at a refutation that is actually a confirmation:

> The page itself isn’t likely to rank well. Removing it might mean if you have a massive site that we’re better able to crawl other content on the site. But it doesn’t mean we go “oh, now the whole site is so much better” because of what happens with an individual page.

> “Just because Google says that deleting content in isolation doesn’t provide any SEO benefit, this isn’t always true" Which ... isn't what we said. We said that if people are deleting content just because they think old content is somehow bad that's -- again -- not a thing.

To me that reads like it is possible to improve the ranking of new content by deleting old content. The only thing they are refuting is that the age of the deleted content is the reason for the improvement.


Which is a pretty crucial distinction, no? No one would get upset if CNET announced they were deleting clickbait and blogspam.

With articles such as "The Best Home Deals from Urban Outsitters' Fall Forward Sale" currently gracing their front page, I'm wondering how long HN commenters expect to need access to this content.


> https://www.seroundtable.com/google-dont-delete-older-helpfu...

I think older articles will eventually hold more weight in Google Searches. There will be a before OpenAI vs after OpenAI weighting.


At the same time, is there any reason to trust them over empirical evidence (if there is evidence)?

Incentives are incentivizing.


I don't get it, why they just don't update the entries to disallow googlebot parsing those links, this way it'll be removed for google but accessible for others


This is a great example of the harms of AI.

Google presumably used AI to rank pages (it hasn’t been just PageRank for a while).

The AI has noticed people don’t engage with older content, so it deranks older content. It also deranks websites with lots of older content.

So websites pull their older content , which is an important form of historical memory.

Even if the AI isn’t actually doing this, people assume it is.

Because AIs aren’t rules based, we have to guess what it’s doing.

And we guess it’s deranking old sites.


cnet got bought by vulture capital a couple years back and already had been replacing writers with crap AI before chatGPT was a thing. this shouldn't be a surprise. everything CNET related has been a walking corpse for a while now.


I remember that CNET was a credible source 10 years ago, now it’s like a tabloid that reports 1 week old stories with clickbait titles and thin content.


So, let me get this right. CNET started using "generative AI" to write their articles. Google no doubt detected it and down ranked them to hell. CNET stopped the AI generation and they decided to delete their archives to improve their rankings?


Newer is not always better, especially when you're looking for information on old things, but I suspect there are vested interests who don't want us to remember how much better things were in the past, so they can continue to espouse their illusion of "progress", and this "cleaning up" of old information is contributing to that goal.

Archive.org deserves all the support it needs. If only the Wayback Machine was actually indexed and searchable too...


One of my favourite websites growing up: download.com


When download.com became infested with ad ware was a sad day. I used to visit that site a lot to see what cool software I could install. Thank the heavens for Linux for teaching me proper package management.


If you're on Windows you can try Scoop https://scoop.sh/#/apps


I just went to http://download.com . On the frontpage they have demos for NFS Underground 2, Vice City and Age of Empires II. It's like I've entered a time warp. So many memories!


softseek.com was better, because everything had a screenshot. I would actually prefer this interface over modern app stores:

https://web.archive.org/web/19991013034959/http://softseek.c...


I was too concerned with which version of DirectX I had. I thought it was much more important than what it turned out to be.


Same. I was a kid in the country, so we didn't have internet. I used to ride my bike with my thumb drive to the library's computers to see what was new on download.com at least once a week. It was like visiting a candy store. It's a shame those days are over.


This is like the flea on the dog's tail, wagging the tail and dog. I don't know how many more levels of meta we can handle.

Also, if you just dump all that content on archive.org, you're kind of just reaching into archive.org's wallet, pulling out dollar bills, and giving them to Google, whose ostensible goal was to index and make available all the world's information. I feel like that's enough irony and internet for today.


CNET did this a while back, but it didn't seem SEO related then. They used to have tons of old tech specs. I remember them being the last source of specs for an obscure managed switch. Then the whole of that data just went away with no notice. Really great resource lost.


I believe that data might have been licensed (from Etilize?). They could have stopped paying for it and lost the rights to display it.


It's things like this that make me really want to see some of the alternative search engines succeed. Hopefully Google continues shooting itself in the foot enough to make people seek alternatives. FWIW I'm using Kagi and have been very happy with it. It's the only alternative I've used (including numerous failed attempts to switch to DDG) where the results have been good enough that I haven't developed a muscle memory to resort back to Google results. And I consider the amount I have to pay to use it reasonable for the value I get, but also an investment in a potential future that doesn't have us all beholden to Google.


I haven't visited a CNET page in years and didn't know they were even still relevant :p

The times I have visited CNET pages in the past was to find specific information. If such information happens to be in deleted articles, that would reduce my interactions with CNET in the future.

I think they should archive the old articles or even offload them to Wayback willingly, but it's possible some of the articles they're purging aren't worthwhile. If I write up an article about a cat playing on a scratching post, there's a good chance there's nothing unique or valuable about it and it doesn't need to sit around gathering bit rot :p


> I haven't visited a CNET page in years and didn't know they were even still relevant :p

I had the same feeling about the Verge.

It feels like going to Britannica online to read about what Encarta encyclopaedia was and why it came in something called a CD-ROM.

---

E: Awesome: https://www.britannica.com/topic/Encarta


> It feels like going to Britannica online

I’ve seen people use Britannica Online as ammunition in arguments on Wikipedia. “Britannica says/does X so Wikipedia should too”


From the article:

'Stories slated to be “deprecated” are archived using the Internet Archive’s Wayback Machine, and authors are alerted at least 10 days in advance, according to the memo.'


You can't (currently) search the Wayback Machine, so you can only find these old articles if you know what the URL once was.


Offloading trash to an already strained organisation. are they donating any help or is it just taking advantage of charity?


Lmao there's no winning with you people :p

Don't worry, Wayback is capable of telling CNET where to go if they were really concerned. If anything they should be thankful for the newly-generated interest that a major company would use them instead of some randoms cherry-picking sites for arguments.


yes, this greatly legitimizes the wayback archive.


I didn't know they were strained. Probably best to block them so they don't spend resources on unimportant things. You can find a few UA strings here https://udger.com/resources/ua-list/bot-detail?bot=archive.o... Ultimately, it looks like they do respect the `archive.org_bot` string so if you want to help the strained organization, probably best to block it as well.

You can also save them some space, it appears, by uploading a file on your website asking them not to retain it and then sending an email to them. See https://webmasters.stackexchange.com/a/128352

For now, I just blocked the bot, but when I have some time maybe I'll ask them to delete the data.

Hopefully, with a little community effort we can reduce the strain on them so they can spend their limited resources reasonably.


> I haven't visited a CNET page in years and didn't know they were even still relevant :p

I'm not sure that CNet is relevant. I see their headlines regularly because they're a panel on an aggregate I visit. Most of what they publish is on a level with bot-built and affiliate pages.

We just had a long run of "Best internet providers in [US city]" as if people who live there could choose more than one. Between those will be Get This Deal On HP InkJet Printers and Best Back To School VPN Deals 2023 that compares 10 different Kape offerings.

If there is a point to cnet's existence, I truly don't see it.


Article says that’s what they’re doing. It says exactly that.

Doesn’t really make things better, IMO, but they are at least doing that.


If your tool for finding pages incentivizes people to destroy some pages so others can be found, then something went horribly wrong.


“It became necessary to destroy the town to save it.”


They're also using LLMs to write articles.

I don't understand why they think I or anyone else won't just skip the part where they lard it up with ads and just talk to the robot directly.


Or just skip CNET entirely. They haven't been worthwhile for a very long time.


While it would be nice to believe that search engines solve the issue automagically, there are a lot of reasons why organizations want to reduce they amount of old/outdated/stale information that's served to searchers. I know where I work, we're constantly deleting older material (at least for certain types of content--which news and quasi-news pubs doesn't necessarily fall into).


What reasons would there to be to delete old new articles?

My inner historian is screaming.


One reason is if you keep all your old versions of product docs up, Google will randomly send people to the old version instead of the current one, and then customers will get confused by the outdated info

I’m sure there must be some way to fix this with META tags/etc, but it is often easier just to delete the old stuff, than change the META tags on 100s or 1000s of legacy doc pages


I'm decently fond of how ansible docs do it (for example) - consider ex. https://docs.ansible.com/ansible/2.9/modules/list_of_all_mod... which has a nice banner at the top saying this is an old version.


but if i have an old version of the product, then i may need those old docs.


And you're probably out of support and aren't a current paying customer so you're a lower priority than current customers that want current information. (Old information is often available on product support sites but it probably won't come up with a random search.)


When I worked at $MAJOR_VENDOR, we had an internal-only archive of product installation and documentation ISOs for old versions, discontinued products, etc. It didn't go back forever (although apparently some even older stuff was archived offline and could be retrieved on request), but it did contain product versions that had been out-of-support for over a decade. A lot of older material which had been removed from the public website was contained in it.

There was a defined process for customers to request access to it – they'd open a support ticket, tell us what they wanted, the support engineer would check the licensing system to confirm they were licensed for it, then transfer it to an externally accessible (S)FTP server for the customer to download, and provide them with the download link. There was a cron job which automatically deleted all files older than 30 days from the externally-accessible server.


And for product documentation for version n-2.x, it's easy enough to see the use case for why a customer might need it. (e.g. they have some other product that's version locked to it) But there is a ton of collateral, videos, etc. that companies produce that get stale, musty, or just plain wrong and if you leave them all out there it's just a mess that's now up to the customer to decide what's right, what's mostly right, and what's plain wrong.

So, while there may be some historical interest in how we were talking about, say, cloud computing in 2010, that's the kind of thing I keep in my personal files and generally wouldn't expect a company to keep searchable on its web site.


> But there is a ton of collateral, videos, etc. that companies produce that get stale, musty, or just plain wrong

That 2017 product roadmap slide saying "we'll deliver X version N+1 by 2020" gets a bit embarrassing when 2023 rolls around and it is still nowhere to be seen. Maybe the real answer is "X wasn't making enough money so the new version was cancelled", but you don't want to publicly announce that (what will the press make of it?), you just hope everyone forgets you ever promised it. You can't get rid of copies of the slide deck your sales people emailed to customers, or you presented at a conference, or in the Internet Archive Wayback Machine – but at least you can nuke it off your own website. Reduces the odds of being publicly embarrassed by the whole thing.


Or your 2011 presentation about cloud computing didn't mention containers. I'm not sure it's embarrassing any more than a zillion other forward looking crystal ball glimpses is embarrassing. But there's no real reason to keep on your site unless your intent is to provide a view into technology's twisty path or the various false starts every company makes with its products.


Companies are not, in general, in the business of being the archive of record. They're, among other things, in the business of providing their current customers with the information that's most applicable and correct for their current needs and not something relevant to a completely different version of software from 10 years ago.


Ok Captain Obvious. That isn't useful OR new information.


Deleting old articles ease the burden to manage it.

There are infrastructure costs and manage costs.


The benefit they get from those articles is probably near zero but the maintenance cost is tangible.


If you use a robots.txt file to put your site's archives off limits to Google, does Google obey that? Your own search box can still search them.


They would probably have thought of that if they had anyone left who knew anything about the internet...


Google search is terrible these days. I find myself using other engines even yandex to find things these days. It has been in a decline now for at least the last 3-5 years.

All this SEO garbage may be responsible but I still think google could do a much better job.

Feels like Google is just over monetizing it’s search at this point and the quality of product is in a free fall.


I saw someone mention kagi, a premium search engine, on here recently and I've been loving it. Hadn't heard of yandex, I'll give that one a try too.

Google definitely has an incentive to push content farms covered in Google ads to the top of its search engine, whereas a premium service's only incentive is to provide the best search engine possible.


Can we all please start building decentralized tools now.

Stop giving these crap companies power over you and your data. You don't know what they'll do with your data in the future (think George Orwell).

As we can see these companies can out of the blue force other companies to delete their old articles. There goes freedom of speech just to rank higher on a shi*y search engine that no one smart even uses (I use brave search). Then again, these companies kowtowing to Googl mainly write for the brainwashed sheep. I personally don't use any of their privacy invasive analytics on my blog.


They have no obligation to serve old stuff indefinitely. If you like an article, it's on you to save it, just like people in the old times would keep newspaper cutouts about their favorite band and VHS cassettes of interviews etc.

Yes, things are becoming ephemeral. Probably good. Been like that forever except for some naive time window of strange expectations in the 2000s decade. We shouldn't keep rolling ahead of us a growing ball of useless stuff. Shed the the useless and keep the valuable. If nobody remembers otherwise, then probably it's not a thing of value.


The problem is that it's no longer possible to find it. Lots of early internet history is lost or hard to find - despite the fact that storage space is not an issue these days.


Maybe. I'm of two minds. Valuable things stand the test of time, they get retold and reused and remembered and kept alive. It's always been like that. On the other hand this could be construed as a defense of oral tradition above writing, but writing has had immense impact on our capabilities and technological progress historically, including the preservation of ancient texts. But I'm not sure we'd be so much better off if we had access to all the gossip news of the time of Plato. Not to mention that forgetting, retelling, rediscovering, discussing stuff anew allows for a kind of evolution and mutation that can regularize and robustify the knowledge.

So we need some preservation, but indiscriminate blind hoarding of all info isn't necessarily the best.


There can be many reasons for this from SEO point of view. One of them can be to send Link Juice to the pages they want to rank better (May be because those pages bring more advertising revenue or being new don't have enough backlinks.) So let's say you have old pages which have good amount of backlinks but the stale content is not bringing enough visitors to your site. So you can delete those pages and permanently redirect them to new pages. This way the new page will get a Link Juice boost and it will perform better in search results.


Highly doubt that this is for SEO purposes, serp does not work like that. Smell like a bad management decision, and probably some huge investment in an ai tool that needs to be justified monetarily. They will probably nuke a huge portion of their traffic, and loose a ton of backlinks.

If you genuinely strive to create a great content that piece of content will drive traffic, leads, sales, help for rankings and etc. for years.

Just by searching a random thing like "how to build an engine" the first article that comes, and the top results is from 1997.


The fact that they resorted to writing crap these days with AI is what is making them lose relevance, not that they have been in the business for a long time, and have thousands upon thousands of articles.

CNET used to, perhaps not be cool, but it certainly was a place you'd go get tech news from every once in a while. Just imagine they were huge enough to buy download.com, back then, even! For a while, they were essentially *the* place you'd go for downloads of various shareware stuff.

Now, it's a ghost town of a site with AI-written junk.


SEO optimization is actively destroying the archives of blogs out there. Pruning articles to rank better is rewarded. Removing knowledge to "play the game" is a viable path to making money.

The saving grace here? The existence of the Wayback Machine. A non-profit by the Internet Archive that is severely underfunded. If you ever needed a reason to donate, this is probably it. And even then, the survival of this information depends on a singular platform. Digital historians of the future will have a tough job.


Can't believe how many morons are at the helm and making decisions. This simple exercise should tell us how much trustworthy CNET is with their technical knowledge LOL


Alternate title: "How the internet ate itself."

This seems antithetical to the internet approach of 'data wants to be free.'

Data most assuredly doesn't want to die. Why U kill data?


Recently, I did some research on the topics of "Windows Phone" and "Windows Mobile". I found a lot of interesting articles on Hacker News, thanks to everyone who contributed to the database.

However, many of the links are no longer working. Some websites domain have expired, while others have chosen to remove the old articles. That made me feel so bad.


I have noticed this at some of the sites I used to write for. Hence, as soon as I put a new piece up, I archive it at archive.is, and include that reference in my site's list of my work. Periodically, I should go there and check each one, but there's a lot of material.

I did not know that this was a possible motivation as to why my more 'historic' work is disappearing, though.


Sites also just reorganize, change CMSs, simply go out of business (which can happen to archive.is as well). I save a lot of my own stuff but it's a bit hit or miss and I expect a lot of the material we lean on The Wayback Machine to save is probably pretty hard to actually discover.


True about archive.is being probably more at risk than WayBack Machine. I just don't know when I'll ever get the 2-3 days necessary to additionally back the posts up at WM, because each save is very slow.


What’s the difference between archive.is and Internet Archive Wayback Machine?


It handles some of the more complex formatting better, and succeeds better in simplifying complex CSS and templates. Also better latency, usually.


Could CNET not have relocated these articles to a distinct directory and then used robots.txt to prevent search engines from indexing them? This move seems like an excuse to offload subpar content by placing the blame on external factors. It's sort of like saying, "we withdrew a significant amount of cash from the bank and torched it to make space for fresh funds."


Luckily it's just one website making a dumb decision to delete their content. I doubt that this will actually boost their rankings, and I hope others don't follow CNET's lead. (Hopefully CNET is wrong, and that deleting your sites old content doesn't actually help your search engine ranking on Google.)


The gamification of ranking and clickbait by encouraging digital amnesia and bitrot seems both absurd and evil. New content isn't always good or better when specific content was desired.

0. Donate to internet archive now

1. Website operators should aim to not hide content or download artifacts behind JS with embedded absolute URIs or authenticated APIs.


Let Google die already.


Before we crucify these guys do we know what articles they are deleting?

I’ve found deleting the worst quality ads to be good for everyone. Maybe they are deleting the worst.


That’s the reason why we’ve to archive articles we found valuable. Things are not in the net forever :’(


Does anybody actually browse CNET on purpose? I mean, the content has been the quality of LLM-generated for a decade or more and the only way any sane person ends up seeing it is if you get redirected through a link-shortener/link-obfuscator, surely. Surely?


So much for - "Internet doesn't forget". It does. A lot of stuff has been lost.


My first reaction was to think "Google is a shitty search engine if CNET can improve their search ranking by deleting information."

But then I thought about scenarios where this might be legitimate. If we assume:

    1. Google has some ability to assess the value of content that isn't 100% reliable at a per-article level, but in aggregate is accurate.
    2. Google therefore has the ability to judge CNET's content quality score in aggregate, and use that in search rankings for individual articles.
    3. CNET knows that they have articles that rate well on Google's quality score, and articles that rate poorly.
    4. CNET has reason to think they can generally distinguish "good" articles from "bad" articles.
THEN it would (could?) make sense to remove the "bad" articles in order to raise their aggregate quality score. It's kind of like the Laffer Curve of Google Content Quality. Overall traffic could go up if the "bad" articles are "bad" enough.

That said, unless CNET was publishing absolutely useless garbage at some point, it would make a lot more sense to just de-index those articles, or even for Google to provide a way for a site to mark lower-quality articles, so Google understands that the overall quality score shouldn't be harmed by that article about the Top 5 Animals As Ranked By Hitler, but if someone actually Googles that, it's fine for CNET to have retained that article, and for Google to send them to it.

In short: Google advising sites not to do this when there are perfectly reasonable alternatives seems silly.


This seems like a win-win for CNET. They get publicity from the news and lots of typical outrage and Google-bashing comments. And if it doesn't improve the ranking as expected, they can always put back the old articles, quietly.


Smart to delete old content.

Very few people use it.

And it costs resources to keep up and crawl.

Better to prune content that doesn’t get a lot of traffic to keep your domain authority high.

Also 50k page limit on sitemap.xml.

Also… stop hoarding. Delete everything you can. As often as you can. Clinging to the past isn’t healthy.


Why not at least spin the old articles into a sub-brand? Something like CNET archive?


Terrible shame. CNET reviews were like going shopping in a technology wonderland when I was a kid. I wonder if they're going to erase James Kim's body of work as well. That would be a shame.


They’re simply wrong in their theory of the case here. My guess is they will look back on this experiment in a year and regret it. Having older, highly linked content on a site is every SEO’s dream.


Why do the articles have to be deleted? Is robots.txt or <meta name="robots" content="noindex,nofollow"> not sufficient? Deleting articles will break so many external links.


Thanks, Google, for making the 'net even more useless! & ahistoric!


I begin to think that Google is a disease for the Internet.


After I weeded my 20 year garden of personal content , I can confirm the blog was more relevant and a better user experience.

Not all content should be immortal.


Some things about the 1990s Internet will soon be found nowhere except in some people's memories. Big Tech is trying very hard to ensure that.


Wouldn't it be better to move them to a new domain? Worst case scenario they're kept forever.

Best case that new domain starts ranking well.


Is the fact that CNET deleting old articles improving its Google Search ranking? I'd say so, it's all over the news now.


Does anyone have a self hostable archiving/bookmark manager to recommend? I have been using Wallabag but I find it just okay


I feel like the internet is a bust at this point - I find it all quite depressing - which is kind of silly I know.


It's been quite a while, but CNET has never been the same since CBS acquired them in 2008. They drove the site to become SEO fodder, almost like a ZergNet for tech. And their most recent acquirer is diluting whatever integrity remains by pushing out a slew of "lifestyle" content.

It's a shame because CNET, surprisingly, has retained some really great editors that are as knowledgeable as they come in their respective domains. David Katzmaier and Brian Cooley come to mind.


Seems fine. Better delete bad content than destroy mid content by drowning it in an ocean of crap.


There are so many websites nowadays that should delete themselves altogether.


This smells like one content company throwing shade at another content company.


So, CNET decides to delete lots of useful information, just for better SEO. None of the other old tech news websites are trying to destroy thier information for the SEO, they still get good rankings. This should be a good lesson to anyone else who are thinking of going this route.


Hopefully they'll get artificial penalty for gaming the system.


That's not a behaviour we would like to encourage, right?


Half of the web it's just to reference itself on ads.


Why wouldn’t this just lower their domain authority


Is it tho - or is it to ‘save the content from AI’


now if someone would go through and remove all the outdated tech support/ trouble shooting


In all seriousness, I thought cNet


IT : dementia :: finance : cancer


Why not make a GPT implementation rephrase every article once a week or so. Then it appears new from a SEO perspective


This superstitious "prune old content" bs is commonly repeated by seo know nothing's.


SEO is so fucked...


When did CNET become so thirsty?


That's what I moved to ChatGPT4 for most of my info queries. Google is badly compromised by SEO. Plus ChatGPT4 gives answer right away within having to go through multiple results and search within each page for specific line I need.


Heck, sometimes that answer is even correct!


I have been recently forking off a subproject from Git repo. After spending a lot of time messing around with it and getting into a lot of unforeseen trouble, I finally asked ChatGPT how to do it and of course ChatGPT knew the correct answer all along. I felt like an idiot. Now I always ask ChatGPT first. These LLMs are way smarter than you would think.

GPT4 with WolframAlpha plugin even gave me enough information to implement Taylor polynomial approximation for Gaussian function (don't ask why I needed that), which would have otherwise taken me hours of studying if I could even solve it at all.

PS: GPT4 somehow knows even things that are really hard to find online. I recently needed standard error but not of mean but rather of standard deviation. GPT4 not only understood my vague query but gave me formula that is really hard to find online even if you already know the keywords. I know it's hard to find, because I went ahead to double-check ChatGPT's answer via search.


So you implemented a poly approx fror a Gaussian function without understanding what you were doing (implying that if you wanted to do it yourself it would take hours of studying).

Good luck when you need to update it and adjust it - this is the equivalent than copying/pasting a function from Stack Overflow.


I double-checked everything, but that's beside the point. I was replying to GGP's insinuation that ChatGPT is unreliable. In my experience, it's more likely to return correct results than the first page of search. Search results often resemble random rambling about tangentially related topics whereas ChatGPT gets its answer right on first try. ChatGPT understands me when I have only a vague idea of what I want whereas search engines tend to fail even when given exact keywords. ChatGPT is also way more likely to do things right than me except in my narrow area of expertise.


i use a tool for programming that's based on ChatGPT

i find it most helpful when i am not sure how to phrase a query so that direct search would find something. but i also found that in at least half the cases the answer is incomplete or even wrong.

the last one i remember explained in the text what functions or settings i could use in the text, but the code example that it presented did not do what the text suggested. it really drove home the point that these are just haphazardly assembled responses that sometimes get things right by pure chance.

with questions like yours i would be very careful to verify that the solution is actually correct.


> don't ask why I needed that

But now I'm curious!


In the same way you can tell if a search result is "good", you can usually tell if what ChatGPT is telling you something truthful.

And you face the same problem when looking for something in a domain you are not an expert in - no way to tell if a web page is truthful and no way to tell if ChatGPT is right. ChatGPT just lets you make more mistakes more efficiently.

But for those cases where you kind of know the answer, ChatGPT is usually better than search.


Well, you say that flippantly, but if you ask it correctly, in most cases the answer is correct as well. You should obviously double check the solution, but that applies to anything, be it a Google search or a Wikipedia article.


Think of it like floating point logic.


A broken clock is right twice a day!


Right there with you. I've gotten so used to having it give me exactly the answer to my specific question that, when I must fall back to traditional search, it's noticeably unpleasant.


Gpt always an answer for you. Even if it’s made up!


I tried using ChatGPT as a search engine, but it's too slow. With DuckDuckGo/Google you can just go to the domain, type, enter and you have your answer. I haven't used it in a few months, but with ChatGPT you have to log in, pass various screenings, hope it's now down, and then finally get to where you can type.

With a regular search engine you've already found your answer by that time.


I have ChatGPT4 tab open most of the time, so it's always ready to go.


I think it has much more to do with Google gutting their own working algorithms than it does with SEO




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: