So google's shitty search now economically incentivizes sites to destroy information.
Can there be any doubt that Google destroyed the old internet by becoming a bad search engine? Could their exclusion of most of the web be considered punishment for being sites being so old and stable that they don't rely on Google for ad revenue?
I'll just assume you neglected to read TFA, because if you had, you would have discovered that it links to an official Google source that states CNET shouldn't be doing this.[1]
I could imagine CNET‘s SEO team got an average rank goal instead of absolute SEO traffic. So by removing low ranked old pages the avg position of their search results goes closer to the top even though total traffic sinks. I‘ve seen stuff like this happen at my own company as well, where a team‘s KPIs are designed such that they‘ll ruin absolute numbers in order to achieve their relative KPI goals, like getting an increase in conversion rates by just cutting all low-conversion traffic.
In general, people often forget that if your target is a ratio, you can attack the numerator or the denominator. Often the latter is the easier to manipulate.
Even if it's not a ratio. When any metric becomes a target it will be gamed.
My organization tracks how many tickets we have had open for 30 days or more. So my team started to close tickets after 30 days and let them reopen automatically.
Meanwhile that's not necessarily a bad outcome. In theory it makes the data better by focusing on deaths that might or might not have been preventable, rather than making every hospital look responsible for inevitable deaths.
Of course the actual behavior in the article is highly disturbing.
This is why KPIs or targets should NEVER be calculated values like averages or ratios. The team is then incentivized to do something hostile such as not promote the content as much so that the ratio is higher, as soon as they barely scrape past the impressions mark.
When deciding KPIs, Goodhart's law should always be kept in mind: when a measure becomes a target, it ceases to be a good measure.
It's really hard to not create perverse incentives with KPIs. Targets like "% of tickets closed within 72 hours" can wreck service quality if the team is under enough pressure or unscrupulous.
Sure they can, e.g. on-time delivery (or even better shipments missing the promised delivery date) is a ratio. Or inventory turn rates, there you actually want people to attack the denominator.
Generaly speaking, an easy solution is to attach another target to either the nominator or denominator, a target that requires people to move that in value in acertqin direction. That might even be a different team thanthe one having goals on the ratio.
> Sure they can, e.g. on-time delivery (or even better shipments missing the promised delivery date) is a ratio. Or inventory turn rates, there you actually want people to attack the denominator.
These are good in that they’re directly aligned with business outcomes but you still need sensible judgement in the loop. For example, say there’s an ice storm or heat wave which affects delivery times for a large region – you need someone smart enough to recognize that and not robotically punish people for failing to hit a now-unrealistic goal, or you’re going to see things like people marking orders as canceled or faking deliveries to avoid penalties or losing bonuses.
One example I saw at a large old school vendor was having performance measured directly by units delivered, which might seem reasonable since it’s totally aligned with the company’s interests, except that they were hit by a delay on new CPUs and so most of their customers were waiting for the latest product. Some sales people were penalized and left, and the cagier ones played games having their best clients order the old stuff, never unpack it, and return it on the first day of the next quarter - they got the max internal discount for their troubles so that circus cost way more money than doing nothing would have, but that number was law and none of the senior managers were willing to provide nuance.
Yeah, every part of this was a “don’t incentivize doing this”. I doubt anyone would ever be caught for that since there was nothing in writing but it was a complete farce of management. I heard those details over a beer with one of the people involved and he was basically wryly chuckling about how that vendor had good engineers and terrible management. They’re gone now so that caught up with them.
That only says that Google discourages such actions, not that such actions are not beneficial to SEO ranking (which is equal to the aforementioned economic incentive in this case).
So whose word do we have to go on that this is beneficial, besides anonymous "SEO experts" and CNET leadership (those paragons of journalistic savvy)?
Perhaps what CNET really means is that they're deleting old low quality content with high bounce rates. After all, the best SEO is actually having the thing users want.
In my experience SEO experts are the most superstitious tech people I ever met. One guy wanted me to reorder HTTP header fields to match another site. He wanted our minified HTML to include a linebreak just after a certain meta element just because some other site had it. I got requests to match variable names in minified JS just because googles own minified JS had that name.
> In my experience SEO experts are the most superstitious tech people I ever met.
And some are the most data-driven people you'll ever meet. As with most people who claim to be experts, the trick is to determine whether the person you're evaluating is a legitimate professional or a cargo-culting wanna-be.
I’ve always felt there is a similarity to day traders or people who overanalyze stock fundamentals. There comes a time when data analysis becomes astrology…
> There comes a time when data analysis becomes astrology.
Excellent quote. It's counterintuitive but looking at what is most likely to happen according to the datasets presented can often miss the bigger picture.
This. It is often the scope and context that determines logic. It is easy to build bubbles and stay comfy inside. Without revealing much, I asked a data scientist whose job it is to figure out bids on keywords and essentially control how much $ is spent on advertising something at a specific region about negative criteria. As in, are you sure you wouldn’t get this benefit even if you stopped spending the $ and his response was “look at all this evidence that our spend caused this x% increase in traffic and y% more conversions” and that was 2 years ago. My follow up question was - okay, now that the thing you advertised is popular, wouldn’t it be the more organic choice in the market, and we can stop spending the $ there?
His answer was - look at what happened when we stopped the advertising in this small region in Germany 1.5 years ago!
My common sense validation question still stands. I still believe he built a shiny good bubble 2 years ago, and refuses to reason with wider context, and second degree effects.
Leos are generally given the “heroic/action-y” tropes, so if you are, for example, trying to pick Major League Baseball players, astrology could help a bit.
Some of the most superstitious people I've ever met were also some of the most data-driven people I've ever met. Being data-driven doesn't exclude unconscious manipulation of the data selection or interpretation, so it doesn't automatically equate to "objective".
The data analysis I've seen most SEO experts do is similar to sitting at a highway, carefully timing the speed of each car, taking detailed notes of the cars appearance, returning to the car factory and saying that all cars need to be red because the data says red cars are faster.
One SEO expert who consulted for a bank I worked at wanted us to change our URLs from e.g. /products/savings-accounts/apply by reversing them to /apply/savings-accounts/products on the grounds that the most specific thing about the page must be as close to the domain name as possible, according to them. I actually went ahead and changed our CMS to implement this (because I was told to). I'm sure the SEO expert got paid a lot more than I did as a dev. A sad day in my career. I left the company not long after...
Unfortunately though, this was likely good advice.
The yandex source code leak revealed that keyword proximity to root domain is a ranking factor. Of course, there’s nearly a thousand factors and “randomize result” is also a factor, but still.
SEO is unfortunately a zero sum game so it makes otherwise silly activities become positive ROI.
I think you're largely correct but Google isn't one person so there may be somewhat emergent patterns that work from an SEO standpoint that don't have a solid answer to Why. If I were an SEO customer I would ask for some proof but that isn't the market they're targeting. There was an old saying in the tennis instruction business that there was a bunch of 'bend your knees, fifty please'. So lots of snakeoil salesman but some salesman sell stuff that works.
That's a bit out there, but Google has mentioned in several different ways that pages and sites have thousands of derived features and attributes they feed into their various ML pipelines.
I assume Google is turning all the site's pages, js, inbound/outbound links, traffic patterns, etc...into large numbers of sometimes obscure datapoints like "does it have a favicon", "is it a unique favicon?", "do people scroll past the initial viewport?", "does it have this known uncommon attribute?".
Maybe those aren't the right guesses, but if a page has thousands of derived features and attributes, maybe they are on the list.
So, some SEO's take the idea that they can identify sites that Google clearly showers with traffic, and try to recreate as close a list of those features/attributes as they can for the site they are being paid to boost.
I agree it's an odd approach, but I also can't prove it's wrong.
Is minified "code" still "source code"? I think I'd say the source is the original implementation pre-minification. I hate it too when working out how something is done on a site, but I'm wondering where we fall on that technicality. Is the output of a pre-processor still considered source code even if it's not machine code? These are not important questions but now I'm wondering.
Source code is what you write and read, but sometimes you write one thing and people can only read it after your pre processing. Why not enable pretty output?
Plus I suspect minifying HTML or JS is often cargo cult (for small sites who are frying the wrong fish) or compensating for page bloat
It doesn't compensate bloat, but it reduces bytes sent over the wire, bytes cached in between and bytes parsed in your browser for _very_ little cost.
You can always open dev tools in your browser and have an interactive, nicely formatted HTML tree there with a ton of inspection and manipulation features.
In my experience usually the bigger difference is made by not making it bloated in the first place... As well as progressive enhancement, nonblocking load, serving from a nearby geolocation etc. I see projects minify all the things by default while it should be literally the last measure with least impact on TTI
It does stuff like tree shaking as well; it's quite good. If your page is bloated, it makes it better. If your page is not bloated, it makes it better.
I suppose the difference is that someone debugging at that level will be offered some sort of "dump" command or similar, whereas someone debugging in a browser is offered a "View Source" command. It's just a matter of convention and expectation.
If we wanted browsers to be fed code that for performance reasons isn't human-readable, web servers ought to serve something that's processed way more than just gzipped minification. It could be more like bytecode.
Let's be honest, a lot of non-minified JS code is barely legible either :)
For me I guess what I was getting at is that I consider source the stuff I'm working on - the minified output I won't touch, it's output. But it is input for someone else, and available as a View Source so that does muddy the waters, just like decompilers produce "source" that no sane human would want to work on.
I think semantically I would consider the original source code the "real" source if that makes sense. The source is wherever it all comes from. The rest is various types of output from further down the toolchain tree. I don't know if the official definition agrees with that though.
>If we wanted browsers to be fed code that for performance reasons isn't human-readable,
Worth keeping in mind that "performance" here refers to saving bandwidth costs as the host. Every single unnecessary whitespace or character is a byte that didn't need to be uploaded, hence minify and save on that bandwidth and thus $$$$.
The performance difference on the browser end between original and minified source code is negligible.
Last time I ran the numbers (which admittedly was quite a number of years ago now), the difference between minified and unminified code was negligible once you factored in compression because unminified code compresses better.
What really adds to the source code footprint is all of those trackers, adverts and, in a lot of cases, framework overhead.
The way I see it, if someone needs to minify their JavaShit (and HTML?! CSS?!) to improve user download times, that download time was horseshit to start with and they need to rebuild everything properly from the ground up.
Isn’t this essential what WebAssembly is doing? I’ll admit I haven’t looked into it much, as I’m crap with C/++, though I’d like to try Rust. Having “near native” performance in a browser sounds nice, curious to see how far it’s come.
Minifying HTML is basically just removing non-significant whitespace. Run it through a formatter and it will be readable.
If you dislike unreadable source code I would assume you would object to minifying JS, in which case you should ask people to include sourcemaps instead of objecting to minification.
I mean, isn't that precisely why open source advocates advocate for open source?
Not to mention, there is no need to "minify" HTML, CSS, or JavaShit for a browser to render a page unlike compiled code which is more or less a necessity for such things.
Minifying code for browsers greatly reduces the amount of bandwidth needed to serve web traffic. There's a good reason it's done.
By your logic, there's actually no reason to use compiled code at all, for almost anything above the kernel. We can just use Python to do everything, including run browsers, play video games, etc. Sure, it'll be dog-slow, but you seem to care more about reading the code than performance or any other consideration.
I already alluded[1] to the incentives for the host to minify their JavaShit, et al., and you would have a point if it wasn't for the fact that performance otherwise isn't significantly different between minified and full source code as far as the user would be concerned.
I'm not talking about the browser's performance, I'm talking about the network bandwidth. All that extra JS code in every HTTP GET adds up. For a large site serving countless users, it adds up to a lot of bandwidth.
Somebody mentioned negligible/deleterious impacts on bandwidth for minified code in that thread, but they seemed to have low certainty. If you happen to have evidence otherwise, it might be informative for them.
>In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language).
According to the Open Source Definition of the OSI it's not:
> The program must include source code [...] The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor [...] are not allowed.
The popular licenses for which this is a design concern are careful to define source code to mean "preferred form of the work for making modifications" or similar.
Google actually describes an entirely plausible mechanism of action here at [1]. old content slows down site crawling, which can cause new content to not be refreshed as often.
Sure, one page doesn’t matter, but thousands will.
>Removing it might mean if you have a massive site that we’re better able to crawl other content on the site. But it doesn’t mean we go “oh, now the whole site is so much better” because of what happens with an individual page.
Parsing this carefully, to me it sounds worded to give the impression removing old pages won’t help the ranking of other pages without explicitly saying so. In other words, if it turns out that deleting old pages helps your ranking (indirectly, by making Google crawl your new pages faster), this tweet is truthful on a technicality.
In the context of negative attention where some of the blame for old content being removed is directed toward Google, there is a clear motive for a PR strategy that deflects in this way.
The tweet is also clearly saying that deleting old content will increase the average page rank of your articles in the first N hours after it is published. (Because the time to first crawl will decrease, and the page rank is effectively zero before the first crawl).
CNet is big enough that I’d expect Google to ensure the crawler has fresh news articles from it, but that isn’t explicitly said anywhere.
And considering all the AI hype, one could have hoped that the leading search engine crawler would be able to "smartly" detect new contents based on a url containing a timestamp.
Apparently not if this SEO trick is really a thing...
EDIT : sorry my bad it's actually the opposite. One could expect that a site like CNET would include a timestamp and a unique ID in their URL in 2023. This seems to be the "unpermalink" of a recent cnet article.
I did the tweet. It is clearly not saying anything about the "average page rank" of your articles because those words don't appear in the tweet at all. And PageRank isn't the only factor we use in ranking pages. And it's not related to "gosh, we could crawl your page in X hours therefore you get more PageRank."
It's not from Google PR. It's from me. I'm the public liaison for Google Search. I work for our search quality team, not for our PR team.
It's not worded in any way intended to be parsed. I mean, I guess people can do that if they want. But there's no hidden meaning I put in there.
Indexing and ranking are two different things.
Indexing is about gathering content. The internet is big, so we don't index all the pages on it. We try, but there's a lot. If you have a huge site, similarly, we might not get all your pages. Potentially, if you remove some, we might get more to index. Or maybe not, because we also try to index pages as they seem to need to be indexed. If you have an old page that doesn't seem to change much, we probably aren't running back ever hour to it in order to index it again.
Ranking is separate from indexing. It's how well a page performs after being indexed, based on a variety of different signals we look at.
People who believe removing "old" content aren't generally thinking that's going to make the "new" pages get indexed faster. They might think that maybe it means more of their pages overall from a site could get indexed, but that can include "old" pages they're successful with, too.
The key thing is if you go to the CNET memo mentioned in Gizmodo article, it says this:
"it sends a signal to Google that says CNET is fresh, relevant and worthy of being placed higher than our competitors in search results."
Maybe CNET thinks getting rid of older content does this, but it's not. It's not a thing. We're not looking at a site, counting up all the older pages and then somehow declaring the site overall as "old" and therefore all content within it can't rank as well as if we thought it was somehow a "fresh" site.
That's also the context of my response. You can see from the memo that it's not about "and maybe we can get more pages indexed." It's about ranking.
Suppose CNET published an article about LK99 a week ago, then they published another article an hour ago. If Google hasn’t indexed the new article yet, won’t CNET rank lower on a search for “LK99” because the only matching page is a week old?
If by pruning old content, CNET can get its new articles in the results faster, it seems this would get CNET higher rankings and more traffic. Google doesn’t need to have a ranking system directly measuring the average age of content on the site for the net effect of Google’s systems to produce that effect. “Indexing and ranking are two different things” is an important implementation detail, but CNET cares about the outcome, which is whether they can show up at the top of the results page.
>If you have a huge site, similarly, we might not get all your pages. Potentially, if you remove some, we might get more to index. Or maybe not, because we also try to index pages as they seem to need to be indexed.
The answer is phrased like a denial, but it’s all caveated by the uncertainty communicated here. Which, like in the quote from CNET, could determine whether Google effectively considers the articles they are publishing “fresh, relevant and worthy of being placed higher than our competitors in search results”.
You're asking about freshness, not oldness. IE: we have systems that are designed to show fresh content, relatively speaking -- matter of days. It's not the same as "this article is from 2005 so it's old don't show it." And it's also not what is being generally being discussed in getting rid of "old" content. And also, especially for sites publishing a lot of fresh content, we get that really fast already. It's essential part of how we gather news links, for example. And and and -- even with freshness, it's not "newest article ranks first" because we have systems that try to show the original "fresh" content or sometimes a slightly older piece is still more relevant. Here's a page that explains more ranking systems we have that deal with both original content and fresh content: https://developers.google.com/search/docs/appearance/ranking...
Ha, I actually totally agree with you, apparently my comment gave the wrong impression. I was just arguing with the GP's comment which was trying to (fruitlessly, as you point out) read tea leaves that aren't even there.
While CNET might not be the most reliable side, Google telling content owners to not play SEO games is also too biased to be taken at face value.
It reminds me of Apple's "don't run to the press" advice when hitting bugs or app review issues. While we'd assume Apple knows best, going against their advice totally works and is by far the most efficient action for anyone with enough reach.
Considering how much paid-for unimportant and unrelated drivel I now have to wade through every time I google to get what I am asking for, I doubt very much that whatever is optimal for search-engine ranking has anything to do with what users want.
Do the engineers at Google even know how the Google algorithm actually works? Better than SEO experts who spend there time meticulously tracking the way that the algorithm behaves under different circumstances?
My bet is that they don't. My bet is that there is so much old code, weird data edge cases and opaque machine-learning models driving the search results, Google's engineers have lost the ability to predict what the search results would be or should be in the majority of cases.
SEO experts might not have insider knowledge, but they observe in detail how the algorithm behaves, in a wide variety of circumstances, over extended periods of time. And if they say that deleting old content improves search ranking, I'm inclined to believe them over Google.
Maybe the people at Google can tell us what they want their system to do. But does it do what they want it to do anymore? My sense is that they've lost control.
I invite someone from Google to put me in my place and tell me how wrong I am about this.
Once upon a time, Matt Cutts would come on HN give a fairly knowledgeable and authoritative explanation of how Google worked. But those days are gone and I'd say so are days of standing behind any articulated principle.
I work for Google and do come into HN occasionally. See my profile and my comments here. I'd come more often if it were easier to know when there's something Google Search-related happening. There's no good "monitor HN for X terms" thing I've found. But I do try to check, and sometimes people ping me.
The engineers at Google do know how our algorithmic systems work because they write them. And the engineers I work with at Google looking at the article about this found it strange anyone believes this. It's not our advice. We don't somehow add up all the "old" pages on a site to decide a site is too "old" to rank. There's plenty of "old" content that ranks; plenty of sites that have "old" content that rank. If you or anyone wants our advice on what we do look for, this is a good starting page: https://developers.google.com/search/docs/fundamentals/creat...
There is. Which is why I specifically talked only about writing for algorithmic systems. Machine learning systems are different, and not everyone fully understands how they work, only that they do and can be influenced.
It's really hard to get a deep or solid understanding of something if you lack insider knowledge.
The search algorithm is not something most Googlers have access too but I assume they observe what their algorithm does constantly in a lot of detail to measure what their changes are doing.
I think in this context, saying that it's not a thing that google doesn't like old content just means that google doesn't penalize sites as a whole for including older pages, so deleting older pages won't help boost the site's ranking.
This is not the same as saying that it doesn't prioritize newer pages over older pages in the search results.
The way it's worded does sound like it could imply the latter thing, but that may have just been poor writing.
That Googler here. I do use Google! And yeah, I get sometimes people want older content and we show fresher content. We have systems designed to show fresher content when it seems warranted. You can imagine a lot of people searching about Maui today (sadly) aren't wanting old pages but fresh content about the destruction there.
But we do show older content, as well. I find often when people are frustrated they get newer content, it's because of that crossover where there's something fresh happening related to the query.
If you haven't tried, consider our before: and after: commands. I hope we'll finally get these out of beta status soon, but they work now. You can do something like before:2023 and we wouldn't show pages from before 2023 (to the best we can determine dates). They're explained more here:
https://twitter.com/searchliaison/status/1115706765088182272
Maybe not related to the age of the content but more content can definitely penalize you. I recently added a sitemap to my site, which increased the amount of indexed pages, but it caused a massive drop in search traffic (from 500 clicks/day to 10 clicks/day). I tried deleting the sitemap, but it didn't help unfortunately.
100K+. Mostly AI and user generated content. I guess the sudden increase in number of indexed pages prompted a human review or triggered an algorithm which flagged my site as AI generated? Not sure.
it seems incredibly short-sighted to assume that just because these actions might possibly give you a small bump in SEO right now, they won't have long-term consequences.
if CNET deletes all their old articles, they're making a situation where most links to CNET from other sites lead to error pages (or at least, pages with no relevant content on them) and even if that isn't currently a signal used by google, it could become one.
Technically, you're supposed to do a 410 or a 404, but when some pages being deleted have those extremely valuable old high-reputation backlinks, it's just wasteful, so i'd say it's better to redirect, to the "next best page" like maybe a category or something, or the homepage, as the last resort. Why would it be problematic? Especially if you do a sweep and only redirect pages that have valuable backlinks.
I was only talking about mass redirecting 404s to the homepage, which I've heard is not great, I think what you're saying is fine -- but that sounds like more of a well thought out strategy.
It's not that we discourage it. It's not something we recommend at all. Not our guidance. Not something we've had a help page about saying "do this" or "don't do this" because it's just not something we've felt (until now) that people would somehow think they should do -- any more than "I'm going to delete all URLs with the letter Y in them because I think Google doesn't like the letter Y."
People are free to believe what they want, of course. But we really don't care if you have "old" pages on your site, and deleting content because you think it's "old" isn't likely to do anything for you.
Likely, this myth is fueled by people who update content on their site to make it more useful. For example, maybe you have a page about how to solve some common computer problem and a better solution comes along. Updating a page might make it more helpful and, in turn, it might perform better.
That's not the same as "delete because old" and "if you have a lot of old content on the site, the entire site is somehow seen as old and won't rank better."
Your recommendations are not magically a description of how your algorithm actually behaves. And when they contradict, people are going to follow the algorithm, not the recommendation.
Yeah, Google’s statement seems obviously wrong. They say they don’t tell people to delete old content, but then they say that old content does actually affect a site in terms of it’s average ranking and also what content gets indexed.
What the Google algorithm encourage/discourage and what google blog or documentation encourage/discourage are COMPLETELY different things. Most people here are complaining about the former, and you keep responding about the latter.
No one has demonstrated that simply removing content that's "old" means we think a site is "fresh" and therefore should do better. There are people who perhaps updated older content reasonably to keep it up-to-date and find that making it more helpful that way can, in turn, do better in search. That's reasonable. And perhaps that's gotten confused with "remove old, rank better" which is a different thing. Hopefully, people may better understand the difference from some of this discussion.
This is another problem of the entire SEO industry. Websites trust these SEO consultants and growth hackers more than they trust information from Google itself. Somehow, it becomes widely accepted that the best information on Google ranking is from those third parties but not Google.
I'm not sure it is so cut and dried. Who is more likely to give you accurate information on how to game Google's ranking: Google themselves, or an SEO firm. I suspect that Google has far less incentive to provide good information on this than an SEO firm would.
Google will give you advice on how to not be penalized by Google. They won’t give you advice on how to game the system in your favor.
The more Google helps you get ahead, the more you end up dominating the search results. The more you dominate the results, the more people will start thinking to come straight to you. The more people come straight to you, the more people never use Google. The less people use Google, the less revenue Google generates.
I would like to know what dollar amount Google makes on people typing things like “Amazon” into google search and then clicking the first paid result to Amazon.
It’s the same on YouTube - the majority of the people who work there seem to have no idea how “the algorithm” actually works - yet they still produce all sorts of “advice” on how to make better videos.
There’s an easy proof that those SEO consultants have a point: find a site that according to Google’s criteria will never rank, which has rocketed to the top of the search rankings in its niche within a couple months. That’s a regular thing and proves that there are ways to rank on Google that Google won’t advise.
It could be premature to place fault with the SEO industry. Think about the incentives: Google puts articles out, but an SEO specialist might have empirical knowledge from working for a various number of web properties. It's not that I wouldn't trust Google's articles, but specialists might have discovered undocumented methods for giving a boost.
The good ones will share the data/trends/case studies that would support the effectiveness of their methods.
But the vast majority are morons, grifters, and cargo culters.
The Google guidance is generally good and mildly informative but there’s a lot of depth that typically isn’t covered that the SEO industry basically has to black box test to find out.
> Websites trust these SEO consultants and growth hackers more than they trust information from Google itself.
That's because websites' goals and Google's goals are not aligned.
Websites want people to engage with their website, view ads, buy products, or do something else (e.g. buy a product, vote for a party). If old content does not or detracts from those goals, they and SEO experts say, it should go because it's dragging the rest down.
Google wants all the information and for people to watch their ads. Google likes the long tail; Google doesn't care if articles from the 90's are outdated because people looking at it (assuming the page runs Google ads) or searching for it (assuming they use Google) means impressions and therefore money for them.
Google favors quantity over quality, websites the other way around. To oversimplify and probably be incorrect.
Google actively lies on an infinite number of subjects. And SEO is a completely adversarial subject where Google has an interest in lying to prevent some behaviors. While consultants and "growth hackers" are very often selling snake oil, that doesn't make Google an entity you can trust either.
Hey, don't do that. That's bad. But if you keep doing it, you'll get better SEO. No, we won't do anything to prevent this from being a way to game SEO.
"Google says you shouldn't do it" and "Google's search algorithm says that you should do it" can both be true at the same time. The official guidance telling you what to do doesn't track with what the search algorithm uses to decide search placement. Nobody's going to follow Google's written instructions if following the instructions results in a penalty and disobeying them results in a benefit.
They say "Google doesn't like "old" content? That's not a thing!"
But who knows, really? They run things to extract features nobody outside of Google knows that are proxies for "content quality". Then run them through pipelines of lots of different not-really-coordinated ML algorithms.
Maybe some of those features aren't great for older pages? (broken links, out-of-spec html/js, missing images, references to things that don't exist, practices once allowed now discouraged...like <meta keywords>, etc). And I wouldn't be surprised if some part of overall site "reputation" in their eyes is some ratio of bad:good pages, or something along those lines.
I have my doubts that Google knows exactly what their search engines likes and doesn't like. They surely know which ads to put next to those maybe flawed results, though.
I don’t know man, I read it but I’ve learned to judge big tech talk purely by their actions and I don’t think there’s a lot of incentive built into their system that supports this statement.
My understanding is that if you have a very large site, removing pages can sometimes help because:
- There is an indexing "budget" for your site. Removing pages might make reindexing of the rest of the pages faster.
- Removing pages that are cannibalising on each other might help the main page for the keywords to rank higher.
- Google is not very fond of "thin wide" content. Removing low quality pages can be helpful, especially if you don't have a lot of links to your site.
- Trimming the content of a website could make it easier for people and Google to understand what the site is about and help them find what they are looking for.
Google search ranking involves lots of neural networks nowadays.
There is no way the PR team making that tweet can say for sure that deleting old content doesn't improve rank. Nobody can say that for sure. The neural net is a black box, and it's behaviour is hard to predict without just trying it and seeing.
Speaking from experience as someone who is paid for SEO optimization there's a list a mile long of things Google says "doesn't work" or you "shouldn't do" but in fact work very well and everyone is doing it.
I remember these kind of sources right from inside in the Matt Cutts era 15+ years ago encouraging and advising so many things which later proven to not be the case. I wouldn't take this only because it was written by the official guide.
Google says so many things about SEO which are not true. There are some rules which are 100% true and some which they just hope their AI thinks they are true.
There's an awful lot of SEO people on Twitter that claim to be connected to Google, and the article he links on the Google domain as a reference doesn't say anything on the topic that I can find. I'm reluctant to call that an official source.
Journalist here. Danny Sullivan works for Google, but spent nearly 20 years working outside of Google as a fellow journalist in the SEO space before he was hired by the company.
1st paragraph is correct , 2nd not quite - Matt Cutts was a distinguished engineer (looking after web spam at Google) who took on the role of the search spokesperson - it’s that role Danny took over as “search liaison”
No. But it's also complicated, as Matt did thinks beyond web spam. Matt worked within the search quality team, and he communicated a lot from search quality to the outside world about how Search works. After Matt left, someone else took over web spam. Meanwhile, I'd retired from journalism writing about search. Google approached me about starting what became a new role of "public liaison of search," which I've done for about six years now. I work within the search quality team, just as Matt did, and that type of two-way communication role he had, I do. In addition, we have an amazing Search Relations team that also works within search quality, and they focus specifically on providing guidance to site owners and creators (my remit is a bit broader than that, so I deal with more than just creator issues).
I'm the source. I officially work for Google. The account is verified by X. It's followed by the official Google account. It links to my personal account; my personal account links back to it. I'm quoted in the Gizmodo story that links to the tweet. I'm real! Though now perhaps I doubt my own existence....
He claims to work for Google on X, LinkedIn, and his own website. I am inclined to believe him because I think he would have received a cease and desist by now otherwise.
He claims to work for google as search "liason". He's a PR guy. His job is to make people think that google's search system is designed to improve the internet, instead of it being designed to improve google's accounting.
I actually work for our search quality team, and my job is to foster two-way communication between the search quality team and those outside Google. When issues come up outside Google, I try to explain what's happened to the best I can. I bring feedback into the search quality team and Google Search generally to help foster potential improvements we can make.
Yes. All this is saying that you do not write any code for the search algorithms. Do you know how to code? Do you have access to those repos internally? Do you read them regularly? Or are you only aware of what people tell you in meetings about it.
Your job is not to disseminate accurate information about how the algorithm works but rather to disseminate information that google has decided it wants people to know. Those are two extremely different things in this context.
I work on these kind of vague "algorithm" style products in my job, and I know that unless you are knee deep in it day to day, you have zero understanding of what it ACTUALLY does, what it ACTUALLY rewards, what it ACTUALLY punishes, which can be very different from what you were hoping it would reward and punish when you build and train it. Machine learning still does not have the kind of explanatory power to do any better than that.
No. I don't code. I'm not an engineer. That doesn't mean I can't communicate how Google Search works. And our systems do not calculate how much "old" content is on a site to determine if it is "fresh" enough to rank better. The engineers I work with reading about all this today find it strange anyone thinks this.
Probably not; anyone can claim to work for these companies with no repercussions, because is it a crime? Maybe if they're pricks that lower these companies' public opinion (libel), but even that requires a civil suit.
But lying on the internet isn't a crime. I work for Google on quantum AI solutions in adtech btw.
He’s been lying a long time, considering that he’s kept the lie up that he’s an expert on SEO for nearly 30 years at this point, and I’ve been following his work most of that time.
Did you notice that nowadays a lot of websites have a lot of uninteresting drivel giving a "background" to whatever the thing was you were searching for before you get to read (hopefully) the thing you were searching for?
People discovered that Google measures not only how much time you stay on a webpage but also how much you scroll to define how interesting a website is. So now every crappy "tech tips" website that has an answer that fits in a short paragraph now makes you scroll two pages before you get the thing you actually wanted to read.
I search for "how to do X", and instead of just showing me how to do it, which might take 30 seconds, they put a ton of fluff and filler in to the video to make it last 5 minutes.
Typical video goes something like:
0 - Ads, if you're not using an ad blocker
1 - Intro graphics/animation
2 - "Hi, I'm ___, and in this video I'm going to show you how to do X"
3 - "Before I get in to that, I want to tell you about my channel and all the great things I do."
4 - "Like and subscribe."
5 - "Now let's get in to it..."
6 - "What is X?"
7 - "What's the history of X?"
8 - "Why X is so great."
9 - finally... "How to do X"
Fortunately you can skip around, but it's still a bunch of useless fluff and garbage content to get to the maybe 30 seconds of useful information.
What makes this worse is that there's an increasing trend for how-tos to be only available on video.
As someone who learns best by reading, I'm already at a disadvantage with video to begin with. To make it worse, instructional videos tend to omit a great deal of detail in the interest of time. Then when you add nonsense like you're pointing out, it makes the whole thing a frustrating and pointless activity.
Notice how Google frequently offers YouTube recommendations at the top of things like mobile results, or those little expandable text drop downs? My guess is it is because clicking that let's them serve a high intent video ad at a higher CPM than a search text ad.
As someone who is Deaf, many of these videos are not accessible. They rely on shitty Google auto captions which aren't accurate at least 25% of the time.
It gets even better when you subscribe to YouTube Premium.
You get no random ad content which just cut into the feed at will, which makes for a somewhat better experience. But there's the inevitable "NordVPN will guarantee your privacy", "<some service here which has no ads and was made by content creators so you don't have to look at ads if you subscribe but hey all our content is on YT but with ads and here is an ad>" ad.
There is no escape. I actually pay for YT premium and it's SO much better than being interrupted by ads for probiotic yoghurt or whatever. I know there are a couple of plugins out there which I have not tried (I think nosponsors is one of them) but I really don't think there is any escape from this stuff.
Same. I think it's a mix of that and this "presenter voice" everyone thinks they have to use. My ADHD brain doesn't focus on it well because it's too slow so it's useless to me but all my life I've been told when presenting I should speak slowly and articulately while the reality is that watching anyone speak that way drives me nuts
What's great about writing is that readers can go at their own pace. When speaking, you have to optimize for your audience and you probably lose more people by being too fast vs. the people you lose by talking too slow. I have to say I appreciate YouTubers that go a million miles an hour (hi EEVBlog). As a native speaker of English, I can keep up. But you have to realize, most people in the world are not native speakers of English.
(The converse is; whenever I turn on a Hololive stream I'd say that I pick up 20% of what they're saying. If they talked slower, I would probably watch more than every 3 months. But, they rightfully don't feel the need to optimize for non-native speakers of Japanese.)
This is why I hate the trend of EVERYTHING being made into a video. Simple things that mean I have to watch 4-5min of video and have my eardrums blasted by some dubstep intro so some small quiet voice can say "Hi guys, have you ever wanted to do _x_ or _y_ more easily?" before finally just giving me the nugget of information I came for.
I wish more stuff were available in just text + screenshots..
Some of those people seem to be speaking so slow that it is excruciating to listen to them. When I find someone who speaks at a normal speed and I have to slow the video down, they usually have more interesting things to say.
That said, tinkering before and after youtube has been two different worlds. I really like having video to learn hands-on activities. I just wrapped up some mods to a Rancilio Silvia, and I noticed my workflow was videos, how-to guides and blog posts, broader electrical information documentation, part specific manuals / schematics, and my own past knowledge. I felt very efficient having been through the process before, and knowing when to lean on which resource. But the videos are by far the best resource to orient myself when first jumping in to the project, and thus save me a lot of time.
I mean, people are bad at editing. "I didn't have time to write a short letter, so I've written a long letter instead." I don't think it's a conspiracy.
I definitely write super long things when I consciously make the decision to not spend much time on something. Meanwhile, I've been working on a blog post for the better part of 2 years because it's too long, but doesn't cover everything I want to discuss. If you want people to retain the content, you have to pare it down to the essentials! This is hard work.
> I mean, people are bad at editing. "I didn't have time to write a short letter, so I've written a long letter instead." I don't think it's a conspiracy.
Making a long video isn't like writing a rambling letter. It takes work to make 10 minutes of talk out of a 1-minute subject. And mega-popular influencers do this, not just newbs who haven't learned how to edit properly yet.
"Tell me everything you know about Javascript in 1 minute." Figuring out what not to say is the hard part of that question. Rambling into the camera for an hour is easy.
But we're not talking about people taking 10 minutes to summarize a complex topic. We're talking about people taking 10 minutes to deliver 30 seconds of simple, well-delineated info.
This is something that happens a lot. I'll Google a narrow technical question that can be answered in three lines of text--there's literally nothing more of value to say about it--and all the top hits are 5+ minute videos. That doesn't happen by accident.
There's certainly a wide gamut of creators out there, and the handymen I've seen have videos like you mentioned. I imagine the complaints above are about the far more commercialized channels that do in fact model their videos after YT's algorithm.
It doesn't have to be a literal conspiracy. Why do you reject the possibility that people and organizations are reacting to very real and concrete financial incentives which clearly exist?
Certainly there are a lot of people that stretch their videos out to put in more ads, but not everyone with a long video is playing some metrics optimization game. They're just bad at editing.
I think the situation that people run into is something like "how do I install a faucet" and they are getting someone who does it for a living explaining it for the first time. Explaining it for the first time is what makes it tough to make a good video. Then there are other things like "top 10 AskReddit threads that I feel like stealing from this week" and those are too long because they are just trying to get as much ad revenue as possible. The original comment was about howtos specifically, and I think you are likely to run into a lot of one-off channels in those cases.
Sponsorblock is great for cleaning this crap up. Besides skipping sponsor segments, I have it set to autoskip unpaid/self promotion, interaction reminders, intermissions/intros, endcards/credits, and filler tangent/jokes set to manual skip.
One thing I really wish sponsorblock would add is the ability to mask off some part of the screen with an option to mute. More and more channels are embedding on-screen ads, animations, and interaction "reminders" that are, at best, distracting.
uBlock Origin is my extension of choice for this. It makes it really easy to block those distractions, and there are quite a few pre-existing filters to choose from.
I think you missed what the poster is asking for. They want to block a portion of the video itself. For example when you watch the news on TV, there is constant scrolling text on the bottom of the screen with the latest headlines. They want to block stuff like that.
I believe you're right! I can't think of any extension that would be able to modify the picture of a stream itself in real-time. What came to my mind was the kind of 'picture-in-picture' video that some questionable news sites display as you scroll down an article, usually a distracting broadcast which is barely related to the news itself.
There’s a channel that explains how to pronounce words that is a particularly bad offender. They talk about the history of the word up front, but without ever actually saying the word. They only pronounce it in the last few seconds, right as the thumbnail overlay appears.
Yeah, I usually skip up to a half of a typical video until they get to the point, sometimes more. People feel like just getting down to business is somehow wrong, they need first to tell the story of their life and how they came to the decision of making this video and why I may want to watch it. Dude, I am already watching it, stop selling it and start doing it!
> Did you notice that nowadays a lot of websites have a lot of uninteresting drivel giving a "background" to whatever the thing was you were searching for before you get to read (hopefully) the thing you were searching for?
I know you came here looking for a recipe for boiled water, but first here's my thesis on the history and cultural significance of warm liquids.
It’s interesting that water is often boiled in metal pots. There are several kinds of metal pots. Aluminum, stainless steel, and copper are often used for pots to boil water in.
Water boils in pots with different metals because only temperature matters for boiling water. If the water is 100c at sea level, it will boil.
Also, "jump to recipe" could be a simple anchor tag that truly skips the drivel. But for some reason it executes ridiculous JavaScript that animates the scroll just slowly enough to trigger every ad's intersection observer along the way.
This has been going on for about a decade now. This alone has caused me to remove myself from Google's products and services. They have unilaterally made the internet worse.
and the cherry on top is that they also own the browser. helps to thwart attempts to "scam" google analytics and track those poor a-holes that don't use it.
Are the employees at Google working on Search aware of how bad search results have become in the past year or two? Literally almost everyone I know, inside and outside of tech, has noticed a significant downgrade in quality from Google search results. And a lot of it is due to artificially inflated SEO techniques.
We've been diligently working to improve the results through things like our helpful content system, and that work is continuing. You can read about some of it in a recent post here (and it also describes the Perspectives feature that's live on mobile): https://blog.google/products/search/google-search-perspectiv...
"We've been diligently working to improve the results" was the response to the question of "Are the employees at Google working on Search aware of how bad search results have become in the past year or two?" I thought that was a clear response.
To be more explicit, yes, we're aware that there are complaints about the quality of search results. That's why we've been working in a variety of ways, as I indicated, to improve those.
We have continued to build our spam fighting systems, our core ranking systems, our systems to reward helpful content. We expanded our product reviews system to cover all types of reviews, as this explains: https://status.search.google.com/incidents/5XRfC46rorevFt8yN...
I am of the opinion that its just internet becoming more spammy and unhelpful rather than google searching becoming bad. Every Tom and his mom seems to have a blog/website which they don't even write themselves. Most of the content on the internet is now for entertainment rather than purpose or knowledge. So, I do wonder if its just the state of the internet these days. As a layman, these days I just go directly to Wikipedia/reddit/youtube rather than searching on google.
The Internet is becoming spammy and bad because of Google's rules for ranking. The fact that Google favors newer content and longer pages with filler text is why people are making the content lower quality.
My strong impression is that in the last two years there a couple change were rolled out to search that sent it straight into the sewer - search seemed to be tweaked to crassly, crudely put any product name above anything else in the searches. But since then, it seems like quality has crept back up again. Simple product terms still get top billing but more complicated searches aren't nerfed.
So it seems the search quality team exists but gets locked on in the closet by advertising periodically.
I know you can't verify anything directly but maybe we could set a system of code for you to communicate what's really happening...
You are also talking to someone who is on the PR team. This term gets thrown out a lot but in this case it is factually true, you are literally talking to a shill. I mean no disrespect to Danny but you are not going to get an honest and straightforward answer out of him.
If you think I am exaggerating, try to prompt him to see if you can get him to acknowledge that Googles current systems incentivize SEO spam. See if he passes the Turing test.
Don't kick the messenger. It's already good that someone (allegedly) from a department related to the situation could give some input. No need to dump all your frustrations on them
Facebook doesn't have guidance telling content creators to publish conspiracy theories, but their policies are willfully optimized to promote it. Take responsibility for the results of your actions like an adult.
We don't have a policy or any guidance saying to remove old content. That said, we absolutely recognize a responsibility to help creators understand how to succeed and what not to do in terms of Google Search. That's why we publish lots of information about this (none of which says "old content is bad." A good place to review the information we provide is from our Search Essentials page: https://developers.google.com/search/docs/essentials
> Are people on this site really convinced that an L3 Google engineer can flick the "Fix Google" switch on the search engine?
No, it's just when someone speaks on behalf of the company with the terms "we," they are typically addressed with "you." That doesn't mean we think they're the CEO. Are you unfamiliar with this concept? I can send you an SEO guide on it.
You're missing the point; This guy has zero power over what Google does so publicly berating him is not going to accomplish anything.
And anyways, the sentence "Take responsibility for the results of your actions like an adult" actually does imply he has some personal responsibility here. It's not helpful to the discussion and it's rude.
If you choose to throw yourself on a public forum doing PR for a company doing dumb things and you also insult everyone's intelligence by lying to them, people are gonna be a little rude
That's a little hyperbolic, don't you think? Do you even hear yourself? I fully understand Google hate but directing it at one person who is literally just doing their job (and hasn't lied to anyone despite your allegation) is childish and counterproductive. Save that for Twitter.
Danny has been here since 2008. Your account was created in 2022.
And also, "people" aren't being rude, you are. Own your actions.
No, I'm not being hyperbolic. There is one reason for the SEO algorithm to reward longer articles, and that's ad revenue. To paint it as anything else is lying. And you opened up this conversation extremely rudely with "OMG are you so dumb you think he owns Google."
The age of the articles was discussed in the original article, but when I was speaking to this engineer, I was talking about the length of articles which is the main criticism levied against Google SEO. I'm aware you didn't read any of it
I followed the thread just fine. You accused me of being rude (it was someone else) and also accused the other commenter of lying. Neither of which are true.
You did that, not me. It's you who seem to be having a problem with understanding the thread.
You said L3 so I was curious. I looked up the guy's LinkedIn [0] and honestly an L3 engineer would have a lot more context about Google's search. Danny, what do you even do?
Before Google, Danny Sullivan was a well respected search engine blogger/journalist. As far as I know, he isn't an engineer. There's no need to be rude.
I work for our search quality team, directly reporting to the head of that team, to help explain how search works to people outside Google and bring concerns and feedback back into team so we can look at ways to improve. I came to the position about six years ago after retiring from writing about search engines as a journalist, explaining how they work to people from 1996 onward.
Making statements that you wish publishers wouldn't do various things, doesn't change the actual incentives that the real-world ranking algorithms create for them.
I mean, saying that you should design pages for people rather than the search engine clearly hasn't shut down the SEO industry.
It doesn't matter if your guidance discourages it, your SEO algorithm is encouraging it. What you call "helpful" in your post is what is financially helpful to Google, not what's helpful to me.
There's no denying Google encourages long rambling nonsense over direct information
No one has demonstrated getting rid of "old" content somehow makes the rest of the site "fresh" and therefore ranks better. What's likely the case is that some people have updated content to make it more useful -- more up-to-date -- and the content being more helpful might, in turn, perform better. That's a much different thing that "if you have a lot of old content, the entire site is somehow old." And if you read the CNET memo, you'll see there's a confusion with these points.
But there's the rub, you're not making content more helpful. You're making it longer and more useless so we have to scroll down more so Google can rake in more ads. The fact that you're calling it more "helpful" is insidious. That's why garbage SEO sites are king on the internet right now. It's the same thing you guys do with Youtube, where you decreased monetization for videos under a certain length. Now every content creator is encouraged to artificially inflate the length of their video for more ads.
You're financially rewarding people for hiding information.
I think the theory of Google's death is that they are "killing the golden goose." The idea is that they are killing off all the independent websites on the internet. That is, all the sites besides Facebook/Instagram/Twitter/NetFlix/Reddit/etc. that people access directly (either through an app or a bookmark) and which (barring Reddit) block GoogleBot anyway.
These are all the sites (like CNET) that Google indexes which are the entire reason to use search. They are having their rankings steadily eroded by an ever-rising tide of SEO spam. If they start dying off en masse and if LLMs emerge as a viable alternative for looking up information, we may see Google Search die along with them.
As for why their revenues are still increasing? It's because all the SEO spam sites out there run Google Ads. This is how we close the loop on the "killing the golden goose" theory. Google uses legitimate sites to make their search engine a viable product and at the same time directs traffic away from those legitimate sites towards SEO spam to generate revenue. It's a transformation from symbiosis/mutualism to parasitism.
Edit: I forgot to mention the last, and darkest, part of the theory. Many of these SEO spam sites engage in large-scale piracy by scraping all their content off legitimate sites. By allowing their ads to run on these sites, Google is essentially acting as an accessory to large-scale, criminal, commercial copyright infringement.
Directs traffic away not to generate revenue but to generate revenue faster this quarter in time for the report. They could make billions without liquifying the internet but they would make billions slowly
> Google uses legitimate sites to make their search engine a viable product and at the same time directs traffic away from those legitimate sites towards SEO spam to generate revenue.
[Disclosure: Google Search SWE; opinions and thoughts are my own and do not represent those of my employer]
Why do you assume malicious intent?
The balance between search ranking (Google) and search optimization (third-party sites) is an adversarial, dynamic game played between two sides with inverse incentives, taking place on an economic field (i.e. limited resources). There is no perfect solution; there’s only an evolutionary act-react cycle.
Do you think content spammers spend more or less resources (people, time, money) than Google’s revenue? So then the problem becomes how do you win a battle with orders of magnitude less people, time, and money? Leverage, i.e., engineering. You try your best and watch the scoreboard.
Some people think Google is doing a great job; some think we couldn’t be any worse. The truth probably lies across a spectrum in the middle. So it goes with a globally consumed product.
Also, note, Ads and Search operate completely independent. There’s no signals going from Ads to Search, or vice versa, to inform rankings; Search can’t even touch a lot of the Ads data, and Ads can’t touch Search data. Which makes your theory misinformed.
Not GP, but to me, admittably a complete non-expert on search, there are so many low-hanging fruits if search result quality was anywhere on Google's radar that it is really difficult not to assume malicious intent.
Some examples:
- why pinterest is flooding the image results with absolute nonesense? How difficult it would be to derank a single domain that manages to screw google's algorithm totally?
- why there is no option for me to blacklist domains from the search result? Are there really some challenges that can't be practically solved in a couple of minutes of thinking?
- Does google seriously claim they can't differentiate between stackoverflow and the content copying rip-off SEO spam sites?
The issues you pointed out might be due to a company policy of not manually manipulating search results and leaving it all to the algorithm. It can be argued that this leads them to improve their algorithm, although at this point I don't think any algorithm other than a good and big LLM/classifier-transformer can solve the ranking problem, and that is probably not economical or something. But OTOH they manually ban domains they deem to be not conformant to the views of the Party. (not CCP, 1984)
Also, note, Ads and Search operate completely independent
That’s the mistake. They should be talking. Sites that engage in unethical SEO to game search rankings should be banned from Google’s ad platform. Why aren’t they? Because Google is profiting from the arrangement.
There doesn't have to be any malicious intent, just an endless chase for increased profit next quarter. SEO spam has more ads, thus generates more income for Google. Even if Ads and Search operate "completely independently", there must be a person in the corporate hierarchy which has control over both and could push the products to better synergize and make that KPI tick up.
Actually deranking sites which feature more than three Google Ads banners would improve search quality (mainly by making sites get to the point rather than padding a simple answer into an essay like an 8th grader at an exam) - but it would reduce Ads income so you cannot do it, no matter how independent you claim to be.
I think dismissing the relationship and impact adtech and search continue to have on web culture is an incredibly pointy-headed misstep. It's the sort of willful oversight that someone makes when their career relies on something being true.
Unless you have a clear view by leadership of what they desire the web should be and are willing to disclose it in detail, then there's not much to add by saying you work in Search.
Their search results are declining rapidly in quality. "<SEARCH QUERY> reddit" is one of their most common searches. Their results are filled with SEO spam and bots now.
At some point a competitor will emerge. The tech crowd will notice it and begin to use it. Then it will go widespread.
I suspect that the revenue increases have more to do with the addition of new users in developing markets rather than actual value added. Once all potential users have been reached, Google will have to actually improve their product.
I'm reminded of the 00's era joke that Microsoft could burn billions of dollars, pivot to becoming a vacuum cleaner manufacturer, and finally make something that doesn't suck.
I don't think Google dying would be good (lots of things would have to migrate infra suddenly), but the adtech being split off into something else would certainly be a welcome turn of events, IMO. I'm tired of seeing promising ideas killed because they only made 7-figure numbers in a spreadsheet where it'd have been viable on its own somewhere it wasn't a rounding error.
Before someone suggests a new search engine where the ranking algorithm is replaced with AI, I would like to propose a return to human-curated directories. Yahoo had one, and for a while, so did Google. It was pre-social-media and pre-wiki, so none of these directories were optimized to take advantage of crowdsourcing. Perhaps it's time to try again?
>
Before someone suggests a new search engine where the ranking algorithm is replaced with AI, I would like to propose a return to human-curated directories. Yahoo had one, and for a while, so did Google. It was pre-social-media and pre-wiki, so none of these directories were optimized to take advantage of crowdsourcing.
False. Google Directory (and many other major name, mostly now defunct, web directories) were powered by data from DMOZ which was crowdsourced (and kind of still is, through Curlie [0] and while some parts of the website show updated as recently as today, enough fairly core links are dead or without content that its pretty obviously not a thriving operation.) Also, it was not pre-Wiki: WikiWikiWeb was created in 1995, DMOZ in 1998. It was pre-Wikipedia, but Wikipedia wasn’t the first Wiki.
Actually, several static snapshops exist (a benefit of open licensing) despite the fact attempts to fork and continue have been not so successful. In addition to the one upthread there are also:
One issue there was that human-curated directories are everywhere. Hacker News is one. Reddit is another. And during the Yahoo times, directories were made everywhere and all over the place. Which one is authoritative? There's too much of them out there.
That said, in NL a lot of people's home pages for a long time was set to startpagina.nl, which was just that, a cool directory of websites that you could submit to the website. It seems to exist still, too.
I don't think we need any "AI" in the modern sense of that word. It would be an improvement to bring google back to its ~2010 status.
Not sure if the kagi folks are willing to share, but I get the impression that pagerank, tf-idf and a few heuristics on top would still get you pretty far.
Add some moderation (known-bad sites that e.g. repost stackoverflow content) and you're already ahead of what google gives me.
I feel like they presume I'm a gullible person they need to protect who is just on the Internet for shopping and watching entertainment.
Wasn't the point of them tracking us so much to customize and cater our results? Why have they normalized everything to some focus group persona of Joe six-pack?
***
Let's try an experiment
Type in "Chicago ticket" which has at least 4 interpretations, a ticket to travel to Chicago, a citation received in Chicago, A ticket to see the musical Chicago and a ticket to see the rock band Chicago.
For me I get the Rock band, citation, baseball and mass transit ticket in that order.
I'm in Los Angeles, have never been to Chicago, don't watch sports, and don't listen to the rock band. Google should know this with my location, search and YouTube history but it apparently doesn't care. What do you get?
Or it know way too much. An alternative explanation for it could also be:
It knows you're in LA and did not look up "Chicago flight", so you probably aren't looking for flights there.
Chicago musical isn't playing in LA so probably not the right kind.
Probably why most people get parking ticket listed higher. It would be interesting to see the results in a city where the band, team or musical has an event soon.
Google tracks you to "customize search results" and I even have a 100% google'd phone (Pixel) but when I'm searching for restaurants it still shows me stuff from portland oregon instead of portland maine. This despite literally having my "Home" marked in my google account as portland maine.
Going deep into personalization on Google.com (answering queries using your signed in data across other google properties) feels like high risk low reward. In a post gdpr & right to be forgotten environment they know they have targets on their back. Is super deep personalization really worth getting slapped with megafines?
Where you'll see integrations like this used to be assistant, but is now bard. Both of which have lawyer boggling Eulas and a brand they can sacrifice if need be.
Yes, some things should be beyond economic incentives. Destroying the historical record, for instance. We have plenty of precedent around that, now that we realised it's bad.
This is interesting, as I am actually doing the same thing with a site I have as I noticed my crawl budget has gotten less especially this year and fewer new articles are being indexed.
I suspect this is a long-term play for Google to phase out this search and replace with Bard. Think about it all these articles are doing now is writing a verbose version of what Bard gives you directly unless it’s new human content.
Google has in essence stolen all their information by scraping and storing in a database for its LLM and is offering its knowledge of this directly to users, so in a way, this is akin to Amazon selling its own private label products.
An article about reduced quality was pretty popular on HN a few years ago, that Google results looks like ads. But I believe we have hit a new low recently. Perhaps that is true for the overall quality of publications on the net. The amount or either approved news sites without significant content or outright click farms is immense. Even for topics that should net results. A news site filter would already help a lot, but even then the search seems to only react on buzzwords. Sometimes even terms you didn't search at all that were often associated with said buzzwords.
Even if we pretend for a moment that your statement, that google's search is "shitty", is universally accepted as truth, you can't blame this one on Google.
People have been committing horrifying atrocities in the name of SEO for years. I've seen it firsthand. And it spectacularly backfired each time.
This can very probably be yet another one of such cases.
Can Google tell the difference between old relevant information and old irrelevant (or outdated) information? I'm not seeing any evidence of that. A search engine is not a subject matter expert of everything on the Internet, and it shouldn't be.
In all fairness, there is some old information I would love to disappear eventually. Nothing quite as frustrating than having a question and all the tutorials are for a version of a software that is 15 years old and behaves completely different than the new one.
This is some bullshit. It’s bad enough that a lot of sites with content going back 10-20 years have linkrot or have simply gone offline. But I am at a loss for words that they’re disappearing content on purpose just for SEO rankings.
If this is what online publishing has come to we have seriously screwed up.
In fairness to them. If one of their top ways of getting traffic is being marked as less relevant because of older articles, what do you want them to do? Just continue to lose money because Google can't rank them appropriately?
Who says this is actually improving their ranking? SEO is smoke and mirrors. I wouldn't be surprised if this whole exercise was actually counterproductive for them, and they just don't realize it.
Not always smoke. If you have enough resources, you can make research and test hypothesis. Also, sometimes there are leaks, like a leak of Yandex source code, which disclosed what factors were used for ranking.
Yep. Also, the days of the "Advanced Search" page have passed, but Google still has time range options under the "Tools" button near the top of the results page. If they're giving the user the option to filter results by time, then it's pretty goofy for the algorithm to de-rank a site in results where the the default of "Any time" was selected in the query just because the site has old articles in the index.
I'm afraid that, as so often with any topic, the vast majority of users never or rarely uses the time filter and so everything gets optimized for the users to dumb to search properly.
In this case I'm not too keen on blaming users for this when the option is sorta buried in the current design. The Tools button isn't prominent, the current value of the time range option is completely hidden when it's on the default of "Any time", and the search query doesn't change if you alter the time range option. (And, to my knowledge, there's not a search query incantation for specifying the time range.)
If the current UI is a reflection of what PMs at Google want users to do, then they don't really care if the user uses, or even knows about, the option to filter by time. So I don't think it flies to point at the user and say, "you're holding it wrong".
And if the content isn't shared somewhere (typically on a non-Google property), then is it even relevant anymore? All the Googlers who defined search relevance outside of freshness have left to other opportunities.
I'd rather see them address it with a more complicated approach that preserves the old content. For example, they could move it to an unindexed search archive while they rework it to have more useful structures, associations and other taxonomy.
The approach to just move it to another domain and be done suggests it's not well enough researched for what an organization their size could manage.
> In fairness to them [...] what do you want them to do? Just continue to lose money
How come CNET is getting a free pass here to do something obnoxious because they're losing money, but Google is getting the stick for (maybe) doing something obnoxious to avoid losing money?
I think Google do a lot of bad stuff but let's at least assign blame proportionally.
It's my understanding that doing what you propose will still use up the crawl budget, because the bot will have to download the page and parse it to understand that it is no-indexed.
Then you could also block those pages in robots.txt, no? (You do need to do both though, as otherwise pages can be indexed based on links, without being crawled.)
Exactly. This should be solvable without actually deleting the pages. I assume they're only removing articles with near-zero backlinks, so a noindex,nofollow should generally be fine, but if crawl budget is an issue robots.txt and sitemap can help.
The real answer is that there's a non-zero cost to maintain these pages, and even more so if robots.txt entries and such have to be maintained for them as well. And if they have no monetary benefit, or even potentially a detriment, it makes more sense for them from a business perspective to just get rid of them. Unfortunately.
I assume GP means to split Search but keep search ads, called "AdWords", together with it. The third-party advertisement part of Google was/is called AdSense. The two were always run separately AFAIK.
I thought that the ads on search were a nice to have, but that the real value of search was the profile it allows google to build about a person, which is then used by all ad systems used by google.
Split that off, and search is all of a sudden producing a whole let less value, since the profiling value cannot be meaningfully extracted by other companies.
The article isn't super-clear about what's happening and, for most purposes, just dumping stuff on the Wayback Machine is probably not that different from deleting it even if the bits are still "somewhere." A few years back I copied any of my CNET stuff I cared about to my own site and, in general, that's a strategy I've followed with a number of sites as I don't expect anything to continue to be hosted or at least be findable indefinitely.
I kinda wish there were some way to store every page I ever visit automatically and index it locally for easy search. Then when I want to look up e.g. the guy who had the popular liquid oxygen fire website back in the 90s it would be easy. But I also fear it would be used against me somehow too.
A really awesome on-the-ground website for information about specifics in collectibles was allexperts.com (now defunct). They got bought by something about.com who then got bought by a Thought Company? and they straight up deleted all of the 10+ years of people asking common questions about collectibles and getting solid answers. Poof.
Surely Google determines "fresh, relevant" content according to whatever has recently been published, which this doesn't change. If anything, doesn't Google consider sites with a long history of content with tons of inbound links as more authoritative and therefore higher-ranked?
This baffles me. It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.
The theory I've heard is related to 'crawl budget'. Google is only going to devote a finite amount of time to indexing your site. If the number of articles on your site exceeds that time, some portion of your site won't be indexed. So by 'pruning' undesirable pages, you might boost attention on the articles you want indexed. No clue how this ends up working in practice.
Google's suggestion isn't to delete pages, but maybe mark some pages with a no index header.
Google crawls the entire page, not just the subset of text that you, a human, recognize as the unchanged article.
It’s easy to change millions of pages once a week with on-load CMS features like content recommendations. Visit an old article and look at the related articles, most read, read this next, etc widgets around the page. They’ll be showing current content, which changes frequently even if the old article text itself does not.
I'm pretty sure Google is smart enough to recognize the main content of a page, and ignore things like widgets and navigation. That's Search Engine 101.
It’s possible they examined the server logs for requests from GoogleBot and found it wasting time on old content (this was not mentioned in the article but would be a very telling data point beyond just “engagement metrics”).
There’s some methodology to trying to direct Google crawls to certain sections of the site first - but typically Google already has a lot of your URLs indexed and it’s just refreshing from that list.
It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.
If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.
This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.
> The theory I've heard is related to 'crawl budget'. Google is only going to devote a finite amount of time to indexing your site.
Once a site has been indexed once, should it really be crawled again? Perhaps Google should search for RSS/Atom feeds on sites and poll those regularly for updates: that way they don't waste time doing to a site scrape multiple times.
Old(er) articles, once crawled, don't really have to be babysat. If Google wants to double-check that an already-crawled site hasn't changed too much, they can do a statistical sampling of random links on it using ETag / If-Modified-Since / whatever.
The SiteMap, which was invented by Google and designed to give information to crawlers, already includes last-updated info.
No need to invent a new system based on RSS/Atom, there is already an actually existing and in-use system based on SiteMap.
So, what you suggest is already happening -- or at least, the system is already there for it to happen. It's possible Google does not trust the last modified info given by site owners enough, or for other reasons does not use your suggested approach, I can't say.
I can imagine a malicious actor changing an SEO-friendly page to something spammy and not SEO-friendly. Since E-Tag and If-Modified-Since are returned by the server, they can be manipulated.
Even if that rule were true, why wouldn’t everything in the say, top NNN internet sites get an exemption? It is the Internet’s most hit content, why would it not be exhaustively indexed?
Alternatively, other than ads, what is changing on a CNN article from 10 years ago? Why would that still be getting daily scans?
Probably bad technology detecting a change. Things like current news showing up beneath the article, which changes whenever a new article is added. I've seen this happen on quite a few large websites. It might be technologically easier to drop old articles than the amount of time to fix whatever they use to determine if a page has changed. You would think a site like CNET wouldn't have to deal with something like that, but sometimes these sites that have been around for a long time have some serious outdated tech.
That's a good point about the static nature of some pages. Is there any way to tell a crawler to crawl this page, but after this date don't crawl again, but keep anything you previously crawled.
Google is paying Wikipedia through "Wikimedia Enterprise." If Wikipedia weren't able to sucker people into thinking that they're poverty-stricken, Google would probably prop it up like they do Firefox.
If I were establishing a "crawl budget", it would be adjusted by value. If you're consistently serving up hits as I crawl, I'll keep crawling. If it's a hundred pages that will basically never be a first page result, maybe not.
Wikipedia had a long tail of low-value content, but even the low-value content tends to be among the highest value for its given focus. e.g., I don't know how many people search "Danish trade monopoly in Iceland", and the Wikipedia article on it isn't fantastic, but it's a pretty good start[0]. Good enough to serve up as the main snippet on Google.
Purely speculating, Wikipedia has a huge number of inbound links (likely many more than CNet or even than more popular sites) which crawler allocation might be proportionate to. Even if it only crawled pages that had a specific link from an external site, that would be enough for Google to get pretty good coverage of Wikipedia.
It could be better to opt those articles out of the crawler. Unless that's more effort. If articles included the year and month in the URL prefix, I would disallow /201* instead.
In a major site redesign a couple years ago, we dropped 3/4 of our old URLs, and saw a big improvement in SEO metrics.
I know it doesn’t make sense and that Google says it is not necessary. But it clearly worked for us.
I think a fundamental truth about Google Search is that no one understands how it actually works anymore, including Google. They announce search algorithm updates with specific goals… and then silently roll out tweaks, more updates, etc. when the predicted effect doesn’t show up.
I think the idea that Google is in control and all the SEOs are just guessing, is wrong. I think it’s become a complex enough ML system that now all anyone can do is observe and adjust, including Google.
I have noticed some articles (and not just "Best XXX of 202Y" articles) that seem to always update their "Updated on" date which Google unhelpfully picks up and shows in search results leading me to think the page is much more recent than it is.
> It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.
If the content deleted is garbage, why wouldn't it help? No clue on CNET's overall quality, but I don't have a favorable image of it. Just had a look at their main page and that did not do it any favors.
Perhaps sites with a small ratio of new:total content would be downranked --- but I really don't think that makes sense because that's going to be the case for any long-established site.
Google also might be at fault for making images on web lower quality. Several years ago, Google had announced that page load speed will affect the ranking. Google's tool, PageSpeed Insights gave recommendations on improving load speed. But it also recommended to lower quality of JPEG images to the level where artifacts would be visible. So instead of proper manual testing (using eyes, not a mathematical formula) on a large set of images, some Google employee simply wrote a recommended compression level out of their head and this forced web masters to worsen the quality of the images below any acceptable level.
So it doesn't matter if the photographer or illustrator worked hard to make a beautiful image, Google's robotic decision based on some lifeless mathematical formula crossed out their efforts.
Yes, this is made much worse by Google's "smartphone" crawler on pagespeed insights being an emulated very low end Moto G phone (10 years old?), downclocked to 25% of original (very slow) CPU speed, with a network that maxes out at 1mbit/sec or so, with 150ms latency added.
Makes it incredibly difficult to have nice imagery above the fold at least.
I've seen a couple news sources that are altering their publish dates to show near the top of news feeds. Google will announce "3 hours old", despite being weeks old.
Google should be massively down ranking sites that do this. Also if a site has a huge historical archive, that should be a positive indicator of site quality, not a negative one.
Wow, I thought that was just me. The elation of googling for a niche Reddit topic and finding one very recent only to reach that sad realization that the content is 11y old and likely not relevant anymore.
Even worse when this happens, and you find that it's an old post you made. Happened to me more than once. I never realized this was intentional on reddit's part; I just assumed Google was broken.
Even if updates are mostly minor corrections or batch updates of boilerplate, the capability exists to rewrite any part or all of a story if there's no way to see past versions from third-parties from a single cache from archive.is or possibly archive.org.
Archival navigation and visualization should be deeply integrated into a user-centric, privacy browser.
Yea I’ve noticed that as well. It’s really annoying when you search for something new but it links to an article that says a few months but it’s in fact years old.
I'm not certain they are altering publish dates, it is probably an error of Google. I've my own sites in the search results with wrong publishing dates, that don't make any sense.
I adore archive.org. I'm worried though it is becoming somewhat of a load bearing element of civilization, given the importance of shared and accurate history. We need redundancy.
~~I'm also worried about the deletion of old pages on archive because new owners of a domain update the robots.txt file to disallow it, which I've heard wipes the entire archive.org history of that domain. I hope that gets addressed.~~
Yeah wait, what? I hope that's incorrect. Updating robots.txt to disallow should only omit content from that point onward... It shouldn't be retroactive. What if there's a new owner of a respective domain, for example?
It isn't, I really wish that instead of wiping DECADES of history; it only applies to content that was archived from the day of the domain's registration. I think this is slightly more reasonable, but I imagine they simply don't have access to such data.
The Internet Archive is not really known for deleting anything. In many different postings across the years, their founder and employees have mentioned items being taken "out" of the wayback machine, not "deleted" out of it. I don't think you have anything to worry about.
It's absolutely essential and irreplaceable for the web archive, and that's why I was pretty angry that you guys decided to pick a fight with the big publishers over "loaning" ebooks that could have gotten the whole site killed.
It was possibly a worthwhile fight for someone to have, but not for the site that hosts the Wayback Machine. Separation of concerns, my friends...
Bad. Antithetical to both Google's original ideals and the early 'netzien goals.
Google's deteriorating performance shouldn't result in deleting valuable historical viewpoints, journalistic trends and research material just to raise your newly AI-generated sh1t to the top of the trash fire.
> Bad. Antithetical to both Google's original ideals and the early 'netzien goals.
At this point, we should all realize that every ideal and slogan spouted by tech companies is just marketing. We were just too young and naive to know any better. 'Don't be evil'. In hindsight, that should have set off alarm bells.
Imagine, if you will, an utopic world where a critical service such as finding anything is not dominated by one (1) entity but an actual number - such as ten (10). Sci-fi novels describe this hypothetical market structure as a "competitive market".
In this utopic arrangement, users of search services are more-or-less evenly distributed among different search providers, enjoying a variety of different takes on how to find stuff.
Search providers, continues the sci-fi imagination, keep innovating and differentiating themselves to keep an edge over competition and please their users.
Producers of content, one the other hand, cannot assume much about what the inventive and aggressively competitive group of search providers will accentuate to please their users. So they focus on... improving the quality of their content, which is what they do best anyway.
Its a win-win for users and content producers. Alas, search service providers have to actually work for their money. This is slightly discomforting to a few, but not the end of the world.
One cannot but admire the imagination of such authors. What bizarre universes they keep inventing.
Search is an inherently centralizing and monopolizing industry. The one with the biggest index can show the best results and ads, thus has the biggest budget, and the one with the biggest budget has the biggest index.
Since everyone's reacting to the headline and hasn't read The Fine Article... Please let me call attention to the linked tweet from Google explicitly saying don't do this.
> Are you deleting content from your site because you somehow believe Google doesn't like "old" content? That's not a thing! Our guidance doesn't encourage this. Older content can still be helpful, too.
And? I really do not get how a tweet from Google contradicts the fact that their search engine is hot garbage that incentivises spam, centralised silos and "SEO tricks" like what CNET is doing.
I don't have any agenda, it just seems like important context. c|net is doing something stupid and apparently it isn't even going to accomplish their goal. Maybe this statement from Google will help convince other companies not to do the same thing.
FWIW, I'm as mad about Google quality going downhill as anyone's. The problem of sites doing stupid things for SEO purposes goes back at least 20 years though.
SEO is not reliant, and is often contradictory to statements like the one you quoted. No one who takes SEO seriously looks at what Google says. They look at how Google behaves.
Source: used to professionally offer SEO services.
SEO is a scam run by con artists. Google's worth as a search engine is it's ability to rank pages by quality. SEO tries to fake quality or trick Google to ranking objectively bad sites higher.
Red Ventures is trying to get CNET to be worse. with this and the AI written stories. Google should react by delisting all of CNET.
SEO won a long time ago. Google fights a hopeless war against legions of soulless locusts.
For all its resources it’s incapable of improving the situation. Since it only understands proxies of quality and truth, and these things can be manufactured and industrialized, they are incapable of winning.
Blogspam and fraud consistently outrank the original, and changes to the rules frustrate legitimate sites more than SEO spam. If anything these frequent changes increase the demand for SEO, not make it harder. You just wind up with more.
Until they figure out some kinda digital pesticide for SEO spam, the situation will continue to get worse.
To a large extent yes, but I'd wager a lot of sites have been improved by people reading about ranking and SEO as a lot of it is just better incentives for good/bad behavior (e.g. punishing copy pasting, rewarding relevant keyword usage).
The quality of Google’s search results have been in steady decline for years. Google’s worth as a search engine is mostly the value it accrues to advertisers with pinpoint user targeting at this point.
Information on the internet should be in whatever format best suits the topic. The format that best serves the users looking for that information.
And search engines should learn to interpret that information and the various formats, in order to be able to best connect those searching information with those providing information. Yes, the search engine should adapt to the information and it's formats - not the other way round.
Instead we see "information" (or the AI-generated trite replacing it) adapt it's contents and format for search engines, in a bid to ultimately best serve advertisers. And search engines too adapt and change their algorithms to best serve advertisers.
As a result it becomes ever harder and harder for users to actually find the information they want in a format that works.
It's become so bad, that it's now more practicable to use advanced AI to filter out the actual information and re-format it, rather than go look for for it yourself.
As a human, you no longer want to use the web, and search... you want to have a bot that does that for you... because ultimately the space has become pretty hostile to humans.
That reads like an attempt at a refutation that is actually a confirmation:
> The page itself isn’t likely to rank well. Removing it might mean if you have a massive site that we’re better able to crawl other content on the site. But it doesn’t mean we go “oh, now the whole site is so much better” because of what happens with an individual page.
> “Just because Google says that deleting content in isolation doesn’t provide any SEO benefit, this isn’t always true" Which ... isn't what we said. We said that if people are deleting content just because they think old content is somehow bad that's -- again -- not a thing.
To me that reads like it is possible to improve the ranking of new content by deleting old content. The only thing they are refuting is that the age of the deleted content is the reason for the improvement.
Which is a pretty crucial distinction, no? No one would get upset if CNET announced they were deleting clickbait and blogspam.
With articles such as "The Best Home Deals from Urban Outsitters' Fall Forward Sale" currently gracing their front page, I'm wondering how long HN commenters expect to need access to this content.
I don't get it, why they just don't update the entries to disallow googlebot parsing those links, this way it'll be removed for google but accessible for others
cnet got bought by vulture capital a couple years back and already had been replacing writers with crap AI before chatGPT was a thing. this shouldn't be a surprise. everything CNET related has been a walking corpse for a while now.
I remember that CNET was a credible source 10 years ago, now it’s like a tabloid that reports 1 week old stories with clickbait titles and thin content.
So, let me get this right. CNET started using "generative AI" to write their articles. Google no doubt detected it and down ranked them to hell. CNET stopped the AI generation and they decided to delete their archives to improve their rankings?
Newer is not always better, especially when you're looking for information on old things, but I suspect there are vested interests who don't want us to remember how much better things were in the past, so they can continue to espouse their illusion of "progress", and this "cleaning up" of old information is contributing to that goal.
Archive.org deserves all the support it needs. If only the Wayback Machine was actually indexed and searchable too...
When download.com became infested with ad ware was a sad day. I used to visit that site a lot to see what cool software I could install. Thank the heavens for Linux for teaching me proper package management.
I just went to http://download.com . On the frontpage they have demos for NFS Underground 2, Vice City and Age of Empires II. It's like I've entered a time warp. So many memories!
Same. I was a kid in the country, so we didn't have internet. I used to ride my bike with my thumb drive to the library's computers to see what was new on download.com at least once a week. It was like visiting a candy store. It's a shame those days are over.
This is like the flea on the dog's tail, wagging the tail and dog. I don't know how many more levels of meta we can handle.
Also, if you just dump all that content on archive.org, you're kind of just reaching into archive.org's wallet, pulling out dollar bills, and giving them to Google, whose ostensible goal was to index and make available all the world's information. I feel like that's enough irony and internet for today.
CNET did this a while back, but it didn't seem SEO related then. They used to have tons of old tech specs. I remember them being the last source of specs for an obscure managed switch. Then the whole of that data just went away with no notice. Really great resource lost.
It's things like this that make me really want to see some of the alternative search engines succeed. Hopefully Google continues shooting itself in the foot enough to make people seek alternatives. FWIW I'm using Kagi and have been very happy with it. It's the only alternative I've used (including numerous failed attempts to switch to DDG) where the results have been good enough that I haven't developed a muscle memory to resort back to Google results. And I consider the amount I have to pay to use it reasonable for the value I get, but also an investment in a potential future that doesn't have us all beholden to Google.
I haven't visited a CNET page in years and didn't know they were even still relevant :p
The times I have visited CNET pages in the past was to find specific information. If such information happens to be in deleted articles, that would reduce my interactions with CNET in the future.
I think they should archive the old articles or even offload them to Wayback willingly, but it's possible some of the articles they're purging aren't worthwhile. If I write up an article about a cat playing on a scratching post, there's a good chance there's nothing unique or valuable about it and it doesn't need to sit around gathering bit rot :p
'Stories slated to be “deprecated” are archived using the Internet Archive’s Wayback Machine, and authors are alerted at least 10 days in advance, according to the memo.'
Don't worry, Wayback is capable of telling CNET where to go if they were really concerned. If anything they should be thankful for the newly-generated interest that a major company would use them instead of some randoms cherry-picking sites for arguments.
I didn't know they were strained. Probably best to block them so they don't spend resources on unimportant things. You can find a few UA strings here https://udger.com/resources/ua-list/bot-detail?bot=archive.o... Ultimately, it looks like they do respect the `archive.org_bot` string so if you want to help the strained organization, probably best to block it as well.
You can also save them some space, it appears, by uploading a file on your website asking them not to retain it and then sending an email to them. See https://webmasters.stackexchange.com/a/128352
For now, I just blocked the bot, but when I have some time maybe I'll ask them to delete the data.
Hopefully, with a little community effort we can reduce the strain on them so they can spend their limited resources reasonably.
> I haven't visited a CNET page in years and didn't know they were even still relevant :p
I'm not sure that CNet is relevant. I see their headlines regularly because they're a panel on an aggregate I visit. Most of what they publish is on a level with bot-built and affiliate pages.
We just had a long run of "Best internet providers in [US city]" as if people who live there could choose more than one. Between those will be Get This Deal On HP InkJet Printers and Best Back To School VPN Deals 2023 that compares 10 different Kape offerings.
If there is a point to cnet's existence, I truly don't see it.
While it would be nice to believe that search engines solve the issue automagically, there are a lot of reasons why organizations want to reduce they amount of old/outdated/stale information that's served to searchers. I know where I work, we're constantly deleting older material (at least for certain types of content--which news and quasi-news pubs doesn't necessarily fall into).
One reason is if you keep all your old versions of product docs up, Google will randomly send people to the old version instead of the current one, and then customers will get confused by the outdated info
I’m sure there must be some way to fix this with META tags/etc, but it is often easier just to delete the old stuff, than change the META tags on 100s or 1000s of legacy doc pages
And you're probably out of support and aren't a current paying customer so you're a lower priority than current customers that want current information. (Old information is often available on product support sites but it probably won't come up with a random search.)
When I worked at $MAJOR_VENDOR, we had an internal-only archive of product installation and documentation ISOs for old versions, discontinued products, etc. It didn't go back forever (although apparently some even older stuff was archived offline and could be retrieved on request), but it did contain product versions that had been out-of-support for over a decade. A lot of older material which had been removed from the public website was contained in it.
There was a defined process for customers to request access to it – they'd open a support ticket, tell us what they wanted, the support engineer would check the licensing system to confirm they were licensed for it, then transfer it to an externally accessible (S)FTP server for the customer to download, and provide them with the download link. There was a cron job which automatically deleted all files older than 30 days from the externally-accessible server.
And for product documentation for version n-2.x, it's easy enough to see the use case for why a customer might need it. (e.g. they have some other product that's version locked to it) But there is a ton of collateral, videos, etc. that companies produce that get stale, musty, or just plain wrong and if you leave them all out there it's just a mess that's now up to the customer to decide what's right, what's mostly right, and what's plain wrong.
So, while there may be some historical interest in how we were talking about, say, cloud computing in 2010, that's the kind of thing I keep in my personal files and generally wouldn't expect a company to keep searchable on its web site.
> But there is a ton of collateral, videos, etc. that companies produce that get stale, musty, or just plain wrong
That 2017 product roadmap slide saying "we'll deliver X version N+1 by 2020" gets a bit embarrassing when 2023 rolls around and it is still nowhere to be seen. Maybe the real answer is "X wasn't making enough money so the new version was cancelled", but you don't want to publicly announce that (what will the press make of it?), you just hope everyone forgets you ever promised it. You can't get rid of copies of the slide deck your sales people emailed to customers, or you presented at a conference, or in the Internet Archive Wayback Machine – but at least you can nuke it off your own website. Reduces the odds of being publicly embarrassed by the whole thing.
Or your 2011 presentation about cloud computing didn't mention containers. I'm not sure it's embarrassing any more than a zillion other forward looking crystal ball glimpses is embarrassing. But there's no real reason to keep on your site unless your intent is to provide a view into technology's twisty path or the various false starts every company makes with its products.
Companies are not, in general, in the business of being the archive of record. They're, among other things, in the business of providing their current customers with the information that's most applicable and correct for their current needs and not something relevant to a completely different version of software from 10 years ago.
Google search is terrible these days. I find myself using other engines even yandex to find things these days. It has been in a decline now for at least the last 3-5 years.
All this SEO garbage may be responsible but I still think google could do a much better job.
Feels like Google is just over monetizing it’s search at this point and the quality of product is in a free fall.
I saw someone mention kagi, a premium search engine, on here recently and I've been loving it. Hadn't heard of yandex, I'll give that one a try too.
Google definitely has an incentive to push content farms covered in Google ads to the top of its search engine, whereas a premium service's only incentive is to provide the best search engine possible.
Can we all please start building decentralized tools now.
Stop giving these crap companies power over you and your data. You don't know what they'll do with your data in the future (think George Orwell).
As we can see these companies can out of the blue force other companies to delete their old articles. There goes freedom of speech just to rank higher on a shi*y search engine that no one smart even uses (I use brave search). Then again, these companies kowtowing to Googl mainly write for the brainwashed sheep. I personally don't use any of their privacy invasive analytics on my blog.
They have no obligation to serve old stuff indefinitely. If you like an article, it's on you to save it, just like people in the old times would keep newspaper cutouts about their favorite band and VHS cassettes of interviews etc.
Yes, things are becoming ephemeral. Probably good. Been like that forever except for some naive time window of strange expectations in the 2000s decade. We shouldn't keep rolling ahead of us a growing ball of useless stuff. Shed the the useless and keep the valuable. If nobody remembers otherwise, then probably it's not a thing of value.
The problem is that it's no longer possible to find it. Lots of early internet history is lost or hard to find - despite the fact that storage space is not an issue these days.
Maybe. I'm of two minds. Valuable things stand the test of time, they get retold and reused and remembered and kept alive. It's always been like that. On the other hand this could be construed as a defense of oral tradition above writing, but writing has had immense impact on our capabilities and technological progress historically, including the preservation of ancient texts. But I'm not sure we'd be so much better off if we had access to all the gossip news of the time of Plato. Not to mention that forgetting, retelling, rediscovering, discussing stuff anew allows for a kind of evolution and mutation that can regularize and robustify the knowledge.
So we need some preservation, but indiscriminate blind hoarding of all info isn't necessarily the best.
There can be many reasons for this from SEO point of view. One of them can be to send Link Juice to the pages they want to rank better (May be because those pages bring more advertising revenue or being new don't have enough backlinks.) So let's say you have old pages which have good amount of backlinks but the stale content is not bringing enough visitors to your site. So you can delete those pages and permanently redirect them to new pages. This way the new page will get a Link Juice boost and it will perform better in search results.
Highly doubt that this is for SEO purposes, serp does not work like that. Smell like a bad management decision, and probably some huge investment in an ai tool that needs to be justified monetarily. They will probably nuke a huge portion of their traffic, and loose a ton of backlinks.
If you genuinely strive to create a great content that piece of content will drive traffic, leads, sales, help for rankings and etc. for years.
Just by searching a random thing like "how to build an engine" the first article that comes, and the top results is from 1997.
The fact that they resorted to writing crap these days with AI is what is making them lose relevance, not that they have been in the business for a long time, and have thousands upon thousands of articles.
CNET used to, perhaps not be cool, but it certainly was a place you'd go get tech news from every once in a while. Just imagine they were huge enough to buy download.com, back then, even! For a while, they were essentially *the* place you'd go for downloads of various shareware stuff.
Now, it's a ghost town of a site with AI-written junk.
SEO optimization is actively destroying the archives of blogs out there. Pruning articles to rank better is rewarded. Removing knowledge to "play the game" is a viable path to making money.
The saving grace here? The existence of the Wayback Machine. A non-profit by the Internet Archive that is severely underfunded. If you ever needed a reason to donate, this is probably it. And even then, the survival of this information depends on a singular platform. Digital historians of the future will have a tough job.
Can't believe how many morons are at the helm and making decisions. This simple exercise should tell us how much trustworthy CNET is with their technical knowledge LOL
Recently, I did some research on the topics of "Windows Phone" and "Windows Mobile". I found a lot of interesting articles on Hacker News, thanks to everyone who contributed to the database.
However, many of the links are no longer working. Some websites domain have expired, while others have chosen to remove the old articles. That made me feel so bad.
I have noticed this at some of the sites I used to write for. Hence, as soon as I put a new piece up, I archive it at archive.is, and include that reference in my site's list of my work. Periodically, I should go there and check each one, but there's a lot of material.
I did not know that this was a possible motivation as to why my more 'historic' work is disappearing, though.
Sites also just reorganize, change CMSs, simply go out of business (which can happen to archive.is as well). I save a lot of my own stuff but it's a bit hit or miss and I expect a lot of the material we lean on The Wayback Machine to save is probably pretty hard to actually discover.
True about archive.is being probably more at risk than WayBack Machine. I just don't know when I'll ever get the 2-3 days necessary to additionally back the posts up at WM, because each save is very slow.
Could CNET not have relocated these articles to a distinct directory and then used robots.txt to prevent search engines from indexing them? This move seems like an excuse to offload subpar content by placing the blame on external factors. It's sort of like saying, "we withdrew a significant amount of cash from the bank and torched it to make space for fresh funds."
Luckily it's just one website making a dumb decision to delete their content. I doubt that this will actually boost their rankings, and I hope others don't follow CNET's lead. (Hopefully CNET is wrong, and that deleting your sites old content doesn't actually help your search engine ranking on Google.)
The gamification of ranking and clickbait by encouraging digital amnesia and bitrot seems both absurd and evil. New content isn't always good or better when specific content was desired.
0. Donate to internet archive now
1. Website operators should aim to not hide content or download artifacts behind JS with embedded absolute URIs or authenticated APIs.
Does anybody actually browse CNET on purpose? I mean, the content has been the quality of LLM-generated for a decade or more and the only way any sane person ends up seeing it is if you get redirected through a link-shortener/link-obfuscator, surely. Surely?
My first reaction was to think "Google is a shitty search engine if CNET can improve their search ranking by deleting information."
But then I thought about scenarios where this might be legitimate. If we assume:
1. Google has some ability to assess the value of content that isn't 100% reliable at a per-article level, but in aggregate is accurate.
2. Google therefore has the ability to judge CNET's content quality score in aggregate, and use that in search rankings for individual articles.
3. CNET knows that they have articles that rate well on Google's quality score, and articles that rate poorly.
4. CNET has reason to think they can generally distinguish "good" articles from "bad" articles.
THEN it would (could?) make sense to remove the "bad" articles in order to raise their aggregate quality score. It's kind of like the Laffer Curve of Google Content Quality. Overall traffic could go up if the "bad" articles are "bad" enough.
That said, unless CNET was publishing absolutely useless garbage at some point, it would make a lot more sense to just de-index those articles, or even for Google to provide a way for a site to mark lower-quality articles, so Google understands that the overall quality score shouldn't be harmed by that article about the Top 5 Animals As Ranked By Hitler, but if someone actually Googles that, it's fine for CNET to have retained that article, and for Google to send them to it.
In short: Google advising sites not to do this when there are perfectly reasonable alternatives seems silly.
This seems like a win-win for CNET. They get publicity from the news and lots of typical outrage and Google-bashing comments. And if it doesn't improve the ranking as expected, they can always put back the old articles, quietly.
Terrible shame. CNET reviews were like going shopping in a technology wonderland when I was a kid. I wonder if they're going to erase James Kim's body of work as well. That would be a shame.
They’re simply wrong in their theory of the case here. My guess is they will look back on this experiment in a year and regret it. Having older, highly linked content on a site is every SEO’s dream.
Why do the articles have to be deleted? Is robots.txt or <meta name="robots" content="noindex,nofollow"> not sufficient? Deleting articles will break so many external links.
It's been quite a while, but CNET has never been the same since CBS acquired them in 2008. They drove the site to become SEO fodder, almost like a ZergNet for tech. And their most recent acquirer is diluting whatever integrity remains by pushing out a slew of "lifestyle" content.
It's a shame because CNET, surprisingly, has retained some really great editors that are as knowledgeable as they come in their respective domains. David Katzmaier and Brian Cooley come to mind.
So, CNET decides to delete lots of useful information, just for better SEO. None of the other old tech news websites are trying to destroy thier information for the SEO, they still get good rankings. This should be a good lesson to anyone else who are thinking of going this route.
That's what I moved to ChatGPT4 for most of my info queries. Google is badly compromised by SEO. Plus ChatGPT4 gives answer right away within having to go through multiple results and search within each page for specific line I need.
I have been recently forking off a subproject from Git repo. After spending a lot of time messing around with it and getting into a lot of unforeseen trouble, I finally asked ChatGPT how to do it and of course ChatGPT knew the correct answer all along. I felt like an idiot. Now I always ask ChatGPT first. These LLMs are way smarter than you would think.
GPT4 with WolframAlpha plugin even gave me enough information to implement Taylor polynomial approximation for Gaussian function (don't ask why I needed that), which would have otherwise taken me hours of studying if I could even solve it at all.
PS: GPT4 somehow knows even things that are really hard to find online. I recently needed standard error but not of mean but rather of standard deviation. GPT4 not only understood my vague query but gave me formula that is really hard to find online even if you already know the keywords. I know it's hard to find, because I went ahead to double-check ChatGPT's answer via search.
So you implemented a poly approx fror a Gaussian function without understanding what you were doing (implying that if you wanted to do it yourself it would take hours of studying).
Good luck when you need to update it and adjust it - this is the equivalent than copying/pasting a function from Stack Overflow.
I double-checked everything, but that's beside the point. I was replying to GGP's insinuation that ChatGPT is unreliable. In my experience, it's more likely to return correct results than the first page of search. Search results often resemble random rambling about tangentially related topics whereas ChatGPT gets its answer right on first try. ChatGPT understands me when I have only a vague idea of what I want whereas search engines tend to fail even when given exact keywords. ChatGPT is also way more likely to do things right than me except in my narrow area of expertise.
i use a tool for programming that's based on ChatGPT
i find it most helpful when i am not sure how to phrase a query so that direct search would find something. but i also found that in at least half the cases the answer is incomplete or even wrong.
the last one i remember explained in the text what functions or settings i could use in the text, but the code example that it presented did not do what the text suggested. it really drove home the point that these are just haphazardly assembled responses that sometimes get things right by pure chance.
with questions like yours i would be very careful to verify that the solution is actually correct.
In the same way you can tell if a search result is "good", you can usually tell if what ChatGPT is telling you something truthful.
And you face the same problem when looking for something in a domain you are not an expert in - no way to tell if a web page is truthful and no way to tell if ChatGPT is right. ChatGPT just lets you make more mistakes more efficiently.
But for those cases where you kind of know the answer, ChatGPT is usually better than search.
Well, you say that flippantly, but if you ask it correctly, in most cases the answer is correct as well. You should obviously double check the solution, but that applies to anything, be it a Google search or a Wikipedia article.
Right there with you. I've gotten so used to having it give me exactly the answer to my specific question that, when I must fall back to traditional search, it's noticeably unpleasant.
I tried using ChatGPT as a search engine, but it's too slow. With DuckDuckGo/Google you can just go to the domain, type, enter and you have your answer. I haven't used it in a few months, but with ChatGPT you have to log in, pass various screenings, hope it's now down, and then finally get to where you can type.
With a regular search engine you've already found your answer by that time.
Can there be any doubt that Google destroyed the old internet by becoming a bad search engine? Could their exclusion of most of the web be considered punishment for being sites being so old and stable that they don't rely on Google for ad revenue?