Google is obviously on a mission to keep people on Google owned properties. So, they take what they crawl and find a way to present that to the end user without anyone needing to visit the place that data came from.
Airlines are a good example. If you search for flight status for a particular flight, Google presents that flight status in a box. As an end user, that's great. However, that sort of search used to (most times) lead to a visit to the airline web site.
The airline web site could then present things Google can't do. Like "hey, we see you haven't checked in yet" or "TSA wait times are longer than usual" or "We have a more-legroom seat upgrade if you want it".
Google took those eyeballs away. Okay, fine, that's their choice. But they don't give anything back, which removes incentives from the actual source to do things better.
You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.
In short, I don't think the crawler is the problem. And I don't think Google will realize what the problem is until they start accidentally killing off large swaths of the actual sources of this content by taking the audience away.
Then it comes fully circle to Google unfairly using their market position vis-a-vis data, search and advertising. It’s a win-win Google lets the data dictate which markets to enter and on one hand they can jack up advertising fees on customers/competitors and unfairly build their own service into search above both ads and organic results.
If you use the same airline they will make sure you get to the destination.
I have typically used this strategy when flying back to the US from the EU. Take an EZJet or similar low cost airline from random small EU city to a larger EU city like Paris, London, Frankfurt, etc... and book the return trip to the US from the larger city. I've also been forced to do this from some EU cities since there was no connecting partner with a US airline.
* UA or Lufthansa round trip (single carrier) $3K
* UA round trip SFO - Paris + Aeroflot round trip Paris - Moscow: $1K
No amount of search could reduce the gap. I went with the second option. The gap is even bigger if you have a route with multiple segments.
To anyone from airtreks, I love you so much!
Depending on the definition of "tight" each of us have. I remember having 40mins in Munich, and that is a BIG airport. Especially if you disembark on one side of the terminal and your flight is on the far/opposite end. That's 25-30mins brisk walking. With 5000 people in-between you could as well miss your flight. No discussion about stopping to get a coffee or a snack.. you'll miss your flight.
Keeping users from clicking through to organic results helps them generate more revenue.
Not sure who you talked to, but I've never heard that before. They all want to sell more direct and forego GDS fees and/or other types of fees and commissions. I'd love to see a quote from an airline VP or above that they don't care about their distribution model, boosting direct sales percentages, etc.
Just like Amazon with Amazon Basics.
Monopolies themselves aren't illegal. To be convicted of an antitrust violation, a firm needs to both have a monopoly and needs to be using anticompetitive means to maintain that monopoly. The recent "textbook" example was of Microsoft, which in the 90s used its dominant position to charge computer manufacturers for a Windows license for each computer sold, regardless of whether it had Windows installed or was a "bare" PC.
Depending on how you define the market, Google may not even have a monopoly. It's probably dominant enough in web search to count, but if you look at its advertising network it competes with Facebook and other ad networks. In the realm of travel planning (to pick an example from these comments), it's barely a blip.
Furthermore, Google can potentially argue it's not being anticompetitive: all businesses use their existing data to optimize new products, so Google could claim that it not doing so would be an artificial straightjacket.
> In the realm of travel planning (to pick an example from these comments), it's barely a blip.
They used their monopoly in web search to gain non-negligible marketshare in entirely unrelated industry. That's text book anti-competitive behavior.
Google can argue whatever they want, but the argument that they're enabling other businesses is a bad one. It casts Google as a private regulator of the economy, which is exactly what antitrust laws are intended to deal with.
BTW, newspapers also make money from subscriptions and sales of copies, so your analogy is doubly wrong.
Enough people realizing Google is trapping and cannibalizing traffic to the other sites it feeds off of, and choosing to do other things EXCEPT touching Google properties, is THE ONLY way they'll be unseated.
No clear legal path to stop a bully means it's an ethical / habit path.
Not saying there's any easy way, just that this is it.
In fact Google had to make certain concessions in order for the Google Flights acquisition to get regulatory approval.
IIRC a Chinese firewall between Google data and Google Flights...but like many regulations they were likely written by Google lobbyists aka the industry experts. Because at the end of the day Google flights: 1. Still has the built in widget above organic results and 2. They still bid on their own ad spots jacking up costs on competition which is ultimately passed on to consumers.
It's a tall order to convince the courts that Google's actions consumers, or is illegal: after all, being innovative in ways that may end up hurting the competition is a key feature of a capitalist society - proving that a line has been crossed is really hard, by design.
google has a monopoly on search ads and does enforce it, being a drain on the economy since in many fields you only succeed if you spend on search ads
If someone could convince the courts that this is correct, then I'm sure Google would lose. However, I bet dollars to donuts Google's counter-arguement would be that the people doing the searching and quickly finding information are also consumers, and they outnumber advertisers and may be harmed by any proposed remediation in favor of advertisers.
Search isnt a single category. If you break it down, they arent a monopoly. For example. 1/2 of PRODUCT SEARCHES begin on Amazon. It's probably hard to argue Google as a monopoly if who they see as their main competitor has half the market share.
And also, isn't Google the largest sponsor for Wikipedia already? In 2019 - Google donated $2M . In 2010, Google also donated $2m .
No, but they often ask for donations when you visit the site, which people won't see if they just see the in-line blurb from Wikipedia on the Google results page.
> In 2019 - Google donated $2M . In 2010, Google also donated $2m .
$2M is a pittance compared to what I expect Google believes is the value of their Wikipedia blurbs. If Wikipedia could charge for use of this data (which another commenter claims they are working on doing), they could easily make orders of magnitude more money from Google.
Of course, my expectation is that Google would rather drop the Wikipedia blurbs entirely, or source the data elsewhere, than pay significantly more.
They may be able to charge for bandwidth (if you want to use a Wikipedia image, you can use Wikipedia's enterprise CDN instead of their own), but their licensing allows me to rehost content as long as I follow the attribution & sublicensing terms.
Google has no problem operating their own CDNs, so I find it unlikely that Wikipedia will be able to monetize Google search results in such a manner as you described.
Disclaimer: I work for Google; opinions are my own.
Is there a Wikipedia crawling "welfare" program if I'm not a trillion dollar mega company?
So they're pretty clearly looking to monetize organizations which consume their data in a for-profit context.
You could e.g. just cover operational cost and/or improve the service quality from it.
2 mln is probably Google's hourly profit. For that they get one of the biggest knowledge bases in the world. It's basically free as far as Google is concerned.
The instant Google becomes confident they can supplant Wikipedia, they will.
You don't have to guess, their numbers are public. In 2020 they made $40B in profit, so it takes them about 27 minutes to make $2M in profit.
That's an odd way of phrasing things. If Wikipedia were to take away free access to their data, Google wouldn't be dropping Wikipedia, Wikipedia would be dropping Google. This line of thinking "you took this when I was giving it away for free, but now I want to charge for it, so you are expected to keep paying for it" is incorrect.
Then, after two or five years, let it fester then abandon it. Nobody gets promoted for keeping well oiled machines running.
It was actually good for writing stuff when I tried it. Never brought in enough traffic. Killed.
Google isn't a sponsor of Mozilla, they're a customer. Do people think Google is "sponsoring" Apple with $1.5 billion a year too?
The cynic in me thinks the product is anti-trust insurance.
These are two very different companies with a very different relationship with Google. And very different influences on Google.
Google wants to be on iOS. It brings customers to Google. A lot of them. iOS is possibly more profitable to Google than Android even with all the payments Apple extracts from them.
Google needs Mozilla so that Google may pretend that there's competition in browser space and that they don't own standards committees. The latter already isn't really true, and Google increasingly doesn't care about the former.
$2 Million a year? Now I know why Googlers complained about having one less olive in their lunch salad.
How much does Google PROFIT from Wikipedia and how much does Wikipedia loses in fundraising when Google fails to send users to the info provider?
And most of the value of wikipedia is created by its unpaid users, not Wikimedia foundation.
Wikipedia visitors, edits, and revenue are all increasing, and the rate that they're increasing is increasing, at least in the last few years. Is this a claim about the third derivative?
> Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.
The Wikimedia Enterprise thing seems like it has nothing to do with missing visitors and that companies ingesting raw Wikipedia edits are an opportunity for diversifying revenue by offering paid structured APIs and service contracts. Kind of the traditional RedHat approach to revenue in open source: https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise
Or see page views on English Wikipedia from 2016-current: https://stats.wikimedia.org/#/en.wikipedia.org/reading/total... Looks pretty flat, right? Does that seem normal?
As for Wikimedia Enterprise, you do have to read between the lines a bit. "The focus is on organizations that want to repurpose Wikimedia content in other contexts, providing data services at a large scale".
The flattening of users could also be due to a general internet-wide reduction in long-form (or even medium-form) non-fiction reading. How are page views for The New York Times?
Seems like it should be simple to A/B test, though. Obviously Google could do it themselves by randomly taking away the widget, but would could also see whether referrals from non-Google search engines (though they are themselves a tiny percentage) continue to increase while Google remains flat.
As for NYT, is there a better proxy to compare to? There's no public pageview stats and they have a paywall.
It's a bit of a vicious cycle, but In general most websites are so chock filled with crap that not having to click into them for real is a relief!
I have no idea how these companies, which make quite a decent amount of money at least up until 2020, can have such utterly poor sites.
I once counted some 20+ redirects on a single request during this process heh..
If there was a way to split Google flights into a separate company and somehow ensure it wouldn't devolve into absolute trash like its competitors, that would be a good thing.
Shopping for flights is a surprisingly technically difficult thing to do well.
There are events associated with flight status that Google doesn't know. Like change fee waivers, cash comp awards to take a later or earlier flight, seat upgrades, etc.
Lest we forget the money in that scenario is from butts in seats not clicks on a website. The particular example is ill chosen as google is actually taking on a cost, taking nothing, and gifting the airline a better ui.
I don't think that's correct. In the old days you'd either call a travel agent or use an aggregator like expedia.
Google muscles out intermediaries like Expedia, Yelp, and so on. It's not likely much better or worse for the end user or supplier. Just swapping one middleman for another.
Just for example, Expedia is probably 5% of Google's total revenue and Google doesn't like slim margin services by and large that can't be automated.
Travel is fairly high-touch - people centric. It doesn't fit Google's "MO".
But... its shitty that google can play all sides of the markets while holding people ransom to mass sums of money to pay to play on PPC where google doesn't... i think that's where the problem shines.
In essence, you're advocating that eBay goes away because google could do it... they could.. and eBay is technically just an intermediary, but do we want everything to be googlefied?
Google bought up/destroyed other aggregators - remember the days of fatwallet, priceline, pricewatch, shopzilla and such when they used to focus on discounts/coupons/deals and now they're moving more towards rewards/shopping/experience - it used to be i could do PPC on pricewatch and reach millions of shoppers are a reasonable rate, but now that google destroyed them all, the PPC rate on "goods" is absurdly high and not having an affordable market means only the amazons and walmarts can really afford to play...
it used to be you could niche out, but even then, that's getting harder
I don't think I'm really advocating for it as much as I see as a more or less neutral change.
That said, I'm pretty ambivalent about Google. Their size is a concern, but they also tend to be pretty low on the dark pattern nonsense. eBay, to use an example you gave, screwed me out of some buyer protection because of poor UX and/or bug (I never saw the option to claim my money after the seller didn't respond). In this specific instance Google ends the process by sending you to the airline to complete the booking. That, imho, is likely better than dealing with Expedia.
Google has a huge pay 2 play problem with PPC... i've worked for Expedia so that's the only reason i know this :)
It's the reason companies work with Expedia many times because they don't have the leverage expedia group does...
i see it as unnatural change btw... "borg" if you will.
What makes you think they care? Killing off the sources of content might even be there goal. If they kill off sources of content, they'd be more than happy to create an easier-to-datamine replacement.
Hypothetically, if they killed off wikipedia, they are best placed to use the actual wikipedia content in a replacement, which they can use for more intrusive data-mining.
Google sells eyeballs to advertisers; being the source of all content makes them more money from advertisers while making it cheaper to acquire each eyeball.
 AFAIK, wikipedia content is free to reuse.
The core problem is incentives: http://infolab.stanford.edu/~backrub/google.html "we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm."
In a nutshell, though, I don't see equal access for all crawlers changing anything. Maybe that's the first barrier they hit, but it isn't the biggest or hardest one by far. Bing has good crawler access, but shit market share.
There was a similar one where Google emphatically claimed a US quarter was worth five cents in a pretty and large snippet graphic.
And quite often a people also ask refine is just some random guys comment from redit.
(Search who invented running)
As for Wikipedia, the WMF is quite happy that most of their traffic is now served by Google. WMF is in the business of distributing knowledge, not in the eyeballs business. Serving traffic is just a cost for them. The main problem has been that the average cost for Wikipedia to serve a page has gone up, because many readers read it via Google, and more people who visit Wikipedia are logged-in authors, which costs them more to serve. I'm sure there's an easy solution to this problem (for example, beneficiaries of Wikipedia can donate compute facilities and services, or something along those lines).
They have relented in some ways, rolling out stuff in the widget like:
"The airline has issued a change fee waiver for this flight. See what options are available on American's website"
But obviously, that kind of stuff isn't shown on Google for quite some time after it exists on the source site. And the widget pushes the organics off the fold unless you have a huge monitor.
As for Wikipedia, I was referring to this: https://news.ycombinator.com/item?id=26487993
"Airlines are, obviously, extremely pleased to disseminate this information"
In the same way that publishers love AMP, yes. They don't actually like it, but they are forced to make the best of it.
In this case, I want to be sent literally anywhere but aa.com.
If you back up a bit, the widget didn't used to tell you there was a change fee waiver when the flight was full, while aa.com did.
That's an actual, tangible benefit that a consumer might want, worth real money. You can also even often "bid" on a dollar amount to receive if you're willing to change flights. Google doesn't present that info today.
There are more examples. My perspective isn't that Google should lead you to aa.com, but I do feel it's a bit dishonest that the widget is so large it pushes aa.com below the fold. It doesn't need to be that large.
If I'm a passenger, there's plenty of ways for airlines to notify me. If I'm searching for a flight status online, it's because I'm picking someone up. If I want more information, I'll click through. I don't see how either me or the airline are hurt.
> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.
Why is that even a problem for Wikipedia?
Yes, it's legal. But it does reduce visitor interactions for those sites. Reduced visitor interactions isn't good for web sites...it takes away incentives, reduces brand value, reduces revenue. Eventually, that is not great for consumers.
Ever been a middleman? Squeeze your suppliers enough, and you kill them. The next supplier will pre-emptively cut quality, features, etc, because they know you're going to try and squeeze them to death.
"If I'm searching for a flight status online, it's because I'm picking someone up"
That's one use case, it's not all of them. There's middle ground too, like the bones they throw Wikipedia in the form of links for more info.
Do you have a link of that product/service from Wikipedia?
Discussed not long ago: https://news.ycombinator.com/item?id=26484080
Except that, allow other companies to crawl/compete, and you can take eyeballs away from Google (which may well then return eyeballs to Wikipedia so long as the Google competitors don't also present scraped data).
But wouldn't this be a good thing? Since wikipedia is a nonprofit aiming to provide knowledge, google stealing & caching their content might help serve information to more while reducing wikipedia's server load, so IMO it might not be so bad for wikipedia.
What product is that?
And if some of the sources somehow die? New sources will spring up. It doesn't matter.
Every airline doesn't need their own webpage. They could all provide a standard API.
That's sort of how it works in the corporate booking tool world. It is decidedly not a better experience for end users, IMO.
There's quite a lot about each airline that is different, so any unified approach is a lowest common denominator. You'll notice things like loyalty points, for example, have more rich data on the airline's website. And that some fares are ONLY on the website. Or that seat maps have more useful detailed info, etc.
And that's all shopping/booking. Departure control, flight status, upgrade/downgrade, check-in, seat upgrades, standby, etc, are for the most part only on the airline's website.
 it feels like Google have contributed a lot to websites being slow and bad with eg ads, amp, angular, and probably more things for the other 25 letters of the alphabet.
Hehe, sure, nothing nefarious or greedy here... move along, move along, nothing to see...
There are exceptions, like an airport-called ground stop.
It's pretty difficult to come up with a non-computer analogy for how Google works now. Pick a different space, and the power imbalance is quite clear. If they wanted, they could destroy StackExchange very quickly with these widgets.
It definitely feels like one set of rules for them and a different set for everyone else.
I don't really see this as Google playing by different rules so much as economic incentives being aligned in Google's favor.
> Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission. The rest of the file specifies that Google, Microsoft, Yahoo and two other non-search engines are not allowed to crawl certain pages on census.gov, but are otherwise allowed to crawl whatever else they can find on the website. This tells us that there are two different classes of crawlers in the eyes of the operators of census.gov: those given wide access, and those that are totally denied.
> And, broadly speaking, when we examine the robots.txt files for many websites, we find two classes of crawlers. There is Google, Microsoft, and other major search engine providers who have a good level of access and then there is anyone besides the major crawlers or crawlers that have behaved badly in the past that are given much less access. Among the privileged, Google clearly stands out as the preferred crawler of choice. Google is typically given at least as much access as every other crawler, and sometimes significantly more access than any other crawler.
If you wanna go nuclear on people who do that, include an invisible link in your html and forbid access to that URL in your robots.txt, then block every IP who accesses that URL for X amount of time.
Don't do this if you actually rely on search engine traffic though. Google may get pissed and send you lots of angry mail like "There's a problem with your site".
Ah, but of course you would exclude Google's published crawler IPs from this restriction, because that is exactly what they want you to do.
Blocking scrapers in robots.txt is more of a way of saying, "hey, you're doing it wrong."
It's also worth noting that the original article is out of date. The current robots.txt at census.gov is basically wide-open .
I do not want the government deciding "what purposes" e.g. non-commercial, serve the public good. The public gets to decide that. (charging a license for commercial use is maybe ok (assuming supporting that use costs government "too much").
And I very do not want current situation with the government anointing a handful of corporations (the farthest thing from the public possible) access and denying everyone else including all of the actual public.
And an example from the quickly-approaching future, when there will be Nationally Recognized Media Organizations who license "Fact-Checkers," through which posts to public-facing will have to be submitted for certification and correction.
the public's "decision" on things like this is made manifest by government policy, no?
It's possible to have government policies that the majority of people disagree with, that remain for complicated reasons related to apathy, lobbying, party ideology, or just because those issues get drowned out by more important debates.
Government is an extension of the will of the people, but the farther out that extension gets, the more divorced from the will of the people it's possible to be. That's not to say that businesses are immune from that effect either -- there are markets where the majority of people participating in them aren't happy with what the market is offering. All of these systems are abstractions, they're ways of trying to get closer to public will, and they're all imperfect. But government is particularly abstracted, especially because the US is not a direct democracy.
I'm personally of the opinion that this discussion is moot, because I think that people have a fundamental Right to Delegate, and I include web scraping public content under that right. But ignoring that, because not everyone agrees with me that delegation is right, allowing the government to unilaterally rule on who isn't allowed to access public information is still particularly susceptible to abuse above and beyond what the market is capable of.
If people have to 'scrape' that data from a public resource, I'd say they're presenting the data in the wrong way.
I may be wrong (this isn't my area), but I was under the impression that robots.txt was just an unofficial convention? I'm not saying people should ignore robots.txt, but are there legal ramifications if ignored? I'm not asking about techniques sites use to discourage crawlers/scrapers, I'm specifically wondering if robots.txt has any legal weight.
Looks like Google is trying to turn it into an RFC though.
Under this consortium, website owners would be allowed to either allow all crawlers (approved by the consortium) or none at all (that is, none that is in the consortium, i.e. you could allow a specific researcher or something to crawl your website on a case-by-case basis).
This consortium would be composed of the search engines (Google, MS, other industry members), as well as government appointed individuals and relevant NGOs (electronic frontier foundation, etc?). There would be an approval process that simply requires your crawl to be ethical and respect bandwidth usage. Violations of ethics or bandwidth limits could imply temporary or permanent suspension. The consortium could have some bargain or regulatory measures to prevent website owners from ignoring those competitive and fairness provisions.
An industry-wide agreement not to compete for commercially valuable access to suppliers of data?
Comprised of companies that are current (and in some cases perennial) focusses of antitrust attention?
I think there might be a problem with that plan.
Could you better describe your objections?
And I'm not making this up. This kind of stuff has happened to me many times.
But this sub-thread is about misbehaved crawlers.
The idea was that instead of a centralized search, people could have fat clients that individually query these apis and then aggregate the results on the client machine.
Essentially every query would be a what/where or what/who pair. This would focus the results
I really think we need to reboot those core ideas.
We have a manual version today. There's quite a few large databases that the crawlers don't get.
The one place for everything approach has the same fundamental problems that were pointed out 30 years ago, they've just become obvious to everybody now.
So your best bet is setting a page rule to allow those URLs.
I don't think it's mass blocking is the right approach in general. IPs, even residential, are relatively easy and relatively cheap. At some point you're blocking too many normal users. Captchas are a strong weapon, but they too have a significant cost by annoying the users. Cloudflare could theoretically do invisible-invisible captchas by never even running any code on the client, but that would be wholesale tracking and would probably not fly in the EU.
the Web scraping technology that I am aware of has reached end game already: Unless you are prepared to authenticate every user/visitor to your website with a dollar sign, lobby congress to pass a bill to outlaw web scraping, you will not be able to stop web scraping in 2021 and beyond.
Due to aggressive no-script and uBlock use I, browsing the website as a human, keep getting hit by captchas and my success rate is falling to a coinflip. If there's a script to automate that I'm all ears.
I just no longer see it possible to 1) put information on the web (private or public) 2) give access outside your organization (customers or visitors) 3) expect your website will not be scraped.
ToS is NOT the law unfortunately.
Botnets and automated crawling are completely different things. This isn't about preventing service degradation (even if it gets presented that way). It's an attempt by content publishers to control who accesses their content and how.
Cloudflare is actively assisting their customers to do things I view as unethical. Worse, only Cloudflare (or someone in a similarly central position) is capable of doing those things in the first place.
There's not even any information to see what the "private forum access" that you have to pay for is about, what kind of people are in it...or even to know about what happens with the money.
For me, this sounds like a scam.
I mean, srsly. I am building a peer-to-peer network that tries to liberate the power of google, specifically, and I would not even consider joining this club. And I am the best case scenario of the proposed market fit.
To add, there are a lot of businesses that use the terms 'Knucklehead' so finding their business on secretary of state business searches might be impossible.
> Meet with legislators and regulators to present our findings as well as the mock legislation and regulations. We can’t expect that we can publish this website or a PDF and then sit back while governments just all start moving ahead on their own. Part of the process is meeting with legislators and regulators and taking the time helping them understand why regulating Google in this way is so important. Showing up and answering legislators’ questions is how we got cited in the Congressional Antitrust report and we intend to keep doing what’s worked so far.
If I actually wanted to restrict bots, it would be much easier to restrict googlebot because they actually follow the rules.
I don't disagree in principle that there should be an open index of the web, but for once I don't see Google as a bad actor here.
There were a surprising number of sites out there that explicitly blocked access to anyone but Google/Bing at the time.
We'd also get a dozen complaints or so a month from sites we'd crawled. Mostly upset about us using up their bandwidth, and telling us that only Google was allowed to crawl them (though having no robots.txt configured to say so).
If Google is taking traffic and reducing revenue, a company can deny in robots.txt. Google will actually follow those rules - unlike most others that are supposedly in this 2nd class.
The company did respect robots.txt, though it was initially a bit of a struggle to convince certain project managers to do so.
No. The internet is public. Publishers shouldn't get any say in who accesses their content or how they do it. As far as I'm concerned, the fact that they do is a bug.
Yes, the bad bots don't give a fuck, but even the non-malicious bots (ahrefs, moz, some university's search engine etc) don't bring any value to me as a site owner, take up band width and resources and fill up logs. If you can remove them with three lines in your robots.txt, that's less noise. Especially universities do, in my opinion, often behave badly and are uncooperative when you point out their throttling does not work and they're hammering your server. Giving them a "Go Away, You Are Not Wanted Here" in a robots.txt works for most, and the rest just gets blocked.
Why can't you just ratelimit IPs that are "too active" for your server to handle?
And yes, that means google gets special treatment.
Think about the model for a site like stackoverflow. The longest of long tail questions on that site: what’s the actual lifecycle of that question?
- posted by a random user
- scraped by google, bing, et al
- visited by someone who clicked on a search result on google
- eventually, answered
- hopefully, reindexed by google, bing et al
- maybe never visited again because the answer now shows up on the google SERP
In the lifetime of that question how many times is it accessed by a human, compared to the number of times it’s requested and rerequested by an indexing bot?
What would be the impact on your site of three more bots as persistent as google bot? Why should you bother with their requests?
So yes, sites care about bot traffic and they care about google in particular.
Additional evidence here: https://knuckleheads.club/the-evidence-we-found-so-far/
They obviously don't care enough then - Google says you should use rdns to verify that googlebot crawls are real. Cloudflare does this automatically now as well for customers with WAF (pro plan).
This was common even back in the mid-2000s:
As inflammatory as the headline of the page looks, they literally admit it's not google's fault in the smaller text lower down:
"This isn’t illegal and it isn’t Google’s fault, but"
Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.
Google says they do not do this:
"Google doesn't post a public list of IP addresses for website owners to allowlist."
If it was the only indexer allowed, and it was publically governed, then enforcing changes to regulation would be a lot more straightforward. Imagine if indexing public social media profiles was deemed unacceptable, and within days that content disappeared from all search engines.
I don't think it'll ever happen, but it's interesting to think about.
Hijacking scroll like this is one of the biggest turnoffs a website can have for me, up there with being plastered with ads and crap. It's ok imo in the context of doing some flashy branding stuff (think Google Pixel, Tesla splashes) but contentful pages shouldn't ever do this.
Which would put it under government regulation and be forever mired in politics over what was moral, immoral, ethical or unethical and all other kerfuffle. To an extent, it’s already that way, but that would make it worse than it is currently.
Similar to some of the concepts of "Linked Data", maybe - https://en.wikipedia.org/wiki/Linked_data.
The problem is getting to a standard, it would essentially need to be federated search so a standard would have to be established (de facto most likely).
Also, indexes and storage, distribution of processing load.. peer-to-peer search is already a thing, but it doesn't seem to be a core function of the Internet.
This is basically the same concept as making an "open" version of something that is "closed" in order to compete, I guess.
A public cache wouldn't be able - indeed shouldn't - to play cat and mouse games with potential opponents. I suspect most of the games played require not explaining exactly what you're doing.
Knuckleheads' Club at the very least are doing a great job of raising awareness and the potential barriers to entry for alternatives.
A mandated singular public cache has potential slippery slopes.
That may be, but it seems like everything has a slippery slope - if the wrong person gets into power, or if the public look the other way/complacence/ignorance/indifference, etc, etc. It shouldn't stop us evaluating choices on their merits, and there is a lot of merit to entrusting 'core infrastructure' type entities to the government - or at-least having an option.
I am unconvinced by the "slippery slope" argument being deployed by default to any governmental attempt to combat tech monopolies.
"One index to rule them all" seems more fraught with difficulty than, "large cloud providers are unhappy that crawlers on the open web are crawling the open web".