Hacker News new | past | comments | ask | show | jobs | submit login
Only Google is really allowed to crawl the web (knuckleheads.club)
957 points by skinkestek on March 26, 2021 | hide | past | favorite | 346 comments

The bigger problem, to me, is not around crawling. It's the asymmetrical power Google has after crawling.

Google is obviously on a mission to keep people on Google owned properties. So, they take what they crawl and find a way to present that to the end user without anyone needing to visit the place that data came from.

Airlines are a good example. If you search for flight status for a particular flight, Google presents that flight status in a box. As an end user, that's great. However, that sort of search used to (most times) lead to a visit to the airline web site.

The airline web site could then present things Google can't do. Like "hey, we see you haven't checked in yet" or "TSA wait times are longer than usual" or "We have a more-legroom seat upgrade if you want it".

Google took those eyeballs away. Okay, fine, that's their choice. But they don't give anything back, which removes incentives from the actual source to do things better.

You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

In short, I don't think the crawler is the problem. And I don't think Google will realize what the problem is until they start accidentally killing off large swaths of the actual sources of this content by taking the audience away.

They are not just taking away internet traffic, but in the flights example, they actually acquired an aggregate flight/travel company and so they are actually entering markets and competing with their own ad customers.

Then it comes fully circle to Google unfairly using their market position vis-a-vis data, search and advertising. It’s a win-win Google lets the data dictate which markets to enter and on one hand they can jack up advertising fees on customers/competitors and unfairly build their own service into search above both ads and organic results.

Be careful when using Google Flight, last time I checked they use significantly less margins between flights so it’s shorter trips but much riskier.

You can screwed any time you book a connecting flight on two different airlines even if the times aren't tight. For instance if one is cancelled.

If you use the same airline they will make sure you get to the destination.

That's true, but it can save you a ton of money. You just have to be aware of the risks and plan accordingly.

I have typically used this strategy when flying back to the US from the EU. Take an EZJet or similar low cost airline from random small EU city to a larger EU city like Paris, London, Frankfurt, etc... and book the return trip to the US from the larger city. I've also been forced to do this from some EU cities since there was no connecting partner with a US airline.

The difference is mind-boggling in some cases. On one trip in 2019 I had the following coach fair choices for SFO - Moscow return trip tickets booked 3 weeks prior to departure.

* UA or Lufthansa round trip (single carrier) $3K

* UA round trip SFO - Paris + Aeroflot round trip Paris - Moscow: $1K

No amount of search could reduce the gap. I went with the second option. The gap is even bigger if you have a route with multiple segments.

https://www.airtreks.com/ will do this for you with a person. phenomenal service.

To anyone from airtreks, I love you so much!

Yeah this strategy is good, but you need to allow a long layover like 6 hours if you have to go through immigration and change airports for the connection which happens pretty often with ryanair and ezjet. It’s a big pain, but it does save money.

If you're booking each leg with different carrier, I find it best to pay the little extra with kiwi.com and they give you guarantee for the connection. I missed connection twice and they always got me on the next flight to the destination for free.

in my ideal world the software ITA wrote for airlines and is now owned by Google would be in the hands of consumers and the airlines could have adapted to shifts in demand probably without the need for abrupt cessation of services and human fatigue on industry employees caused when route optimisation analysis tempts executives with what I suspect are ultimately fictitious net present savings.

> even if the times aren't tight

Depending on the definition of "tight" each of us have. I remember having 40mins in Munich, and that is a BIG airport. Especially if you disembark on one side of the terminal and your flight is on the far/opposite end. That's 25-30mins brisk walking. With 5000 people in-between you could as well miss your flight. No discussion about stopping to get a coffee or a snack.. you'll miss your flight.

It doesn't really matter if it's on the same airline, it just has to be in the same reservation. Usually, that is the same thing; however on international and hyper-local (the kind that end with a Cessna) flights you'd often have several airlines with codeshares, and you could buy two separate tickets on the same airline if you wanted.

Another approach that's interesting is "buy long, fly short". Sometimes buying A->B->C and getting off at B is cheaper than just buying A->B. But, airlines can cancel the the A->B->C flight and replace it with an A->B and A->C, and place you on the A->C flight.

Can you elaborate on this? Do you mean shorter layovers?

It sounds like it - and third-party companies will often show you flights that involve different companies on the different legs - which can leave you in a pickle because technically each airline's job is to get you to the end of THIER flight, not the entire journey.

And sometimes with a change of airport!

I remember when in Germany some budget airlines used to say they'd fly to "Frankfurt" (FRA) but actually flew to "Frankfurt-Hahn" (HHN) - 115km away. After arrival in HHN they put you on a bus to FRA that took about 2 hours.

Oh don’t worry, you have 15 on-paper minutes to go from A1 to A70 in Detroit... in January... and the shuttle is down.

Even before it gets to that point, they routinely display snippets off regular websites and show ads next to it.

Keeping users from clicking through to organic results helps them generate more revenue.

Here’a a thought: most companies don’t actually want to serve a ton of extra pages. For example, airlines just want to fly passengers. They don’t care who puts those butts in seats and they would fully acknowledge that they aren’t able to deliver a better flight search than Google can. I mean, sure, some small team of web developers at every airline is pissed, but the CEO needs butts in seats to keep the pilot and service unions off her back. Her own web dev team is the least of her problems.

"For example, airlines just want to fly passengers. They don’t care who puts those butts in seats"

Not sure who you talked to, but I've never heard that before. They all want to sell more direct and forego GDS fees and/or other types of fees and commissions. I'd love to see a quote from an airline VP or above that they don't care about their distribution model, boosting direct sales percentages, etc.

Sure, but if you aren’t controlling the experience of getting butts in seats, it’s harder to upsell and make even _more_ money from those butts.

the OP wasn't talking about google competing with airlines, but with other flight aggregating and search/booking services, by abusing their monopoly on web search.

> Google lets the data dictate which markets to enter and on one hand they can jack up advertising fees on customers/competitors and unfairly build their own service into search above both ads and organic results.

Just like Amazon with Amazon Basics.

isn't the solution to disaffective aqui - hiring the restoration of the ability to IPO companies like ITA Software so there's another way to reach financial security for talented programmer - entrepreneurs? for that matter, do the conditions for recreating the frequency of lower level programmer millionaires (hardly a family home debt free today) like Microsoft created, require the recreation also of senior executive abuse of option schemes? It seems to me that making it a reasonable chance of becoming at least financially secure for not irrational amounts of dedication and 90 weeks, and underwriting that with the greater robustness of larger companies and hence livable salaries, instead of trying to sustain the apparent startup free for all figuring to a common heat death of the advertising budgetary universe?

Aren't there anti trust laws to prevent this kind of thing?

Antiturst laws are hard to enforce in the United States.

Monopolies themselves aren't illegal. To be convicted of an antitrust violation, a firm needs to both have a monopoly and needs to be using anticompetitive means to maintain that monopoly. The recent "textbook" example was of Microsoft, which in the 90s used its dominant position to charge computer manufacturers for a Windows license for each computer sold, regardless of whether it had Windows installed or was a "bare" PC.

Depending on how you define the market, Google may not even have a monopoly. It's probably dominant enough in web search to count, but if you look at its advertising network it competes with Facebook and other ad networks. In the realm of travel planning (to pick an example from these comments), it's barely a blip.

Furthermore, Google can potentially argue it's not being anticompetitive: all businesses use their existing data to optimize new products, so Google could claim that it not doing so would be an artificial straightjacket.

It's not that hard, we're just out of practice due to the absurd Borkist economic theories we've been operating under for 40+ years. The laws are all there if the head of the DOJ antitrust division has the gumption to go reverse some bad precedents.

> In the realm of travel planning (to pick an example from these comments), it's barely a blip.

They used their monopoly in web search to gain non-negligible marketshare in entirely unrelated industry. That's text book anti-competitive behavior.

Google can argue whatever they want, but the argument that they're enabling other businesses is a bad one. It casts Google as a private regulator of the economy, which is exactly what antitrust laws are intended to deal with.

Is web search even a "market" independent of ads?


Where's the money?

That's like arguing that newspapers are not a market, because it makes money from ads.

No, I was asking if web search was a market independent of ads.

BTW, newspapers also make money from subscriptions and sales of copies, so your analogy is doubly wrong.

In collecting and selling your data to 3rd parties.

Just tell people to stop using google. Go direct.

Upvoted - regardless how pointless some people might think this comment is, it really is the ONLY way that Google is going to drop out of its aggregate lead position.

Enough people realizing Google is trapping and cannibalizing traffic to the other sites it feeds off of, and choosing to do other things EXCEPT touching Google properties, is THE ONLY way they'll be unseated.

No clear legal path to stop a bully means it's an ethical / habit path.

Not saying there's any easy way, just that this is it.

I find those little snippets actually mostly worthless, maybe because I’ve seen enough of them taken out of context or basically using a snippet from someone who figured out SEO properly, meanwhile the correct information may be down a couple links or not there at all.

Laws are one thing and enforceability another.

In fact Google had to make certain concessions in order for the Google Flights acquisition to get regulatory approval.

IIRC a Chinese firewall between Google data and Google Flights...but like many regulations they were likely written by Google lobbyists aka the industry experts. Because at the end of the day Google flights: 1. Still has the built in widget above organic results and 2. They still bid on their own ad spots jacking up costs on competition which is ultimately passed on to consumers.

Anti-trust in the US tend to not hit the big tech players as much they do other sectors. Also there is actually a debate in the judicial system about the extent of Anti trust laws themselves.

Chicago school basically published a bunch of position papers that made feudal corporations a legal entity that "aren't monopolies" because the give things away for free. Because the consumer isn't paying, it can't be bad.

The current anti-trust doctrine in the US has a goal of protecting consumers - not competition. What Google is doing is arguably great for consumers but awful to their competitors/other organizations. Technically, companies can simply block Google using robots.txt - but in reality that will lose them more money than the current partial disintermediation by Google is costing them - and Google knows this.

It's a tall order to convince the courts that Google's actions consumers, or is illegal: after all, being innovative in ways that may end up hurting the competition is a key feature of a capitalist society - proving that a line has been crossed is really hard, by design.

consumers are in this case the advertisers.

google has a monopoly on search ads and does enforce it, being a drain on the economy since in many fields you only succeed if you spend on search ads

> consumers are in this case the advertisers.

If someone could convince the courts that this is correct, then I'm sure Google would lose. However, I bet dollars to donuts Google's counter-arguement would be that the people doing the searching and quickly finding information are also consumers, and they outnumber advertisers and may be harmed by any proposed remediation in favor of advertisers.

googles answer to this at yesterdays hearing..

Search isnt a single category. If you break it down, they arent a monopoly. For example. 1/2 of PRODUCT SEARCHES begin on Amazon. It's probably hard to argue Google as a monopoly if who they see as their main competitor has half the market share.

The US is so behind in identifying markets in technology which is what is leading to this dominance by a few companies and their resulting monopoly like power. We had already figured out that you can be dominant in only a subset of a market. For example Disney was forced to sell off fox sports channels when it purchased fox because it already owned espn and would have dominated sports TV. That’s the thing it wasn’t even just TV but a subset, sports TV. That identification is where we are behind. As of now no, one in the FTC knows what makes a market or why say YouTube and Facebook may both show large amounts of video content but are absolutely not competitors in video content space. It is because the functions are completely different. YouTube is barely a social network at all despite having users and comments and pages. Facebook is hardly a video platform at all because it isn’t profitable for users to focus on Facebook videos and make ad money.

Anecdotally, its true for books too. Amazon is a great way to figure out which books are the best reviewed before deciding to get one (whether paper, kindle or 'other means').

That's so disingenuous there should be a new term for it.

gSplain or gWash

Which is shortsighted. If competition did not benefit consumers, there would be no need for it anyway.

Anti-trust above all recognizes competition benefits consumers. And so “unfairly competing” is prohibited, because it is bad for the market, thus bad for consumers.

Yes, but they lack enforcement.

That depends, would Google let us know?

not if they could avoid it

Wikipedia isn't monetized. Doesn't it benefit them if Google is serving their content for free and people are finding the information they want without having to hit Wikipedia??

And also, isn't Google the largest sponsor for Wikipedia already? In 2019 - Google donated $2M [1]. In 2010, Google also donated $2m [2].

[1] https://techcrunch.com/2019/01/22/google-org-donates-2-milli...

[2] https://en.wikipedia.org/wiki/Wikimedia_Foundation

> Wikipedia isn't monetized.

No, but they often ask for donations when you visit the site, which people won't see if they just see the in-line blurb from Wikipedia on the Google results page.

> In 2019 - Google donated $2M [1]. In 2010, Google also donated $2m [2].

$2M is a pittance compared to what I expect Google believes is the value of their Wikipedia blurbs. If Wikipedia could charge for use of this data (which another commenter claims they are working on doing), they could easily make orders of magnitude more money from Google.

Of course, my expectation is that Google would rather drop the Wikipedia blurbs entirely, or source the data elsewhere, than pay significantly more.

Unlikely that Wikipedia will be able to charge for content, seeing as all of their content is CC-BY-SA licensed. https://en.wikipedia.org/wiki/Wikipedia:Licensing_update

They may be able to charge for bandwidth (if you want to use a Wikipedia image, you can use Wikipedia's enterprise CDN instead of their own), but their licensing allows me to rehost content as long as I follow the attribution & sublicensing terms.

Google has no problem operating their own CDNs, so I find it unlikely that Wikipedia will be able to monetize Google search results in such a manner as you described.

Disclaimer: I work for Google; opinions are my own.

Fine, the content is free. But if your crawlers want access to the content, then pay! Simple as that.

Will it be a flat fee, so that I, a lowly one-man crawler developer will not be able to afford it? Will it be that only Google can afford it, thus making their monopoly position even stronger?

Is there a Wikipedia crawling "welfare" program if I'm not a trillion dollar mega company?

Actually, yes! The current API is free. The new Enterprise API is paid.

Sure! Apply to become a crawler. And if you meet certain criteria and your crawlers don’t exceed a quota then have at it. The key is not to make it technically challenging, but to erect a legal barrier.

Wikimedia recently announced Wikimedia Enterprise for "organizations that want to repurpose Wikimedia content in other contexts, providing data services at a large scale".

So they're pretty clearly looking to monetize organizations which consume their data in a for-profit context.

monetizing != for-profit

You could e.g. just cover operational cost and/or improve the service quality from it.

I think they may have meant "(organizations) (which consume their data in a for-profit context)."

Google was/is also the largest sponsor of Mozilla. This doesn't stop Google from sabotaging Mozilla.

2 mln is probably Google's hourly profit. For that they get one of the biggest knowledge bases in the world. It's basically free as far as Google is concerned.

The instant Google becomes confident they can supplant Wikipedia, they will.

> 2 mln is probably Google's hourly profit.

You don't have to guess, their numbers are public. In 2020 they made $40B in profit, so it takes them about 27 minutes to make $2M in profit.

NOT a sponsor of Mozilla. Google buys web traffic (as default search engine) for ~$300M and turns it into several times that $ in ad revenue.

Not sure why you're being downvoted; I completely agree with what you're saying (modulo questionable usage of "sponsor"). If Wikipedia were to try to charge for this use of their data, Google would likely make it a priority to drop the Wikipedia blurbs, either without replacement, or with data sourced elsewhere.

> Google would likely make it a priority to drop the Wikipedia blurbs, either without replacement, or with data sourced elsewhere.

That's an odd way of phrasing things. If Wikipedia were to take away free access to their data, Google wouldn't be dropping Wikipedia, Wikipedia would be dropping Google. This line of thinking "you took this when I was giving it away for free, but now I want to charge for it, so you are expected to keep paying for it" is incorrect.

Given the scale that google already operates at, I don't doubt that they would just take a copy of thr content and rebrand it as a google service, complete with user contribution.

Then, after two or five years, let it fester then abandon it. Nobody gets promoted for keeping well oiled machines running.

Remember Knol? https://en.wikipedia.org/wiki/Knol?wprov=sfti1

It was actually good for writing stuff when I tried it. Never brought in enough traffic. Killed.

> Google was/is also the largest sponsor of Mozilla. This doesn't stop Google from sabotaging Mozilla.

Google isn't a sponsor of Mozilla, they're a customer. Do people think Google is "sponsoring" Apple with $1.5 billion a year too?

> they're a customer.

The cynic in me thinks the product is anti-trust insurance.

$1.5 billion a year? You're off by an order of magnitude; the number is thought to be over $10 billion a year.

Google being Apple's customer doesn't mean Google isn't sponsoring Mozilla.

These are two very different companies with a very different relationship with Google. And very different influences on Google.

Google wants to be on iOS. It brings customers to Google. A lot of them. iOS is possibly more profitable to Google than Android even with all the payments Apple extracts from them.

Google needs Mozilla so that Google may pretend that there's competition in browser space and that they don't own standards committees. The latter already isn't really true, and Google increasingly doesn't care about the former.

Well then they can't nag users to donate to Jimmy Wales' trust fund.

Couldn't you make a similar argument about for-profit uses of free/libre software? The software serves a useful purpose, who cares where it came from?

>>Google donated $2M [1]. In 2010, Google also donated $2m [2].

$2 Million a year? Now I know why Googlers complained about having one less olive in their lunch salad.

How much does Google PROFIT from Wikipedia and how much does Wikipedia loses in fundraising when Google fails to send users to the info provider?

Wikipedia is drowning in money so this whole line of discussion is weird.

And most of the value of wikipedia is created by its unpaid users, not Wikimedia foundation.

> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically.

Wikipedia visitors, edits, and revenue are all increasing, and the rate that they're increasing is increasing, at least in the last few years. Is this a claim about the third derivative?

> Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

The Wikimedia Enterprise thing seems like it has nothing to do with missing visitors and that companies ingesting raw Wikipedia edits are an opportunity for diversifying revenue by offering paid structured APIs and service contracts. Kind of the traditional RedHat approach to revenue in open source: https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise

See https://searchengineland.com/wikipedia-confirms-they-are-ste... from 2015. Google's widgets that present Wikipedia data do reduce visitors to Wikipedia.

Or see page views on English Wikipedia from 2016-current: https://stats.wikimedia.org/#/en.wikipedia.org/reading/total... Looks pretty flat, right? Does that seem normal?

As for Wikimedia Enterprise, you do have to read between the lines a bit. "The focus is on organizations that want to repurpose Wikimedia content in other contexts, providing data services at a large scale".

The first link doesn't seem quite conclusive (see the part at the bottom), and also doesn't give evidence that Google's widgets are to blame.

The flattening of users could also be due to a general internet-wide reduction in long-form (or even medium-form) non-fiction reading. How are page views for The New York Times?

Seems like it should be simple to A/B test, though. Obviously Google could do it themselves by randomly taking away the widget, but would could also see whether referrals from non-Google search engines (though they are themselves a tiny percentage) continue to increase while Google remains flat.

Edit: Removed bad "simple english graph", thanks. Though the regular english wikipedia traffic is flat from 2016-present.

As for NYT, is there a better proxy to compare to? There's no public pageview stats and they have a paywall.

That first graph is Simple English, not English, and is in millions, not billions. They also explicitly call out the methodology change in 2015...

I'm not sure I agree with this. I think airline websites are so garbage filled that they've driven people to use the simple alternative of the google flights checkout.

It's a bit of a vicious cycle, but In general most websites are so chock filled with crap that not having to click into them for real is a relief!

BA had some tracking request inline on the “payment processing” page which when blocked by my pihole prevents me from ever getting to the confirmation page, just have to refresh your email and wait for the best.

I have no idea how these companies, which make quite a decent amount of money at least up until 2020, can have such utterly poor sites.

I once counted some 20+ redirects on a single request during this process heh..

I don’t know what they’re doing but most every single sign on tool I’ve seen redirects 10-20 times during the sign on process (and then dumps you to the homepage to navigate your way back).

Probably to get first party cookies on a handful of domains

Yeah, the Google flights issue is difficult. On one hand, the business practice is problematic. On the other hand, Google flights is so much better than its competitors it's ridiculous.

If there was a way to split Google flights into a separate company and somehow ensure it wouldn't devolve into absolute trash like its competitors, that would be a good thing.

It was ITA and prior to Google buying them, did a pretty good business selling backend flight shopping services to aggregators and airlines.

Shopping for flights is a surprisingly technically difficult thing to do well.

I'm talking about flight status. Not Google Flights, shopping, or booking.

There are events associated with flight status that Google doesn't know. Like change fee waivers, cash comp awards to take a later or earlier flight, seat upgrades, etc.

It's not Google's prerogative to scrape a website and display its content, no matter how awful the website.

If 1 airline let me view information in a friendly fashion and the other didn't I would do business with the first.

Lest we forget the money in that scenario is from butts in seats not clicks on a website. The particular example is ill chosen as google is actually taking on a cost, taking nothing, and gifting the airline a better ui.

If you make an awful website that can be scrapped it's a matter of when not if someone will take your data and give it to your consumers whether your trying to upsell them or not...

>However, that sort of search used to (most times) lead to a visit to the airline web site.

I don't think that's correct. In the old days you'd either call a travel agent or use an aggregator like expedia.

Google muscles out intermediaries like Expedia, Yelp, and so on. It's not likely much better or worse for the end user or supplier. Just swapping one middleman for another.

I can't prove it was that way, but I spent a lot of time in the space. For a long time, the airline's site used to be the top organic result, and there was no widget. Similar for other travel related searches (not just airlines) over time. Google has been pushing down organic results in favor of ads and widgets for a long time...and slowly, one little thing at a time. Like no widgets -> small widget below first organic result -> move the widget up -> make it bigger -> etc.

I don't think google muscling out intermediaries like Expedia is a good thing.

Just for example, Expedia is probably 5% of Google's total revenue and Google doesn't like slim margin services by and large that can't be automated.

Travel is fairly high-touch - people centric. It doesn't fit Google's "MO".

But... its shitty that google can play all sides of the markets while holding people ransom to mass sums of money to pay to play on PPC where google doesn't... i think that's where the problem shines.

In essence, you're advocating that eBay goes away because google could do it... they could.. and eBay is technically just an intermediary, but do we want everything to be googlefied?

Google bought up/destroyed other aggregators - remember the days of fatwallet, priceline, pricewatch, shopzilla and such when they used to focus on discounts/coupons/deals and now they're moving more towards rewards/shopping/experience - it used to be i could do PPC on pricewatch and reach millions of shoppers are a reasonable rate, but now that google destroyed them all, the PPC rate on "goods" is absurdly high and not having an affordable market means only the amazons and walmarts can really afford to play...

it used to be you could niche out, but even then, that's getting harder

>In essence, you're advocating that eBay goes away because google could do it... they could.. and eBay is technically just an intermediary, but do we want everything to be googlefied?

I don't think I'm really advocating for it as much as I see as a more or less neutral change.

That said, I'm pretty ambivalent about Google. Their size is a concern, but they also tend to be pretty low on the dark pattern nonsense. eBay, to use an example you gave, screwed me out of some buyer protection because of poor UX and/or bug (I never saw the option to claim my money after the seller didn't respond). In this specific instance Google ends the process by sending you to the airline to complete the booking. That, imho, is likely better than dealing with Expedia.

Companies opt in to sites like Expedia and list their properties/flights/vacations on their marketplace and they pay a commission for those being booked. Expedia doesn't just crawl them and demand a royalty for sending them traffic...

Google has a huge pay 2 play problem with PPC... i've worked for Expedia so that's the only reason i know this :)

It's the reason companies work with Expedia many times because they don't have the leverage expedia group does...

i see it as unnatural change btw... "borg" if you will.

It's actually pretty different because another middleman can basically arise only if it's a big success in the iOS App Store because coming up in Google searches would be impossible and more or less the same in the Play Store. So, Google is not just yet another intermediary.

Only if Google stays around long term. I wouldn't be surprised if each free product on its graveyard took down a dozen of competing products before it was killed of.

Then someone can start a competitor up again, right? Assuming there's actually a market for it.

Not every market is lucrative in the extreme and it can take a long time to recover from being "disrupted". I think it is also a common practice for larger shopping chains to dump prices when they open a new location in order to clear out the local competition, so the damage it causes is well understood to be long lasting.

In regards to airlines, Google and Amadeus have a partnership I believe. Amadeus is the main source of data for many of these airline websites. If Google gets the data from Amadeus directly and not these websites, they are just cutting out the middleman. I don't shed a tear for any of these middleman (together with their Dark Pattern UX design).

Amadeus isn't a source of flight status. It is a source for (some) planned schedules and fares. Global distribution systems are a complex topic that's hard to sum up on HN. For flight status, Google is pulling from OAG and Flight Aware, and also from airline websites. Though they don't show airline sites as a source.

> And I don't think Google will realize what the problem is until they start accidentally killing off large swaths of the actual sources of this content by taking the audience away.

What makes you think they care? Killing off the sources of content might even be there goal. If they kill off sources of content, they'd be more than happy to create an easier-to-datamine replacement.

Hypothetically, if they killed off wikipedia, they are best placed to use the actual wikipedia content[1] in a replacement, which they can use for more intrusive data-mining.

Google sells eyeballs to advertisers; being the source of all content makes them more money from advertisers while making it cheaper to acquire each eyeball.

[1] AFAIK, wikipedia content is free to reuse.

I'm not suggesting it's illegal. There are a great many practices that are legal that I dislike.

You're wrong on a lot of facts here. Google Flights doesn't get its data just by crawling, they get it from Sabre, the FAA, Eurocontrol, etc. Airlines are, obviously, extremely pleased to disseminate this information. Google Flights "gives back" in the exact same way as any other travel outlet: they book passengers.

As for Wikipedia, the WMF is quite happy that most of their traffic is now served by Google. WMF is in the business of distributing knowledge, not in the eyeballs business. Serving traffic is just a cost for them. The main problem has been that the average cost for Wikipedia to serve a page has gone up, because many readers read it via Google, and more people who visit Wikipedia are logged-in authors, which costs them more to serve. I'm sure there's an easy solution to this problem (for example, beneficiaries of Wikipedia can donate compute facilities and services, or something along those lines).

They don't get individual flight status (what I was talking about) from Sabre or the FAA or Eurocontrol. I didn't get into fares and planned schedules and Google Flights, that's a different topic. I was talking about the big widget you get for queries on status for a particular flight, which is not Google Flights.

They have relented in some ways, rolling out stuff in the widget like: "The airline has issued a change fee waiver for this flight. See what options are available on American's website"

But obviously, that kind of stuff isn't shown on Google for quite some time after it exists on the source site. And the widget pushes the organics off the fold unless you have a huge monitor.

As for Wikipedia, I was referring to this: https://news.ycombinator.com/item?id=26487993

"Airlines are, obviously, extremely pleased to disseminate this information"

In the same way that publishers love AMP, yes. They don't actually like it, but they are forced to make the best of it.

Oh, status. I was thinking of schedules. Still, what is the point for the consumer of being directed to an airline's terrible status page? And are they even capable of being crawled? Looking at American's site (it was the most ghastly airline that sprang to mind) I don't see how a crawler would be able to deal with it, and indeed the Google snippet for AA flight status, on the aa.com result which is far down in the results page, just says "aa.com uses cookies" which is about what you'd expect.

In this case, I want to be sent literally anywhere but aa.com.

"what is the point for the consumer of being directed to an airline's terrible status page?"

One example...

If you back up a bit, the widget didn't used to tell you there was a change fee waiver when the flight was full, while aa.com did.

That's an actual, tangible benefit that a consumer might want, worth real money. You can also even often "bid" on a dollar amount to receive if you're willing to change flights. Google doesn't present that info today.

There are more examples. My perspective isn't that Google should lead you to aa.com, but I do feel it's a bit dishonest that the widget is so large it pushes aa.com below the fold. It doesn't need to be that large.

That's the result of the crawling, and it preventing competition. Google would much prefer that people complain about the details while ignoring the root cause.

I don't understand that. The crawling access is mostly the same as it ever was. Google's SERP pages are not. A mutually beneficial search engine that respects it's sources would still crawl the same. Google used to be that.

The core problem is incentives: http://infolab.stanford.edu/~backrub/google.html "we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm."

That's incorrect. Before the search oligopolies formed, new search engines could start up. There was excite, hotbot, altavista, and more. Now they don't have access. Search these comments for census.gov.

There are companies that do pretty well in this space, like ahrefs, for example. They do resort to trickery, like proxy clients that look like home computers or cell phones. But, if a small entity like ahrefs can do it, anyone can do it.

In a nutshell, though, I don't see equal access for all crawlers changing anything. Maybe that's the first barrier they hit, but it isn't the biggest or hardest one by far. Bing has good crawler access, but shit market share.

I swear something like 50% of those digests are totally incorrect as well. It's amazing they have kept the feature because it has never had a very high signal-to-noise ratio. I never trust what's presented in these digests without double-checking the source page.

I remember when rich snippets (one type of those widgets) came out there were a lot of funny examples. One for a common query about cancer treatments that pulled data from a dodgy holistic site saying that "carrots cured most types of cancer" (or something like that).

There was a similar one where Google emphatically claimed a US quarter was worth five cents in a pretty and large snippet graphic.

The most memorable rich snippet humor I've seen is a horse breeder sharing a story of how her searches gave snippets with my little ponies as the preview image.

I recall in the last uk election google got the infographic of party leaders about 60-70% wrong.

And quite often a people also ask refine is just some random guys comment from redit.

Have you heard the story of Thomas Running? It’s a story Google will tell you.

(Search who invented running)

> The airline web site could then present things Google can't do. Like "hey, we see you haven't checked in yet" or "TSA wait times are longer than usual" or "We have a more-legroom seat upgrade if you want it".

If I'm a passenger, there's plenty of ways for airlines to notify me. If I'm searching for a flight status online, it's because I'm picking someone up. If I want more information, I'll click through. I don't see how either me or the airline are hurt.

> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

Why is that even a problem for Wikipedia?

It's a progression. It's not a huge problem, by itself, for either. But Google shareholders want to continue the same YoY gains. The only cash cow is search, so they continue to take screen real estate that used to go to others, and take it for themselves. Whether that's more ads, or more widgets, or whatever.

Yes, it's legal. But it does reduce visitor interactions for those sites. Reduced visitor interactions isn't good for web sites...it takes away incentives, reduces brand value, reduces revenue. Eventually, that is not great for consumers.

Ever been a middleman? Squeeze your suppliers enough, and you kill them. The next supplier will pre-emptively cut quality, features, etc, because they know you're going to try and squeeze them to death.

"If I'm searching for a flight status online, it's because I'm picking someone up"

That's one use case, it's not all of them. There's middle ground too, like the bones they throw Wikipedia in the form of links for more info.

> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

Do you have a link of that product/service from Wikipedia?


> In short, I don't think the crawler is the problem.

Except that, allow other companies to crawl/compete, and you can take eyeballs away from Google (which may well then return eyeballs to Wikipedia so long as the Google competitors don't also present scraped data).

They're making it easier to search for flights and arrange a trip. It's UX and makes me not hate the airlines/travel process as much. And I end up buying the flight from the airline anyways, and in many cases doing the arranging on the airline site in the end once it's determined, so Google is giving that back. They're not taking stuff from the airlines, I mean what ads and stuff are on the airline sites anyways specifically during the search process. Where they are taking away is from the Expedia's and other aggregation sites that offer a garbage/hodgepodge experience that drives people crazy.

You're talking about Google Flights, which is completely unrelated to flight status.

The way that the web has been fundamentally broken by Google and other companies is one of the reasons I am excited about an alternative protocol called Gemini. It doesn't replace the web entirely, but for basic things like exchanging information, it's great. https://gemini.circumlunar.space/

>Google's widgets have been reducing traffic to Wikipedia pretty dramatically.

But wouldn't this be a good thing? Since wikipedia is a nonprofit aiming to provide knowledge, google stealing & caching their content might help serve information to more while reducing wikipedia's server load, so IMO it might not be so bad for wikipedia.

Large swaths of web are garbage. Wasting people's time and attention on visiting pointless sites for something presentable in a small box is obviously not economical.

And if some of the sources somehow die? New sources will spring up. It doesn't matter.

> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

What product is that?

I think the flight arrivals/departures is a bad example. A good example might be putting flights.google.com on the first page or even allow it to exist.

Not sure what you mean. Both for flight status as well as flight shopping, Google drops a huge widget at the top and pushes everything else down, below the visible fold.

Standardized interoperability enables overall progress.

Every airline doesn't need their own webpage. They could all provide a standard API.

"Every airline doesn't need their own webpage. They could all provide a standard API."

That's sort of how it works in the corporate booking tool world. It is decidedly not a better experience for end users, IMO.

There's quite a lot about each airline that is different, so any unified approach is a lowest common denominator. You'll notice things like loyalty points, for example, have more rich data on the airline's website. And that some fares are ONLY on the website. Or that seat maps have more useful detailed info, etc.

And that's all shopping/booking. Departure control, flight status, upgrade/downgrade, check-in, seat upgrades, standby, etc, are for the most part only on the airline's website.

The way to look at this from Google’s point of view is to realise that most websites are slow and bad[1], so if Google sent you there you would have a bad experience with a bad slow website trying to find the information you want. Google want to make it better for you.

[1] it feels like Google have contributed a lot to websites being slow and bad with eg ads, amp, angular, and probably more things for the other 25 letters of the alphabet.

> Google want to make it better for you.

Hehe, sure, nothing nefarious or greedy here... move along, move along, nothing to see...

I've noticed that sometimes Google had updated flight information before the displays at the airport.

For the most part individual airports own that infrastructure. So it's hard to generalize. For most types of notable flight status/time changes, however, airlines usually know first.

There are exceptions, like an airport-called ground stop.

Does the concergie of a hotel take anything away when he informs you that your flight has been delayed?

It's hard for me to make that an apt analogy. She's not well known as a portal to find websites, which is what Google had been for most of its existence.

It's pretty difficult to come up with a non-computer analogy for how Google works now. Pick a different space, and the power imbalance is quite clear. If they wanted, they could destroy StackExchange very quickly with these widgets.

Only if you are presuming they are going to make a question answering widget, too, since the content on stackexchange doesn't materialize out of thin air

Perhaps I am misunderstanding or over simplifying things but it always surprises me that there are legal cases brought against companies who scrape data when so many of Google's products are doing exactly this.

It definitely feels like one set of rules for them and a different set for everyone else.

I mean it's not that weird that a company would authorize major search engines scraping them but no one else.

I don't really see this as Google playing by different rules so much as economic incentives being aligned in Google's favor.

Google doesn't scrape anything that the site owner objects to.

https://knuckleheads.club/the-googlebot-monopoly/ has actual details.

> Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission. The rest of the file specifies that Google, Microsoft, Yahoo and two other non-search engines are not allowed to crawl certain pages on census.gov, but are otherwise allowed to crawl whatever else they can find on the website. This tells us that there are two different classes of crawlers in the eyes of the operators of census.gov: those given wide access, and those that are totally denied.

> And, broadly speaking, when we examine the robots.txt files for many websites, we find two classes of crawlers. There is Google, Microsoft, and other major search engine providers who have a good level of access and then there is anyone besides the major crawlers or crawlers that have behaved badly in the past that are given much less access. Among the privileged, Google clearly stands out as the preferred crawler of choice. Google is typically given at least as much access as every other crawler, and sometimes significantly more access than any other crawler.

Broadly speaking, robots.txt files are often ignored. I used to run a fairly large job ad scraping organization, and we would be hired by companies (700 of the fortune 1000 used us) to scrape the job ads from their career pages, and then post those jobs on job boards. 99 of 100 times, the robots file would disallow us to scrape. Since we were being paid by that company's HR team to scrape, we just ignored it because getting it fixed would take six months and 22 meetings.

> Broadly speaking, robots.txt files are often ignored.

If you wanna go nuclear on people who do that, include an invisible link in your html and forbid access to that URL in your robots.txt, then block every IP who accesses that URL for X amount of time.

Don't do this if you actually rely on search engine traffic though. Google may get pissed and send you lots of angry mail like "There's a problem with your site".

> Don't do this if you actually rely on search engine traffic though. Google may get pissed and send you lots of angry mail like "There's a problem with your site".

Ah, but of course you would exclude Google's published crawler IPs from this restriction, because that is exactly what they want you to do.

We would occasionally have customer try doing that. AWS has lots of IP addresses :-).

Nice insight - use different IP address for hidden links!

So from the website's point of view there is no difference between 'crawling' and 'scraping'. Census.gov I assume has a ton of very useful information which is in the public domain which a host of potential companies could monetize by regularly scraping census.gov. Census.gov's purpose to make this information available to people is served by google, yahoo and bing. On the other hand if I have a business which is based on that data, in fact I'm at cross purposes to them.

The census data is available for bulk download, mostly as CSV (for example [1]). Scraping census.gov is worse for both the Census Bureau (which might have to do an expensive database query for each page) and for the scraper (who has to parse the page).

Blocking scrapers in robots.txt is more of a way of saying, "hey, you're doing it wrong."

It's also worth noting that the original article is out of date. The current robots.txt at census.gov is basically wide-open [2].

[1] https://www.census.gov/programs-surveys/acs/data/data-via-ft...

[2] https://www.census.gov/robots.txt

Scrapers don't care about robots.txt. I have scraped multiple websites in a previous job and the robots.txt means nothing. Bigger sites might detect and block you but most don't.

I'm generally anti business. But I have to disagree. "The Public" that the government serves includes businesses. Businesses (ignoring corporate personhood bullshit) are owned and operated by people.

I do not want the government deciding "what purposes" e.g. non-commercial, serve the public good. The public gets to decide that. (charging a license for commercial use is maybe ok (assuming supporting that use costs government "too much").

And I very do not want current situation with the government anointing a handful of corporations (the farthest thing from the public possible) access and denying everyone else including all of the actual public.

A specific case where this favorite-picking by government enables corruption: https://en.wikipedia.org/wiki/Nationally_recognized_statisti...

And an example from the quickly-approaching future, when there will be Nationally Recognized Media Organizations who license "Fact-Checkers," through which posts to public-facing will have to be submitted for certification and correction.

Favorite-picking by the government is corruption by itself already.

> I do not want the government deciding "what purposes" e.g. non-commercial, serve the public good. The public gets to decide that.

the public's "decision" on things like this is made manifest by government policy, no?

In theory. In practice, is every single policy that our government upholds currently popular with the majority of people?

It's possible to have government policies that the majority of people disagree with, that remain for complicated reasons related to apathy, lobbying, party ideology, or just because those issues get drowned out by more important debates.

Government is an extension of the will of the people, but the farther out that extension gets, the more divorced from the will of the people it's possible to be. That's not to say that businesses are immune from that effect either -- there are markets where the majority of people participating in them aren't happy with what the market is offering. All of these systems are abstractions, they're ways of trying to get closer to public will, and they're all imperfect. But government is particularly abstracted, especially because the US is not a direct democracy.

I'm personally of the opinion that this discussion is moot, because I think that people have a fundamental Right to Delegate[0], and I include web scraping public content under that right. But ignoring that, because not everyone agrees with me that delegation is right, allowing the government to unilaterally rule on who isn't allowed to access public information is still particularly susceptible to abuse above and beyond what the market is capable of.

[0]: https://anewdigitalmanifesto.com/#right-to-delegate

In the case of Census.gov, they offer an API to get the data[0]. It's actually pretty nice. Stable, ton of data, fairly uniform data structure across the different products. Very high rate limits, considering most data only needs retrieved once a year. I think they understand the difference between crawling and scraping.

[1] https://www.census.gov/data/developers.html

But Google, Yahoo and Bing are also monetizing the data. Why are they allowed to provide “benefits” but “scrapers” are not? Why is it wrong to monetize public data?

Having data in the right format as a download or via an API would be the best way to go for public data.

If people have to 'scrape' that data from a public resource, I'd say they're presenting the data in the wrong way.

I used to run a fairly large job ad scraping operation. Our scraped data was used by many US state and federal job sites. "Scraping" is just using software to load a page and extracting content. "Crawling" is just load a page, find hyperlinks (hmm... a kind of content), and then crawling those links. Crawling is just a kind of scraping.

Is it legal for a government entity to issue a robots.txt like that? Maybe the line between use and abuse hasn't been delinated as well as it needs to be.

Is failure to honor a robots.txt a crime? Or rather, would it be unlawful to spoof a user agent to access this publicly available data? After the linkedin [0] case it seems reasonable to think not.

[0]: https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...

Spoofing user-agents hasn't worked in a long time for anything but small operations because search engines publish specific IP ranges their scrapers use.

The CFAA is so broad and broadly interpreted that I would assume that failure to honor any any site's robots.txt file may incur criminal liability if the U.S. government can claim American jurisdiction (e.g., because the site's owners are U.S. persons or a U.S. corporation, or because the site's servers are located in the U.S.).

> Is it legal for a government entity to issue a robots.txt like that?

I may be wrong (this isn't my area), but I was under the impression that robots.txt was just an unofficial convention? I'm not saying people should ignore robots.txt, but are there legal ramifications if ignored? I'm not asking about techniques sites use to discourage crawlers/scrapers, I'm specifically wondering if robots.txt has any legal weight.

Yep, it's defined at https://www.robotstxt.org/

Looks like Google is trying to turn it into an RFC though.

Perhaps there could be some kind of 'Crawler consortium'?

Under this consortium, website owners would be allowed to either allow all crawlers (approved by the consortium) or none at all (that is, none that is in the consortium, i.e. you could allow a specific researcher or something to crawl your website on a case-by-case basis).

This consortium would be composed of the search engines (Google, MS, other industry members), as well as government appointed individuals and relevant NGOs (electronic frontier foundation, etc?). There would be an approval process that simply requires your crawl to be ethical and respect bandwidth usage. Violations of ethics or bandwidth limits could imply temporary or permanent suspension. The consortium could have some bargain or regulatory measures to prevent website owners from ignoring those competitive and fairness provisions.

> Perhaps there could be some kind of 'Crawler consortium'?

An industry-wide agreement not to compete for commercially valuable access to suppliers of data?

Comprised of companies that are current (and in some cases perennial) focusses of antitrust attention?

I think there might be a problem with that plan.

I don't see the problem. If a bunch of non-google companies pooled resources to make a crawl, that would reduce market concentration, not increase it.

Well, yes, and a common solution to anti-trust cases, that I know of, is some kind of industry self-regulation. In this case I wouldn't trust the industry only to self-regulate; hence, they should at invite (while keeping a minority but not insignificant position) governments and civil society (ngos and other organizations) to participate.

Could you better describe your objections?

Are there any actual repercussions for just ignoring robots.txt?

There is if you are doing it for work. For example, your company could get sued if you are found using that data and ignoring the ToS. If you are a public figure, you could get your name tarnished as doing something unethical or the media may call it "hacking". If you are rereleasing the data then you risk getting a takedown notice.

robots.txt is not a terms of service. Even if it was, it wouldn't be enforceable for a public website. You would need to prove that a web crawler is maliciously causing disruption to your service, and that is not easy.

All it takes is your company execs or lawyers to be afraid of a stern letter, and ask you to cancel your project. If you're violating their robots.txt, you're probably violating their terms of service that's hidden somewhere. And your company doesn't want to risk having to pay hundreds of thousands to fight a court case. There's also venues besides courts for them to attack you, like contacting the publishers or hosting platforms for your derivative works. It's a chilling effect.

And I'm not making this up. This kind of stuff has happened to me many times.

Your crawler's IP might get banned, eventually.

Sometimes website admins will also try to report your ips to the service provider as a source of attacks (even if not true).

Given how often I've had misbehaving crawlers slow own servers in the early 2000s, I do not see how a crawler that disobeys robots.txt is not an attempted attack.

On a related note, Cloudflare just introduced "Super Bot Fight Mode" (https://blog.cloudflare.com/super-bot-fight-mode/) which is basically a whitelisting approach that will block any automated website crawling that doesn't originate from "good bots" (they cite Google & Paypal as examples of such bots). So basically everyone else is out of luck and will be tarpitted (i.e. connections will get slower and slower until pages won't load at all), presented with CAPTCHAs or outright blocked. In my opinion this will turn the part of the web that Cloudflare controls into a walled garden not unlike Twitter or Facebook: In theory the content is "public", but if you want to interact with it you have to do it on Cloudflare's terms. Quite sad really to see this happen to the web.

On the other hand, I do not want my site to go down thanks to a few bad 'crawlers' that fork() a thousand http requests every second and take down my site, forcing me to do manual blocking or pay for a bigger server/scale-out my infrastructure. Why should I have to serve them?

You can use the same rate-limiting for all crawlers, Google or not.

Googlebot is pretty careful and generally doesn’t cause these problems.

Right, then they shouldn't be effected by the rate-limiting, as long as its reasonable. If it was applied evenly to all clients/crawlers, it'd at least allow the possibility for a respectful, well designed crawler to compete.

The problem is, if you own a website, it takes the same amount of resources to handle the crawl from Google and FooCrawler even if both are behaving, but I'm going to get a lot more ROI out of letting Google crawl, so I'm incentivized to block FooCrawler but not Google. In fact, the ROI from Google is so high I'm incentivized to devote extra resources just for them to crawl faster.

We know that. No one claims websites are doing this for no reason. It's explicitly written in the article.

But this sub-thread is about misbehaved crawlers.

Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.

In the early 90s there were various nascent systems for essentially public database interfaces for searching

The idea was that instead of a centralized search, people could have fat clients that individually query these apis and then aggregate the results on the client machine.

Essentially every query would be a what/where or what/who pair. This would focus the results

I really think we need to reboot those core ideas.

We have a manual version today. There's quite a few large databases that the crawlers don't get.

The one place for everything approach has the same fundamental problems that were pointed out 30 years ago, they've just become obvious to everybody now.

I wonder what happens to RSS feeds in this situation. Programs I run that process RSS feeds will just fetch them over HTTP completely headlessly, so if there are any CAPTCHAs, I'm not going to see them.

I've found that Cloudflare isn't great at this. I even found cases where my site was failing to load to googlebot (a "good" bot that they probably have the IPs for) because they were serving a captcha instead of my CSS.

So your best bet is setting a page rule to allow those URLs.

In my experience, those either get detected(?) and let through (rss can be agressively cached after all) or you're out of luck and the website owner set up e.g. wordpress (which automatically included rss URLs) but did not configure cloudflare to let rss through.

That will be interesting to see with regards to legal implications. If they (in the website operator's name) block access to e.g. privacy info pages to a normal user "by accident", that could be a compliance issue.

I don't think it's mass blocking is the right approach in general. IPs, even residential, are relatively easy and relatively cheap. At some point you're blocking too many normal users. Captchas are a strong weapon, but they too have a significant cost by annoying the users. Cloudflare could theoretically do invisible-invisible captchas by never even running any code on the client, but that would be wholesale tracking and would probably not fly in the EU.

Cloudflare is an agent for website owners. Nearly everything is configurable and the defaults are permissive.

How hard is it to ask Cloudflare to let you crawl?

It's not Cloudflare who is deciding it. It's the website owners who request things like "Super Bot Fight Mode". I never enable such things on my CF properties. Mostly it's people who manage websites with "valuable" content, e.g. shops with prices who desperately want to stop scraping by competitors.

I can say this will give a lot of businesses false sense of security. It is already bypassable.

the Web scraping technology that I am aware of has reached end game already: Unless you are prepared to authenticate every user/visitor to your website with a dollar sign, lobby congress to pass a bill to outlaw web scraping, you will not be able to stop web scraping in 2021 and beyond.

But what about captchas?

Due to aggressive no-script and uBlock use I, browsing the website as a human, keep getting hit by captchas and my success rate is falling to a coinflip. If there's a script to automate that I'm all ears.

100% doable. Like I said these type of blanket throttling seems to be the new trend but it's already defeated.

I just no longer see it possible to 1) put information on the web (private or public) 2) give access outside your organization (customers or visitors) 3) expect your website will not be scraped.

ToS is NOT the law unfortunately.

So, one more reason to hate Cloudflare and every single website that uses it.

Or maybe don’t “hate” folks who are just trying to put some content online and don’t want to deal with botnets taking down their work? You know, like what the internet was intended for.

> don’t want to deal with botnets taking down their work

Botnets and automated crawling are completely different things. This isn't about preventing service degradation (even if it gets presented that way). It's an attempt by content publishers to control who accesses their content and how.

Cloudflare is actively assisting their customers to do things I view as unethical. Worse, only Cloudflare (or someone in a similarly central position) is capable of doing those things in the first place.

Internet was certainly not intended for centralization. I hit Cloudflare captchas and error pages so often it's almost sickening. So many things are behind Cloudflare, things you least expect to be behind Cloudflare.

It's easy enough to bypass most Cloudflare “anti-bot” with an unusual refresh pattern or messing with a cookie. (It's easier to script this than solve the CAPTCHAs.)

Anyone malicious who is determined enough will just pay their way through any captcha. Yet for me, as a legitimate user, these "one more step" pages feel downright humiliating. At this point, if I see one, I either just nope out of it, or look for a saved copy on archive.org.

Can we take a moment to talk about this club's business model?

There's not even any information to see what the "private forum access" that you have to pay for is about, what kind of people are in it...or even to know about what happens with the money.

For me, this sounds like a scam.

I mean, no information about any company. No imprint. No privacy policy. No non-profit organization. And just a copy/paste wordpress instance.

I mean, srsly. I am building a peer-to-peer network that tries to liberate the power of google, specifically, and I would not even consider joining this club. And I am the best case scenario of the proposed market fit.

Not being set up as a 527 nonprofit[0] is the biggest red flag - no donation or membership money has to be spent for political purposes. They also use memberful for their membership/payment system, which doesn't require owning a business, so you might be paying out to the owner directly instead of to a business with its own bank account. Maybe the owner is looking at HN and can clarify.

To add, there are a lot of businesses that use the terms 'Knucklehead' so finding their business on secretary of state business searches might be impossible.

0: https://www.irs.gov/charities-non-profits/political-organiza...

They want you to pay them to "research" google's web crawling monopoly. It's really just a donation, but they don't frame it like that. Probably more credible than using a crowd funding website, because it sounds like their pushing for actual legislation.

> Meet with legislators and regulators to present our findings as well as the mock legislation and regulations. We can’t expect that we can publish this website or a PDF and then sit back while governments just all start moving ahead on their own. Part of the process is meeting with legislators and regulators and taking the time helping them understand why regulating Google in this way is so important. Showing up and answering legislators’ questions is how we got cited in the Congressional Antitrust report and we intend to keep doing what’s worked so far.

I'd like to see some data on their claim that website operators are giving googlebot special privileges. As far as I can tell it would be a huge pain in the ass to block crawler bots from my servers, not that I've tried. I have some weird pages that tend to get crawlers caught in infinite loops, and I try to give them hints with robots.txt but most of the bots don't even respect robots.txt.

If I actually wanted to restrict bots, it would be much easier to restrict googlebot because they actually follow the rules.

I don't disagree in principle that there should be an open index of the web, but for once I don't see Google as a bad actor here.

A company I worked for ~7 years ago ran its own focused web crawler (fetching ~10-100m pages per month, targeting certain sections of the web).

There were a surprising number of sites out there that explicitly blocked access to anyone but Google/Bing at the time.

We'd also get a dozen complaints or so a month from sites we'd crawled. Mostly upset about us using up their bandwidth, and telling us that only Google was allowed to crawl them (though having no robots.txt configured to say so).

Isn't that the website owners right though? I'm not sure I understand the problem here.

If Google is taking traffic and reducing revenue, a company can deny in robots.txt. Google will actually follow those rules - unlike most others that are supposedly in this 2nd class.

Yup, no problem here, was just making an observation about how common such blocking was (and about the fact that some people were upset at being crawled by someone other than Google, despite not blocking them).

The company did respect robots.txt, though it was initially a bit of a struggle to convince certain project managers to do so.

> Isn't that the website owners right though?

No. The internet is public. Publishers shouldn't get any say in who accesses their content or how they do it. As far as I'm concerned, the fact that they do is a bug.

No, it's not. I can setup a login page and keep you out if I want. And I can do it however I want.

But your login page will be public and subject to being crawled.

My server, my rules.

I usually recommend setting only Google/Bing/Yandex/Baidu etc to Allow and everything else to Disallow.

Yes, the bad bots don't give a fuck, but even the non-malicious bots (ahrefs, moz, some university's search engine etc) don't bring any value to me as a site owner, take up band width and resources and fill up logs. If you can remove them with three lines in your robots.txt, that's less noise. Especially universities do, in my opinion, often behave badly and are uncooperative when you point out their throttling does not work and they're hammering your server. Giving them a "Go Away, You Are Not Wanted Here" in a robots.txt works for most, and the rest just gets blocked.

> they're hammering your server

Why can't you just ratelimit IPs that are "too active" for your server to handle?

From some I could, but why would I? If they're not adding value and they don't want to behave, I don't see a reason to spend money to adapt my systems to be "inclusive" towards their usage patterns.

In context, you're justifying blocking all automated traffic, even that which does behave, by pointing out that some of it doesn't. That attitude seems lazy at best, malicious at worst.


Now that's a really good point. I wonder why there isn't a standard protocol for signalling upstream that a particular connection is abusive and to please rate limit the path at the source on your behalf? It would certainly add complexity, but the current situation is hardly better.

When you operate commercial sites at scale, bots are a real thing you spend real engineering hours thinking about and troubleshooting and coding to solve for.

And yes, that means google gets special treatment.

Think about the model for a site like stackoverflow. The longest of long tail questions on that site: what’s the actual lifecycle of that question?

- posted by a random user - scraped by google, bing, et al - visited by someone who clicked on a search result on google - eventually, answered - hopefully, reindexed by google, bing et al - maybe never visited again because the answer now shows up on the google SERP

In the lifetime of that question how many times is it accessed by a human, compared to the number of times it’s requested and rerequested by an indexing bot?

What would be the impact on your site of three more bots as persistent as google bot? Why should you bother with their requests?

So yes, sites care about bot traffic and they care about google in particular.

See figure I.4 on page 24 of this UK government report: https://assets.publishing.service.gov.uk/media/5efb1db6e90e0...

Additional evidence here: https://knuckleheads.club/the-evidence-we-found-so-far/

A lot of news websites restrict any crawler other than Google. And this does not happen only via robots.txt.

Indeed, years ago I had scripts to automatically fetch URLs from IRC and I quickly realized that if I didn't spoof the user agent of a proper web browser many websites would reject the query. Googlebot's UA worked just fine however.

> Googlebot's UA worked just fine however

They obviously don't care enough then - Google says you should use rdns to verify that googlebot crawls are real[0]. Cloudflare does this automatically now as well for customers with WAF (pro plan).

0: https://developers.google.com/search/docs/advanced/crawling/...

Spoofing your user-agent as googlebot is a common way to bypass paywalls, is (was?) a way to read Quora without creating an account, etc. Publishers obviously need to send their page/article to Google if they want it to be indexed but may not want to send the same page content to a normal user: https://www.256kilobytes.com/content/show/1934/spoofing-your...

This was common even back in the mid-2000s:



Google aren't the bad actor in the sense that they are actively doing something wrong, but they are definitely benefiting from the monopoly that they created and work on maintaining. If this continues then nobody will really ever be able to challenge them, which means possibly "better" products will fail to penetrate the market.

> but for once I don't see Google as a bad actor here.

As inflammatory as the headline of the page looks, they literally admit it's not google's fault in the smaller text lower down:

"This isn’t illegal and it isn’t Google’s fault, but"

LinkedIn profile/Quora answer are accessible by Google bot without signin

The studies and data to support their claim is in the first paragraph of the article you "read" before posting the question.

It's hilarious to think there exists people who think googlebot does not get special treatment from website operators. Here is an experiment you can do in a jiffy, write a script that crawls any major website and see how many URL fetches it takes before your IP gets blocked.

Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.

> Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.

Google says[1] they do not do this:

"Google doesn't post a public list of IP addresses for website owners to allowlist."


From that same page they recommend using a reverse DNS lookup (and then a forward DNS lookup on the returned domain) to validate that it is google bot. So the effect is the same for anyone trying to impersonate googlebot (unless they can attack the DNS resolution of the site they’re scraping I guess).

I don't whitelist googlebot, but I don't block them either because their crawler is fairly slow and unobtrusive. Other crawlers seem determined to download the entire site in 60 seconds, and then download it again, and again, until they get banned.

I have never had that problem running screaming frog on big brand sites apart from one or two times.

I don't scrape a website often, but when I do, I'm using a user agent of a major browser.

Do any of them intersect with Google Cloud IP addresses? If so set up a VPN server on Google Cloud.

The idea of a public cache available to anyone who wishes to index it is ... kind of compelling.

If it was the only indexer allowed, and it was publically governed, then enforcing changes to regulation would be a lot more straightforward. Imagine if indexing public social media profiles was deemed unacceptable, and within days that content disappeared from all search engines.

I don't think it'll ever happen, but it's interesting to think about.

Common Crawl is attempting to offer this as a non-profit: https://commoncrawl.org

o/t but what the hell are they doing to scroll on that page? I move my fingers a centimeter on my trackpad and the page is already scrolled all the way to the bottom.

Hijacking scroll like this is one of the biggest turnoffs a website can have for me, up there with being plastered with ads and crap. It's ok imo in the context of doing some flashy branding stuff (think Google Pixel, Tesla splashes) but contentful pages shouldn't ever do this.

Add *##+js(aeld, scroll) to your uBO filters. That will stop scroll JS for all websites.

> If it was the only indexer allowed, and it was publically governed

Which would put it under government regulation and be forever mired in politics over what was moral, immoral, ethical or unethical and all other kerfuffle. To an extent, it’s already that way, but that would make it worse than it is currently.

Here's an idea... what if search became a peer-to-peer standardized protocol that is part of the stack to complement DNS? E.g. instead of using DNS as the primary entry point, you use a different protocol at that level to do "distributed search". DNS would still play a role too, but if "search" was a core protocol, the entry point for most people would be different.

Similar to some of the concepts of "Linked Data", maybe - https://en.wikipedia.org/wiki/Linked_data.

The problem is getting to a standard, it would essentially need to be federated search so a standard would have to be established (de facto most likely).

Also, indexes and storage, distribution of processing load.. peer-to-peer search is already a thing, but it doesn't seem to be a core function of the Internet.

This is basically the same concept as making an "open" version of something that is "closed" in order to compete, I guess.

I'd have to look more but maybe running a cache isn't dead simple. I can imagine that the benefits of manipulating what's in the cache either adding or removing would be very high. Google and the others are private companies so they're not required to do everything in the public view.

A public cache wouldn't be able - indeed shouldn't - to play cat and mouse games with potential opponents. I suspect most of the games played require not explaining exactly what you're doing.

An alternative but similar idea, apply your own algorithms to a crawler/index. That's half the problem with these large platforms commanding the majority of eyeballs, you search the entire web for something and you get results back via a black box. Alternatives in general are most definitely a good thing.

Knuckleheads' Club at the very least are doing a great job of raising awareness and the potential barriers to entry for alternatives.

That would be a very cool use case for something like STORJ or IPFS.

Imagine if Donald Trump decided that indexing Joe Biden's campaign site was unacceptable.

A mandated singular public cache has potential slippery slopes.

>A mandated singular public cache has potential slippery slopes.

That may be, but it seems like everything has a slippery slope - if the wrong person gets into power, or if the public look the other way/complacence/ignorance/indifference, etc, etc. It shouldn't stop us evaluating choices on their merits, and there is a lot of merit to entrusting 'core infrastructure' type entities to the government - or at-least having an option.

Imagine if Donald Trump decided to tax campaign donations to Joe Biden's campaign at 100%.

I am unconvinced by the "slippery slope" argument being deployed by default to any governmental attempt to combat tech monopolies.

This is an argument against centralization more than it is against government.

"One index to rule them all" seems more fraught with difficulty than, "large cloud providers are unhappy that crawlers on the open web are crawling the open web".

If the impact stopped at "large cloud providers" being unhappy, I think that you're correct. But I think we've seen considerably downstream "difficulty" for the rest of society from search essentially being consolidated into one private actor.

So out law web scrapping entirely?

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact