With the rise of AI, web crawlers are suddenly controversial

throwup238 · on Feb 18, 2024

> For decades, robots.txt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

The basic social contract of the web fell apart long ago when almost everyone decided that Google was the only search engine worth serving and started aggressively blocking other crawlers.

zarzavat · on Feb 18, 2024

Add to that the removal of public APIs and things like RSS feeds. I would much rather use an API than scrape you, I will even pay a small fee, but if you don’t provide anything, then you’re getting scraped.

1vuio0pswjnm7 · on Feb 19, 2024

"Add to that the removal of public APIs and things like RSS feeds."

You mean like this public API

https://hacker-news.firebaseio.com/v0/item/39421253.json

Or this RSS feed

https://www.theverge.com/rss/index.xml

1vuio0pswjnm7 · on Feb 19, 2024

Another one. This one has a rate limit.

  curl -si40A "" https://old.reddit.com/new.rss > 1.json

  sed 10q 1.json

   HTTP/1.1 200 OK
   content-length: 40582
   content-type: application/atom+xml; charset=UTF-8
   x-ua-compatible: IE=edge
   x-frame-options: SAMEORIGIN
   x-ratelimit-remaining: 93
   x-ratelimit-used: 3
   x-ratelimit-reset: 96
   x-reddit-pod-ip: 10.104.158.170:80
   x-reddit-internal-ratelimit-rls-type: ip-standard

In theory, APIs to retrieve public information suck because they are designed to be rate-limited or subject to quotas. Whereeas IME public-facing websites are much less likely to set rate limits and enforce quotas.

I prefer to use the public-facing website instead of "Web APIs" and convert HTML to SQLite which I prefer over non-LD JSON. If the public-facing website changes their HTML, I change the code in the filter. This is rare. It is very simple for me to do.

WirelessGigabit · on Feb 19, 2024

Reddit recently started blocking anonymous GETs to RSS feeds.

And if you use your cookies they temp-block your account after x (like x every 30 minutes) times.

And I didn't even care that there was no content on the RSS feed, I just wanted the title / notification so I don't have to use ... (I can't put anything in here as there is no alternative that allows me to track everything).

ec109685 · on Feb 19, 2024

Are there services that have a copy of the Internet for a fee?

How did perplexity.ai crawl the internet for their AI?

Plasmoid · on Feb 18, 2024

I'm out of the loop on this. What has been happening? Do you get IP blocked? Rate limited?

realusername · on Feb 18, 2024

You do and since the rise of Cloudflare, you need their approval to create your search engine

oceanplexian · on Feb 18, 2024

Unfortunately for them, AI is a double edged sword. AI can trivially scrape content from a screencap. And there are thousands of residential proxies that can’t be defeated even by companies like CF.

spencerflem · on Feb 18, 2024

If you have residential proxies, you don't need AI to scrape via video, you have the HTML already.

oceanplexian · on Feb 18, 2024

Companies spend thousands of engineering hours building elaborate mechanisms to obfuscate JS, protect APIs, apps, trigger captchas, and hide content, all of it completely powerless against the current gen of AI tools. If a human can read it, a bot can.

Mr_P · on Feb 18, 2024

Those same companies often invest in accessibility for vision-impaired users. I'm not sure you need a screen capture to scrape content when the site is designed to be navigable with a screen reader.

timthelion · on Feb 19, 2024

Surprisingly chatgpt is not accessible via screen reader, their captcha is purely image based.

CyberDildonics · on Feb 19, 2024

Is OCR from 30 years ago now "AI"?

linkjuice4all · on Feb 18, 2024

"With the rise of AI, photos of the exterior of your business are suddenly controversial"

Many revenue-based websites tried to have it both ways with web crawlers wherein they wanted to block automated access or repeat viewers while letting first time viewers get a free taste. Others have noted that basically Google gets a free pass for all the traffic it brings in but everyone else has to respect robots declarations.

It seems like a no brainer - if your web server is configured to reply to GET requests with a 200 status and some content then they get to do pretty much whatever they want with it.

Don't want to give access to everyone? Stop sending your content for free and get them to agree to some contract and authorize/license their access to your stuff.

spencerflem · on Feb 18, 2024

I don't think this is that cut and dry. Its perfectly reasonable to be fine with your content being linked to in an index like google search does, but not fine with the content being read by humans for free or used to train AI.

I'm sure the sites would rather that google paid them in licencing too, but without changes to our laws that's not going to happen

postalrat · on Feb 18, 2024

It is also reasonable to serve the same content to regular visitors as you do to Google?

Is ok to serve a full article to Google and put visitors behind wall?

spencerflem · on Feb 18, 2024

yes, because google is only using that article to decide what searches should link to it

AnthonyMouse · on Feb 18, 2024

Given two pages that both contain the information the user wants, they prefer the one that isn't behind a paywall, and so they prefer a search engine that puts those results first.

Giving the search engine the full content so it will rank higher is lying, because that information isn't actually there unless you pay, which most users aren't going to do.

In theory you could solve this with a requirement for the site to disclose that it's doing this, so the search engine can have a box that says "exclude paywalled content", but then everybody would check the box.

spencerflem · on Feb 19, 2024

Google wants you to be happy with your search, and doesn't care at all about journalists. If they thought people would prefer paywalled content gone they could do that anytime.

I agree the option would be nice, but I think you are wrong that everyone wants paywalled articles excluded

AnthonyMouse · on Feb 19, 2024

> If they thought people would prefer paywalled content gone they could do that anytime.

Which is what they do by penalizing sites for showing something different to Googlebot than the user.

> I agree the option would be nice, but I think you are wrong that everyone wants paywalled articles excluded

What they want is for non-paywalled links that include the relevant information to be listed first, which is de facto equivalent to excluding paywalled links (because they'll be on page 75) in the vast majority of search results.

matthewrobertso · on Feb 18, 2024

I want to research a topic, so I do a google search about it. If 100% of the search results are readable to google, so that they are indexed, but unreadable to me, due to paywalls, the google search is useless.

I'm not sure what the solution is, but paywalled articles in search results are bad. If they want to be indexed they should have to offer that same indexed content to anyone browsing the index.

layer8 · on Feb 18, 2024

I don’t like paywalls as well, but in principle payed content is justified, and if one is willing to pay for relevant content, isn’t it better that Google allows one to find it? Maybe Google Search should have a switch “show only non-paywalled results” (paraphrased) (I’m sure they could figure out which content is paywalled if they wanted to), but personally I would probably still prefer seeing which sources exist even if they are paywalled.

spencerflem · on Feb 18, 2024

i agree, it would be nice for google to have an option to avoid paywalled articles, or to specify to it what accounts you pay for and allow those only

getting all journalism for free isn't sustainable though

LeroyRaz · on Feb 18, 2024

While I believe in journalism, I'm pretty appalled at the state of modern journalism. There have been a few big fully televised court cases recently, e.g., Depp v. Heard, and I was stunned at how poor the media coverage of them was. As I was intrigued by the legal system, I watched tens of hours of raw footage of witnesses, lawyers and judges, and I was amazed at how watching the raw footage revealed how incredibly biased and superficial the journalism coverage was (on all sides). As this experience showed how untrustworthy newspapers can be, I'm really not aware of any newspaper/journalism (maybe Private Eye? Or Bellingcat?) that is worth reading, let alone paying for.

I guess it might be a catch twenty two. Low quality -> low income -> low quality, but the sadness of such a dynamic does not make me want to pay for an inferior service.

carlosjobim · on Feb 18, 2024

If you're doing serious research you pay for the paywall. It's not unreadable to you, just like a coke isn't undrinkable to you because you have to pay for it.

matthewrobertso · on Feb 18, 2024

No I don't, I disable JavaScript and read what they served google in the first place.

If I went to a public water fountain and found that someone had turned it into a coca cola dispensing machine, I wouldn't be happy and wouldn't pay to use it.

"Journalists" creating pay walls, using SEO tactics to push their articles into my search results, and then trying to extract rent don't deserve money.

carlosjobim · on Feb 18, 2024

You despise the people writing the content you want to read, at the same time that you are demanding to access their works for free. Do you also work for free for any stranger?

phone8675309 · on Feb 18, 2024

Where in the parent post did the poster say they "despised" the people writing the content?

carlosjobim · on Feb 19, 2024

Calling them "journalists" instead of journalists.

matthewrobertso · on Feb 18, 2024

No, I don't want to read their content. I want to find an answer to my search query.

If the search results are full of paywalled articles that claim to have text relevant to my query, but won't show me the article because publishers are trying to extract money from me, the publishers of those articles have made my task harder and shouldn't be rewarded. This is a form of spam.

janalsncm · on Feb 19, 2024

In this case I think your beef is with Google and not the paywalled sites. A newspaper is going to do whatever it takes to keep the lights on, and if that means forcing people to pay, so be it.

For Google, they have made a product decision about how to treat paywalled content. They don’t care. It hurts the user experience but the days when Google cared about improving their search experience are long gone.

carlosjobim · on Feb 18, 2024

> I want to find an answer to my search query.

And sometimes the answer is behind a paywall. It's not spam at all. On the contrary, spam is always free.

matthewrobertso · on Feb 18, 2024

There is always a free source with the answer somewhere. The trouble comes when the free sources are pushed far down in the results by legacy brands.

When this happens, I will continue to pretend to be google to access the content they are pushing. If publishers want to change this behavior they could try not letting Google index it, so I don't need to see it in my search results.

carlosjobim · on Feb 18, 2024

Your argument boils down to "I want free stuff", as I see it. Okay, but why in the world should Google care about what you want in the search results then? You do not bring any value and will not bring any future value.

For other users, they see value in having paywalled results if they are the best results, because they do not have a block against paying for content.

If you for example search for a movie on Google, they'll show you paid options to watch it on streaming services or rent it from streaming services. That's good and what should be expected from a search engine.

Paying for stuff is how the world works. If a restaurant boasts about having the nicest steaks, you're not going to get a free steak just to be sure that it's good.

But I really think it is time for a better way to pay for content and articles instead of having to subscribe to each source.

matthewrobertso · on Feb 20, 2024

I don't think you understand my argument as you are making a second food analogy (first coke, now steak). Please read my response to your first food analogy as it applies to both food analogies.

carlosjobim · on Feb 20, 2024

Information has always been paid for, whether it's news, books or magazines. If you expect for something to be free just because it's found on a search engine, I don't know where you got that from. I think my examples for music and movies that I've given in this thread are worth considering.

It's like if a friend of yours takes you to a nice Mexican place. Why would you expect to get a burrito al Pastor to eat for free, just because you eat for free when you visit relatives? Nobody said it would be free.

matthewrobertso · on Feb 20, 2024

Really, a third food analogy? Is this satire?

hobs · on Feb 18, 2024

It's a bait and switch - you just offered me free cokes and then let me know that its after I sign up for a subscription service, no thanks!

spencerflem · on Feb 18, 2024

It's your assumption that everything behind a google link ought to be 100% free (ad supported). Other people disagree, and Google does not advertise anywhere that their list is free content only.

lmm · on Feb 18, 2024

> Google does not advertise anywhere that their list is free content only.

Google does advertise that they index based on the same content that's available to anyone viewing the page, and has policies against presenting a different version of the page to their crawler versus what you're showing to visitors.

carlosjobim · on Feb 19, 2024

It's splitting hairs at this point, but anyone visiting the page can view the same content as the crawler – if they pay.

Should Google also stop indexing Facebook, since Facebook puts a login wall for people to access their content? Should YouTube (ie Google) ban movie trailers, since it's just a tease for paywalled movies? The iTunes store let people listen to 30 seconds of a song before purchasing at the paywall. Was that wrong?

lmm · on Feb 19, 2024

> Should Google also stop indexing Facebook, since Facebook puts a login wall for people to access their content?

Yes - I thought they already did? (I know LinkedIn edges around this by putting up a login wall only if you have a cookie showing that you'd logged in previously).

> Should YouTube (ie Google) ban movie trailers, since it's just a tease for paywalled movies? The iTunes store let people listen to 30 seconds of a song before purchasing at the paywall. Was that wrong?

A free sample of a paid thing is fine if everyone knows that's what it is. It's when you bait-and-switch by offering something that seems like it's free to start with that it's a problem. Like imagine showing a movie in the town square and then 10 minutes in you pause it and tell everyone they need to buy a ticket or leave.

carlosjobim · on Feb 20, 2024

> Yes - I thought they already did?

Just checked, Google still indexes Facebook and puts relevant results on top. If you're not logged in you can't continue.

carlosjobim · on Feb 18, 2024

No, it's not bait and switch. A book store has an index of books they sell, that doesn't mean they're free. I expect a high quality search engine to deliver paid results if they are the best results.

Should Google Maps remove businesses that charge for their products and services from their search results as well?

hobs · on Feb 18, 2024

I wouldn't expect that at all, search engines search the content they have available to proffer it to you, that's the job.

If by clicking on the thing it does not have the content I searched for (how am I even certain I get it when I pay you?) I would call that result bad.

If you want to charge for stuff that's great, I recommend it, and if you want to give out a free sample or an index that's great, but it should be the same to all comers.

hobs · on Feb 18, 2024

I deeply 100% disagree with you - if you want to show google something else that is a spam tactic that only ends in bad outcomes. If your content is good enough to pay for its good enough to hide from everyone.

A search based on the results being something otherwise than the contents of the search is misleading, a spam tactic, and bad.

o11c · on Feb 18, 2024

There's a different between "I allow people to access my data" and "I allow people to create products based on my data"

CharlesW · on Feb 18, 2024

> Stop sending your content for free and get them to agree to some contract and authorize/license their access to your stuff.

Agreed, we need to wrap the public web with DRM immediately. We can't expect companies like OpenAI to waste time worrying about pesky legalities like copyright and content licensing.

spencerflem · on Feb 18, 2024

I don't think there's a technical solution that would work here. This is a law & enforcement problem

AnthonyMouse · on Feb 18, 2024

It depends what your goal is.

If you're trying to prevent anyone from getting a copy of what's on your site, you're probably screwed, because technical solutions are hard and legal solutions are only going to be violated by The Internet since it isn't all in your jurisdiction.

If you're just trying to keep them from putting an excessive load on your servers, technical solutions are easy. Just give them an API or some other efficient way to receive the data.

ducttapecrown · on Feb 18, 2024

That's what the above comment implies if you read it sarcastically, they might be in agreement with you!

spencerflem · on Feb 18, 2024

whoops, duh you're right

flir · on Feb 18, 2024

You're not even sure this is a copyright issue in the US yet, let alone every other country on the planet. So yes, if you don't want it freely consumed, don't freely publish it.

CharlesW · on Feb 18, 2024

> You're not even sure this is a copyright issue in the US yet…

We are sure that OpenAI is building its business on copyrighted content, under the defense that to do otherwise is "impossible."

The courts have not decided whether this is "fair use". However, the four factors make it pretty clear.

AnthonyMouse · on Feb 18, 2024

> The courts have not decided whether this is "fair use". However, the four factors make it pretty clear.

It's funny because you say it's "pretty clear" without saying which way you think it goes, which then makes it unclear what you think the result is.

CharlesW · on Feb 18, 2024

What I think ultimately doesn't matter, but I'm an AI booster and am hopeful that there's a path for GPT's "trainers" (including software engineers) to have their rights legally recognized. Also, I don't want to take away anyone's joy of applying the factors to this use case and imagining how courts might decide.

flir · on Feb 18, 2024

Before that you'd have to establish that whatever the chatbots are doing is "use" in the sense that "fair use" means it. The act of reading a book or looking at a painting isn't protected by fair use.

If it is a use that falls under fair use, I'm thinking about the Google Books case, where an act that is far less transformative (digital archiving and search) was found to be fair use (https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....)

(It's going to be "who's got more money?", isn't it? Fun fun fun.)

CharlesW · on Feb 18, 2024

> The act of reading a book or looking at a painting isn't protected by fair use.

This is a very strange way to look at it. If I encode Oppenheimer to MPEG-4, do I now own that IP? In both cases, I'm looking at the source and encoding it to a compressed representation.

A couple notes on "transformative": (1) It's just one of the four factors. (2) It means that you added new expression or meaning to it. By definition, an LLM can't create anything it wasn't trained on — it's pattern recognition, interpolation, and recombination.

fragmede · on Feb 19, 2024

When an LLM writes an essay about, say, Oppenheimer, the essay it spits out isn't, word for word, in the training data. it's created this new essay. Following that, asking it to write an essay about Doofenhimer, a fictional contemporary of Oppenheimer that I just invented, where was it trained on for that? to say it can't create anything when, given the depth of the training data, it seems capable of creating things, seems like an underestimation of its abilities, and moreover, what it actually means to be creative. when I'm being creative, there are connections being made between completely unrelated things that I've been exposed to at some point in my past which I combine in creative ways to create new things. if it's all just pattern recognition, interpolation, and recombination, then either nothing new can ever be created by humans doing exactly that, or LLMs are also creating new things. Are Salvador Dali's paintings not creative because the idea of clocks and the concept of melting aren't new?

https://chat.openai.com/share/2b9a8245-2063-4722-81e8-20d10c...

spencerflem · on Feb 18, 2024

Wouldn't that be worse? Then instead of there being some way to get the information (pay them) there's now no way to get it.

flir · on Feb 18, 2024

I'm in "I just want to watch the world burn" mode over this right now, to be honest. If all that's left on the open internet is SEO spam, and all the chatbots can train on is SEO spam, then so be it. We didn't deserve to have nice things.

spencerflem · on Feb 18, 2024

Yea the world sucks, tech companies are making it worse, the economy is built to funnel money to whoever can suck the most value out of the commons for themselves, climate change will ruin everything, hug your favorite endangered species while you still can, and the internet is best at spreading lies

I don't see how arguing that it should be harder for journalists to get paid helps anything though

flir · on Feb 18, 2024

Accelerationism, obvs.

(Except I'm pretty sure that doesn't work either).

BeFlatXIII · on Feb 18, 2024

Perhaps this can accelerate DRM hacking if multiple billions of startup money are put behind breaking it to stealthily add to the data sets. All the more fun if it's done somewhere extradition-proof from the US.

calibas · on Feb 18, 2024

> For decades, robots.txt governed the behavior of web crawlers.

It never governed anything, web crawlers were never under any obligation to follow robots.txt.

This article seems like they took an existing controversy, rebranded it as something new, then blamed in on AI.

janalsncm · on Feb 19, 2024

No, I think this is something new. It’s an existential problem for newspapers. Most newspapers have never thrived online but they got a stream of clicks and a few 100x stories every once in a while. But with language models and RAG, we may create a world where no one has to visit their site at all.

This really hit home for me in Hard Fork’s interview with the CEO of Perplexity AI [1]. The crux of the problem is, if Perplexity does what it strives to do, provide answers rather than links, no one will be clicking on links and generating revenue for newspapers anymore. I encourage you to watch him struggle with this question because this is really an irreconcilable problem.

[1] https://m.youtube.com/watch?v=AJuv_UwxELA

fallingknife · on Feb 18, 2024

> they took an existing controversy, rebranded it as something new, then blamed in on <new thing>

Basically describes most of what passes as journalism these days.

franze · on Feb 18, 2024

> web crawlers were never under any obligation to follow robots.txt.

other than the fact that you could get successful sued if you do not follow them i.e.: eBay v. Bidder's Edge in 2000

nadermx · on Feb 18, 2024

That was overturned, https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White Buffalo Ventures LLC v. University of Texas at Austin

geor9e · on Feb 18, 2024

The Internet Archive never follows robots.txt and they're still around

aaronrobinson · on Feb 18, 2024

Drama. Crawlers have always been controversial.

andybak · on Feb 18, 2024

> But as unscrupulous AI companies seek out more and more data

I'm not sure I'm ready to concede the fundemental value judgement being made here. At least I refuse to accept it as a given rather then the core issue to be decided.

micromacrofoot · on Feb 18, 2024

many crawlers have always ignored robots.txt, if you’re monitoring any moderately visited site you’re bound to see random spikes of bots hammering your server no matter what text file or headers you set

datavirtue · on Feb 18, 2024

This. I stopped reading after the first few sentences. Whoever wrote that verge article is clueless.

Data accessible. Data free. Period.

phone8675309 · on Feb 19, 2024

Maybe it was the same person that put together the Verge PC build guide (not the person presenting it, the person who wrote the script for it)

elpocko · on Feb 18, 2024

robots.txt is relevant and effective, as is my DNT header.

amelius · on Feb 18, 2024

When did robots.txt get a legal status?

Or did it ever?

franze · on Feb 18, 2024

eBay v. Bidder's Edge (2000)

nadermx · on Feb 18, 2024

That was overturned, https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White Buffalo Ventures LLC v. University of Texas at Austin

naiv · on Feb 18, 2024

Proxy companies are a big winner now

lewhoo · on Feb 18, 2024

I don't get it. The crux of it all seems to be that Google isn't competing with owners of data it crawls using the very same data. The crawl part isn't as much of a controversy as usage, isn't it ? The mentioned eBay v. Bidder's Edge (2000) seems to be a dispute over usage.

pixl97 · on Feb 19, 2024

I mean Google is making AI models and Google is summarizing sites so they may well be competing. The difference is Google has a near monopoly on search and ads.

mediumsmart · on Feb 19, 2024

The web comes in 2 versions. One of them has a basic social contract. maybe