Hacker News new | past | comments | ask | show | jobs | submit login
With the rise of AI, web crawlers are suddenly controversial (theverge.com)
90 points by leephillips 4 months ago | hide | past | favorite | 84 comments



> For decades, robots.txt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

The basic social contract of the web fell apart long ago when almost everyone decided that Google was the only search engine worth serving and started aggressively blocking other crawlers.


Add to that the removal of public APIs and things like RSS feeds. I would much rather use an API than scrape you, I will even pay a small fee, but if you don’t provide anything, then you’re getting scraped.


"Add to that the removal of public APIs and things like RSS feeds."

You mean like this public API

https://hacker-news.firebaseio.com/v0/item/39421253.json

Or this RSS feed

https://www.theverge.com/rss/index.xml


Another one. This one has a rate limit.

  curl -si40A "" https://old.reddit.com/new.rss > 1.json

  sed 10q 1.json

   HTTP/1.1 200 OK
   content-length: 40582
   content-type: application/atom+xml; charset=UTF-8
   x-ua-compatible: IE=edge
   x-frame-options: SAMEORIGIN
   x-ratelimit-remaining: 93
   x-ratelimit-used: 3
   x-ratelimit-reset: 96
   x-reddit-pod-ip: 10.104.158.170:80
   x-reddit-internal-ratelimit-rls-type: ip-standard
In theory, APIs to retrieve public information suck because they are designed to be rate-limited or subject to quotas. Whereeas IME public-facing websites are much less likely to set rate limits and enforce quotas.

I prefer to use the public-facing website instead of "Web APIs" and convert HTML to SQLite which I prefer over non-LD JSON. If the public-facing website changes their HTML, I change the code in the filter. This is rare. It is very simple for me to do.


Reddit recently started blocking anonymous GETs to RSS feeds.

And if you use your cookies they temp-block your account after x (like x every 30 minutes) times.

And I didn't even care that there was no content on the RSS feed, I just wanted the title / notification so I don't have to use ... (I can't put anything in here as there is no alternative that allows me to track everything).


Are there services that have a copy of the Internet for a fee?

How did perplexity.ai crawl the internet for their AI?


I'm out of the loop on this. What has been happening? Do you get IP blocked? Rate limited?


You do and since the rise of Cloudflare, you need their approval to create your search engine


Unfortunately for them, AI is a double edged sword. AI can trivially scrape content from a screencap. And there are thousands of residential proxies that can’t be defeated even by companies like CF.


If you have residential proxies, you don't need AI to scrape via video, you have the HTML already.


Companies spend thousands of engineering hours building elaborate mechanisms to obfuscate JS, protect APIs, apps, trigger captchas, and hide content, all of it completely powerless against the current gen of AI tools. If a human can read it, a bot can.


Those same companies often invest in accessibility for vision-impaired users. I'm not sure you need a screen capture to scrape content when the site is designed to be navigable with a screen reader.


Surprisingly chatgpt is not accessible via screen reader, their captcha is purely image based.


Is OCR from 30 years ago now "AI"?


"With the rise of AI, photos of the exterior of your business are suddenly controversial"

Many revenue-based websites tried to have it both ways with web crawlers wherein they wanted to block automated access or repeat viewers while letting first time viewers get a free taste. Others have noted that basically Google gets a free pass for all the traffic it brings in but everyone else has to respect robots declarations.

It seems like a no brainer - if your web server is configured to reply to GET requests with a 200 status and some content then they get to do pretty much whatever they want with it.

Don't want to give access to everyone? Stop sending your content for free and get them to agree to some contract and authorize/license their access to your stuff.


I don't think this is that cut and dry. Its perfectly reasonable to be fine with your content being linked to in an index like google search does, but not fine with the content being read by humans for free or used to train AI.

I'm sure the sites would rather that google paid them in licencing too, but without changes to our laws that's not going to happen


It is also reasonable to serve the same content to regular visitors as you do to Google?

Is ok to serve a full article to Google and put visitors behind wall?


yes, because google is only using that article to decide what searches should link to it


Given two pages that both contain the information the user wants, they prefer the one that isn't behind a paywall, and so they prefer a search engine that puts those results first.

Giving the search engine the full content so it will rank higher is lying, because that information isn't actually there unless you pay, which most users aren't going to do.

In theory you could solve this with a requirement for the site to disclose that it's doing this, so the search engine can have a box that says "exclude paywalled content", but then everybody would check the box.


Google wants you to be happy with your search, and doesn't care at all about journalists. If they thought people would prefer paywalled content gone they could do that anytime.

I agree the option would be nice, but I think you are wrong that everyone wants paywalled articles excluded


> If they thought people would prefer paywalled content gone they could do that anytime.

Which is what they do by penalizing sites for showing something different to Googlebot than the user.

> I agree the option would be nice, but I think you are wrong that everyone wants paywalled articles excluded

What they want is for non-paywalled links that include the relevant information to be listed first, which is de facto equivalent to excluding paywalled links (because they'll be on page 75) in the vast majority of search results.


I want to research a topic, so I do a google search about it. If 100% of the search results are readable to google, so that they are indexed, but unreadable to me, due to paywalls, the google search is useless.

I'm not sure what the solution is, but paywalled articles in search results are bad. If they want to be indexed they should have to offer that same indexed content to anyone browsing the index.


I don’t like paywalls as well, but in principle payed content is justified, and if one is willing to pay for relevant content, isn’t it better that Google allows one to find it? Maybe Google Search should have a switch “show only non-paywalled results” (paraphrased) (I’m sure they could figure out which content is paywalled if they wanted to), but personally I would probably still prefer seeing which sources exist even if they are paywalled.


i agree, it would be nice for google to have an option to avoid paywalled articles, or to specify to it what accounts you pay for and allow those only

getting all journalism for free isn't sustainable though


While I believe in journalism, I'm pretty appalled at the state of modern journalism. There have been a few big fully televised court cases recently, e.g., Depp v. Heard, and I was stunned at how poor the media coverage of them was. As I was intrigued by the legal system, I watched tens of hours of raw footage of witnesses, lawyers and judges, and I was amazed at how watching the raw footage revealed how incredibly biased and superficial the journalism coverage was (on all sides). As this experience showed how untrustworthy newspapers can be, I'm really not aware of any newspaper/journalism (maybe Private Eye? Or Bellingcat?) that is worth reading, let alone paying for.

I guess it might be a catch twenty two. Low quality -> low income -> low quality, but the sadness of such a dynamic does not make me want to pay for an inferior service.


If you're doing serious research you pay for the paywall. It's not unreadable to you, just like a coke isn't undrinkable to you because you have to pay for it.


No I don't, I disable JavaScript and read what they served google in the first place.

If I went to a public water fountain and found that someone had turned it into a coca cola dispensing machine, I wouldn't be happy and wouldn't pay to use it.

"Journalists" creating pay walls, using SEO tactics to push their articles into my search results, and then trying to extract rent don't deserve money.


You despise the people writing the content you want to read, at the same time that you are demanding to access their works for free. Do you also work for free for any stranger?


Where in the parent post did the poster say they "despised" the people writing the content?


Calling them "journalists" instead of journalists.


No, I don't want to read their content. I want to find an answer to my search query.

If the search results are full of paywalled articles that claim to have text relevant to my query, but won't show me the article because publishers are trying to extract money from me, the publishers of those articles have made my task harder and shouldn't be rewarded. This is a form of spam.


In this case I think your beef is with Google and not the paywalled sites. A newspaper is going to do whatever it takes to keep the lights on, and if that means forcing people to pay, so be it.

For Google, they have made a product decision about how to treat paywalled content. They don’t care. It hurts the user experience but the days when Google cared about improving their search experience are long gone.


> I want to find an answer to my search query.

And sometimes the answer is behind a paywall. It's not spam at all. On the contrary, spam is always free.


There is always a free source with the answer somewhere. The trouble comes when the free sources are pushed far down in the results by legacy brands.

When this happens, I will continue to pretend to be google to access the content they are pushing. If publishers want to change this behavior they could try not letting Google index it, so I don't need to see it in my search results.


Your argument boils down to "I want free stuff", as I see it. Okay, but why in the world should Google care about what you want in the search results then? You do not bring any value and will not bring any future value.

For other users, they see value in having paywalled results if they are the best results, because they do not have a block against paying for content.

If you for example search for a movie on Google, they'll show you paid options to watch it on streaming services or rent it from streaming services. That's good and what should be expected from a search engine.

Paying for stuff is how the world works. If a restaurant boasts about having the nicest steaks, you're not going to get a free steak just to be sure that it's good.

But I really think it is time for a better way to pay for content and articles instead of having to subscribe to each source.


I don't think you understand my argument as you are making a second food analogy (first coke, now steak). Please read my response to your first food analogy as it applies to both food analogies.


Information has always been paid for, whether it's news, books or magazines. If you expect for something to be free just because it's found on a search engine, I don't know where you got that from. I think my examples for music and movies that I've given in this thread are worth considering.

It's like if a friend of yours takes you to a nice Mexican place. Why would you expect to get a burrito al Pastor to eat for free, just because you eat for free when you visit relatives? Nobody said it would be free.


Really, a third food analogy? Is this satire?


It's a bait and switch - you just offered me free cokes and then let me know that its after I sign up for a subscription service, no thanks!


It's your assumption that everything behind a google link ought to be 100% free (ad supported). Other people disagree, and Google does not advertise anywhere that their list is free content only.


> Google does not advertise anywhere that their list is free content only.

Google does advertise that they index based on the same content that's available to anyone viewing the page, and has policies against presenting a different version of the page to their crawler versus what you're showing to visitors.


It's splitting hairs at this point, but anyone visiting the page can view the same content as the crawler – if they pay.

Should Google also stop indexing Facebook, since Facebook puts a login wall for people to access their content? Should YouTube (ie Google) ban movie trailers, since it's just a tease for paywalled movies? The iTunes store let people listen to 30 seconds of a song before purchasing at the paywall. Was that wrong?


> Should Google also stop indexing Facebook, since Facebook puts a login wall for people to access their content?

Yes - I thought they already did? (I know LinkedIn edges around this by putting up a login wall only if you have a cookie showing that you'd logged in previously).

> Should YouTube (ie Google) ban movie trailers, since it's just a tease for paywalled movies? The iTunes store let people listen to 30 seconds of a song before purchasing at the paywall. Was that wrong?

A free sample of a paid thing is fine if everyone knows that's what it is. It's when you bait-and-switch by offering something that seems like it's free to start with that it's a problem. Like imagine showing a movie in the town square and then 10 minutes in you pause it and tell everyone they need to buy a ticket or leave.


> Yes - I thought they already did?

Just checked, Google still indexes Facebook and puts relevant results on top. If you're not logged in you can't continue.


No, it's not bait and switch. A book store has an index of books they sell, that doesn't mean they're free. I expect a high quality search engine to deliver paid results if they are the best results.

Should Google Maps remove businesses that charge for their products and services from their search results as well?


I wouldn't expect that at all, search engines search the content they have available to proffer it to you, that's the job.

If by clicking on the thing it does not have the content I searched for (how am I even certain I get it when I pay you?) I would call that result bad.

If you want to charge for stuff that's great, I recommend it, and if you want to give out a free sample or an index that's great, but it should be the same to all comers.


I deeply 100% disagree with you - if you want to show google something else that is a spam tactic that only ends in bad outcomes. If your content is good enough to pay for its good enough to hide from everyone.

A search based on the results being something otherwise than the contents of the search is misleading, a spam tactic, and bad.


There's a different between "I allow people to access my data" and "I allow people to create products based on my data"


> Stop sending your content for free and get them to agree to some contract and authorize/license their access to your stuff.

Agreed, we need to wrap the public web with DRM immediately. We can't expect companies like OpenAI to waste time worrying about pesky legalities like copyright and content licensing.


I don't think there's a technical solution that would work here. This is a law & enforcement problem


It depends what your goal is.

If you're trying to prevent anyone from getting a copy of what's on your site, you're probably screwed, because technical solutions are hard and legal solutions are only going to be violated by The Internet since it isn't all in your jurisdiction.

If you're just trying to keep them from putting an excessive load on your servers, technical solutions are easy. Just give them an API or some other efficient way to receive the data.


That's what the above comment implies if you read it sarcastically, they might be in agreement with you!


whoops, duh you're right


You're not even sure this is a copyright issue in the US yet, let alone every other country on the planet. So yes, if you don't want it freely consumed, don't freely publish it.


> You're not even sure this is a copyright issue in the US yet…

We are sure that OpenAI is building its business on copyrighted content, under the defense that to do otherwise is "impossible."

The courts have not decided whether this is "fair use". However, the four factors make it pretty clear.


> The courts have not decided whether this is "fair use". However, the four factors make it pretty clear.

It's funny because you say it's "pretty clear" without saying which way you think it goes, which then makes it unclear what you think the result is.


What I think ultimately doesn't matter, but I'm an AI booster and am hopeful that there's a path for GPT's "trainers" (including software engineers) to have their rights legally recognized. Also, I don't want to take away anyone's joy of applying the factors to this use case and imagining how courts might decide.


Before that you'd have to establish that whatever the chatbots are doing is "use" in the sense that "fair use" means it. The act of reading a book or looking at a painting isn't protected by fair use.

If it is a use that falls under fair use, I'm thinking about the Google Books case, where an act that is far less transformative (digital archiving and search) was found to be fair use (https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....)

(It's going to be "who's got more money?", isn't it? Fun fun fun.)


> The act of reading a book or looking at a painting isn't protected by fair use.

This is a very strange way to look at it. If I encode Oppenheimer to MPEG-4, do I now own that IP? In both cases, I'm looking at the source and encoding it to a compressed representation.

A couple notes on "transformative": (1) It's just one of the four factors. (2) It means that you added new expression or meaning to it. By definition, an LLM can't create anything it wasn't trained on — it's pattern recognition, interpolation, and recombination.


When an LLM writes an essay about, say, Oppenheimer, the essay it spits out isn't, word for word, in the training data. it's created this new essay. Following that, asking it to write an essay about Doofenhimer, a fictional contemporary of Oppenheimer that I just invented, where was it trained on for that? to say it can't create anything when, given the depth of the training data, it seems capable of creating things, seems like an underestimation of its abilities, and moreover, what it actually means to be creative. when I'm being creative, there are connections being made between completely unrelated things that I've been exposed to at some point in my past which I combine in creative ways to create new things. if it's all just pattern recognition, interpolation, and recombination, then either nothing new can ever be created by humans doing exactly that, or LLMs are also creating new things. Are Salvador Dali's paintings not creative because the idea of clocks and the concept of melting aren't new?

https://chat.openai.com/share/2b9a8245-2063-4722-81e8-20d10c...


Wouldn't that be worse? Then instead of there being some way to get the information (pay them) there's now no way to get it.


I'm in "I just want to watch the world burn" mode over this right now, to be honest. If all that's left on the open internet is SEO spam, and all the chatbots can train on is SEO spam, then so be it. We didn't deserve to have nice things.


Yea the world sucks, tech companies are making it worse, the economy is built to funnel money to whoever can suck the most value out of the commons for themselves, climate change will ruin everything, hug your favorite endangered species while you still can, and the internet is best at spreading lies

I don't see how arguing that it should be harder for journalists to get paid helps anything though


Accelerationism, obvs.

(Except I'm pretty sure that doesn't work either).


Perhaps this can accelerate DRM hacking if multiple billions of startup money are put behind breaking it to stealthily add to the data sets. All the more fun if it's done somewhere extradition-proof from the US.


> For decades, robots.txt governed the behavior of web crawlers.

It never governed anything, web crawlers were never under any obligation to follow robots.txt.

This article seems like they took an existing controversy, rebranded it as something new, then blamed in on AI.


No, I think this is something new. It’s an existential problem for newspapers. Most newspapers have never thrived online but they got a stream of clicks and a few 100x stories every once in a while. But with language models and RAG, we may create a world where no one has to visit their site at all.

This really hit home for me in Hard Fork’s interview with the CEO of Perplexity AI [1]. The crux of the problem is, if Perplexity does what it strives to do, provide answers rather than links, no one will be clicking on links and generating revenue for newspapers anymore. I encourage you to watch him struggle with this question because this is really an irreconcilable problem.

[1] https://m.youtube.com/watch?v=AJuv_UwxELA


> they took an existing controversy, rebranded it as something new, then blamed in on <new thing>

Basically describes most of what passes as journalism these days.


> web crawlers were never under any obligation to follow robots.txt.

other than the fact that you could get successful sued if you do not follow them i.e.: eBay v. Bidder's Edge in 2000


That was overturned, https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White Buffalo Ventures LLC v. University of Texas at Austin


The Internet Archive never follows robots.txt and they're still around


Drama. Crawlers have always been controversial.


> But as unscrupulous AI companies seek out more and more data

I'm not sure I'm ready to concede the fundemental value judgement being made here. At least I refuse to accept it as a given rather then the core issue to be decided.


many crawlers have always ignored robots.txt, if you’re monitoring any moderately visited site you’re bound to see random spikes of bots hammering your server no matter what text file or headers you set


This. I stopped reading after the first few sentences. Whoever wrote that verge article is clueless.

Data accessible. Data free. Period.


Maybe it was the same person that put together the Verge PC build guide (not the person presenting it, the person who wrote the script for it)


robots.txt is relevant and effective, as is my DNT header.


When did robots.txt get a legal status?

Or did it ever?


eBay v. Bidder's Edge (2000)


That was overturned, https://en.wikipedia.org/wiki/Intel_Corp._v._Hamidi and White Buffalo Ventures LLC v. University of Texas at Austin


Proxy companies are a big winner now


I don't get it. The crux of it all seems to be that Google isn't competing with owners of data it crawls using the very same data. The crawl part isn't as much of a controversy as usage, isn't it ? The mentioned eBay v. Bidder's Edge (2000) seems to be a dispute over usage.


I mean Google is making AI models and Google is summarizing sites so they may well be competing. The difference is Google has a near monopoly on search and ads.


The web comes in 2 versions. One of them has a basic social contract. maybe




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: