AI crawlers need to be more respectful

everforward · 2024-07-25T20:50:56.000000Z

I'm curious if there's a point where a crawler is misconfigured so badly that it becomes a violation of the CFAA by nature of recklessness.

They say a single crawler downloaded 73TB of zipped HTML files in a month. That averages out to ~29 MB/s of traffic, every second, for an entire month.

Averaging 30 megabytes a second of traffic for a month is crossing into reckless territory. I don't think any sane engineer would call that normal or healthy for scraping a site like ReadTheDocs; Twitter/Facebook/LinkedIn/etc, sure, but not ReadTheDocs.

To me, this crosses into "recklessly negligent" territory, and I think should come with government fines for the company that did it. Scraping is totally fine to me, but it needs to be done either a) at a pace that will not impact the provider (read: slowly), or b) with some kind of prior agreement that the provider is accepting responsibility to provide enough capacity.

While I agree that putting content out into the public means it can be scraped, I don't think that necessarily implies scrapers can do their thing at whatever rate they want. As a provider, there's very little difference to me between getting DDoSed and getting scraped to death; both ruin the experience for users.

simonw · 2024-07-25T20:06:20.000000Z

"One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."

Wow. That's some seriously disrespectful crawling.

gnfargbl · 2024-07-25T21:01:36.000000Z

10TB/day is, roughly, a single saturated 1Gbit link. In technical bandwidth terms, that is the square root of fuck all.

The crazy thing here is that the target site is content to pay ~$700 for the volume of traffic that you can move through a single teeny-tiny included-at-no-extra-charge cat5 Ethernet link in one single day. And apparently, they're going to continue doing so.

Saris · 2024-07-25T21:24:32.000000Z

Yeah that's an insane price for bandwidth, they need to move providers ASAP if that's the kind of fees they get.

bastawhiz · 2024-07-25T22:19:32.000000Z

Hosting documentation shouldn't need that much bandwidth. It's text and zip files full of text. Without bots, that's a very small cost even if the bytes are relatively costly.

stusmall · 2024-07-25T20:28:18.000000Z

Respect to them for not naming names, that's the classy move, but I wouldn't blame them if they did.

ryandrake · 2024-07-25T20:38:49.000000Z

Why is it "classy" to not name names when a business (which likely holds it self out there as reputable) behaves badly, especially when it behaves in a way that costs you money? Everyone is so vague and coy. These companies are being abusive and reckless. Name and shame!

WhyCause · 2024-07-25T20:59:06.000000Z

In the article, they mention that they are working with the crawling company to be reimbursed for the download costs.

Naming and shaming the company while you're trying to work with them is a real good way to not get what you want.

OtherShrezzing · 2024-07-25T20:42:59.000000Z

Naming names isn't really required. The hosts have a $5,000 bandwidth fee, but so do the consumers. There's maybe 10 companies with the financial & compute resources to let a $5,000-per-month-per-website bug run rampant before taking the harvesting service offline.

Meta/Google/Whoever may benefit from economies of scale, so they're not seeing the full $5,000 their side, but they're hitting tens of thousands of sites with that crawler.

echoangle · 2024-07-25T22:39:58.000000Z

You know you can hit the data rate they were complaining about by using a residential fiber connection, right? 10 TB per day is about 1 Gigabit continuous if I’m not mistaken. There are probably millions of people that could to this if they wanted to.

OtherShrezzing · 2024-07-26T08:44:28.000000Z

There's millions of people who could do that to an individual website. There are remarkably few organisations who could do that simultaneously across the top 100,000 or so sites on the internet, which is how readthedocs has encountered this issue.

energy123 · 2024-07-25T20:44:30.000000Z

I thought data egress is much more expensive than downloading it?

Joel_Mckay · 2024-07-25T20:45:20.000000Z

It is 100% likely a cloud provider IP range.

They are a persistent source of spam email servers, scrapers, and bot probes.

The simple reason is the operators quickly dump a host, and the next user is left wondering why their legitimate site is instantly spelunking spam ban lists.

It is the degenerative nature of cloud services... and unsurprisingly we end up often banning most parts of Digital Ocean, Amazon, Azure, Google, and Baidu.

Have a wonderful day, =3

immibis · 2024-07-25T20:51:51.000000Z

It's the degenerative nature of assuming an IP corresponds to a user. They have not corresponded to users for over a decade. I once discovered I'm banned on my mobile phone connection from at least one app which doesn't know that CGNAT exists (a very poor assumption for mobile phone apps in particular). If you must block IPs, do it as a last resort, make it based on some observable behavior, quickly instated when that behavior occurs, and quickly uninstated when it does not.

Joel_Mckay · 2024-07-25T21:18:57.000000Z

Really depends on the use-case, but yeah the response happens in a proportional manner.

We also follow the tit-for-tat forgiveness policy to ensure old bans are given a second chance. Mostly, we want the nuisance to sometimes randomly work, as it wastes more of their time fixing bugs.

And note, if a server is compromised and persistently causing a problem... we won't hesitate to black hole an entire country along with the active Tor exit nodes and known proxies lists (the hidden feature in context cookies).

Have a great day friend, =3

croemer · 2024-07-25T22:26:59.000000Z

Why are you ending all your messages with =3 ?

Joel_Mckay · 2024-07-25T22:37:10.000000Z

https://www.jpl.nasa.gov/images/pia22092-arp-142-the-penguin...

Don't worry about it friend =3

johnnyanmac · 2024-07-26T18:04:00.000000Z

Read that before, read it again to make sure I didn't miss anything. The lack of clarity here is disappointing and only asks more questions than it answers.

Joel_Mckay · 2024-07-26T18:41:09.000000Z

Don't worry about it friend =3

immibis · 2024-07-26T02:30:05.000000Z

What I'm getting from this is that you hate having users almost as much as Reddit (which enshittified their website and banned all non-shit mobile apps and all search engines other than Google).

Joel_Mckay · 2024-07-26T03:17:10.000000Z

Imagine a world, where people walk into your business with a mask over their face saying horribly abusive things... while pretending they are your neighbors... And poof... they automatically vanish along with their garbage content.

They may visit again, but are less likely to mess with the platform. Note, cons never buy anything... ever... it is against their temperament.

I find it interesting several of cons on YC are upset by someone else's administrative policies. Reddit should enforce these policies too, or at least drop a country or pirate flag icon beside nasty posts...

Have a great day, and don't fear the ban hammer friend =3

thephyber · 2024-07-28T23:51:47.000000Z

What is the term “cons” you use here?

Joel_Mckay · 2024-07-29T01:34:17.000000Z

In this context, a few users are scraping YC profile information to attempt various scam/wire-fraud schemes against community members.

1. masquerading as YC associated organizations

2. attempting to extract banking information with classic 419'er tactics

3. spamming member contact routes with several phone and email Grifts

These folks appear to be operating on a 4 month cycle, and redistributing target leads to other fraudsters.

Annoying, but hardly a problem limited to YC if that is your concern.

Best of luck, =3

Faaak · 2024-07-25T20:48:34.000000Z

5000$ for 73TiB seems excessive tough? Some EU cloud providers would price it 10x cheaper

immibis · 2024-07-25T20:49:25.000000Z

Hetzner is $1 per TB and that's only if they decide your overall usage is excessive and needs to be billed for.

mschuster91 · 2024-07-25T20:17:13.000000Z

The more interesting thing for me is: the crawler didn't detect it on its own that it racked up 10TB from one site in one day.

If I would design a crawler, I'd keep at least some basic form of tracking, if only to check for people deliberately trolling me by delivering me an infinite chain of garbage.

bastawhiz · 2024-07-25T22:22:19.000000Z

I've been running guthib.mattbasta.workers.dev for years, and in the past few months it's hit the free limit for requests every day. Some AI company is filling their corpus with exactly that: infinite garbage.

immibis · 2024-07-25T20:49:01.000000Z

That's also $73 worth of bandwidth on another server host. Please stop using extremely expensive hosts and then blaming other people for the consequences of your decision.

exe34 · 2024-07-25T20:14:07.000000Z

at that scale, they could just ship some usb disks and a prepaid return postage.

Joel_Mckay · 2024-07-25T20:33:26.000000Z

Try page rate-limiting (6 hits a minute is plenty for a human), and then pop up a captcha.

If they keep hitting the limit within an hour 4+ times, than get fail2ban to block the IP for 2 days.

73TB is a fair amount to have on a cloud... usually at >30TiB firms must diversify with un-metered server racks, and CDN providers (traditional for large media files etc.)

Good luck =3

DamonHD · 2024-07-25T20:34:59.000000Z

When the bots come out of cloud services then the IPs are all over the place: it's much harder to do right these days.

Joel_Mckay · 2024-07-25T20:52:48.000000Z

Rate limiting firewalls and spider traps also work well...

There is page referral monitoring, context cookies, and dynamically created link chaff with depth-charge rules.

One can dance all day friend, but a black-hole is coming for the entire IP block shortly. And unlike many people, some never remove a providers leased block and route until it is re-sold. =3

immibis · 2024-07-26T02:33:39.000000Z

I mean, blocking random IP ranges is your prerogative if you don't want to have customers. The scrapers will find ways around, while actual users will be unable to use your site. Residential proxies are something like $5 for 1000.

Joel_Mckay · 2024-07-26T02:56:20.000000Z

True, note domestic ISP IP ranges are published, and unless one deals internationally... don't bother serving people that will never buy anything from your firm anyways.

Domestic "Users" functioning as proxies will be tripping usage limits, and getting temporarily banned. Google does this by the way, try hammering their services and find out what happens.

Context cookies also immediately flag egregious multi-user routes, and if it is a ISP IP you know its a problem user. If it is over 15 users an hour per IP, than you can be 100% sure its a Tor proxy.

we ban over 243000+ IPs, and saw zero impact to our bottom line.

Have a nice day, =)

DamonHD · 2024-07-26T07:41:18.000000Z

Those of us running sites for public information rather than sales cannot make the simple cut-off that you do.

And again, none of this is simple. It has taken me a few weeks to establish a usage mechanism that does catch the worst feed pullers, but it still can hurt legit new users. That is an opportunity cost.

Joel_Mckay · 2024-07-26T10:06:21.000000Z

One must assume most user IP edge proxies are compromised hosts. If someone paid for that list they were almost certainly conned, as the black hats regularly publish that content on their forums. These folks want as many users as possible in order to hide their nuisance traffic origin in the traffic noise.

Allowing users known to have an active RAT or their "proxy friends" on a commercial site is not helping anyone.... especially the victims.

https://www.youtube.com/watch?v=aCbfMkh940Q

Worth studying the problem from time to time when you get bored of the antics.

These folks are generally uninterested in positively contributing to any community, but rather show up to cause trouble for fun and profit.

User API quotas are popular for a reason. =3

EVa5I7bHFq9mnYK · 2024-07-26T05:12:17.000000Z

Exactly, I'm banned or captchaed from half of all web sites these days, because of the AI.

Joel_Mckay · 2024-07-26T10:12:15.000000Z

Try updating your web browser, as sites often flag outdated user agent strings hard-coded in many bots/spiders.

Have a nice day, =3

EVa5I7bHFq9mnYK · 2024-07-27T10:57:28.000000Z

I am sure that bots/spiders use the very best and latest and least suspicious user agent strings.

DamonHD · 2024-07-27T11:25:27.000000Z

Some bad actors do indeed use ancient UA strings.

immibis · 2024-07-28T12:29:19.000000Z

So do some users. Thanks for banning me.

Joel_Mckay · 2024-07-29T06:50:02.000000Z

The captcha trigger events on many sites will often keep nagging/blocking people till they update.

Don't take this trend personally friend, as if we see a fake iPhone sporting bandwidth >1300Mbps+... than the host is getting permanently banned anyway.

Have a wonderful day, =3

pests · 2024-07-25T20:47:24.000000Z

6 hours a minute? Are you joking?

Click around a few times on any of your sites and looks like I'll be banned?

Multi tasking? Opening multiple interesting links?

Like what.

Joel_Mckay · 2024-07-25T21:08:25.000000Z

Generally for sites:

1. gets incrementally slower until firewall user rate limiting tokens refill the bucket (chokes >6MiB/min bandwidth use, and enforces abnormal traffic ban rules.)

2. Pauses serving a page if you spider though 6+ pages a minute (chokes speculative downloading)

3. if you violate site usage rules 4+ times in the past hour, than your get a 2 day IP ban

4. if you trip a spider trap, than you get a 5 day ban

5. If you are issued more than 5 context cookies, than the IP will get spammed with a captcha on every page for 5 days

6. If you violate any number of additional signatures (shodan etc.) than you get your IP block and route permanently banned. There is only 1 exception to this rule, and we don't share that with anyone.

7. The site content navigation is programmatically generated in JavaScript

8. The legal notice is very real for some people

Have a nice day friend, =)

croemer · 2024-07-25T22:29:13.000000Z

Who is "we"?

Joel_Mckay · 2024-07-25T22:38:34.000000Z

Plural of the deployment team admins.

Don't worry about it friend =3

xcv123 · 2024-07-25T20:51:29.000000Z

> 73TB is a fair amount to have on a cloud

From the article:

"This was a bug in their crawler that was causing it to download the same files over and over again."

Joel_Mckay · 2024-07-25T21:29:00.000000Z

Must be a CDN choice issue, as most sites limit per IP daily file downloads, or have an account login with a quota limit.

People hitting the same files again sounds like a developer testing their code.

DamonHD · 2024-07-25T20:29:44.000000Z

Not just AI: here is my current side-quest: https://www.earth.org.uk/RSS-efficiency.html

Over 99% of the bandwidth (and CPU) taken by the biggest podcast / music services simply on polling feeds is completely unnecessary. But ofc pointing this out to them gets some sort of "oh this is normal, we don't care" response because they are big enough to know that eg podcasters need them.

bastawhiz · 2024-07-25T23:37:33.000000Z

I run pinecast.com. If there was a leaderboard for hn users serving XML, I'd almost certainly be in the top five.

I don't disagree with your post. But: RSS downloads are at an all time low, and that's a bad thing.

They're at an all time low because Spotify and Apple both fetch feeds from centralized servers. 1000 subscribers no longer means 24000ish daily feed fetches, it means 48. With keep alive or H2, these services simply don't reconnect. The number of IPs that hit me from Apple, for instance, is probably only double digits.

Since Apple and Spotify both sit between me and the listeners, they eliminate the privacy that listeners would otherwise enjoy. It also forces podcasters to go to them to find out how many people are subscribed, which means lots of big databases instead of one database that I host for my customers.

Centralization of feed checking carries huge risks, in my opinion, especially as both Apple and Spotify make moves to also become the hosting providers.

DamonHD · 2024-07-26T07:34:50.000000Z

Do you have a CDN between you and Apple / Spotify? Because if you do I think that Apple/Spotify are polling that CDN every few minutes and the CDN is having its bandwidth wasted invisibly, but presumably priced in.

Also I agree that the re-centralisation is a bad thing, mainly.

(I'd like to move to email to discuss this further, if possible: I have an arXiv paper to write!)

bastawhiz · 2024-07-26T15:15:26.000000Z

The data I'm giving you is based on logs from the CDN. Most feeds are checked by Apple and Spotify every hour, but usually it's less frequently rather than more: shows that haven't been published to in a year or more might see very infrequent feed checks.

immibis · 2024-07-25T20:53:20.000000Z

Can you insert a podcast item that says "you are spamming our server - please stop"?

DamonHD · 2024-07-25T20:58:36.000000Z

Not easily on a static server and not without the risk of annoying an actual real human listener!

If you look at the "Defences" section: https://www.earth.org.uk/RSS-efficiency.html#Hints you'll see there are some things that can be done, such as randomly rejecting a large fraction of requests that don't allow compression (gzip is madly effective on many feed files: it's rude for a client not to allow it). But all these measures take effort to set up, and don't stop the bad bots making the request 100s of times too often. Just responding to each stupid request forces a flurry of packets and wakes up and uses CPU...

mikae1 · 2024-07-25T20:49:56.000000Z

HellPot: https://github.com/yunginnanet/HellPot

> Clients (hopefully bots) that disregard robots.txt and connect to your instance of HellPot will suffer eternal consequences. HellPot will send an infinite stream of data that is just close enough to being a real website that they might just stick around until their soul is ripped apart and they cease to exist.

levkk · 2024-07-25T20:52:24.000000Z

This doesn't solve their bandwidth costs which is their real problem with these bots.

jesprenj · 2024-07-25T20:56:29.000000Z

My 100 mbps upload bandwidth at home is free (apart from the monthly 35€ payment). Useless bots will get stuck downloading from me instead of hogging readthedocs.

bastawhiz · 2024-07-25T22:55:27.000000Z

I think your ISP will cut you off long before they stop pulling content from you

Venn1 · 2024-07-25T20:37:29.000000Z

I blocked Microsoft/OpenAI a few weeks ago for (semi) childish reasons. Seven months later, Bing still refuses to index my blog, despite scraping it daily. The AI scrapers and crawlers toggle on Cloudflare did the trick.

winddude · 2024-07-25T20:14:59.000000Z

Not only that, even commoncrawl had issues (about a year ago) where AWS couldn't keep up with the demand for downloading the WARCs.

As someone who written a lot of crawling infrastructure and managed large scale crawling operations, respectful crawling is important.

That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value. It sometimes felt hard to reach out to websites and ask them to allow our crawlers, and grey tactics were often used. And I'm always for a more open internet.

I think regular releases of content in a compressed format would go a long way, but there would always be a race for having the freshest content. What might be better is offering the content in machine format, XML or JSON or even SOAP. Which is usually better for what the sites crawling want to achieve, cheaper for you to serve, and cheaper and less resource intensive compared to crawling. (Have them "cache" locally by enforcing rate limiting and signup)

apantel · 2024-07-26T02:25:54.000000Z

> That being said it always seems like google has had a massively unfair advantage for crawling not only with budget but with brandname, and perceived value.

VCs and other startup culture evangelists are always challenging founders to figure out what their ‘unfair advantage’ is.

That’s the name of the game.

pants2 · 2024-07-25T20:22:04.000000Z

While the crawling is disrespectful, it seems RTD could find a cheaper host for their files. At my work we have a 10G business fiber line and serve >1PB per month for around $1,500. Takes 90% of the load off our cloud services. Took me just a couple weeks to set up everything.

persedes · 2024-07-25T20:27:34.000000Z

If I understood correctly, they have a CDN that normally takes care of it, there were just some links that were not ported / covered by the CDN yet?

stusmall · 2024-07-25T20:26:22.000000Z

They normally aren't serving from webservices but from subsidized CDNs.

exhaze · 2024-07-25T23:09:41.000000Z

Having built an AI crawler myself for first party data collection:

1. I intentionally made sure my crawler was slow (I prefer batch processing workflows in general, and this also has the effect of not needing a machine gun crawler rate)

2. For data updates, I made sure to first do a HEAD request and only access the page if it has actually been changed. This is good for me (lower cost), the site owner, and the internet as a whole (minimizes redundant data transfer volume)

Regarding individual site policies, I feel there’s often a “tragedy of the commons” dilemma for any market segment subject to aggregator dominance:

- individual sites often aggressively hide things like pricing information and explicitly disallow crawlers from accessing them

- humans end up having to access them: this results in a given site either not being included at all, or accessed once but never reaccessed, causing aggregator data to go stale

- aggregators often outrank individual sites due to better SEO and likely human preference of aggregators, because it saves them research time

- this results in the original site being put at a competitive disadvantage in SEO, since the their product ends up not being listed, or listed with outdated/incorrect information

- that sequence of events leads to negative business outcomes, especially for smaller businesses who often already have a higher chance of failure

Therefore, I believe it’s important to have some sort of standard policy that is implemented and enforced at various levels: CDNs, ISPs, etc.

The policy should be carefully balanced to consider all these factors as well as having a baked in mechanism for low friction amendment based on future emergent effects.

This would result in a much better internet, one that has the property of GINI regulation, ensuring well-distributed outcomes that are optimized for global socioeconomic prosperity as a whole.

Curious to hear others’ perspectives about this idea and how one would even kick off such an ambitious effort.

int3 · 2024-07-25T20:17:45.000000Z

Shouldn't all sites have some kind of bandwidth / cost limiting in place? Not to say that AI crawlers shouldn't be more careful, but there are always malicious actors on the internet, seems foolish not to have some kind of defense in place

jsheard · 2024-07-25T20:27:07.000000Z

The big three cloud providers (AWS/GCP/Azure) have collectively decided that you don't want to set a spending limit actually, so they simply don't let you.

immibis · 2024-07-25T20:54:48.000000Z

The big three cloud providers are the most expensive by a factor of 10-100x, and shouldn't be used under any circumstances unless you really, really need specific features from them.

Saris · 2024-07-25T21:26:08.000000Z

Isn't running a webserver on those kind of a silly idea for that reason?

DamonHD · 2024-07-25T20:23:10.000000Z

It's harder to do right then you think. The first dynamic bandwidth (and concurrent connection) limiter that I wrote was to protect a site against Google in part!

bo1024 · 2024-07-25T22:05:16.000000Z

They say this:

> We have IP-based rate limiting in place for many of our endpoints, however these crawlers are coming from a large number of IP addresses, so our rate limiting is not effective.

Do you have something else in mind? Just shut down the whole site after a certain limit?

mateozaratefw · 2024-07-25T21:01:19.000000Z

Tik Tok crawler fucked us up by taking a product-name (e-commerce) and inserting it into the search bar recursively with the results page. Respect the game, but not respecting the robots.txt crawl delay is awful.

troupo · 2024-07-25T19:54:52.000000Z

What amazes me that none of this is surprising, all this behavior (not just what's described in the post) is on par with what the companies are doing, and have been doing for decades... And yet there will be many people, including here on HN, who will just cheer these companies on because they spit out an "opensource model" or a 10-dollars-a-month subscription

talldayo · 2024-07-25T20:03:29.000000Z

The situation didn't change when it was search index crawlers being called-out. At the end of the day, this sort of "abuse" is native to the world of the internet; like you said, it's decades old at this point.

HN will cheer on a lot of things that are counter-intuitive to their wellbeing; open-weight models doesn't feel like one of them. You can't protest AI (or search engines) because after long enough people can't do their job without them. The correct course of action is to name-and-shame, not write pithy engineering blogs begging people to stop. People won't stop.

troupo · 2024-07-26T06:03:48.000000Z

> HN will cheer on a lot of things that are counter-intuitive to their wellbeing; open-weight models doesn't feel like one of them.

Except for the fact that they come from undisclosed sources from a company that does this: https://x.com/Tantacrul/status/1794863603964891567

talldayo · 2024-07-28T13:51:07.000000Z

And I'll be damned if Tanta isn't having his tweets used for training Elon's AI. The parochial circle regresses as it goes around, I'm done acknowledging the make-believe barriers we pretended the internet clung to.

You post it, others consume it. Same as it ever was.

influx · 2024-07-25T20:02:19.000000Z

Do you feel the same way about Google spidering for their commercial search engine?

LukeShu · 2024-07-25T20:25:33.000000Z

I don't.

Just 3 AI spiders put more load on our servers than all search engine spiders and all human traffic combined.

Some numbers I have handy from before I blocked the bots:

ClaudeBot drove more requests through our Redmine in a month than it saw in the combined 5 years prior to ClaudeBot.

Bytespider accounted for 59% of the total traffic to our Git server.

Amazonbot accounted for 21% of the total traffic to our Git server.

Google has never even been close to breaking out of the single-digit-percentages of any metric.

DamonHD · 2024-07-25T20:33:42.000000Z

Generally Googlebot is well behaved and efficient these days, though I have discovered that it is currently horribly broken around 429 / 503 response codes... And pays no attention to Retry-After either... Also Google-Podcast which is meant to have been turned off!

o11c · 2024-07-25T23:14:15.000000Z

Someone needs to start adding all these AI's homepages to the browser "malware" lists.

BadHumans · 2024-07-25T20:05:04.000000Z

Google sent traffic to your website. You were at least getting something in return.

ToucanLoucan · 2024-07-25T20:09:11.000000Z

Emphasis on were, since they've made their search such utter shit in the quest for ad revenue that they're now going to have their own AI sum up your results (badly) instead to attempt to solve the problem they created.

hmry · 2024-07-25T20:11:50.000000Z

Yeah, and when Google stopped sending traffic, people sued them, as they should.

Saris · 2024-07-25T21:27:48.000000Z

Not originally since they just sent people to the site.

But these days where they just rip content from the site to give people as answers, completely depriving the site of traffic, yeah that seems basically just as bad as the AI bots.

ToucanLoucan · 2024-07-25T20:06:04.000000Z

Exploitation of the common man being a key ingredient of a product has never once inspired any actual consumer revolt. Fundamentally, no matter what they say, as long as they get their fleeting hit of dopamine from buying/using a thing, people just don't care.

As Squidward says: nobody gives a care for the fate of labor as long as they get their instant gratifications.

johneth · 2024-07-26T09:31:16.000000Z

> "One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler."

Invoice the abusers.

They're rolling in investor hype money, and they're obviously not spending it on competent developers if their bots behave like this, so there should be plenty left to cover costs.

bo1024 · 2024-07-25T21:05:39.000000Z

It would be nice to have more of a peer-to-peer infrastructure (torrent inspired) for serving big resources.

immibis · 2024-07-26T02:37:02.000000Z

Never going to happen due to misaligned incentives. In 2024 everyone wants to keep their data behind lock and key. The commons is gone. Just look at the Google/Reddit deal.

RecycledEle · 2024-07-25T20:14:33.000000Z

I suspect the AI companies are as careless with their training as they are with their web scraping.

Joel_Mckay · 2024-07-25T20:25:05.000000Z

Had a conversation with a firm that wanted a distributed scraper built, and they really did not care about site usage policies.

You would be fooling yourselves if you think such a firm cared about robots.txt or page tags.

We warned them they would be sued eventually, to contact the site owners for legal access to the data, and issued a hard pass on the project. Probably they assumed if the indexing process was out of another jurisdiction their domestic firm wouldn't be liable for theft of service or copyright infringement.

It was my understanding AI/ML does not change legal obligations in business, but the firm probably found someone to build that dubious project eventually...

Spider traps and rate-limiting are good options too. =3

reaperman · 2024-07-25T20:45:53.000000Z

Robots.txt doesnt create a legal obligation. It’s just a set of rules saying “if you don’t follow these rules to politely crawl our site, we’ll block you from crawling our site”.

Obviously “anything goes” in civil suits however - if someone is being absurdly egregious with their crawling there’s usually some exposure to one tort or another.

Joel_Mckay · 2024-07-25T21:43:46.000000Z

The posted site access/usage policy is legally enforceable in most jurisdictions as far as I know...

And Reddit has definitely become more proactive about scrapers. ;-)

reaperman · 2024-07-26T02:38:34.000000Z

Please review HiQ vs. LinkedIn - it hinged on the fact that HiQ hired crowdsourced workers (“turkers”) to create fake profiles through which to access LinkedIn’s platform (who had to agree to the ToS to create these accounts). The court found that hiQ expressly agreed to the user agreement when it created its corporate account on LinkedIn’s platform.

This doesn't apply if you don't ever agree to anything - which is the case if the information is not locked behind account creation.

Joel_Mckay · 2024-07-26T03:41:31.000000Z

This gets complex fast, as a click-army is not necessarily violating the EULA.

However, if they scraped the content using these account credentials, than it becomes a problem in a commercial context.

If I recall, only journalists and academics could argue Fair use at that point.

Anyway, I didn't touch the project mostly for copyright and trademark risk concerns.

Have a great day =3

echoangle · 2024-07-25T22:49:31.000000Z

Legally enforceable as in “you can block people who don’t follow the policy” or enforceable as in “you can sue them for money”?

Joel_Mckay · 2024-07-26T01:09:40.000000Z

If I recall it is considered theft-of-service if you bypass the posted site usage terms with an agent like a spider, and certainly a copyright violation for unauthorized content usage (especially in the context of a commercial venture.)

One may be sued, but not because you parsed robots.txt wrong =3

reaperman · 2024-07-26T02:30:48.000000Z

> it is considered theft-of-service if you bypass the posted site usage terms

My understanding is that this is not accurate.

HiQ v LinkedIn established that this is only the case if you actually agreed to the terms of service. Such "agreement" only happens if the information is walled behind an account creation process, e.g. Facebook, Inc. v. Power Ventures, Inc. If it's just scraping publicly available webpages, the only legal issue with scraping would be unreasonably or obviously negligent scraping practices which lead to degradation or denial-of-service. And obviously the line for that would have to be determined in civil court.

eBay v. Bidder's Edge (2000) is the last case that I could find which even considered violation of robots.txt as very minor part of the judgement, but the findings were based far more on other things. Intel Corp. v. Hamidi also implicitly overruled the judgement in that that ruling (though not related to robots.txt, which was really just a very minor point in the first place).

Joel_Mckay · 2024-07-26T03:32:52.000000Z

Hard to say, I seem to recall it was because some spider authors used session cookies to bypass the EULA (the page probe auto-clicks "I agree" to capture the session cookie), and faked user agent strings to spoof gogglebot to gain access to site content.

One thing is for certain, is its jurisdictional... and way too messy to be responsible for maintaining/hosting (the ambiguous copyright protection outside a research context looked way too risky.) =3

Atotalnoob · 2024-07-25T20:47:42.000000Z

AI/ML has created a race to suck all the data on the internet regardless of copyright or status and use it.

OpenAI introduced gptbot in August 2023… they already took everything

Joel_Mckay · 2024-07-25T22:05:31.000000Z

Unlikely, site-generator hosts are still happily providing a limitless supply of remixed well-structured nonsense, random images with noise, and valid links to popular sites.

In this case, they showed up to the data buffet long after it went rotten due to SEO.

Have a nice day =3

bakugo · 2024-07-25T20:57:52.000000Z

The words "AI" and "respectful" don't belong in the same sentence. The mere concept of generative AI is disrespectful.

apantel · 2024-07-26T02:28:28.000000Z

The concept of generative AI is not disrespectful. You could have generative AI trained on licensed data - would that be disrespectful?

asdasdsddd · 2024-07-25T20:56:12.000000Z

Why can't you just rate limit non-browser user agents very aggressively, if your primary audience is human.

guhcampos · 2024-07-25T20:13:17.000000Z

We all know wheret his is headed. In 5 years every useful content in the Web will be behind a paywall.

surfingdino · 2024-07-25T21:04:08.000000Z

Served as word vomit.

andrei_says_ · 2024-07-26T06:09:05.000000Z

They will be once we legislate that respect. Not a second earlier.

surfingdino · 2024-07-25T21:01:14.000000Z

Sites need to start suing crawler operators for bandwidth costs.

echoangle · 2024-07-25T22:47:45.000000Z

How is that supposed to work? Under which law will you force me to pay for using your public website, especially if I am not in your country? Just put up a captcha and block crawlers, you’re not going to get them to pay you

joshu · 2024-07-25T21:10:10.000000Z

everything old is new again. i remember when someone at google started aggressively crawling del.icio.us from a desktop machine and i ended up blocking all employees...

Pesthuf · 2024-07-25T21:02:51.000000Z

Have capitalists ever stopped just because their actions (that make them money) hurt others? Because the consequences of the damage they cause might in the end hurt them, too?

Hoping they just stop seems futile…

xyst · 2024-07-25T20:05:18.000000Z

So much waste. Even worse than the digital currency rush.

croemer · 2024-07-25T20:04:14.000000Z

Just 2 buggy crawlers seems not that many, sure they each had large impact, but given that there are likely hundreds if not thousands of such crawlers out there it's a rather small number. It seems that most crawlers are actually respectful.

PaulHoule · 2024-07-25T20:24:09.000000Z

I used to run a site with a huge number of pages that had high running costs but low revenue.

The only web crawler that did anything for me was Google, as Google sent an appreciable amount of traffic. Referrers from Bing were almost undetectable: the joke among my black hat SEO friends at the time was that you could rank for money keywords like "buy wow gold" and get 10 hits. Then there were the Chinese crawlers like Baidu that would crawl at 10x the rate of Google but send zero referrers. And then there were crawlers looking for copyrighted images that cost me money to accommodate even if they never sent me cease and desist letters.

As much as I hate the Google monopoly I couldn't afford having my site crawled like that without any benefit to me.

It's an awful situation for the long term though because it prevents new entrants. Right now I am thinking about a new search engine for a vertical where a huge number of products are available from different vendors and when you do find results from Google they are sold out at least 70% of the time. I hate to think it's going to get harder to make something.

Centigonal · 2024-07-25T20:14:52.000000Z

How many foundation AI companies are there? 2 is a pretty big chunk of that pie.

corbet · 2024-07-25T20:19:26.000000Z

I think there are a thousand wannabe companies all trying to suck up as much data as they can; not a sustainable situation in any way.

PaulHoule · 2024-07-25T20:24:52.000000Z

There was a paper about webcrawlers circa 2000 that pointed out that the vast majority of academics who ran webcrawlers never published a paper based on their work.

morkalork · 2024-07-25T20:51:42.000000Z

Sounds like all the kids who want to make a video-game and start with building a game engine.

croemer · 2024-07-25T20:28:15.000000Z

What makes you think that only a dozen or so foundation AI companies are scraping?