I run some forums, some of them are quite large. Recently the big increase in scraping by the search engines (Bing has had the greatest increase) caused me to question why.
It used to be that the cost of scraping came with the benefit of being search engine listed which drove traffic, but that feels less true than it used to (for a lot of reasons).
But now the cost of scraping doesn't feel in the favour of a website.
Scraping and bots are for search engines listing, technology tests / experiments, advert / audience measurements, brand protection, IP tracking, copyright enforcement, screenshots for links on other websites (i.e. Facebook), Pinterest linkbacks, training of LLMs (my hypothesis on Bing's massive increase), spam, etc, etc.
With the search engine value lowered by less traffic, yet a solid community still growing via word of mouth... the rest of those things offer no value to me or the community. So I asked the community, what do you want to do here? Leave them all? Ban some? Ban all? Some midway thing?
Almost unanimously the community (who fund the costs by donations, and at least 30% of all traffic and costs were known to be associated to bots) chose to block every bot.
So that's what we've done.
We've blocked every major hosting and Cloud ASN, or put a challenge up to the few known to be proxies (i.e. Google Data Saver), and we've blocked hundreds of bot user agents, we've blocked requests where no Accept header was present where it should be, we've blocked TLS ciphers that aren't modern web browsers — I looked at requests by Python, Go, Curl, Wget, etc... and blocked everything that obviously differed from a valid browser.
In the end we blocked about 40% of our traffic, and so far not a single real human has said (and it's a tight-knit but large community with lots of ways of contacting me) that they've had any issue at all.
We appear to have reduced our traffic and associated costs, with no loss to us at all.
About a year ago I noticed one of my websites going down about weekly for an hour or so. My website had one endpoint that was available in a few hundred thousand versions. Ment for the user, for the bots it was just a few thousand variants. Setup in sitemap, setup in robots and including the right meta tags. Ment to update every few months.
But we'll, not with the bing bot. It ignored my timeouts and queried hundreds of thousands, for him identical, pages every single week. Not one connection, not two or three but about 10 IPs hammering my servers at once. No second between request, not even pausing when the server is going down. Something even 'bad bots' usually do.
I assumed it was just any bot calling itself Bing. But no, it was their IP ranges.
I blocked nearly all of their IPs. Which appears to be the only way to make sure it doesn't ddos me again. Bing is like 1% of my traffic, not even worth the hustle.
Yeah, Bing has gone completely nuts the last six months or so. They'll happily send the equivalent of a small DDoS at sites hosted on shared hosting, knocking them completely offline for a while. Nuts.
Could have been in the last 6 months for me too. And yes its crazy, most my sites have between 8 and 16 concurrent database connections available. In real world this works for thousands of daily users. But for a bing bot it's simply not enough.
Why aren't you caching your forum as static pages for users who aren't logged in at the very least? E.g., rebuild the cache every x time as a cron task, but even then, every page load shouldn't be incurring database overhead if someone isn't even logged in. Equally, you can force a rebuild of cache for x relevant pages when someone posts a new thread or comment.
Otherwise, if someone alt+clicks a bunch of a category's threads as they look interesting then you're going to have a bad time.
I am not the forum guy and my content is not really static. It's cached, but for this specific sub page I need at least one small query. Only a few requests are rebuilding the page, but if every 10th is rebuilt and 9 other bots are hammering mostly cached pages I still have the same issue.
The bot was never ment to query the same pages thousands of times. These pages were identical to him. There were already bot specific rules programmed in.
This website already is heavily cached and optimized. Even thought it only has 16 database connections there maybe is one timeout every few months. Users usually don't open tabs much faster than the short requests take, only bing does.
Really the time that went into optimizing it makes me kinda sad when someone questions that is was a effort Vs payoff thing. The bot ignored all rules I gave him and barely brought any benefit for me in terms of traffic. There is no payoff here, only effort.
This happened to me every week or so, not entirely predictable, on a small website that supports only a handful of users. Being an enterprise site with some corporate backing it's on quite a decent server, but it wasn't a match. The DDOS would leave things in a semi recovered state at best.
And you'll abandon those forums one day, and a month after, i'll search for some niche problem, someone will link to that forum, and there will be no archive(.org) page, no google cached page, no nothing to get the answer.
I've only ever shuttered one forum, and that was at the behest of the community itself. A decade in and it had become a toxic place, everyone agreed that they wanted to give multiple new things a try and that the place in question should be deleted. Not archived, not available forever, but deleted and nothing kept of it. I obliged.
When the day comes that I shutter another I'll ask the active members at the time what they want to happen to their data. They may desire to leave it as a resource, they may want to delete it, if there's a clear majority in the decision I'll go with whatever they desire. I value the choice of those whose data it is, who contributed to creating it, over anyone else's hypothetical needs.
> When the day comes that I shutter another I'll ask the active members at the time what they want to happen to their data.
You might not have that chance; unless you have a co-admin with full access to everything, the reason for the forum to shut down might be because you're no longer there.
And? In this hypothetical situation where they “aren’t around”, I don’t think that people searching for answers to tech support issues are high on their list of concerns, and - probably - almost certainly those of the forum members.
"And" the availability of information is important.
Respectfully, on most forums, I don't care about the community, I care about the content, that's why I'm there, to have discourse and generate meaningful value in the form of knowledge. If someone passes, yes, that sucks, but that's life, we're all snuffing it at some point. However, the world carries on spinning, and that information should continue to be available, especially if the forum is for a niche and frequently generates useful information.
If a forum is becoming "toxic" then that sounds like a moderation problem.
It seems like these kind of forums are not exactly your area of interest, if they're community focused. Information gets snuffed out all the time, with every death of a person we lose large piece of information, but hoarding, and especially expecting others to hoard or assist hoarding is not the correct approach to what is essentially a _you_ problem, so grab that terabyte disk and make a mirror yourself if you are so inclined. Nobody is or should be required to let corpo behemoths in for your convenience and to comply with your questionable opinions.
I'm part of several digital archivism projects. My personal disk array is 54TB of data. That's without even getting into 1PB+ of data on LTO carts.
Last time I checked, Archive.org et al weren't a "corpo behemoth", but consuming server resources is exactly what a normal user does.
Site owners should get with the times and serve up cached static pages to users who aren't logged in. Even then, they should be serving up cached static pages and rebuilding cache for relevant pages when someone posts new content when it comes to forums. Not being able to handle a few crawlers is an administration problem. Why should the community/public suffer for someone's inability to configure a server appropriately?
We live in primarily free societies where individual has the right to decide upon their actions. Telling people that there is only one "correct" way of doing things is obnoxious and toxic and reflects upon your inability to see your opinion for what it is, an opinion.
My opinion is that "screw crawlers and scrapers" is a valid opinion. If i'm hosting a playground, it's my playground and my rules. If you want to play elsewhere, please do. If you want to preserve data, please do, but not at my expense.
Disagree with that? feel free to, but don't think that you are somehow in the right, because if you go with this shit to court, you will be laughed out of the door.
You have no inherent right to other people's data, regardless of how they shared it or the visibility of it at the time they shared it. You are not owed the sum of human knowledge.
If people wished for their content to be available to all forever they'd run a blog and pay to ensure it is available, and would proactively seek to get it archived.
People on forums aren't doing that, and the data of any given individual is a contextless collection of semi-random mumblings on different topics because without the fullness of a conversation involving others none of it makes sense.
It is within that context that a forum admin can decide what to do, they have been granted right (by T&C) to the collection of all the forum members comments which restores the context and gives meaning to the content. Every individual on the forums I operate can obtain their own data, but it would be meaningless by itself.
As the operator of the collection of content I get to determine what best to do with that, and sometimes that may be to delete it all. Sometimes that may be to seek to archive it. And on this occasion it is to treat this knowledge as having valuable to those already participating in the community and to not be shared beyond that.
Elsewhere you said this:
> Call it what most forums are: an ad-supported business. People generate content for the owner for free because they too derive value from the information that others share. The middleman is just a middleman
But the 300+ forums I run have no adverts, they are not a business, they are non-profit. Their value (if you want to measure everything in a capitalist way) is social, to help those in the community.
The purpose of the forums I run isn't to expand the sum of human knowledge, or to make myself personally wealthy of the back of the efforts of others, the purpose is to help be a remedy to adult loneliness by connecting people by their shared interests in geographically small areas such that it builds relationships and forms bonds.
Yes there is a hell of a lot of expertise captured here around those interests... but no-one has any inherent right to it.
This tangent is in relation to my shuttering one forum.
That forum was around a music band in the UK, and the audience of the forum turned out to be lower than expected - University age. They were emotionally immature, over-shared online, slept with each other, had relationships and break-ups... all in public. The music forum did have lots of music info on it, but it was intertwined with a lot of very highly personal information posted at a time when a reasonable expectation of the internet was ephemerality.
It was totally right to protect the individuals future selves from their past selves, and I would delete again.
There are certainly downsides to hording data. At the very least, information takes up space. It also tends to suck up mental bandwidth: you have to keep organizing, de-duplicating, and migrating to newer formats. It's much easier to just delete it. Just like it's much easier to throw old ratty tshirts. IMO, data hoarding is just as much of a mental disorder as hoarding physical stuff.
This idea that all information must be preserved for forever is also at odds with privacy. See, e.g., the right to be forgotten.
I think that the reason that many people don't put much effort into archiving information is a cultural one. Most people simply haven't given much thought to the question of fate of information or knowledge they happen to find, and the importance of preserving that knowledge for health of society's discourse.
Why are forum admins beholden to archive their data in perpetuity in case someone wants free advice or knowledge?
Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?
While the openness of the (now-ending) early days of the internet was liberating and allowed knowledge sharing on an unprecedented scale, the downside is the huge devaluing of that knowledge and skills.
I do actually but it is up to the person. The main reason for me to encourage it is that if knowledge is reserved for the high priests it will eventually be lost. How many civilizations did we build by now? No one knows! We haven't the records. The stuff people must have figured out. Of course may would pretend it wasn't a big deal but all those deleted forums had plenty of insights to offer. Practical ones and historically valuable.
The real value of knowledge doesn't change if you duplicate it or make it widely available. On the long term, blocking access and rent seeking doesn't create value, it destroys it. It seems useful for the individual who wants to pay their bills or for the one with insatiable greed but in the end it will makes us stupid.
For example: I would like a high quality UV-B lamp that isn't INSANELY expensive. They are pretty ordinary lamps but developing the coating is very expensive. The work has been done tho, lots of times, over and over again. Most results are just bad.
About 35% of the US and about 1 billion globally have vitamin D deficiency, 50% has an insufficiency: Fatigue, Not sleeping well, Bone pain or achiness, Depression or feelings of sadness, Hair loss, Muscle weakness, Loss of appetite, Getting sick more easily, etc
Great loss of economic productivity or more opportunity for me? You decide!
If I contribute time in answering questions or solving problems, like with mailing lists still being available to view, something that I intentionally put into the public domain with the intent of helping people should remain available. Just because a forum exists as a business to someone doesn't mean that the content has no value to the general public. The forum itself has no value, only the only content has value, which is what draws in the traffic to make money in the first place.
Call it what most forums are: an ad-supported business. People generate content for the owner for free because they too derive value from the information that others share. The middleman is just a middleman.
To not allow that content to be indexed/cached/archived/mirrored whilst making money off of it is pretty scummy in the long-term. There's tons of forums I used to visit whose information is now forever lost, that included a lot of very useful programs for niche bits of kit, which is now otherwise very expensive e-waste.
> Why are forum admins beholden to archive their data in perpetuity in case someone wants free advice or knowledge?
Because otherwise their work was wasted.
> Do you maintain a freely-available repository of all of your knowledge and experience, in case someone else wants to consult it one day?
I would if I could, I’ve already contributed what knowledge, bandwidth, and money I can to the Internet Archive. What about you?
> While the openness of the (now-ending) early days of the internet was liberating and allowed knowledge sharing on an unprecedented scale, the downside is the huge devaluing of that knowledge and skills.
I cannot even process how wrong this is. Objectively the preservation of knowledge and skills is a good thing, and you cannot devalue knowledge, which is itself priceless.
This argument really makes no sense. If I tell Bob how to fix his transmission down at the local diner, but nobody records the conversion, that wasn't wasted work. But fixed his transmission: mission accomplished.
So this data will not be losted forever? Also, do you mean that all data and all posts made by users should belong to admins only and only admins should decide what to do with it?
They're not, but blocking all bots, also blocks others, that want to archive all that data forever, be it a private person using wget, or a service like archive.org.
Why would I, it's online, I know where to find it... until it's gone from there. Also, that would meen I'd have to archive it before I actually needed it archived. And archiving would have to be done manually. And after it's gone, and the only proof of it existing is a text somewhere else saying that the solution to my problem is here -> LINK and the link is that, the data is gone. Not even on archive.org.
Have we really come to a phase of internet use, where everytime you see something, you have to manually save it, and on every post (even here or on reddit, facebook r wherever) a link is not good enough, but you have to copy-paste the whole block of text just to make it a bit future-proof?
And the perpetual tale of the forum post that says "this has been asked before, use the search" and the first search result is this person saying to use search
Bizarrely, I can't remember the last time duckduckgo (basically bing) gave me a forum as a search result, though it used to regularly give me results from them. Maybe it's the admins blocking crawlers but it feels more like a conscious decision.
I’ve been wondering how much of it comes down to optimizing for ad impressions. If you search, get a result, and it answers your question they sell one page of keyword ads. If you go back and forth a dozen times, they sell a dozen times as many impressions.
Given my usual behavior is to check two pages then add !g, where I check two pages and decide I don't need more info, I don't think that's a strong move.
I’m not saying it’s smart, just that I could easily imagine someone chasing the wrong metric or trying to balance revenue against the likelihood that you’ll stop using them. For example, in your scenario that’s still twice as many impressions so unless you make Google tour primary maybe that’s a win.
I am definitely happy you asked your users what they thought and made your decision. But saying "no human complained" might not be a good metric if people use Google or whatever to discover your site or it's info. People don't complain about things they don't know exist.
If you aren’t doing so already, I highly suggest working alongside the Internet Archive to preserve the information on your forums. One day they will close down, and your users will want to see their posts, refer to now broken bookmarks, and generally access the information.
How do you have costs that are directly attributable to scraping? Unless you are using a serverless platform that bills per request or your pages are large enough that egress bandwidth is expensive enough, I’m not convinced most sites would save much doing this.
I'm not really sure how you arrive at the conclusion that only serverless platforms result in costs? It's not just 40% of cpu or egress, it's 40% of database load, 40% of logging, 40% of APM/instrumentation.
40% is 40%. Maybe 40% of their cost isn't enough to warrant whatever time these efforts cost them, but for many people out there it will be.
Sounds like the original commenter had a reasonable case, but I just don’t think it’s likely to save anything for small sites on traditional stacks.
If you are running on i.e. EC2 and RDS instances, you’re not saving anything by using 40% less of the CPU, unless you can actually downsize the instance as a result. Read-only traffic is also not that hard to scale out, but with forums etc, you can be stuck with some legacy systems for sure.
It's a multi-tenant platform (about 300 forums, with the biggest being around 250K visitors per month), the database is on a vertically scaled box that is too excessive now the traffic has reduced, but I was able to delete a few of the Linodes that were horizontally scaling the API and Web UI (the Web UI is just a client of the API hence those could be saved too).
I've also noticed that my cache hit rate is extraordinary now, which I assume is because humans read recent stuff and bots read the long-tail of old stuff.
As someone who does targeted scraping of forums, I can say having a good open API and caching is probably the best way to decrease load.
If you use Cloudflare, turn off their anti-bot stuff. It is far more efficient to let them just serve bots from the cache than having scrapers use tricks to bypass them and go directly to your origin server.
I designed most of, and built a chunk of, the WAF and firewall stuff at Cloudflare. That includes wirefilter (a wireshark display filter inspired firewall), and coupled with Cloudflare using maxmind you get to block ASNs in addition to other characteristics of the request.
With that context, I used bgp.he.net to look up the big ones I know and then wrote the rules.
You can try out our free IP to Country ASN database[0]. You can just grep the IP addresses by looking up the ASN or AS domains. Then just extract the IP address range and you should be good to go. [1]
The paid databases comes with AS type (hosting, ISP, business etc.) and we have a VPN detection database as well.
While I perfectly understand, I’m a bit worried about my own web browser ( Offpunk ) which uses python-requests and is thus very often associated with being a bot.
The browser has the goal of being light and downloading only the text and pictures (no css, no js). So we have the same goal here.
Shameless plug, if you do not want to spend the time aggregating all datacenter IP addresses you can use the IPDetective.io API to easily detect of an IP address comes from a datacenter, VPN, proxy or botnet.
Anecdotally, I have drastically reduced my internet activity in recent years. So many of the websites are just noise with crappy information. The good answers are often found on reddit in one or two clicks or directly asked on discord so there is no need to spend hours with google anymore (the crappy algo doesn't help). I also refrained from posting or discussing anything on social media after some bad experience with the users there. It takes effort to make good quality posts, but it rarely goes anywhere in those massive social media sites and often it may even get punished if it is against the public opinion.
So I am tired and only make a few posts every few days now on HN. I am sure while my activity has dropped, the bots are getting more active so nothing is lost. Maybe some quality and how the traffic share look like, but I don't know.
> So many of the websites are just noise with crappy information.
What I hate the most about today's internet is how search engines allow blatant scrapers to feature so high in search results. So many times I Google for something to find Stack Overflow as the main search hits, and right next to it there are a couple of sites that copied Stack Overflow's questions verbatim. Once I googled for FLOSS projects I had on GitHub and lo and behold there were half a dozen obscure sites that also claim to host my project, with everything copied verbatim from git repo to project descriptions.
It's better than one thing at least: Internet scraping is the process of...
you can already see where this is going. Sites with 6 pages of boilerplate that sounds like an 6th grader padded an essay around a 2 word answer they've scraped from somewhere else. Worst of the 2 words of content aren't even all that accurate most of the time. At least sites that copy the answer verbatim still give you the answer!
My main gripe with Discord is the somewhat ephemeral nature of it, due to the search being horrible, as well as not publicly indexed nor easily accessible without an account + an invite to a specific server.
I think LLMs are useful because they’re effectively trained on Reddit. It’s for sure one of the most useful places to find good information and advice on the web.
I think a ton of reddit is already bots. But usually they copy comments word-for-word.
Also, I think voting and moderation will upvote and downvote the AI generated comments in such a way that they don't poison Redditch as a training data source.
Counter point: the whole internet can be distilled into an LLM and shared in a super condensed format.
Both of these things will happen (old web getting spammed, old web being distilled and crystalized), and the future will be weird and unpredictable to us now.
If you completely ignore the fact that many humans congregate on the internet to be social with other humans then sure. There's a lot of opinions, art, ideas, jokes, and meaningful life long connections that happen because of the internet. In my mind that was it's only real utility. Sure search engines are good for research, and shopping. But community, talking to people with life experience in an area of interest have changed who I am as a person. Condensing the internet into a binary, is effectively meaningless to me unless I end up on a remote island with enough battery power to look up edible plants and not internet connection...
> the whole internet can be distilled into an LLM and shared in a super condensed format
> the future will be weird and unpredictable to us now
I'm going to play devil's advocate for those people who always drop by saying LLM is pretty much like a human and human is pretty much like an LLM anyway, and say it would be no different to now
Except bots aren't like humans at all. They have no life experience. It's basically a text interface to a dictionary. A company can pay 100000 bots to spam your favorite messaging board with half baked propaganda, hate and advertisements. A human doesn't have that bandwidth, motivation, or interest.
An internet saturated by bots is like reading reviews on Amazon without pictures. Pointless, intentionally misleading, and often confidently wrong.
Yep. I don't know how enthusiasts manage to say human is the same as AI and also say that everything will certainly change for sure because of AI, they should pick one...
You can get your Internet back alive if you simply sacrifice your retina scans and all other biometrics that prove your humanity. There may be a reason OpenAI and Worldcoin have the same founder: profit from ruining a nice thing, then profit from saving it (via bringing it back from the dead as a zombie).
I don't believe it. Unless someone is running bots that watch netflix/porn all day, there is no way that they are consuming half of all traffic. Half of all posts? Sure. Half of all webpage requests? Ok. But half of internet traffic?
With you. In 2021 53% of traffic was from 3 video platform - not counting countless other video providers and also not counting all other non-video content.
Our data show in the first half of 2021 bandwidth traffic was dominated by streaming video, accounting for 53.72% of overall traffic, with YouTube, Netflix, and Facebook video in the top three.
> And looks like Sandvine is very credible as countries relies upon it to censor internet in their countries.
Sandvine's primary business is traffic shaping and bandwidth management for cellular networks, shipboard networks, and other places where you need to do intelligent QoS. They are the reason Google Maps still works well on your cellular connection while you pass a house using cellular broadband to download a torrent.
The fact that some people use it to censor is an abuse of technology, not the intention of it.
I thought it would be more. A human doesn't visit anywhere near 10 different webpages a second, a bot can. A human isn't uploading and downloading data 24/7, a bot is. A human doesn't make a new post a minute, every minute, a bot does.
Exactly. No way it's half of bandwidth, because video.
I'm guessing it's half of "connections", including DDoS attacks. But even then, I wonder how reliable their methodology is. Like are they including port scanning here, when a connection isn't even made?
Considering the bulk of data is now behind some form of paywall (ie you need a netflix subscription to access netflix data) I would be very surprised to see even bing+google crawlers combined consuming anywhere near 50% of traffic. They just don't have the access necessary to start pulling such numbers.
I don’t quite understand how or why but I was tangentially involved in a small-mid size e-commerce website that only serves B2B customers in the US and to a small extent in Canada and it absolutely got lots of hits from what I assume are Chinese search engines to the point where we decided we absolutely needed cloud flare rate limiting.
Now the code is a hot mess, sure and I am partially to blame for that but that is kind of besides the point. We don’t need to serve any more than thousands of concurrent users which the website can handle but we have to basically ban Chinese traffic to stay online.
Maybe people at bigger companies already know this but it was a revelation to me how much it takes just to stay alive in production.
I anal and I definitely don’t know what they get out of crawling every single product detail page on our website multiple times a day. Nothing here changes that often. Maybe they have some bad/overzealous code? Are they looking to attack take over our servers to them attack others with our machines? If it is an attack, why use Chinese IP addresses? Why not use their bit farms? If it is legitimate search engine, why not respect robots.txt?
It's true because advertisers and publishers measure "performance" (hence revenue/prices) based mostly on views & clicks. You will be amazed how many farms are in the open but hidden. And we're not talking on those stupid easy to catch traffic boosters, it's an entire industry behind.
I think there are bots to watch youtube and twitch to boost the viewership. I think that could consume a lot of traffic. I have no data or research to back that up and I am no expert.
I have no idea either, but most mobile/residential proxy providers heavily charge bandwidth [1]. I'd imagine that you can bot views on twitch without streaming the video, since the iOS twitch app offers audio-/chat-only modes, so just connecting to the chat (+ mb some other obfuscated stuff) could be sufficient to bot views.
[1] $9.45/GB at brightdata.com/proxy-types/residential-proxies
I tend to agree unless everyone else is using 32GB favicons on the default websites like me. Streaming takes up a massive amount of bandwidth. That and P2P torrent sharing is only increasing with all the streaming services going back to the cable TV payment plan models.
There are a handful of bots that mirror/archive multimedia content that is anonymously accessible. There is no way those bots have the storage capacity to mirror even a single pass of all the anonymous content.
What I have seen increasing exponentially is port scanning but that takes up almost no bandwidth. Even the broken scanners that in effect look like an amateur DDoS only utilize about 15kb/s using dozens of CIDR blocks at the same time. That does not even remotely hold a candle to streaming.
The link below [1] is talking about Netflix as a percentage of internet downstream traffic and this is only Netflix. There are now hundreds of streaming providers and according to Sandvine streaming accounts for 65% of internet traffic. [2] This does not include torrents and other file sharing.
Here [3] are some fun stats. One of them backs up the submission but it isn't clear if they mean requests or bandwidth. Given that Netflix or streaming alone is 65% of the bandwidth that would lead me to believe the issue of this thread is a lack of clarity around bandwidth vs requests. The wording on all of these sites is too Wibbly Wobbly.
Every day, the internet generates more than 2,183,908 tons of CO2 emissions.
Internet traffic statistics show that 51.8% of all traffic is generated by bots, while humans account for only 48.2%. I think they mean requests, not bandwidth.
I’d believe half of all traffic to non-video hosts, however. We provide a lot of free content and it’s staggering how many bad robots there are - not actual malice, just what you’d expect from lousy programmers who don’t get the bill. Things like the same IP / User-Agent downloading the same file thousands of times just in case it changed in the last 5 seconds or crawling millions of permutations of search parameters rather than using the site maps. Many week that’s half of the total traffic, many tens of terabytes of HTML & JSON.
As someone who once wrote a scraper to interface with a site that lists lawyers (So we can blacklist them in advance from being called, because they REALLY don't like getting contacted by accident), and another time a scraper to get a list of unemployed people (employee leasing): Bad bots really ruin it for those that actually try to make well behaved scrapers.
I once wrote a small utility for my team that ended up being the second or third largest API consumer in the company until someone yelled about it and I throttled back the polling significantly. I always think of The Sorcerer's Apprentice sequence from Fantasia - while we sleep, our automatons labor, potentially unceasingly.
And I can certainly imagine bots fetching Netflix or Porn or whatever video for personal archival purposes (I use youtube-dl to protect a few videos I really like from the vicissitudes of Google myself).
Google for the UN Economist Network Report on the Attention Economy. There is a line there that says less than 1% of all content produced is actually consumed by ppl. And its shrinking.
When cost to broadcast for anyone on the net falls to 0 everything turns to shit.
The other possibility here then is that legitimate web page views pale in comparison to both streaming and bot traffic, and bot traffic is orders of magnitude higher than other web activity excluding streaming.
> Of all internet traffic in 2022, 47.4% was automated traffic, also commonly referred to as bots. [...] Of that automated traffic, 30.2% were bad bots, a 2.5% increase from 27.7% in 2021
This is a bit misleading, according to the accompanying pie chart, 30.2% of all traffic were bad bots, not 30% of the 47.4%.
What is sorely lacking (from a quick skim of the PDF) is a detailed description of how the data was measured, what protocol it includes, what the error margins are etc.
It’s worse than that as over half of all internet traffic is video streaming. Bots simply aren’t a significant fraction of Netflix, D+, HBO Max etc because there’s no point.
says vendor trying to sell anti-bot software to your managers. Let me guess, by "all internet traffic" they mean "HTTP requests going through our tool"?
While you are right, the amount of bot traffic is huge, from data scrappers, fake views (ie. YT video SEO, video ads, etc), to game miners and social media fake accounts which each scammer/spammer run by the thousands. That's not including the legal stuff like automated systems (web crawlers, media content generators, etc).
Right. I went to their site and I can't even access the actual "report" they're talking about. But it's clear that they've accidentally mixed up the words internet and web.
Web is HTTP, so websites and videos. The internet is everything, including the web from SSH (remote server login), RDP (remote desktop login), to torrenting.
Cloudflare (CDN with a much larger market share) claims that in the past four weeks, the percent of bots vs. humans is about ~29%. Almost 50% seems like a stretch.
TLDR 30% of http requests are bad bots, 17% are okay bots, and 53% human.
The PDF report isn't explicit, but since their assertion is "based on data collected from the company’s global network throughout 2022, which includes 6 trillion blocked bad bot requests", it means the "internet traffic" is measured in number of HTTP requests. As noted in other comments, results would have been very different with network bandwidth.
BTW, I haven't checked if the search bots behaviour has recently changed, but I remember that most of them ignored the directives in robots.txt asking for a slower crawl. And I couldn't find a way to declare that the content almost never changed.
Google Read Aloud [1] bot absolutely hammers my website because I update the url using pagehistory and it opens a new request for each url update. It also ignores the meta tag that's meant to disable it :(
This is a bit confusing because isn’t most internet traffic video [0]? Streaming video requires something like 100x more sustained bandwidth than clicking around a website. So is this 47% of non video traffic or are bots consuming video? Something else?
Does this mean 47% of all online ad spend is worthless? How would one prove that it isn't? I've never seen an ad service offer cost per human click or cost per thousand human impressions
Automated ad fraud is why. It's the biggest problem you've never heard of.
The last time I heard it cited (which, to be fair, was a few years back) the actual number was 33%—for every three dollars spent on digital advertising, a dollar gets lost to fraud. Not too far off.
Roughly three years old I think. But if you're referring to the same ad fraud rings I'm thinking of: yeah, a few did get shut down (which iirc was a first) but there's plenty of new operations taking their place. Ad fraud is a relatively cheap way to make money if you can scale up your operation, and most perpetrators don't get caught or punished. Although that may change in the future.
> This website requires certain cookies to work and uses other cookies to help you have the best experience while on the site.
> By visiting this website, certain cookies have already been set, which you may delete and block. If you do not agree to the use of cookies, you should not navigate this website.
> Visit our privacy and cookie policy to learn more about the cookies we use and how we use your data.
I'm pretty sure GDPR says nothing about cookies that are needed for the site to work, such as session cookies when you're logged in, or cookies to hold settings you set. Am I wrong?
Declining on this form sends me to the / root page. Weird.
Obviously 100% of internet traffic is bots since human brains don't (yet) directly connect to the Internet. As far as what percent of that traffic does eventually enter ears or eyes, there's really no way to tell.
It used to be that the cost of scraping came with the benefit of being search engine listed which drove traffic, but that feels less true than it used to (for a lot of reasons).
But now the cost of scraping doesn't feel in the favour of a website.
Scraping and bots are for search engines listing, technology tests / experiments, advert / audience measurements, brand protection, IP tracking, copyright enforcement, screenshots for links on other websites (i.e. Facebook), Pinterest linkbacks, training of LLMs (my hypothesis on Bing's massive increase), spam, etc, etc.
With the search engine value lowered by less traffic, yet a solid community still growing via word of mouth... the rest of those things offer no value to me or the community. So I asked the community, what do you want to do here? Leave them all? Ban some? Ban all? Some midway thing?
Almost unanimously the community (who fund the costs by donations, and at least 30% of all traffic and costs were known to be associated to bots) chose to block every bot.
So that's what we've done.
We've blocked every major hosting and Cloud ASN, or put a challenge up to the few known to be proxies (i.e. Google Data Saver), and we've blocked hundreds of bot user agents, we've blocked requests where no Accept header was present where it should be, we've blocked TLS ciphers that aren't modern web browsers — I looked at requests by Python, Go, Curl, Wget, etc... and blocked everything that obviously differed from a valid browser.
In the end we blocked about 40% of our traffic, and so far not a single real human has said (and it's a tight-knit but large community with lots of ways of contacting me) that they've had any issue at all.
We appear to have reduced our traffic and associated costs, with no loss to us at all.