The brand recognition is worth something. I haven't been in the market for new headphones in a long time, but I still know the name Raycon from the bajillion sponsorships they do.
Likewise with NordVPN and Raid: Shadowlegends. Never used any of them, don't really intend to, but I do know the name.
I dislike it because it exposes content creators to similar pressures as traditional TV. There's a lot of content that doesn't get made because that content would be unsponsorable or worse yet would make the creator in general unsponsorable. It's also created some strange and twisted linguistics to appease sponsors or YouTube's algorithm like "unalive" or "PDF file" (as a standin for pedophile).
I guess it's the way of the world, but the introduction of heavy monetization has definitely influenced the kind of content YouTube carries.
I'd probably be OK if all the content which doesn't get made without sponsorship wouldn't get made at all, and the people who work as content creators stopped doing so. There is an overabundance of new content, having 10x less content would be perfectly fine, and in pretty much every niche there are amateur enthusiasts who clearly (based on their amount of viewers) are giving their time away, and their content is in many ways preferable and "more real" than the professionals - so I'd be OK if all the professionals stop and these awkward amateur enthusiasts are all that remain.
The same applies to web and blogs; the ability to monetize them by ads (and I do remember the "old web" before it was the case) increased the content but drowned out viewership for the true enthusiasts running things in their spare time, which IMHO were more valuable and I think that regime was better; again, losing 90% or 99% of the content wouldn't be bad in my mind, there still would be more than enough for anyone to ever "consume".
> You can make content without monetization in mind. But it's like giving your time away.
Sure, but then how is this any different from TV? Eg I’ve seen a few videos dramatically overblowing the certainty of life on Mars lately, presumably for views. If I wanted half truths based on lack of context, I could just flip on the news.
> Content which doesn't get made without sponsorship wouldn't get made even if sponsorships didn't exist.
Sponsorships raise the money invested into videos, which raises viewer expectations, suppressing the likelihood these videos would ever be seen. You basically need sponsors for your videos to go anywhere these days because people expect professional editing/lighting/etc. The “I watched a Premier tutorial and filmed on a cellphone” approach won’t cut it anymore.
> People want to get rewarded for they work, you know. Do you also want your plumber to work for free?
I don’t want it to be work, I would prefer it was done by hobbyists. There are tons of thriving hobby communities full of people only getting personal satisfaction.
>You can make content without monetization in mind. But it's like giving your time away.
You're missing the point entirely, the content I refer to as more interesting is stuff people made for fun or on principle not because of financial incentive
Imagine if people only commented on hn because they were expecting a paycheck for it
I also follow the closely related addendum: I do not want standing admin access to your system, unless I need it often enough it really impacts my productivity. Doubly so if it's not hooked up to SSO. If the database gets breached, I don't want my name on the list of people who had the admin password.
Most big businesses are good about that, but I've helped a couple family members with their business' WordPress and just have standing access that I really don't want. They don't want to juggle activating/de-activating my account though, so /shrug.
Same all around for me. I have a couple of longstanding accounts on local businesses I help out, but it’s all via VPNs that send the owner an email when I connect. I also refuse to do any work unless they ask me in writing. Text is OK, and I screenshot it. “Why did you give such-and-such rights to that employee?” “I have it in writing where the owner asked me to, Your Honor.”
This has never come up before, but it’s easy enough to be diligent about it.
Also: I keep a little paper notebook where I log the work I do for everyone, and occasionally have someone else sign and date it. It’s basically a cheap blockchain IRL. “How do you know you did this before you stopped doing work for them?” “Because the owner signed and dated the logbook after I did the work but before they hired the new IT person.”
I’m suuuuuper nitpicky about diligence in all this, for the protection of everyone involved, and especially me.
To put a number on it, linear no threshold models predict ~130 deaths as a result of the radiation (and are known to over-estimate lethalities at low doses).
Around 50 people a year die while clearing snow in Japan, so it's ~ twice as dangerous as shoveling snow in worst-case predictions.
LNT is not known to over-estimate lethalities at low doses. The actual situation is that the predicted deaths at low doses occur at such a low rate that the signal cannot be detected above the noise. That doesn't mean the prediction was wrong, just that it cannot be verified. It's possible (as in, consistent with evidence) that LNT under-predicts deaths at low doses.
Even if LNT would under-predict it is still a rounding error in the big picture of the tsunami disaster.
And, let's put it straight: LNT is scaremongering fiction. People who live in Ramsay, Iran, are exposed to higher level of background radiation that n what is allowed for nuclear workers. Yet, there is no elevated levels of cancer or birth defects, not is there a shorter lifespan for people living there either.
Epidemiology is a very blunt instrument. The data you mention there cannot be used to reach the conclusion you are trying to reach, since confounding effects cannot be excluded (and because the doses they receive can only be estimated, not actually measured). Yes, radiation biologists know all about those people and have judged that evidence as part of a larger picture.
The death rates might be a difference in units; the Forbes article is using deaths per trillion kWh, the other might be deaths per thousand/million kWh.
The difference in ranking might be down to how they model deaths from nuclear power accidents. One may be using the linear no threshold model, and the other may be using something else. We don't have an agreed upon model for how likely someone is to die as a result of exposure to X amount of radiation, which causes wide gaps in death estimates.
E.g. Chernobyl non-acute radiation death estimates range from 4,000 to 16,000, with some outliers claiming over 60,000. That's a wild swing depending on which model you use.
> I think if you create a task but don't await it (which is plausible in a server type scenario), it's not guaranteed to run because of garbage collection or something.
I think that use case doesn't work well in async, because async effectively creates a tree of Promises that resolve in order. A task that doesn't get await-ed is effectively outside it's own tree of Promises because it may outlive the Promise it is a child of.
I think the solution would be something like Linux's zombie process reaping, and I can see how the devs prefer just not running those tasks to dealing with that mess.
then the call to someOtherAsyncFunction will not spawn any kind of task or delegate to the event loop at all - it will just execute someOtherAsyncFunction() within the task and event loop iteration that myAsyncFunction() is already running in. This is a major difference from JS.
If you just did
someOtherAsyncFunction()
without await, this would be a fire-and-forget call in JS, but in Python, it doesn't do anything. The statement creates a coroutine object for the someOtherAsyncFunction() call, but doesn't actually execute the call and instead just throws the object away again.
I think this is what triggers the "coroutine is not awaited" warning: It's not complaining about fire-and-forget being bad style, it's warning that your code probably doesn't do what you think it does.
The same pitfall is running things concurrently. In JS, you'd do:
In Python, the functions will be run sequentially, in the await lines, not in the lines with the function calls.
To actually run things in parallel, you have to to
loop.create_task(asyncFunc())
or one of the related methods. The method will schedule a new task and return a future that you can await on, but don't have to. But that "await" would work completely differently from the previous awaits internally.
I think this is semantically the same thing, though I'm sure your terminology is more correct (not an expert here).
If you do `someOtherAsyncFunction()` without await and Python tried to execute similarly to a version with `await`, then the one without await would happen in the same task and event loop iteration but there's no guarantee that it's done by the time the outer function is. Thus the existing task/event loop iteration has to be kept alive or the non-await'ed task needs to be reaped to some other task/event loop iteration.
> loop.create_task(asyncFunc())
This sort of intuitively makes sense to me because you're creating a new "context" of sorts directly within the event loop. It's similar-ish to creating daemons as children of PID 1 rather than children of more-ephemeral random PIDs.
> but there's no guarantee that it's done by the time the outer function is.
As far as I understood it, calling an async function without await (or create_task()) does not run the function at all - there is no uncertainty involved.
Async functions work sort of like generators in that the () operator just creates a temporary object to store the parameters. The 'await' or create_task() are the things that actually execute the function - the first immediately runs it in the same task as the containing function, the second creates a new task and puts that in the event queue for later execution.
So
asyncFunc()
without anything else is a no-op. It creates the object for parameter storage ("coroutine object") and then throws it away, but never actually calls (or schedules) asyncFunc.
When queuing the function in a new task with create_task(), then you're right - there is no guarantee the function would finish, or even would have started before the outer function completed. But the new task won't have any relationship to the task of the outer function at all, except if the outer function explicitly chooses to wait for the other task, using the Future object that was returned by create_task.
> In any case, for me... >= 65% CPU load for >= 30m/day means it's at 100% effective utilization, and needs expansion relatively soon.
I think this depends on workload still because IO heavy apps hyperthread well and can push up to 100%. I think most of the apps I've worked on end up being IO bound because "waiting on SQL results" or the more generic "waiting on downstream results" is 90% of their runtime. They might spend more time reading those responses off the wire than they do actually processing anything.
There are definitely things that isn't true of though, and your metrics read about right to me.
I don't know what Intellij's AI integration is like, but my brief Claude Code experience is that it really chews through tokens. I think it's a combination of putting a lot of background info into the context, along with a lot of "planning" sort of queries that are fairly invisible to the end user but help with building that background for the ultimate query.
Aider felt similar when I tried it in architect mode; my prompt would be very short and then I'd chew through thousands of tokens while it planned and thought and found relevant code snippets and etc.
They're presumably not crawling the same page repeatedly, and caching the pages long enough to persist between crawls would require careful thinking and consultation with clients (e.g. if they want their blog posts to show up quickly, or an "on sale" banner or etc).
It'd probably be easier to come at it from the other side and throw more resources at the DB or clean it up. I can't imagine what's going on that it's spending a full second on DB queries, but I also don't really use WP.
Its been a few years when i last worked with WP. But the performance issue is because they store a ton of the data in a key value store, instead of table with fixed columns.
This can result in a ton of individual row hits on your database, for what in any normal system is a single 0.1ms (often faster) DB request.
Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper. Its just that the WP is in general ** for performance.
If you want to see what a bad scraper does with parallel requests with little limits, yea, WP is going down without putting up any struggle. But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.
Is that WP Core or a result of plugins? If you know offhand, I don't need to know bad enough to be worth digging in.
> Any web scraper that is scraping SEQUENCIALLY at 1r/s is actually a well behaved and non-intrusive scraper.
I think there's still room for improvement there, but I get what you mean. I think an "ideal" bot would base it's QPS on response time and back off if it goes up, but it's also not unreasonable to say "any website should be able to handle 1 QPS without flopping over".
> Its just that the WP is in general * for performance.
WP gets a lot of hate, and much of it is deserved, but I genuinely don't think I could do much better with the constraint of supporting an often non-technical userbase with a plugin system that can do basically arbitrary things with varying qualities of developers.
> But everybody wanted to use WP, and now those ducks are coming home to roost when there is a bit more pressure.
This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).
Combination of all ... Take in account, its been 8 years when i last worked in PHP and wordpress, so maybe things have improved but i doubt it as some issues are structural.
* PHP is a fire and forget programming language. So whenever you do a request, there is no persistence of data (unless you offload to a external cache server). This result in total rerendering of the PHP code.
* Then we have WP core, that is not exactly shy in its calls to the DB. The way they store data in a key/value system really hurts the performance. Remember what i said above about PHP, ... So if you have a design that is heavy, and your language need to redo all the calls.
* Followed by ... extensions that are, lets just say, not always optimally written. The plugins are often the main reason why you see so many leaked databases on the internet.
The issue of WP is that its design is like 25 years old. It gain most of its popularity because it was free and you where able to extend it with plugins. But its that same plugin system, that made it harder for the WP developers to really tackle the performance issues, as breaking a ton of plugins, often results in losing marketshare.
The main reason why WP has survived the increased web traffic, has been that PHP has increased in performance by a factor of 3x over the years, combined with server hardware itself getting faster and faster. It also helped that cache plugins exist for WP.
But now as you have noticed, when you have a ton of passive or aggressive scrapers hitting WP websites, the cache plugins what have been the main protection layer to keep WP sites functional, they can not handle this. As scrapers hit every page, even pages that are non-popular/archived/... and normally never get cached. Because your getting hit on those non-popular pages, this then shows the fundamental weakness of WP.
The only way you can slightly deal with this type of behavior (beyond just blocking scrapers), is by increasing your database memory limits by a ton, so your not doing constant swapping. Increase the caching of the pages on your actual WP cache extensions, so more is held into memory. Your probably also looking at increasing the amount of PHP instances your server can load, more DB ...
But that assumes you have control over your WP hosting environment. And the companies that often host 100.000 or millions of sites, are not exactly motivated to throw tons of money into the problem. They prefer that you "upgrade" to more expensive packages that will only partially mitigate the issue.
In general, everybody is f___ed ... The amount of data scraping is only going to get worse.
Especially now that LLM's have tool usage, as in, they can search the internet for information themselves. This is going to results in tens of millions of requests from LLMs. Somebody searching for cookie requests, may results in dozens of page hits, in a second, where a normal user in the past first did a google search (hits Google cache), and only then opens a page, ... not what they want, go back, somewhere else. What may have been 10 requests over multiple sites, over a 5, 10 min time frame, is now going to be parallel dozens of request per second.
LLMs are great search engines, but as the tech goes more to consumer level hardware, your going to see this only getting worse.
Solutions are a fundamental rework of a lot of websites. One of the main reasons i switch out of PHP years ago, and eventually settled on Go, was because even at that time, was that we hit hitting limits already. Its one of the reasons that Facebook made Hack (PHP with persistence and other optimizations). The days you can render complete pages, is just giving away performance. The days you can not internal cache data, ... you get the point.
> This is actually an interesting question, I do wonder if WP users are over-represented in these complaints and if there's a potential solution there. If AI scrapers can be detected, you can serve them content that's cached for much longer because I doubt either party cares for temporally-sensitive content (like flash sales).
The issue is not cache content, is that they go for all the data in your database. They do not care if your articles are from 1999.
The only way you can solve this issue, is by having API endpoints to every website, where scraper can directly feed on your database data directly (so you avoid needing to render complete pages), AND where they can feed on /api/articles/latest-changed or something like that.
And that assumes that this is standardized over the industry. Because if its not, its just easier for scraper to go after all pages.
Fyi: I wrote my own scraper in Go, a dual core VPS that costs 3 Euro in the month, what can do 10.000 scraper per second (we are talking direct scraps, not over browser to deal with JS detection).
Now, do you want to guess the resource usage on your WP server, if i let it run wild ;) Your probably going to spend 10 to 50x more money, just to feed my scraper without me taking your website down.
Now, do i do 10.000 per second request. No ... Because 1r/s per website, is still 86400 page hits per day. And because i combined this with actually looking up websites that had "latest xxxx", and caching that content. I knew that i only needed to scrap X amount of new pages every 24h. So it took me a month or 3 for some big website scraping, and later you do not even see me as i am only doing page updates.
But that takes work! You need to design this for every website, some websites do not have any good spot where you can hook into for a low resource "is there something new".
And i do not even talk about websites that actively try to make scraping difficult (like constantly changing tags, dynamic html blocks on renders, js blocking, captcha forcing), what ironically, hurt them more as this can result in full rescraps of their sites.
So ironically, the most easy solution that for less scrupulous scrapers is to simply throw resource at the issue. Why bother with "is there something new" effort on every website, when you can just rescrap every page link you find using a dumb scraper, and compare that with your local cache checksum, and then update your scraped page result. And then you get those over aggressive scraper that ddos websites. Combine that with half of the internet being WP websites +lol+
The amount of resource to scrap, is so small, and the more you try to prevent scrapers, the more your going to hinder your own customers / legit users.
And again, this is just me doing scraping for some novel/manga websites for my own private usage / datahoarding. The big boys have access to complete IP blocks, can resort to using home ips (as some sites detect if your coming from a datacenter leased IP or home ISP ip), have way more resources available to them.
This has been way too long but the only way to win against scrapers, is that we will need a standardized way for legit scraping. Ironically we used to have this with RSS feeds years ago but everybody gave up on them. When you have a easier endpoint for scrapers, there is less incentive to just scrap your every page for a lot of them. Will there be bad guys, yep, but it then becomes easier to just target them until they also comply.
But the internet will need to change to something new for it to survive the new era ... And i think standardized API endpoints will be that change. Or everybody needs to go behind login pages, but yea, good luck with that because even those are very easy to bypass with account creations solutions.
Yea, everybody is going to be f___ed because forget about making money with advertisement for the small website. The revenue model is going to also change. We already see this with reddit selling their data directly to google.
> The way they store data in a key/value system really hurts the performance
It doesnt, unless your site has a lot of post/product/whatever entries in the db and you are having your users search from among them with multiple criteria at the same time. Only then does it cause many self-joins to happen and creates performance concerns. Otherwise the key-value setup is very fast when it comes to just pulling key+value pairs for a given post/content.
Today Wordpress is able to easily do 50 req/sec cached (locally) on $5/month hosting with PHP 8+. It can easily do 10 req/sec uncached for logged in users, with absolutely no form of caching. (though you would generally use an object cache, pushing it much higher).
White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.
Just want to point out that your 50 req/sec cached means nothing in case of dealing with scrapers. What is the entire topic ...
he issue is that scrapers hit so many pages, that you can never cache everything.
If you website is a 5 page blog, that has no build up archive of past posts, sure... Scrapers are not going to hurt because they keep hitting the cached pages and resetting the invalidation.
But for everybody else, getting hit on uncached pages, results in heavy DB loads, and kills your performance.
Scrapers do not care about your top (cached) pages, especially aggressive ones that just rescrape non-stop.
> It doesnt, unless your site has a lot of post/product/whatever entries in the db
Exactly what is being hit by scrapers...
> White House is on Wordpress. NASA is on Wordpress. Techcrunch, CNN, Reuters and a lot more.
Again not the point. They can throw resources onto the problem, and cache tons of data with 512GB/1TB wordpress/DB servers. By that, turns WP into a mostly static site.
Its everybody else that feels the burn (see article, see the previous poster and other).
Do you understand the issue now? WP is not equipped to deal with this type of traffic as its not normal human traffic. WP is not designed to handle this, it barely handles normal traffic without throwing a lot of resources on it.
There is a reason why the reddit/Slashdot effect exists. Just a few 1000 people going to a blog tend to make a lot of WP websites unresponsive. And that is with the ability to cache those pages!
Now imagine somebody like me, that lets a scraper lose on your WP website. I can scrap 10.000 pages / sec on a 4 bucks VPS. But each page that i hit that is not in your cache, will make your DB scream even more, because of how WP works. So what are you going to do with your 50 req/s cached, when my next 9.950 req/s hit all your non-cached pages?! You get the point?
And fyi: 10.000r/s on your cached pages will make your wp install also unresponsive. The scraper resource usage vs WP is a fight nobody wins.
I’ve never been in a private subreddit, and the only public Discords I’ve been in are corporately managed with “community managers” and stuff.
reply