These 5 categories are roughly equally split in aggregate traffic -- somewhere between 15-25% per category. You're right that certain kinds of sites, like e-commerce, are heavily weighted toward search -- but this is not broadly or necessarily true for the whole "content universe" of news, information, & entertainment sites, including blogs and so on.
Our data reveal all sorts of interesting patterns that go against mainstream assumptions about how people read/watch content online. For example, a measly 1% of traffic to content publishers comes from Twitter, even though Twitter certainly seems like it drives way more than 1% of the conversation, especially in certain categories of content. I wrote about that phenomenon here:
If you care to go deeper, one of our data analysts, Kelsey, did a nice deep dive on the different kinds of traffic sources that resonate with different content categories here:
If so, I can say that over time, we have improved our use of bot lists, though that's just an IP blocking thing. Non-human traffic detection is not presently a strong focus of the company, though people have asked us to invest there. The issue is that non-human traffic detection is a somewhat gnarly problem in its own right, with its own vendors (mostly cybersecurity vendors) trying to figure that problem out.
We do know we are missing some traffic due to ad/analytics blockers and pi-hole style VPNs, which is fine.
In that study, we found that 32% of visits to pages were "bad visits" (page session <15s), a pretty high number, but that would include not just bots, but also humans queuing up tabs, Instapaper/Pocket saves, and so on.
I'm just a bit concerned that the Russian malware dudes were re-purposing their click fraud for astroturfing way back in 2015 and they had no problem just sitting and building dwell time instead of bouncing (https://www.trustwave.com/en-us/resources/blogs/spiderlabs-b...). I haven't been able to find anything indicating that US media companies have any kind of tracking to defend against or even identify a similar strategy being used to hit their article analytics to influence article production/placement, especially when it's now known that a Russian information campaign against the US was going on at the time.