Hacker News new | past | comments | ask | show | jobs | submit login

Hmm, that's an interesting question, but I'm not sure I fully understand it. By invalid traffic, do you just mean, non-human (bot) traffic?

If so, I can say that over time, we have improved our use of bot lists, though that's just an IP blocking thing. Non-human traffic detection is not presently a strong focus of the company, though people have asked us to invest there. The issue is that non-human traffic detection is a somewhat gnarly problem in its own right, with its own vendors (mostly cybersecurity vendors) trying to figure that problem out.

We do know we are missing some traffic due to ad/analytics blockers and pi-hole style VPNs, which is fine.

One way we have thought about guesstimating "noise/dirt" in the data is to use one of our universally measured metrics, engaged time. So, we could separate really short page sessions from the rest, under the assumption that if a page session is super short, it's either a mindless human click or a JavaScript-enabled bot crawl. I discussed this on our blog awhile back when we did a data study on the subject:


In that study, we found that 32% of visits to pages were "bad visits" (page session <15s), a pretty high number, but that would include not just bots, but also humans queuing up tabs, Instapaper/Pocket saves, and so on.

Apologies on the terminology - by invalid traffic I'm referring to bots as well as click farms and other issues as used in the Media Rating Council's definition (they divide it into general and sophisticated invalid traffic both of which have a lot of types of traffic, http://mediaratingcouncil.org/101515_IVT%20Addendum%20FINAL%...).

I'm just a bit concerned that the Russian malware dudes were re-purposing their click fraud for astroturfing way back in 2015 and they had no problem just sitting and building dwell time instead of bouncing (https://www.trustwave.com/en-us/resources/blogs/spiderlabs-b...). I haven't been able to find anything indicating that US media companies have any kind of tracking to defend against or even identify a similar strategy being used to hit their article analytics to influence article production/placement, especially when it's now known that a Russian information campaign against the US was going on at the time.

I'd love to have us do better here and you sound very knowledgeable on the subject. Willing to reach out to me by email? ~email redacted~

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact