Hacker News new | past | comments | ask | show | jobs | submit login

Hey, CTO of Parse.ly here. Might be surprising, but news sites get their traffic from 5 broad categories: (1) search engines (mostly Google in US); (2) social networks (Facebook, LinkedIn, Twitter, Pinterest, etc.); (3) editorial & recirculation (homepage promotion, article-to-article promotion); (4) direct, text, & email, which covers things like WhatsApp and manually shared email links; and (5) the long tail ("other"), which covers things like Google News, Flipboard, blogs, and site-to-site links.

These 5 categories are roughly equally split in aggregate traffic -- somewhere between 15-25% per category. You're right that certain kinds of sites, like e-commerce, are heavily weighted toward search -- but this is not broadly or necessarily true for the whole "content universe" of news, information, & entertainment sites, including blogs and so on.

Our data reveal all sorts of interesting patterns that go against mainstream assumptions about how people read/watch content online. For example, a measly 1% of traffic to content publishers comes from Twitter, even though Twitter certainly seems like it drives way more than 1% of the conversation, especially in certain categories of content. I wrote about that phenomenon here:


If you care to go deeper, one of our data analysts, Kelsey, did a nice deep dive on the different kinds of traffic sources that resonate with different content categories here:


Thanks for posting here pixelmonkey! I had a question; I asked the Parse.ly help/marketing people a while ago if you guys were tracking the rates of invalid traffic on articles and they said you guys don't track it. Is that accurate? Or are there any estimates on how much noise/dirt there is in the data?

Hmm, that's an interesting question, but I'm not sure I fully understand it. By invalid traffic, do you just mean, non-human (bot) traffic?

If so, I can say that over time, we have improved our use of bot lists, though that's just an IP blocking thing. Non-human traffic detection is not presently a strong focus of the company, though people have asked us to invest there. The issue is that non-human traffic detection is a somewhat gnarly problem in its own right, with its own vendors (mostly cybersecurity vendors) trying to figure that problem out.

We do know we are missing some traffic due to ad/analytics blockers and pi-hole style VPNs, which is fine.

One way we have thought about guesstimating "noise/dirt" in the data is to use one of our universally measured metrics, engaged time. So, we could separate really short page sessions from the rest, under the assumption that if a page session is super short, it's either a mindless human click or a JavaScript-enabled bot crawl. I discussed this on our blog awhile back when we did a data study on the subject:


In that study, we found that 32% of visits to pages were "bad visits" (page session <15s), a pretty high number, but that would include not just bots, but also humans queuing up tabs, Instapaper/Pocket saves, and so on.

Apologies on the terminology - by invalid traffic I'm referring to bots as well as click farms and other issues as used in the Media Rating Council's definition (they divide it into general and sophisticated invalid traffic both of which have a lot of types of traffic, http://mediaratingcouncil.org/101515_IVT%20Addendum%20FINAL%...).

I'm just a bit concerned that the Russian malware dudes were re-purposing their click fraud for astroturfing way back in 2015 and they had no problem just sitting and building dwell time instead of bouncing (https://www.trustwave.com/en-us/resources/blogs/spiderlabs-b...). I haven't been able to find anything indicating that US media companies have any kind of tracking to defend against or even identify a similar strategy being used to hit their article analytics to influence article production/placement, especially when it's now known that a Russian information campaign against the US was going on at the time.

I'd love to have us do better here and you sound very knowledgeable on the subject. Willing to reach out to me by email? ~email redacted~

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact