Do you mean it’s not easy to *define*? It shouldn’t be difficult to calculate an...

lotso · on Oct 26, 2017

Calculating these metrics at scale is not trivial.

Dylan16807 · on Oct 26, 2017

In real time, yes.

But the user database should already have backups, importing those backups into an analysis server should be easy, and running queries like that on an analysis server should be easy.

Counting messages, or users with X messages, etc. is also largely a function of whether your backup/restore system works. But this time you do it in chunks.

squarecog · on Oct 26, 2017

I helped build Twitter's data platform, 2010-2016.

There isn't an "analysis server" and analyzing user activity is not done on a "user database backup" at Twitter's scale, though indeed that's a common way that would be done for smaller businesses.

By the way, if by user db you literally mean the db with user accounts, that's not the right data source -- you want the user _activity_ db to count active users, and for high-scale applications, those are different things. Presumably user activity updates are orders of magnitude more frequent than user object updates. You don't want to thrash your user db by constantly updating some "last seen at" field. Put that stuff somewhere else.

That said, it's true that counting is simple, it's just a Hadoop / Spark / distributed computing platform of choice job. Filter, distinct, count. It's not even hard in real-time if you have enough ram or are ok with approximate counts with bounded error, thanks to Storm, Heron, Flink, etc.

Defining what exactly constitutes an active user and catching edge cases such as this Digits thing is where things get tricky; the number of weird scenarios that cause under/overcount for what seem like reasonable and straightforward definitions would surprise you.

@baddox nailed it.

Dylan16807 · on Oct 26, 2017

Thanks. Note that I wasn't trying to guess at what twitter does, just to provide a workflow that should be viable almost anywhere, in the absence of easier options. It's good to hear that the underlying idea of "calculating the metric isn't the hard part" is true.