
Rainbird: The Way Twitter Counts Tweets In Realtime (Soon To Be Open Sourced) - ssclafani
http://techcrunch.com/2011/02/04/twitter-rainbird/
======
simonw
I was at the presentation - it's a very smart system. They basically maintain
a whole bunch of counters for any particular thing that's being tracked. For
example, say someone clicks on a t.co link to blog.example.com/foo at 11:41am
on 1st Feb. Rainbird would increment counters for:

t.co click: com (all time)

t.co click: com.example (all time)

t.co click: com.example.blog (all time)

t.co click: com.example.blog /foo (all time)

t.co click: com (1st Feb 2011)

t.co click: com.example (1st Feb 2011)

t.co click: com.example.blog (1st Feb 2011)

t.co click: com.example.blog /foo (1st Feb 2011)

t.co click: com (11am-12 on 1st Feb)

t.co click: com.example (11am-12 on 1st Feb)

t.co click: com.example.blog (11am-12 on 1st Feb)

t.co click: com.example.blog /foo (11am-12 on 1st Feb)

t.co click: com (11:41-42 on 1st Feb)

t.co click: com.example (11:41-42 on 1st Feb)

t.co click: com.example.blog (11:41-42 on 1st Feb)

t.co click: com.example.blog /foo (11:41-42 on 1st Feb)

So that's 16 counters to track one link, but it means they can do fast,
denormalised queries in realtime to track how that link is performing.

It's not just for t.co links - they can use it for internal server monitoring
tools, tweet counts, advertising metrics... pretty much anything that involves
counting at scale.

It's possible to build a similar system for much smaller scale applications
using atomic counters in Redis - I've been experimenting with something like
that for some of my own projects.

~~~
iampims
We use Redis for this kind of stat tracking at <http://www.formspring.me/>

At 10,000,000 responses per day, we're not that small.

Here's the source from our collector: <https://gist.github.com/794474>

------
jhugg
Shameless self-promotion, but important point nonetheless:

VoltDB is truly fantastic at this kind of workload. It's one of the first use
cases we had traction with. Also, VoltDB is open source today.

~~~
lsb
VoltDB looks pretty awesome, but I'm pretty concerned by its lack of ability
to join a table to itself, to have over six tables in a join, to aggregate or
select distinct arbitrary values.
<http://community.voltdb.com/docs/ReleaseNotes/index>

~~~
jhugg
At VoltDB, we're working on improving our SQL support. Initially, we focused
on core SQL useful for OLTP. We've added functionality with every release and
we plan to continue in 2011.

For example, the March release of VoltDB will support more than 6 table joins
and has some improvements to aggregation and distinct code. It also has a much
more usable explain plan feature.

We feel that VoltDB offers one of the richer query interfaces of systems that
scale to its level, but we don't plan to sit still.

------
ajays
I'm missing the point, so I hope the HN community can enlighten me, but: why
do you need Cassandra and all that jazz, if you're just incrementing counters?

A decent memcached instance on modern hardware can easily push several 100K
updates/sec . Couldn't you do the same with a pool of sharded memcached
servers? Compute the MD5 hash of the string you want to count (which can be 16
strings, from the top-rated comment above), and just use that as the key.

We've seen people push 750K QPS on an InnoDB via HandlerSockets
<http://news.ycombinator.com/item?id=1886137> , so imagine what you could do
with a sharded pool of 20 InnoDB servers.

Again: if I'm missing something, I'd love to learn.

~~~
saurik
The day half your memcached data center loses power will be a sad sad day
indeed, and the story of 750k QPS on InnoDB was about read traffic, not
writes.

------
siculars
Watch this talk[0] from Qcon SF November 2010. Ryan King goes into a lot of
detail about many nosql implementations at twitter. This is the first time I
had heard about distributed counting by way of Cassandra to implement tweet
counts specifically and obviously count other things in general. In particular
Cassandra is pointed out as having tremendous write availability to enable
this sort of thing. He also mentions various specifics in how the feature is
coded/designed and different approaches they had to take until they got a
final version.

[0][http://www.infoq.com/presentations/NoSQL-at-Twitter-by-
Ryan-...](http://www.infoq.com/presentations/NoSQL-at-Twitter-by-Ryan-King)

------
darwinGod
On another note, last year Twitter open sourced Gizzard.
[http://engineering.twitter.com/2010/04/introducing-
gizzard-f...](http://engineering.twitter.com/2010/04/introducing-gizzard-
framework-for.html) (ironically, nobody seems to have tweeted this article!!)
Natural curiosity-I downloaded the repo,and tried to understand what it was.
Later,it also seemed the buzz surrounding ( on HN atleast) didnt last much,
and I forgot about Gizzard completely. Are there startups/bigCos using
Gizzard,other Twitter's open source stuff?

~~~
deepu_256
i played with gizzard for one of our services with redis as backend. twitter
engineers on #twinfra IRC channel helped me a lot on understanding gizzard and
flockdb src code. Not yet using in production though.

You have any questions ?

------
yarapavan
First announced in Strata conference. Here is the link to the presentation-
Realtime Analytics at Twitter:

[http://assets.en.oreilly.com/1/event/55/Realtime%20Analytics...](http://assets.en.oreilly.com/1/event/55/Realtime%20Analytics%20at%20Twitter%20Presentation.pdf)

------
fudge
Heh, this is pretty much exactly what I've been toying with in my spare time
lately at <http://illustrend.com> \- only with Erlang instead of Scala.

------
cagenut
I particularly liked the first line on slide 48

 _"We require operational reporting on all services"_

feels like everyone learns that the hard way, twitter just did it much more
visibly

------
mmaunder
Twitter is clearly brilliant and creating a viral and useful product. I admire
everything they've achieved in terms of user adoption and usefulness.

They're proven without a doubt that technology is only one ingredient and it
doesn't have to work well for a web or mobile business to grow.

They're the last company I'll look to for technology to use in my business or
as an example on how to run operations. I have no interest in any of their
open sourced products or in any advice from them on how to run my data center.

~~~
trafficlight
On what basis? Because they've been through some rough patches where they
couldn't keep up with the traffic, you'll dismiss everything they've built?

~~~
moe
I agree with his general sentiment, I'd only phrase it differently.

The point is not that they've been "through some rough patches". The point is
that they failed for years to come up with a reliable implementation of a
solved problem; pub/sub messaging.

Twitter is not "large" by any means. Your telco, stock exchange and many other
companies have dealt with the the same problem-space for decades. Those
reliably dispatch orders of magnitude higher throughput under much more
complex routing conditions. Many of them operate under SLAs that mandate five
or even six nines of availability.

Sure, twitter is (gladly) not dispatching emergency-calls, as such their
requirements are lower. However, given their track-record they're in no
position to give technology advice either.

50 million tweets/day[1] is not a serious workload for a messaging system.

<http://mashable.com/2010/02/22/twitter-50-million-tweets/>

------
vokoda
"twitter <3 open source"

try-hard or what...

~~~
Helianthus16
The very existence of that "insult" amuses me. I miss the days when cooler
than cool meant ice cold.

<meta>I posted this partially to experimentally examine the downvote clumping
effect I've noticed on hn, where some reasonable responses to highly downvoted
posts get downvoted by association.</meta>

~~~
dhotson
It's probably the "Don't feed the trolls" effect.. ;-)

------
snissn
meh? ultimately the back end for it is cassandra, correct? so they're what,
going to open source their implementation of doing stats in cassandra? It's
great for people to open source code, but I think it's more of a championing
for cassandra, than it is a 'code dump'

also in that vein, the formspring comment / gist link is very commendable and
awesome!

------
ck2
That's nice but when are we going to get the deep search that will go back
more than a week, which they promised over a year ago.

