t.co click: com (all time)
t.co click: com.example (all time)
t.co click: com.example.blog (all time)
t.co click: com.example.blog /foo (all time)
t.co click: com (1st Feb 2011)
t.co click: com.example (1st Feb 2011)
t.co click: com.example.blog (1st Feb 2011)
t.co click: com.example.blog /foo (1st Feb 2011)
t.co click: com (11am-12 on 1st Feb)
t.co click: com.example (11am-12 on 1st Feb)
t.co click: com.example.blog (11am-12 on 1st Feb)
t.co click: com.example.blog /foo (11am-12 on 1st Feb)
t.co click: com (11:41-42 on 1st Feb)
t.co click: com.example (11:41-42 on 1st Feb)
t.co click: com.example.blog (11:41-42 on 1st Feb)
t.co click: com.example.blog /foo (11:41-42 on 1st Feb)
So that's 16 counters to track one link, but it means they can do fast, denormalised queries in realtime to track how that link is performing.
It's not just for t.co links - they can use it for internal server monitoring tools, tweet counts, advertising metrics... pretty much anything that involves counting at scale.
It's possible to build a similar system for much smaller scale applications using atomic counters in Redis - I've been experimenting with something like that for some of my own projects.
At 10,000,000 responses per day, we're not that small.
Here's the source from our collector: https://gist.github.com/794474
VoltDB is truly fantastic at this kind of workload. It's one of the first use cases we had traction with. Also, VoltDB is open source today.
For example, the March release of VoltDB will support more than 6 table joins and has some improvements to aggregation and distinct code. It also has a much more usable explain plan feature.
We feel that VoltDB offers one of the richer query interfaces of systems that scale to its level, but we don't plan to sit still.
- VoltDB isn't log-structured, so you really only have to store the state. How fast you can mutate it isn't limited by RAM amounts. We see use cases with utter firehoses of data that update just tens or hundreds of gigabytes of state.
- Beyond normalization, you can probably reduce the number of redundant counters, e.g. use SQL to count which URLs start with "amazon". This would be painful in many systems, but depending on the query, can often be done at scale in VoltDB.
- The byte overhead per counter is also likely much lower in an ACID/Relational store.
Finally, VoltDB is designed to migrate data to disk based stores (such as Hadoop or an OLAP store) as memory fills up. This is a feature we're working very hard on see as a big differentiator. It adds complexity if you need to query across stores, but you get a best-of-both-worlds feature set.
Is the database really entirely in-memory, though?
It also supports continuous and transactionally-isolated snapshotting to disk.
A decent memcached instance on modern hardware can easily push several 100K updates/sec . Couldn't you do the same with a pool of sharded memcached servers? Compute the MD5 hash of the string you want to count (which can be 16 strings, from the top-rated comment above), and just use that as the key.
We've seen people push 750K QPS on an InnoDB via HandlerSockets http://news.ycombinator.com/item?id=1886137 , so imagine what you could do with a sharded pool of 20 InnoDB servers.
Again: if I'm missing something, I'd love to learn.
Also, Cassandra reduces the operational complexity of having a logical store which spans multiple hosts. Your memcache example does not get persistence for free, and sharding mysql is something you have to do manually. The interface to a Cassandra cluster is the same regardless of how many nodes you are running.
Scaling writes is "hard". Incrementing counters is obviously write heavy, and cassandra aims to make it easier.
You have any questions ?
"We require operational reporting on all services"
feels like everyone learns that the hard way, twitter just did it much more visibly
They're proven without a doubt that technology is only one ingredient and it doesn't have to work well for a web or mobile business to grow.
They're the last company I'll look to for technology to use in my business or as an example on how to run operations. I have no interest in any of their open sourced products or in any advice from them on how to run my data center.
I'm still not sure, but it seems you're peeved that Twitter hasn't been perfect and not impressed with Rainbird. (Right?)
This is a bit unfair I think? No one's really waiting for your advice on how to run your data center, whereas Twitter is dealing with the sort of volume that your data center probably couldn't begin to handle without finding some of the same solutions Twitter is now presenting.
Now we could possibly (depending on what you actually meant) get into a debate about Twitter's failures, but that's really not interesting, because your second paragraph is true: technological perfection is not enough by itself.
That doesn't mean Twitter doesn't have anything interesting to say about technological perfection.
But mostly I posted this in order to express the sheer confusion I felt reading your post. If you don't like Twitter, just say so.
The point is not that they've been "through some rough patches". The point is that they failed for years to come up with a reliable implementation of a solved problem; pub/sub messaging.
Twitter is not "large" by any means. Your telco, stock exchange and many other companies have dealt with the the same problem-space for decades. Those reliably dispatch orders of magnitude higher throughput under much more complex routing conditions. Many of them operate under SLAs that mandate five or even six nines of availability.
Sure, twitter is (gladly) not dispatching emergency-calls, as such their requirements are lower. However, given their track-record they're in no position to give technology advice either.
50 million tweets/day is not a serious workload for a messaging system.
try-hard or what...
<meta>I posted this partially to experimentally examine the downvote clumping effect I've noticed on hn, where some reasonable responses to highly downvoted posts get downvoted by association.</meta>
also in that vein, the formspring comment / gist link is very commendable and awesome!