Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DataSift Architecture: Realtime Datamining at 120,000 Tweets Per Second (highscalability.com)
60 points by aespinoza on Nov 29, 2011 | hide | past | favorite | 10 comments


I find the architecture to be really interesting and useful to learn from. However, they are way too expensive. 1000 tweets is not very much data. I'm building a realtime app now and easily am processing tens of thousands of relevant tweets a day. While a service like Datasift could alleviate a lot of heavy lifting on my part, the cost just doesn't make up for it. It feels like their business model is currently focused on use-cases requiring highly specific targeting, but not intended for use where services need high volumes of certain types of data. Shame, that.


@geuis: the cost for 1000 tweets is $0.10, and that's the Twitter license fee. You get $10 in free credits when you sign up, and you can have an On Demand plan starting from $10. It should definitely make it very affordable to anyone wanting to play with the Twitter firehose, the barrier-to-entry has never been so low.


Where are you getting your tweet feed? My initial interest in data sift was because they let you get a feed from the firehose. Twitter doesn't seem to let you do this.

All I want is tweets relevant to a particular subject, and in the early days I don't want to be paying hundreds of dollars for it... I've got keywords and phrases I can use to find them, if I just had access to an API that would let me. (Maybe twitter offers this, I couldn't find it in the past.)


I'm using the stream, but using track filtering. So far it's working very well for my purposes. You are right in that different use cases might need the firehouse and that's where services like gnip and datasift really come in handy. It's too bad that there's not a middle ground.


If you are just looking for keyword streams, they're available for free: https://dev.twitter.com/docs/streaming-api/methods


We have recently seen a number of startups (including YC founded AFAIK) that look for ways to make twitter data useful. But I don't remember catching any notice of such successes.

While DataSift realtime capabilities look really impressive, I'm afraid there isn't that much of use-cases to pay for the data mined that way. Even DataSift's own list of possible use cases looks bleak.

In any way though DataSift should be fine with applying their expertise to other sources of data, which doesn't bear the same cost as the twitter's firehose.


Excellent article.

I worked on a project that I integrated with DataSift.

Lorenzo and every one else I emailed with were very quick to respond and the product performed as expected (besides some hiccups that I can easily understand due to growing pains).

Compared to Gnip, (which we also integrated with) DataSift won hands down on both quality of product and customer relations.

However, I'm still suspect of the usefulness of Twitter analytics/ data mining.


There is a lot of value in mining data off social media, but only if you can convert that into simple and easy use cases. We are doing something like that, and we are very happy about the way it is shaping up.


[deleted]


I find it interesting, specially on what tools/languages they use, and how their architecture is designed to process such a big number of tweets.


Ugh, sorry, I finished reading right at "Information Sources", thinking that was the end of the article, and missed everything after that. I'll delete my previous comment, thanks.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: