Big Data vs Intelligent Data (and what Startups can do with it)

equark · on Sept 5, 2012

Another key fact is that "big data" is actually not that common, especially when it gets to the analysis stage.

The median job size at Microsoft and Yahoo is only 15GB. And 90% of Hadoop jobs at Facebook are under 100GB. Clearly you want to be able to crunch large log files, but in terms of day-to-day analysis the files are much smaller than that. (cite: http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...).

At Sense (http://www.senseplatform.com) most of the clients we work with are struggling not with the size of their data but with tricky modeling problems that don't fit into standard black boxes and with integrating analytics into actual production systems. Adopting something like Hadoop for these tasks is not very productive.

thebigpicture · on Sept 5, 2012

Thank you MS Research for a dose of sanity. "Big data" seems very potent as far as marketing buzzwords go. It plays on people's ignorance and the general sentiment of "too much information".

I'll be keeping this pdf in my "rebuttals to idiocy" folder.

There are some industries that certainly have do have "big data" (Wikipedia has some definitions for "big data" that include size ranges for whatever that's worth) but it does not seem like companies with "big data" are the only targets of "big data" marketing. And from what I know about available solutions, if I really had a "big data" problem (e.g., 100 terabytes not 100 gigabytes) then I would not be choosing Hadoop. I also would not choose SQL or "NoSQL". But that's just me. Some of the best solutions I've found have nearly zero marketing. Go figure.

noelwelsh · on Sept 5, 2012

Interesting paper (and makes me feel more justified in rejecting Hadoop). Do you have any blog posts / other material about the techniques you're using at Sense?

equark · on Sept 5, 2012

Unfortunately no, but you're welcome to email me at tristan@senseplatform.com.

dbecker · on Sept 5, 2012

I think many people are confused about what "big data" means.

I work for an analytics consulting company, and many of our clients want us to use Hadoop with their data. They've heard that Hadoop is the standard for big data, and they associate with "big data" with machine learning.

But the data they want us to put in Hadoop is usually small enough to work with in RAM on my laptop.

thebigpicture · on Sept 5, 2012

It must be great to have such naive clients.

wookietrader · on Sept 5, 2012

From a data analyst's perspective, let's go through what he says.

First he states something along the lines of "More data does not always help." This is right from a theoretical perspective. But: it never hurts. This is also right from a theoretical perspective, it's a result from probability theory: additional observations will always lead to less or equal variance in your estimations. There is no data like more data. There is no down side with more data.

I am not sure in what way (2) and (3) relate to big data. I'd even say that (3) is pro big data.

Then there is this term "intelligent data". Actually, I can't emphasize how badly chosen this term is. Intelligence is related to the quality of actions someone takes. Data does not take actions, It just "is". Data cannot be intelligent, just as a stone cannot be intelligent. He also thinks that data measurements should be repeatable. Guess what, in all interesting cases data measurements are not repeatable due to randomness in the source itself. One of the main challenges of data analysis is to still get robust results. He also thinks that data should be concise, e.g. that the data set at hand should be as minimal as possible to lead to the same actions. This sounds like a chicken and egg problem. How would you be able to even assess this without trying it out?

noelwelsh · on Sept 5, 2012

You're neglecting that data costs money and time to collect and process. More data is more cost.

I am in agreement with the spirit of this post (I'm not interested in arguing whether intelligent data is a good term or not.) Heck I even blogged along similar lines just a few days ago: http://noelwelsh.com/streaming-algorithms/2012/08/29/lean-da... Here are a few problems with collecting everything:

- Big Data infrastructure like Hadoop is expensive and slow. It's very much not in the turn-on-a-dime spirit of startups.

- If you collect everything, the value per data item is low. This impacts the analyses you profitably do. Compare the value that Klaviyo can deliver per data point vs, say, Mixpanel. (And then ask yourself why Mixpanel is moving into "People" analytics. My suggestion: because it's much more valuable.)

Disclaimer 1: My startup, Myna [http://mynaweb.com/], had a shout-out in the blog post.

Disclaimer 2: I'm a Klaviyo user, as of a few days ago.

washedup · on Sept 5, 2012

Today the storage of more data is negligible. There are alternatives to Hadoop. However, to capture new variables that exist in your market place definitely takes time and money.

edhallen · on Sept 5, 2012

I think you make a great point here about the confusion that often gets put out there between data and analysis - a confusion which I'd say is implicit in the term big data as well (and hence I ran with - caveat, I'm the author).

As far as the problem with the term "intelligent data" - I think what you say is exactly true if you do data analysis one time; however, the issue is that for those of us running startups, we find ourselves doing analysis over and over - so intelligently selecting data (in a way that takes us less time together and leads to the same decisions) is a huge win. Read intelligent data as being data + intelligence - not a new type of data.

Likewise, the problem with asserting that more data is always better ignores how most companies are making decisions. At the end of the day, our analysis is completely meaningless without a new action. So a better analysis that doesn't get implemented is worth far less (nothing) than an analysis that gets implemented successfully and drives results.

washedup · on Sept 5, 2012

Good point. There is a cost/benefit balancing for attaining more data which may turn out to be useless. However, technology is increasingly pushing us in a direction that allows us to capture more data at a cheaper cost.

washedup · on Sept 5, 2012

Agreed. Intelligent Data seems like a very messy term. In Big Data we use many different processes to figure out what the "intelligent" relationships are, or what variables really express strong relationships. This is one of the biggest challenges in Big Data. A quote from Alex Pentland:

"the data scientists themselves don't have much of intuition either…and that is a problem. I saw an estimate recently that said 70 to 80 percent of the results that are found in the machine learning literature, which is a key Big Data scientific field, are probably wrong because the researchers didn't understand that they were overfitting the data."

http://www.edge.org/conversation/reinventing-society-in-the-...

photorized · on Sept 5, 2012

Data can't be intelligent.