
The money of big data is in analytics - dendory
http://dendory.net/?b=5135940c
======
noelwelsh
When I see analytics in industry I normally see a whole lotta graphs, big
reports for the monthly board meeting, and so on. The way the industry is
moving is basically replicating what people used to do with "small data" on
the systems for handling "big data". Witness the surge of interest in SQL-on-
Hadoop type systems.

I believe a large amount of this is crap.

Only actions lead to improvements. The tighter your observe/act loop the more
chances to win you have in any given timeframe. Computers can act a heck of a
lot faster than humans. Computers can also do a better job for balancing risk
than humans in many situations.

If I was a consultant I would definitely be selling my report generating
ability. If I was a company that really wanted to win big I would investing in
automated decision making systems.

(Ob plug: My startup, Myna <http://mynaweb.com>, is one small realisation of
this idea.)

~~~
mtrimpe
Sorry to take it off-topic, but I've never understood how you're supposed to
customize your pages based on a synchronous call to a third party API though.

From what I can see it add quite a lot of latency and introduce a single point
of failure, which I'm sure isn't acceptable for most cases.

~~~
noelwelsh
I'd rather have this discussion over email: noel at mynaweb.com

If you're really concerned about latency, integrate on the server-side using
asynchronous calls. If that isn't an option:

\- Our main approach to managing latency right now is making Myna really
really fast (avg response time 2.5ms, 99% ~6ms).

\- If that isn't good enough, you can host your own Myna server(s).

\- We have various things in the works to improve on this in the near- and
medium-term future.

Happy to discuss more on email.

------
apapli
Big data is all the rage, but as an enterprise biz dev guy I hear a lot of
talk about it and not a lot of real doing.

The article states accurately that people can store huge amounts of data. What
I don't think it explored sufficiently is how companies can make sense of
this.

Big data gives companies the ability to find new insights into relationships
between data points they had no idea existed. From what I am seeing, people
are however approaching the issue the same (wrong) way - they are running the
same reports on larger data sets, or running new reports to confirm/disprove
areas where they suspect relationships will exist.

What big data needs is a generally available product (read: $$ thousands per
year, not hundreds of thousands or millions) that allows companies to send in
their data sets, with it proactively mining to find new trends across is
entirety.

I don't think this product exists yet - and if it does please tell me and I'll
rapidly call them for a job - and is exactly where the big money will be.

~~~
oscilloscope
The closest thing I can think of is OpenRefine. It doesn't directly find
trends, but it does do machine learning to help you clean up datasets. You can
also create a scatterplot matrix and do faceted search, which lets you quickly
explore data and relationships between dimensions. It's an open source program
(previously Google Refine).

<http://openrefine.org/>

For health care, finance, enterprises and government, data quality is a huge
issue. Many machine learning and statistical methods just don't work or
produce misleading results with messy data.

~~~
aidos
Ahhh, so that's where Google Refine went to. I was trying to find it recently.
Thanks for the link.

~~~
me_bx
You might also be interested in recline.js (<http://okfnlabs.org/recline/>), a
fork of Google refine, initially created by @maxogden, now maintained by the
Open Knowledge Foundation (okfn).

How their data processing features differ from openRefine is not clear to me
though...

------
qxf2
>>"The problem of big data has been solved. We know how to gather data and
store it."

Nope. Far from it. We are still learning to gather data and store it well.
This is a complex problem. The author is underestimating the difficulty in a
large number of disparate people collecting data and the variety of formats it
produces.

~~~
alexatkeplar
Exactly this. At SnowPlow (<https://github.com/snowplow/snowplow>) we would
love to spend more time downstream at the analysis phase (doing ML etc), but
we still have to spend a ton of time working upstream on collection, storage,
enrichment etc.

A lot of this work is defining, testing and documenting standard protocols,
data models etc (see [https://github.com/snowplow/snowplow/wiki/SnowPlow-
technical...](https://github.com/snowplow/snowplow/wiki/SnowPlow-technical-
documentation) if you're interested). And this is just for eventstream
analytics, working with our own data formats - ingesting and mapping third-
party formats (e.g. Omniture, MailChimp, MixPanel etc) is another lot of work
that needs doing... So a solved problem? Not so much.

------
austenallred
It depends on how you define "analytics." The problem continues to be the
same: There's too much data, and we can't make sense of it. It seems obvious,
but this simple concept has huge ramifications.

Different ways of approaching making use of that data are huge swings: look at
how other search engines tried to make sense of the web vs. how Google did it.
Look at how the dozens of analytics companies tried to make use of your web
analytics vs. how Omniture did it.

In a more modern example, look at how much content is generated every day on
social networks, or how much data healthcare facilities have resting in
different silos that is completely unusable to them. There's a lot more that
needs to be done.

------
capkutay
Although the headline is stating the obvious, this article touches on some
good points that companies so far have completely missed.

"But now that companies can receive and store this data, everything from logs,
to usage tracking, location coordinates, patterns and so on, the next step is
to make sense of all of this."

Shameless plug, but I'm working on those problems now and it is an extremely
rewarding area to be in. If you're interested in working in that area or
learning more about it feel free to ping me.

------
jamestc
>He says that data can be gathered showing how many people see a particular
painting or share it online, and thus reach conclusions on how successful an
artist is.

Wouldn't this just lead to art becoming a new form of spam when people try to
game this system? Not to mention that not all art is painting or can be easily
shared.

------
DrinkWater
I'm sorry, but isn't that obvious?

------
sgloutnikov
Thanks for this article! As someone who is just getting into the field out of
University, I love to absorb as much as possible on this topic and read
different points of view.

------
ErikAugust
"Data is like garbage. You'd better know what you are going to do with it
before you collect it." - Usually attributed to Mark Twain

For an article on big data, it did not provide much data.

------
graycat
Before you can sell me a wrench, have to let me know what nuts it will turn,
and I have to have some such nuts to turn.

For big data, what nuts does it turn? I'm still waiting to hear just what nuts
people want turned, especially those for which 'big data' is essential.

It's easy enough to find cases where analysis has been stopped due to far too
little data or far to little ability to handle more data. A classic example is
R. Bellman's "curse of dimensionality" especially for his work in dynamic
programming -- i.e., best decision making over time under uncertainty (with
flavor quite different from the uses of dynamic programming in some computer
science algorithms).

Broadly, for the curse of dimensionality, we can start with the set of real
numbers R, a positive integer n, and the real n-dimensional space R^n, that
is, just the set of all n-tuples of real numbers. Then as n starts to grow, it
takes 'big data' to start to 'fill', say, the n-dimensional 'cube' with each
side 100 units long, [0, 100]^n. So, if want to describe something in such a
cube and want a lot of accuracy, then can start with 1 MB of data and start
multiplying by factors of 1000 over and over. We can zip past a warehouse full
of 4 TB disk drives in a big hurry.

Here is a general situation in 'analysis': We are looking for the value of
some variable Y. Since we don't know Y, we can say we are looking for the
value of a random variable. Or we can look for its distribution. For input,
maybe we have many pairs (x, y) where x is in R^n and y is in R. Then, maybe
we are told that in our case we have some X in R^n and want the corresponding
Y or its distribution.

Well, essentially there is one, just one, one to rule all the rest, way to
answer this. May I have the envelope, please? Yes, here it is (drum roll):
Simple, plain old cross-tabulation. Why? Because cross tabulation is just the
discrete version of the joint distribution from which we use the classic
Radon-Nikodym theorem (Rudin, 'Real and Complex Analysis') to say that the
best answer we can get (non-linear least squares) is just the conditional
distribution or its expectation the conditional expectation, that is, E[Y|X]
which is the best non-linear least squares approximation of Y as a function of
X, and taking an average from a cross tabulation is the discrete approximation
of this. For a good approximation for a lot of values of X, we can suck up
'big data' and ask for many factors of thousands of times more. The
conditional distribution of Y given X is P(Y <= y|X) and is addressed
similarly. So, net, I agree that there can be some good uses for big data.

Still, before proposing an answer and picking the tools, let's hear the real
questions. Okay?

Why? For one, as general as cross tabulation is, it commonly requires so much
data that even realistic versions of 'big data' are way too small. So,
typically we use methods other than just cross tabulation to make better use
of our limited TBs of data, and to select such methods we really need to hear
the question first.

Before I select a wrench, I want to look at the nut. Is this point too much to
ask in the discussion of 'big data'?

I will end with one more: Suppose we want to estimate E[Y] by taking an
average of n 'samples'. Under the usual assumptions, the standard deviation of
our estimate goes down like 1 over the square root of n. So, to get the
standard deviation 10 times smaller, we need n to be 100 times bigger. So,
roughly for each additional significant digit we want in the estimate, we need
another 100 times as much data. Once we start asking for more than, say, five
more significant digits, we are way up on a parabola in the amount of data we
need. Net, if we want really accurate estimates, then even big data has to
struggle. So, really, we accept the law of diminishing returns and just use
medium or small data.

~~~
pseut
I'm not really sure about the main point of this comment, and I don't do 'big
data' even though I dropped the buzzword into a recent annual report, but I
thought the main focus was on "discovery" of reliable relationships in the
data.

The statistically interesting aspects come from a large number of variables,
not observations.

Edit ... And so I think the comment's focus on the curse* of dimensionality
and "small data" is misplaced.

* edit #2

~~~
rscale
I get concerned about usages of big data that concentrate on "'discovery' of
reliable relationships in the data."

The problem being that as we add more data, the chances of spurious
relationships increase dramatically, and the human brain is incredibly good at
finding a causal explanation for those relationships, even if none exists.
This can quickly turn Big Data into a noise-generating rabbit-hole, leading us
down blind alleys, and wasting our time.

I love that it keeps getting easier to test our hypotheses, but a search that
begins without a logical and reasoned hypothesis is a dangerous beast.

~~~
pseut
I included the word _reliable_ for a reason.

That's one thing that's interesting about this stuff from a statistics
perspective, how you can draw conclusions that are reliable even after some
sort of search process. See, for example, the research by Joe Romano and
Michael Wolf (and their coauthors) on stuff like "family-wise error rate".

~~~
rscale
I also chose my words carefully.

Some people (like you) understand that reliability and validity aren't just "p
< 0.05", but that's far from universal understanding. I've seen intelligent
people accept and reject hypotheses with woefully inadequate evidence, and
I've also seen wild hypotheses built on the backs of strong but meaningless
correlations.

Dangerous beasts can be useful, but they must be treated with due care.

~~~
nignog41
Care to elaborate how how to be more sure of reliability and validity? Any
stories of inadequate examples or meaningless correlations?

Just trying to learn how to better read data

~~~
rscale
> Any stories of inadequate examples or meaningless correlations?

A customer's marketing group was tying visitor data to geodemographic data.
They put together a database with tons of variables, went searching, and found
a multiple regression with a Pearson coefficient of 0.8+, a low p, decided to
rewrite personas, and started devising new tactics based on the discovery.

Fortunately, they briefed the CEO and the CEO said that the dimensions in
question (I honestly don't remember what they were) didn't make intuitive
sense, and demanded more details before supporting such a major shift in
tactics. More research was done, and this time somebody remembered that this
was a product where the customers aren't the users, so they need to be treated
separately. And it turned out the original analysis (done without fancy
analytics) was very close to correct.

If the CEO hadn't been engaged during that meeting, they would've thrown away
good tactics on a simple mistake. The regression was "reliable" by most
statistical measures, but it was noise.

A similar example holds for validity, where I saw a team make wonderfully
accurate promotion response models, but they only measured to the first
"conversion" instead of measuring LTV. And after several months of the new
campaign, it turned out that the new customers had much higher churn, so they
weren't nearly as valuable as the original customers.

> Care to elaborate how how to be more sure of reliability and validity?

I'm not a statistician or an actuary. I'm a guy who took four stat classes
during undergrad. I know just enough to know that I don't know that much.

Disclaimer aside: my biggest rules of thumb are to make sure that you're
measuring the thing you want to measure (not a substitute), to make sure the
statistical methods you're using are appropriate for the data you're
collecting, and to make sure you understand the segmentation of your market.

~~~
pseut
So those are some pretty bad decisions coming from statistical analysis; I
wonder if you think that those people (the marketing group in particular)
would make good decisions generally? It seems like some people are hell-bent
on making bad decisions regardless of the tools available to them.

But, yeah, you hand some people a spreadsheet with numbers in it and their
critical thinking ability just evaporates.

As an aside, that's not what I meant by "reliable" earlier (and, to be really
specific, I agree that low p-values do not ensure reliability even w/out the
other problems introduced by that particular model search).

------
antoniuschan99
<https://www.youtube.com/watch?v=G1KObNG_Wnw>

