

Bradford Cross: Big Data Is Less About Size, And More About Freedom - jaf12duke
http://techcrunch.com/2010/03/16/big-data-freedom/

======
fnid2
I have used FlightCaster many times and everytime my flight has been delayed,
FlightCaster has shown it as on time. I'm wondering if and when it actually
_works_.

Has anyone had a successful delay status notification from FlightCaster?

~~~
bradfordcross
What's up fnid2. We've definitely had some known issues over the past several
months at different levels in our system. We've worked out a lot of problems
and are on top of others as we speak.

That said, Flightcaster works now for a lot of cases, but it is inherently
probabilistic, so we do mispredict as well.

Keep using it, report issues to us when we are wrong, and we'll keep working
on the algorithm and making our data better.

~~~
fnid2
Hey Bradford, thanks for responding. Always good to get someone involved with
the actual system to talk to!

As an individual, I'm not concerned about the _probability_ that my flight is
delayed -- I want to know if it is delayed _for sure_. If I go to flightcaster
and it says it isn't delayed, I'm annoyed when I get to the airport and have
to wait 2 more hours. If it says it _is_ delayed and I wait, then I _could_
miss my flight and I will be not only annoyed, but very angry -- probably at
myself for trusting you.

Of course I could verify the flight status at the airline's website if you say
it's a high probability, but in that case, why use FlightCaster at all? I
should always just go to the more likely correct source first.

Probabilities might work for airlines or systems that are large enough to be
affected by an aggregate flight status or for planning large complicated trips
in advance, but if it fails to identify _my_ flights so often, I will simply
disregard it as a useful source of information -- as I suspect many
individuals will.

So, what is your plan? I've read that you may license the technology to bigger
firms or something like that, which sounds doable, but in my personal
experience, it simply isn't a tool for individual travelers or I don't "get
it." I also fly internationally, so all those flights aren't available.

What is a use case where this would be useful for an individual? Am I simply
not the right user for FlightCaster? Who is your target market? From the
marketing I read about it, it _seems_ targeted to individuals -- that's
usually what tc writes about anyway, so it's a bit confusing for me. I feel
like it could be so useful for me, but I keep getting not so great results.
Even last week I flew from canada through dallas. I couldn't get the first
leg, but then the second leg was 2 hours late and on time according to fc.

What if you could add something where if I check my flight and I _know_ the
status for sure, I can "correct" the status at FlightCaster! Crowd sourced
flight status! :)

~~~
bradfordcross
You raise a lot of really important issues here.

Let's take them one-by-one; data, algorithm, and user experience.

Data: We've done a lot with our data pipeline recently to ensure you have the
best available information in real time and we are able to recover as quickly
as possible if one of our data suppliers goes down or backs up temporarily. As
to the international issue, we will get there in time, but it is not so
simple. We want to get the US stronger first, and the international data
sources are sparser and not the same as those we use already.

Algorithm: We're working really hard on making our algorithm better and
integrating new data sources as new features to learn from. We do a lot of
very hard data transformation and feature extraction in order to get the
feature vectors our models learn from. We just underwent a major overhaul of
our Hadoop and Cascading infrastructure, and our research environment is
setting us up to be able to do a lot of cool stuff now. You can find the code
for our cascading lib here: <http://github.com/clj-sys/cascading-clojure>

User Experience: We wrestle with the kinds of issues you raise on a daily
basis. It is tricky to have a user facing probabilistic model. As you point
out, some folks look for certain outcomes, and don't do so well with
probabilities. Others love to geek out on as much data as possible and want an
even more fine grained information dashboard. It not easy to satisfy both
groups with a single experience, although we're exploring different user
experiences right now that try. As to the crowd sourcing, we've talked about
social media integration, and it does seem like it would help people interact
more to share the best current data with Flightcaster and our other users.
Again, it comes do to the priorities and focus required at startups - we can't
do it all at once.

~~~
corruption
The real question is: does flightcaster actually predict effectively? Would
you be open to publishing the statistical analysis of your model so far? I'd
certainly love to take a look at this. I don't want to see the algorithms you
are using, but it would be nice to compare your model using MDL or other
information criterions to other more simplistic models (e.g. simple geometric
distribution predictions), and analyse it using standard statistical model
diagnostics.

------
mark_l_watson
Great article - this is why I read HN.

I think that Bradford Cross is right on: as more data becomes available, the
challenge is not building 'cookie cutter' applications, but rather using some
combination of ML, text mining, statistics, etc. to create new ideas and
systems.

------
bravura
I am interested to read more about how people put together the pieces, such as
hadoop, hive, pig, and rabbitmq.

------
tigger2010
May be stating the obvious here, but why is a probabilistic prediction model
primarily aimed at consumers? Wouldn't it make more sense to apply to users
for whom probability is useful e.g., truck fleets to predict delays? oil
fields to show likely drill spots? etc.

------
nessence
Great article. There's much more to be said about "Freedom". So many
corporations have access to massive amounts of data and it will soon be easier
to regulate access; whether by control of it's owner, or not.

