

$205,000,000 of Funding For Big Data Startups - swGooF
http://datascience101.wordpress.com/2012/02/28/funding-for-big-data-startups/

======
whenisayUH
This is definitely a hot area, but unfortunately, it is also becoming the
thing everyone wants to be attached to. And so the term is becoming
increasingly meaningless.

It's 2012s "location-based services" or "gamification" or "cloud" (wait,
that's still hot). That said, I suspect big data (at least as I think I
understand it) has more legs. But defining what it is is important else it
becomes yet another buzzword.

Are compete.com and quantcast big data? Is eBay who analyze terabytes of user
meta data "big data"? Is SeatGeek big data? Is Twitter big data?

Just because you have a potentially large database of stuff doesn't mean you
are big data. Hopefully the term comes to mean something but right now, I fear
it does not.

~~~
larsberg
In systems research, it seems to mean something slightly more specific: how do
you architect systems that can cope with crazy amounts of generated data?
Being in the middle of academic job talk season, we hear people talking about
a petabyte of total data and issues with generating and dealing with terabytes
of data per day.

Some of the problems: \- It can take you longer to transfer off the data than
your data acquisition source will allow you to store it there for. \- Even if
you could transfer the data off, now you have the problem of storing it on
your site and distributing it intelligently among processing nodes. \- Even if
you could solve both of those, the projected _power_ costs assocated with that
scale of data are infeasible.

Most of the talks I see and papers I've come across seem to be focused on
better scheduling and more experiment/gather-side filtering based on what you
are planning to do with the data. But take this with a grain of salt, as I'm a
compilers guy, so I just see the systems stuff secondhand and only know enough
to talk about the languages-related issues with people who work in this space
for real.

~~~
vonmoltke
Depends on how you want to frame the problem. I work on sensor systems where
we could only save off a small fraction of the raw data coming off the sensor.
All data processing must be done in real time, in-memory, with the system only
saving off (sending out) a reduced set of processed output products. That
problem isn't new, though; military, meteorological, seismic, and space
sensors have operated with that constraint for decades, since the advances
that allow us to collect data have consistently outstripped the advances that
allow us to record data.

Processing these problems requires a different mindset, and we are reaching a
point now where business and web data flows are reaching the point that these
sensor applications have been at for decades: they need real-time processing,
with a concept of perishable data and a deadline for processing that data into
some intermediate or final product that can be stored for later use.

Your second point is still very much a concern with this paradigm, though. The
applications I speak of usually have some degree of natural parallelism in the
sensor hardware, typically tied to the number of A/D channels coming off the
sensor. Despite that, there are still unsolved[1] computational mapping issues
with respect to breaking these processing tasks up beyond their natural
boundaries. These sensors, and many of the emerging analytics applications,
are not processing sets of independent jobs the way the MapReduce paradigm
envisioned them. The parallel threads need information from each other to
generate their output products, which complicates the division of labor and
the execution control.

From what I have seen thus far, few big data platforms address the real-time
or near-real-time use cases. The applications I work on currently use MPI
grids on a cluster for parallel processing, which to me is the original big
data platform. Not saying its the best way to do it, but nothing I've seen
with the label "big data" can replace it.

[1] Unsolved in the sense that there is no one right answer or set of answers.
There are certainly application-specific ways to "make it work".

------
pedalpete
Unfortunately, I'm not getting a clear idea of what defines 'big data'.

When Facebook started, they wouldn't have been considered a 'big data'
company, would they? Or same with Twitter, so how are they defining the start-
ups that suit these funds?

~~~
vlnul
what about "data science"?

~~~
swGooF
A few years ago, being a data company was not the "cool" thing. Back then, it
was cool to be a social site. Thus, I think Twitter and Facebook correctly
considered themselves social sites. I am guessing that in the companies early
days, someone must have seen the value in all the data that was being
collected.

------
mmx
This news couldn't come at a better time, we've been building a side effect
database for almost 2 years using big data sets from several different
sources. It's functional and a few people have picked it up but we only
recently began talking about looking for outside funding. 2012 should be
interesting for us and our competitors.

------
jgmmo
Big Data is so hot right now. _in Mugatu voice_

------
opendomain
I created my own startup on BigData, but something personal came up. Please
contact me if you are interested in using the domain <http://NoSQL.com> to
apply for this venture capital. Or, put another way, I am looking for a co-
founder - I am very technical and also run my own businesses. </shameless
plug>

