
Most data isn’t “big,” and businesses are wasting money pretending it is - angelohuang
http://qz.com/81661/most-data-isnt-big-and-businesses-are-wasting-money-pretending-it-is/
======
kamaal
As some one who is currently dealing with these sort of things I can tell this
article hits the nail on its head.

Most, heck something like 99.99% of all so-called big data I've dealt is
something I wouldn't even classify as small data. I've seen data feeds in KB's
sent over to be handled in as big data. It happens all the time. A simple data
problem sufficient enough to be easily solved on something like a small db
solution like sqlite is generally taken to 'grid' these days. It reminds me of
the XML days when everything had to be XML. I mean every damn thing, these
days its NoSQL and Big data.

People wrongly do their schema design just so that it can get into a NoSQL,
then use something like Pig to generate data for it. The net result is they
end up badly reinventing parts of SQL all over the place. If only they
understand a little SQL and why its exists they can save themselves all that
pointless complexity they get into. Besides avoiding to use SQL where its
appropriate creates all sorts of data problems in your system. You will go
endlessly reinventing ways doing things similar to what SQL offers while
bloating your code. You will go reading a big part of the code, only to figure
out the person actually intended to something like a nested select query
albeit here very badly.

Besides I find much of this big data thing a total sham. Back in the yester
years we would write Perl scripts to do all sorts complex data processing(With
SQL of course). Heck I've run some very big analytic systems, and automation
set ups in Perl to do far difficult things people do using 'Big data tools'
today.

In larger corporation this has become fashion now. If you want to be known as
a great 'architect' all you need to do is bring in these pointless
complexities. Ensure the set up becomes so complicated it can't explained
without the help of a hundred jargons totally incomprehensible to anybody
beyond your cubicle. That is how you get promoted to become a architect these
days.

~~~
rmrfrmrf
I _really_ don't understand why anyone has a problem with relational
databases. Once you take the time to understand how they work (by taking a
class or reading a book), it's really straightforward and makes a lot of
sense. Not to mention it's really fast and quite reliable.

I get that a NoSQL-ish alternative makes sense for companies that have tons of
shards spanning the globe, but for the vast majority of people, a relational
database serves just fine.

~~~
maggit
I don't have a problem with relational databases. Or rather, I don't have a
problem with relational data modelling.

I _do_ have a problem with Oracle. Even the Oracle experts at my former job
could barely get Oracle to do something sensible, and running it on my own
computer was basically a death sentence for getting anything done.

I have a problem with MySQL, from a sysadmin perspective. When I had it
installed, MySQL was the package that would _always_ break on every update. No
upgrade was small enough that the data files would continue working.

(I don't have experience with Postgres, but SQLite seems more comfortable than
any of the mentioned alternatives)

I have a problem with schemas in my database. It requires upfront work with
modelling my data. I'd rather iterate. Also, nobody I've worked with seems to
put the schemas in automatically; you need to run the special "initdb" script
that isn't maintained to make it start working.

I have a problem with SQL. It would be awesome if we had a standard query
language, but we don't. You can't apply the same SQL to different database
engines mostly because it won't even compile, secondarily because it will give
different results and finally because it will have a completely different
performance profile.

All of this can be fixed by learning stuff, so I know better what I am doing.

But I already know CouchDB [1]. It took me little effort to learn, and it
makes a lot of sense to my mind. I can solve problems with it, and it has neat
properties with regards to master-master replication. So for me, CouchDB works
just fine, just like a relational database works just fine for you :)

So, from my perspective, it seems that using some SQL solution would be the
time consuming option.

[1]: CouchDB can't be considered a "Big data" database for many cases. It is
slow. But it scales neatly :)

~~~
dasil003
This is all pretty sane, except the schema-less part. I just don't understand
why people get all hung up over schemas. Sure, migrations are a minor
inconvenience, but if you just add fields in an ad-hoc fashion over time the
data becomes messy and it's hard to determine any invariants about a large
dataset. Sure this is avoided through careful code organization, but then
aren't you just re-inventing schemas out-of-band?

~~~
gizzlon
I think this is due to how you work and what you build. If you plan it out,
focus on the data structures you need and then build it, schemas are fine
because you know up-front what you want.

In more ad-hoc, constantly changing, "ops this didn't work because of unknown
factor X" type of projects, schemas are a pain. It's sounds really nice to
have a data store that adapts when it's impossible to know up-front what you
need from your data structures.

~~~
yummyfajitas
_In more ad-hoc, constantly changing, "ops this didn't work because of unknown
factor X" type of projects, schemas are a pain._

Such a pain:

    
    
        ALTER TABLE foo DROP COLUMN bar;
        ALTER TABLE foo ADD COLUMN baz varchar(64);
    

From what I've seen, a significant chunk of the desire to use schemaless NoSQL
hipster DBs is simply a desire to avoid learning SQL as if it were a real
programming language.

The only real use case I've ever seen for _schemaless_ DBs is fields with lots
of ad-hoc columns added by multiple people (typically logging/metrics data).

~~~
VLM
"lots of ad-hoc columns added by multiple people"

Inevitably completely un-normalized and junk data. Even worse with no
documentation or procedure anything ever added becomes permanent legacy that
can never be removed. Been there, lived it, hated it, won't allow it to happen
again.

~~~
yummyfajitas
So you won't allow logfiles to happen again?

~~~
VLM
LOL, no, not as a primary store of data, no never again.

Plain text logs are a great place to funnel all "unable to connect to
database" type of errors for dbas / sysadmins to ponder, however.

I've implemented quite a few systems where data changes are logged into a log
table, all fully rationalized, so various reports can be generated about
discrepancies and access rates and stuff like that. This is also cool for
endusers to see who made the last change etc.

Trying to reverse engineer how some data got into a messed up condition using
logs that can be JOINed to actual data tables as necessary is pretty easy,
compared to trying to write a webserver log file parser to read multiple files
to figure out who did what, when, to the data resulting in it being screwed
up. You parse log files for data just once before you decide to do that stuff
relationally. Debug time drops by a factor of 100x.

~~~
DenisM
I'd like to pick your brain as that's the problem I'm facing right now - I
have a web site that is accessed by users, and I would like to get a
comprehensive picture of what they do. I already have a log table for logging
all changes (as you said - I can show it to the users themselves so they know
who hanged what in a collaborative environment), but I struggle defining
meaningful way to log read access - should I record every hit? Would that be
too much data to store and process later? Should I record some aggregates
instead?

~~~
seestheday
It sounds to me like you're trying to reinvent web analytics. Is there a
reason you need user level granularity or is aggregate data enough?

~~~
DenisM
My web site has separate "accounts" for multiple companies, each has multiple
users. I'd like three level of analytics - for a given company (both me and
the company agent would like to see this), across all companies for all users
(I will see this), and rolled up within each company (i.e. higher-level of
activity where it's companies are being tracked, not individual users).

User-level data might be useful for tech support (although this is currently
working fine with text-based log files and a grep).

So I guess I am not sure... I might be content with web analytics... Each
company has its own URL in the site, like so:
<http://blah.com/company/1001/ViewData>,
<http://blah.com/company/1002/ViewData>, etc. Using e.g. Google Analytics I
could see data for one company easily, but can I see data across all companies
(how many users look at ViewData regardless of company id)? Can I delegate the
owner of company 1001 to see only the part of the analytics?

Another monkey wrench is the native iPad app - ideally the analytics would
track users across both native apps and the web site.

------
achompas
From the Berkeley paper on Facebook:

 _Nonetheless, large jobs are important too. Over 80% of the IO and over 90%
of cluster cycles are consumed by less than 10% of the largest jobs (7% of the
largest jobs in the Facebook cluster). These large jobs, in the clusters we
considered, are typically revenue-generating critical production jobs feeding
front-end applications._

So MR job characteristics might follow a power law distribution, and @mims is
focusing on one end of the tail. Sure, that's cool!

But then @mims also selectively quotes the TC article, which ends with an
excellent point that contradicts his thesis:

 _The big data fallacy may sound disappointing, but it is actually a strong
argument for why we need even bigger data. Because the amount of valuable
insights we can derive from big data is so very tiny, we need to collect even
more data and use more powerful analytics to increase our chance of finding
them._

I think @mims over-pursues the stupid Forbes/BI straw man here. As one would
expect with data, the story is complicated. Mom and pop stores don't need to
worry about Cloudera's latest offering, but companies working on the cutting
edge of analysis still absolutely need tools like Hadoop, Impala, and
Redshift.

~~~
scott_w
My feeling is that the title of the article rails against companies who think
they need Big Data when they don't.

------
mshron
I've maintained for awhile now that the distinction isn't between "big" and
"small" data, but between coarse and fine data. Now that everything is done
through the web, previously common data sources (surveys, sales summaries,
etc) are being supplanted by microdata (web logs, click logs, etc). It does
take a different skill set to analyze noisy, machine-generated data than to
analyze clean, survey-like data; it's a skill set that is more biased towards
computational knowledge than classical experimental design, hence the shift in
emphasis.

~~~
BrianEatWorld
I like this distinction. I walked into a world of hurt when I was brought on
to look at application user data after years of working with international
trade data and national statistics. Even when it comes to formulating a
hypothesis and subsequent experiment, the approach is entirely different.

I will say that the article's distinction between small and big data is also
important, but that just comes down to processing power. I think the
distinction you make is far more important and knowing whether you need coarse
or fine data can help keep you out of the issues that are introduced moving
from small to big data.

------
ChuckMcM
Of course one could see it as "IT's revenge" after Scott McNealy so famously
said it was dead. There is a lot of power to be had by creating an interface
for the customer and then keeping everything behind that interface 'obscure'.
They have to have that interface to survive, and if they don't know what goes
on behind it they have no way of discerning outrageous costs from reasonable
ones. The current exemplar seems to be medical costs.

Back in the 60's there was this chamber of secrets called "the Machine Room"
which had the "Mainframe" and various and sundry high priests who went in and
out, and if you literally played your cards, as in punched cards, right you
could get a report on how sales or manufacturing was doing this month.

That got lost when everyone had a PC on their desk, and now some folks are
trying to reclaim it :-)

That said the article is still poorly argued. The cost of data management _is_
fairly high. And generally a big chunk of that cost is the cost of specialists
who provide business 'continuance' which is code for "makes sure that you can
always get your data when you need it, and you can get the answers you need
from it in a timely and repeatable fashion." That hasn't changed at all, and
whether you have some youngster doing "IT" on the creaky Windows 2000 machine
running Back Office or you are using a SaaS company like Salesforce.com, data
management is and will continue to be a mission critical part of staying in
business.

------
dewitt
If I ever want to get rich, I'll set up shop convincing small businesses they
need to do things the way Google does, if only they want to remain
competitive.

Oracle has used exactly this business model to great success, and obscene
profit, for over 30 years.

~~~
SeoxyS
I know that's not what you mean, but I find it quite amusing that you describe
Oracle (a 30 year old company)'s business as convincing people they need to do
things the same way Google (a 15 year old company) does it.

~~~
eurleif
He said they have the same business model. Business models are abstract.

~~~
joseph_cooney
dewitt said "Oracle has used exactly this business model to great success, and
obscene profit, for over 30 years". You claim that business models are
abstract. Perhaps, except in the case where the words "exact" are used, and a
specific company is named. I, like SeoxyS find the paradox of Oracle being
accused of helping businesses to be "me-too" copies of Google - a company half
of Oracle's age - amusing.

~~~
dvse
The model in question is "convince small and medium businesses that they need
to buy my software in order to do things the same way as large companies X and
Y and have a hope of remaining competitive". For Oracle X and Y were banks,
retail and logistics companies, for the new generation of "big data" vendors
it is Google and Facebook.

~~~
joseph_cooney
But the poster in question didn't say "X" and "Y" did they...this feels like
an exercise in pedantry now, but he really did say "exactly" and "google".

~~~
coldtea
Ok, here's some "conversational language" insight:

He said: "Exactly this business model" -- that is, as it pertains to it's
essence.

NOT to be read as:

"Exactly this business model as it pertains to inconsequential details, like
which big company they should be imitating".

------
fiatmoney
There's an important distinction to be made between the storage layer and the
analysis layer. Something like HDFS can make sense as a storage layer once you
hit the > 10TB range even if your average dataset for analysis is reasonably
small (and it should be; 99% of the time you can get by with sampling down to
single-machine size). That doesn't mean you need to be setting up all your
analysis jobs to run via map-reduce; you can usually dump the dataset to a
dedicated machine and do it all in one go with sequential algorithms. As a
side benefit, you have access to algorithms that are really difficult to
express efficiently as map-reduce (eg, computations over ordered time series).

------
philip1209
I think that big data has made math sexy, and selling applied statistics and
operations research to small and medium-sized businesses under the guise of
"big data" with the intention of providing applied mathematical tools is what
is happening in the market.

~~~
christopheraden
Statistics involves checking modeling assumptions. A lot of what I've seen
with the big data people is the repetition of algorithms to the exclusion of
understanding and checking modeling assumptions.

While it's nice that the big data craze is making statistics more popular in
the mainstream press, it is important that statistics does not become just an
application of numerical methods without consideration of underlying
assumptions. I stress this because this has been largely underappreciated in
my experience.

~~~
userbee
"Big data" also checks model assumptions, if only if by monitoring whether or
not acting on the information moves a business metric.

Statistics involves inference over prediction, but either one when done right
validates assumptions.

By the way big data will sit on your face for days.

~~~
christopheraden
I meant checking assumptions not just to see whether the use of the big data
moved a business metric, but also that the model makes sense from a
statistical perspective.

A lot of statistics in business does not bother to check modeling assumptions.
Models are chosen based on whether they've been used in the past and what the
team is familiar with.

I don't doubt that big data (as we call it now) will one day rule. Ronald
Fisher would keel over if he saw the size of datasets we work with
nonchalantly on a daily basis. 50 data points (the size of the Iris data) is
laughable these days.

My reservation with big data is that the technologies are often unnecessary
for the size of the tasks being done. Other than a few data scientists working
on truly large projects, most of the big data talk I hear comes from people
who aren't fighting in the trenches (execs, marketing, journalists).

------
christopheraden
I am grateful to finally see this in an article. The "big data" craze is being
pushed in areas where it really doesn't make sense. We've been bit by the Big
Data bug where I'm at, but it's not coming from the statisticians. It's
usually the executives proposing a shift to big data.

People underestimate how much work it would be to shift an old server onto
modern technologies and tell the statisticians to use MapReduce and NoSQL
instead of SAS and SQL. If the Fortune 500 world has taken this long to catch
on to R, imagine how long it'll take to completely change the DBMS and
analysis software!

------
Tobani
Sure if you're dealing with 1GB of data it probably isn't worth spinning up a
Hadoop cluster to run your analysis. However, if you already have Hadoop up an
running for something that genuinely requires it, that 1GB job might make
sense there. The data may already be in HDFS, and you already have the
infrastructure there to manage and monitor jobs.

The references to Facebook & Yahoo running small jobs on huge clusters may be
a little misleading. It may be simply the easiest place for them to deploy
those jobs consistently.

But yeah... "Big Data" is a total meaningless buzzard.

~~~
Twisol
"Buzzard" isn't an eggcorn I've ever heard before! Did you mean "buzz word"?

~~~
jackmaney
You have to take some of these colloquialisms with a grain assault.

~~~
vasyainv
I don't like to be kept dark and dry on this one

------
guylhem
For most data, it is in fact a waste of money.

Personally, I am loading the data I play with on a postgreSQL database on my
laptop (if you have a mac and want to do that quickly, you may want to check
out the link I just submitted
[http://en.blog.guylhem.net/post/50310070182/running-
postgres...](http://en.blog.guylhem.net/post/50310070182/running-postgresql-
on-mac-osx-mountain-lion-in-2) )

You can do crazy things with the current hardware specs. Like loading all the
data the world bank offers you to download, index it and use it for
regressions (I do). In 2013 you only need a laptop for that.

Most data is not big. Big data is "big" like in a gold rush, where the ones
selling the tools are making the biggest profits.

EDIT: Thanks for the postgresapp.com link! It is a little bit diffent- here I
wanted to use the very same sources as Apple, without adding too much cruft
(like a UI to start/stop the daemon as I had seen in other packages). I also
wanted to see by myself how hard it was to 'make it work' with OSX (quite easy
besides the missing AEP.make and the logfile error). It was basically an
experiment in recompiling from the sources given by apple opensource website,
while staying as close to the OSX spirit as possible (ex: keeping the same
user group, using dscl, using launchdaemon to start the daemon automatically
during the boot sequence like for Apache)

That being said, you're right, for most people postgresapp.com will be a
simpler and faster way to run a postgresql server :-)

~~~
Choronzon
SQLite is also an excellent option for a datastore on OSX.Its not nearly as
full featured as postgres but no application is required and as you have a OS
independent file per db which is extremely portable.SQLite Professional is a
relatively decent free gui you can use also.

~~~
xradionut
SQLite has limitations on the data types it supports. Most of this can be
worked around by application code, but it can be a pain when you have data
that needs to be accessable by more than one application.

------
scorpion032
As also some one that has been in the thick of some of the "big data" projects
in the industry recently, I have to agree with the article.

One of the terms I learnt in the PyData Silicon Valley in March is "Medium
Data". Unless you are dealing with terabytes of RAM and Exa bytes of storage,
google style, the overhead of having to maintain a cluster is something most
(intelligent) people try to avoid.

When you cant avoid hundreds of machines, the cluster is a necessity and you
design that way. But given where the Moore's law curve stands today, most
organisations really dont need that.

You can buy servers on Amazon with 250 gigs of RAM for a few dollars an hour.
They specifically call it the big data cluster. It is possible to analyse the
data using tools like Pandas/Matplotlib and others in the Scientific Python
eco system fairly easily.

These tools are being used by scientists and industry for a really long time,
except they aren't really advertised that way.

For instance, here is some analysis I was doing recently of the children names
in the US, from 1880, with 3 million records:
<http://nbviewer.ipython.org/53ec0c5a2fabcfebb358>. My Mac could handle it
without even breaking a sweat.

~~~
jnazario
i often tell people "if your solution to moving data necessarily involves
shipping contracts" as opposed to "we'll just upload it" or even "i'll just
burn it to a DVD", you're not in big data. (this is akin to "if you don't
worry about power and cooling and instead worry about FLOPS, you're not in
super computing" from the 90s.)

last year i was talking about an implementation we did for some data and was
asked about our scale, "hundreds of terabytes" was my answer. for the people
we were talking to - people who know big data - that sufficed (although a bit
small on their scales, but it did require big data thinking and constructs to
get answers in a reasonable amount of time).

i hadn't realized how many people were wrongly moving to "big data" solutions
until i read these discussions around this article. color me surprised.

------
ctrager
Even if the data isn't big, there can be a benefit from the Hadoop
infrastructure. Say you have just 86,400 rows of data but each row takes 1
second. That adds up to 24 hours of elapsed time, and waiting for that run can
be painful, especially if you are trying to experiment, iterate. With
HDFS/MapReduce you can distribute that work across N machines and divide the
elapsed time by N, speeding up the pace of iteration. I've worked on a project
that had exactly this challenge, before Hadoop was available, and so we had to
invent our own crappy ways of distributing the data to the N machines,
monitoring them, collecting the results. Hadoop HDFS and Map/Reduce, with Job
Tracker, etc, would have been much better than what we came up with.

~~~
dk8996
Unless your problem is I/O bound (you can't get it off the disks fast enough,
or network bound -- transforming data to a worker nodes takes too long) using
Hadoop is the wrong choose. CPU bound problems are better solved with Grid
solutions that do a better job of scaling up (with in a single node) and scale
out to multiple machines. Taking a step back, you should always ask your self
if this can be done on a single machine, taking advantage of Moore's Law.

------
inthewoods
I've seen a fair number of startups that throw around how they are going to
make big money by utilizing the data they gather (called "big data" regardless
of size) - it's all a bit of magical underpants thinking: we'll gather a bunch
of people/users, we can't figure out how to make money off of advertising or
charging them, so then we'll talk about how the "big data" they produce will
be worth a fortune and people will pay to have access to it. Know some folks
in the HR SaaS space that think this is how they'll hit $100m. It's just
comedy.

------
rdtsc
"We don't have big data. Our data is small, and could be easily stored in a
MySQL or even a flat file" said no dev team ever. Everyone is "just like
Google" so they need NoSQL, scaling, clouds and so on.

------
sytelus
As someone who runs jobs on giant clusters day in day out, I just looked at my
last job. It indeed had input data of ~100GB. However size of input data is
misleading. Job does a lot of processing and generates ~5TB of intermediate
data and it took 800+ machine hours to complete. If I'd ran that on my desktop
I would be waiting for a month to finish. On cluster it took ~4 hours.

I'd to smile at the statement "Is more data always better? Hardly". There is
old saying the world of data scientists: There is no data like more data. Yes,
the value of it may be diminishing but when your competitor is trying to
squeeze out gain in second decimal, you are probably better off accepting more
data.

So the moral of the story is, all these really depends. People do get fired
for buying clusters. Modern cluster management software track several
utilization metrics and someone some day would going to look at it and point
out how bad decision it was.

------
xradionut
The reason the "big data" pimps can get away with this is that most of the
people that should know, (that aren't DB programmers, DBAs, true scientists or
engineers in the domain), don't know shit about data and generally too fscking
lazy to learn. So they buy into the latest wave of buzz words and hype.

------
pekk
It's precious to read through almost every post in this thread complaining
about 'big data' and saying that everyone can just use a normal relational
database or whatever. But 'big data' has brought markets to exploit to feed
HN-type entrepreneurs, and jobs and loads of prestige for HN-type engineers -
who I have never noticed to be shy about bragging on how much data is in their
systems without regard to whether that data is particularly meaningful.

------
lambersley
I'm getting to the age where things start coming back under new branding. I
remember in my childhood when my father would talk about bell-bottoms and how
trendy they were once. They they came back and he was shocked.

I remember Doc Martens. They're back. I remember gumby haircut. Its back. I
remember ripped jean...also back.

Technology follows this cyclical trend as well, we just give it fancy names
like Big Data, Cloud and Anything-as-a-Service.

------
_pmf_
> Most data isn’t “big,” and businesses are wasting money pretending it is

Most business leaders are not rational, and we should stop pretending they
are.

------
ams6110
Most blogs don't need javascript, and publishers are pissing off their readers
by pretending they do.

------
wmf
Related discussion from three weeks ago:
<https://news.ycombinator.com/item?id=5602727>

------
eksith
In an effort to _Store All the Things_ a lot of companies have talked
themselves into a rhetorical corner of poorly fitting shoes. At this point,
we've stopped wasting time when asked about Big Table and NoSQL and instead
demo their storage stack on a different framework until their eyes widen.

Then, when questions about how much engineering went into this "thing" that
does such a good job of keeping data secure, and so much of it, we say it's
built on Postgres.

As DevOps Borat says :
<https://twitter.com/DEVOPS_BORAT/status/313322958997295104>

How well you can utilize it and how quickly is just as important as what kind
of data you store in the first place.

------
jamesaguilar
Be wary about drawing conclusions from "most of the jobs were small." Most of
my jobs are small -- because I'm running experiments so I won't have to redo
the big one.

That said, I'm a huge proponent of running stuff simply at first. Few
businesses will ever grow to the point that they need more than a single large
database server and one or two backups. Don't waste your time prepping for
something you'll probably never need, especially when fixing the problem when
the time comes is only marginally more painful than doing it right in the
first place.

------
jwilliams
For me, "big" data is increasing the linkage between your data. It's not
simply more data, but much richer, less formal data relationships. It's taking
your sales data and linking it to your website clicks, linking that to the
weather (or whatever). Or you take something traditionally static and add a
temporal dimension.

This kind of deep linking you can't measure with straight megabytes. A few gig
doesn't seem that large, but if it's a complex graph with a complex hypothesis
- then, sure - that's big.

~~~
PeterisP
Well, that's the whole point - "A few gig doesn't seem that large, but if it's
a complex graph with a complex hypothesis"... then it's still not 'big data'.

It's maybe smart data, maybe detailed data, but definitely not big data - that
problem will have completely opposite needs and techniques than big data
analysis, and should not be mischaracterised as such.

------
pesenti
Most companies today are already using scaled up servers to host their medium
size warehouses (think Teradata or Exadata). That approach is very expensive
(> millions of dollars), only works well with well-defined data, and does not
scale well beyond a few TBs.

Hadoop is not just about running large jobs on very large data. Hadoop also
makes sense when trying to scale on commodity hardware or running ad hoc
queries (which can target a small amount of data) on medium to large data
sets.

~~~
flatfilefan
Expensive - yes, comparatively. Only works well with well defined data - yes,
poorly defined data is hard to use in any statistical caclulation too. Does
not scale well beyond a few TB - bullshit. It does scale really well.

------
aswanson
Oracle writes shitty 'enterprise apps' (god I hate that phrase) that they sell
to big companies, because their salesmen/women wear great attire and are good
at mirroring dumb ceos/cios, like the ones that run several companies I have
worked for. Will someone please end this nonsense? At what point does
usability/stability/utility become factors?

~~~
mattzito
This is a viewpoint that I hear a lot, mostly from people who are not in the
room when these grand enterprise implementation decisions are made. While it's
true that a good salesperson can make a difference in winning a deal vs.
another vendor, salespeople almost never convince a company that they _need_ a
big enterprise software platform. 95% of the time, the company has already
decided that the current way they do X is broken, and now the salesperson can
convince them that they have the solution to that.

The truth is that very often, X is broken inside an organization not because
of executive management, most of whom don't care what software packages get
used or who they buy from or anything else like that, but rather big software
companies get brought in because the technology/backoffice organization inside
the company is a disaster.

Accounting system doesn't properly allocate widget expenses to different cost
centers? Takes a week to update the homepage? No one knows where exactly
sensitive data is being stored?

That's all the technology organization's failure in one way or another. And
when things get bad enough, senior management says, "Okay, our homegrown
accounting system is just not doing the job for us anymore", and here comes
Oracle, happy to sell them their accounting system, which has all of the
features they could possibly want, and sure, it's expensive, but it works, as
opposed to the busted system they've got currently.

Of course, the next failure then, is that the people who will be running and
overseeing and architecting this solution are either the same people who
cocked up the accounting system in the first place, _or_ consultants who have
absolutely zero incentive to do anything other than maximize billable hours.

This means that instead of the organization saying, "We will adapt to off the
shelf software and change our processes to better align with the way the
software is designed to be used", they say, "Make your software work the way
_we_ do things".

Now we're off to the races, as various fiefdoms inside of the big company make
their pitch about what needs to be customized. Everything from the layout of
the screens to the workflow processes to the data model, everything has to be
matched to exactly the way the customer wants to do things.

Back at Oracle HQ, the RFEs have been flying in from not just that customer,
but the other 200 new customers being onboarded , and every one is basically a
demand for a way to modify this or that option - no one is saying, "We wish
there were fewer fields on this page"

So the customers demand more features, Oracle delivers them, and then the
customers promptly use those features to further complicate their platforms,
because they don't have the technical discipline to say, "No, we really don't
need to support different SKU revenue allocations based on currency, we'll
just do it by hand at the end of every quarter".

Looking at it a different way - how is making the software simpler going to
help Oracle win business? If anything, the more features the product has, the
more points they get on the RFP from the next big customer.

So everyone is to blame - Oracle makes money selling and implementing very
complex technology solutions because they're answering the demands of their
customers who depend on overly complex technical requirements because their
technology organizations are poorly run because they don't have any discipline
because senior management isn't technical enough to recognize where the
failure is.

tl;dr - enterprise software is not broken because of the sales people or upper
management, it's broken because the technology organizations are bad at their
jobs

~~~
rmrfrmrf
> _This means that instead of the organization saying, "We will adapt to off
> the shelf software and change our processes to better align with the way the
> software is designed to be used", they say, "Make your software work the way
> we do things"._

This must be a damned if you do, damned if you don't kind of situation,
because I work for a company that attempted to use a OOTB Oracle software
package and ended up getting roundly criticized by every part of the company,
both internally and externally.

~~~
mattzito
Yeah, it very much is a tough row to hoe on either side - and btw, even just
adapting to the OOTB package will still cost you a ton of money, and often in
ways you didn't expect:

I was loosely associated a few years back with a manufacturing company
migrating from their 20 year-old mainframe-based ERP solution to Oracle's ERP.
They really had the worst of both worlds, because not only did they have 20
year-old business processes that no one wanted to change, but the whole
interface for Oracle was so radically different from the "green screen" 3270
interface of the current system that you couldn't even make Oracle look
anything like that. It was doomed to be a complete mess.

But to the point I'd originally planned to make, they tested the system in
limited release, and then went live with it for one particular function, which
was generating and printing order cards or something like that. What no one
had thought of, and didn't occur in testing because it wasn't a real workload
was that the old system sent raw text to the printers at the various factory
sites, while Oracle (iirc) was generating postscript, complete with logos and
formatting, and sending that to the printers at the factories....which it
turns out were connected over 128kb/sec links that were promptly swamped by
the size of the files.

So the whole project had to be put on hold until all of the links between HQ
and the factories could be upgraded, which took months, and the feedback from
the userbase was, "What a piece of shit Oracle is, our 20 year old system can
print to the factories, why is it so hard for them to do that?!?!"

EDIT: looked back in my notes, 128kb/sec lines, not 512

------
zeckalpha
It's a buzzword, not a quantifiable thing.

~~~
christopheraden
The fact that so many people are calling things "big data" when the data is
not high volume (the most popular definition I've seen is the 5 V's definition
--big seems to be a misnomer in this case, as only volume could really be
called a measure of "big") lends credence to your statement.

------
pfarrell
There are two ways to define BigData.

1\. The accumulation, integration and analysis of a larger number of data
sources.

2\. A volume of data that presents challenges running analysis functions
across them... Due to the limits of the tools available.

1 is fraught with the kind of statistical pitfalls that are mentioned in the
posted article. 2 describes a set of problems and boundaries that are time
sensitive. What was BigData in 2006 (to, say LiveJournal or Digg) may not
longer hold. As a data engineer, its important to keep a skeptical eye on
marketing and make sure we're delivering valuable solutions that increase the
bottom line for our business, not just produce "ain't it cool" type
correlations.

------
mercurialshark
Extrapolating relatively few truly random data points from massive datasets,
for analysis and modeling, is what "Big Data" is all about. This article would
have you think that working with clusters or snippets of impossibly ginormous
datasets is somehow less "Big", but that's sorta the point. Perhaps somehow
should inform the author that the more data available doesn't translate into
working with more data.

------
BadassFractal
I wonder if both "responsive web design" and "big data" were just hoaxes that
we were all fed to sell more books and seminars.

------
emiliobumachar
> The “bigger” your data, the more false positives will turn up in it, when
> you’re looking for correlations

I think they are talking about the Sharpshooter Fallacy

<http://en.wikipedia.org/wiki/Texas_sharpshooter_fallacy>

~~~
gizzlon
Nate Silver writes about this in his book.. highly recommended.

<http://en.wikipedia.org/wiki/Nate_Silver#Book>

------
quaunaut
Said this before, still want to see it fixed, as I can't stand to read the
page with the huge grey box on the right side that disables scrolling without
being moused over content. Last time though, I didn't have my environment
info.

Windows 7 Ultimate SP1 Chrome Version 26.0.1410.64 m

------
zenocon
I've been thinking the same thing as the premise of the article for a while
now. More often, I think people just write horrible code / poorly designed
systems that perform sluggishly, and underwhelm...and then someone queues mr.
big data as the silver bullet.

------
kaa2102
Clarifying the scope of a project or data collection & analysis effort is
paramount. You never want to attempt to boil the ocean. The key is to figure
out the data the matters most per your company or organization's strategy.

------
ksk
Does anyone know how much indexed data Google has for their search? (Not the
size of the database) I'd bet it wont be over a few hundred TB - Something
that can fit on most desks in the not too distant future.

------
ohwp
Also: a lot of people seem to think the number of records is the complexity.

------
cmccabe
This article is the equivalent of "horse drawn carriages are perfectly
adequate for most journeys, and much more pleasant and commodious to boot."
Good luck with that, buddy.

You're not going to know what correlations are important and which are not
until you study the data. Telling people to just collect the "important data"
is like telling someone who has lost his keys just to go back to where he left
them.

It's also more than a little insulting to FB and Yahoo to insist they are not
web scale. The problem of small jobs on MR clusters is real, but even with
small jobs, Hadoop turns out to be a lot more cost-effective than various
other proprietary solutions which are your only real enterprise alternative.
The problem of small MR jobs is being solved by things like Cloudera Impala,
which can run on top of raw HDFS to perform interactive queries.

~~~
Choronzon
The problem is your that your ability to explore the data and the data volume
are inversely correlated.You are far more likely to find interesting things
exploring an in-memory dataset using something like ipython and pandas than
throwing pig jobs at a few dozen TB of gunk. Big data is great if you know
exactly what you are looking for. If you get into a stage where you are trying
to explore a huge DB looking for relationships your need to be very good at
machine learning and statistical analysis (spurious correlations ahoy!) to
come out significantly ahead.Its also an enormous time sink. In summation the
bigger the data the simpler the analysis you can throw at it efficiently.

~~~
ims
Very true. Wouldn't the typical approach to this involve probabilistic methods
like taking large-ish (but not "Big") samples from your multi TB data and
doing your EDA with those?

~~~
Choronzon
That would work very well if our random sample accurately reflected the
superset of data,which it almost always does but you also want to consider the
following...

Imagine our data was 98% junk with 2% of the data consisting of sequential
patterns. We may be able to spot this on a graph relatively easily over the
whole dataset but our random sampling would greatly reduce the quality of this
information.

We can extend that to any ordering or periodicity in the data.if data at
position n has a hidden dependency of data at position n+/-1 random sampling
will break us.

~~~
cma
Do random sampling plus n lines of surrounding context.

------
coherentpony
Yeah. I'm a scientist that deals with huge datasets. _Huge_. I must admit that
I do cringe a little every time I see the words 'big data'.

Disclaimer: I haven't read the post. Only the title.

------
Tossrock
Sometimes though, you really do have lots of data and need appropriate
solutions. At Quantcast, our cluster processes petabytes per day and our edge
datacenters handle hundreds of thousands of transactions per second. In fact
we recently open sourced our file system (QFS[1]) built on top of HDFS, which
can up to double FS capacity on the same hardware. Although it's certainly
true that not every company (or even not most) needs all that horsepower,
there are definitely some for whom it's the core of their business.

[1]. <http://quantcast.github.io/qfs/>

~~~
zv
Thank for your self advertisment. But from what I understood, that lots of
businesses threat few gigabytes as big data. It's about fashion to call
yourself "big data" user.

------
GigabyteCoin
What is the author of this article trying to say here?

> _it appears that for both Facebook and Yahoo, those same clusters are
> unnecessary for many of the tasks which they’re handed. In the case of
> Facebook, most of the jobs engineers ask their clusters to perform are in
> the “megabyte to gigabyte” range (pdf), which means they could easily be
> handled on a single computer—even a laptop._

That facebook or yahoo could be run from a laptop?

~~~
wmf
It's specifically talking about analytics jobs, not the user-facing stuff.

