Hacker News new | past | comments | ask | show | jobs | submit login
Most data isn’t “big,” and businesses are wasting money pretending it is (qz.com)
394 points by angelohuang on May 12, 2013 | hide | past | favorite | 156 comments

As some one who is currently dealing with these sort of things I can tell this article hits the nail on its head.

Most, heck something like 99.99% of all so-called big data I've dealt is something I wouldn't even classify as small data. I've seen data feeds in KB's sent over to be handled in as big data. It happens all the time. A simple data problem sufficient enough to be easily solved on something like a small db solution like sqlite is generally taken to 'grid' these days. It reminds me of the XML days when everything had to be XML. I mean every damn thing, these days its NoSQL and Big data.

People wrongly do their schema design just so that it can get into a NoSQL, then use something like Pig to generate data for it. The net result is they end up badly reinventing parts of SQL all over the place. If only they understand a little SQL and why its exists they can save themselves all that pointless complexity they get into. Besides avoiding to use SQL where its appropriate creates all sorts of data problems in your system. You will go endlessly reinventing ways doing things similar to what SQL offers while bloating your code. You will go reading a big part of the code, only to figure out the person actually intended to something like a nested select query albeit here very badly.

Besides I find much of this big data thing a total sham. Back in the yester years we would write Perl scripts to do all sorts complex data processing(With SQL of course). Heck I've run some very big analytic systems, and automation set ups in Perl to do far difficult things people do using 'Big data tools' today.

In larger corporation this has become fashion now. If you want to be known as a great 'architect' all you need to do is bring in these pointless complexities. Ensure the set up becomes so complicated it can't explained without the help of a hundred jargons totally incomprehensible to anybody beyond your cubicle. That is how you get promoted to become a architect these days.

I really don't understand why anyone has a problem with relational databases. Once you take the time to understand how they work (by taking a class or reading a book), it's really straightforward and makes a lot of sense. Not to mention it's really fast and quite reliable.

I get that a NoSQL-ish alternative makes sense for companies that have tons of shards spanning the globe, but for the vast majority of people, a relational database serves just fine.

I don't have a problem with relational databases. Or rather, I don't have a problem with relational data modelling.

I do have a problem with Oracle. Even the Oracle experts at my former job could barely get Oracle to do something sensible, and running it on my own computer was basically a death sentence for getting anything done.

I have a problem with MySQL, from a sysadmin perspective. When I had it installed, MySQL was the package that would always break on every update. No upgrade was small enough that the data files would continue working.

(I don't have experience with Postgres, but SQLite seems more comfortable than any of the mentioned alternatives)

I have a problem with schemas in my database. It requires upfront work with modelling my data. I'd rather iterate. Also, nobody I've worked with seems to put the schemas in automatically; you need to run the special "initdb" script that isn't maintained to make it start working.

I have a problem with SQL. It would be awesome if we had a standard query language, but we don't. You can't apply the same SQL to different database engines mostly because it won't even compile, secondarily because it will give different results and finally because it will have a completely different performance profile.

All of this can be fixed by learning stuff, so I know better what I am doing.

But I already know CouchDB [1]. It took me little effort to learn, and it makes a lot of sense to my mind. I can solve problems with it, and it has neat properties with regards to master-master replication. So for me, CouchDB works just fine, just like a relational database works just fine for you :)

So, from my perspective, it seems that using some SQL solution would be the time consuming option.

[1]: CouchDB can't be considered a "Big data" database for many cases. It is slow. But it scales neatly :)

This is all pretty sane, except the schema-less part. I just don't understand why people get all hung up over schemas. Sure, migrations are a minor inconvenience, but if you just add fields in an ad-hoc fashion over time the data becomes messy and it's hard to determine any invariants about a large dataset. Sure this is avoided through careful code organization, but then aren't you just re-inventing schemas out-of-band?

Yeah, I have experimented with schemaless stores a bit (mostly with JSON fields in postgres, for now), to avoid adding columns for various things to the database, and I am not impressed. Not only is initialization and migration hard, but I've come to realize that there's no downside in adding these to the schema itself.

Sure, the schema gets a bit bloated and dirty, but having undocumented fields in a dict whose existence is signified only by a line of code assigning something is not better.

Where schemaless stores are great is prototyping. Especially for more algorithmic code, where I don't really know what storage I'll be needing and the algorithm will only live there for a few days, schemas are just a burden. That's why I wrote Goatfish: https://github.com/stochastic-technologies/goatfish

I think this is due to how you work and what you build. If you plan it out, focus on the data structures you need and then build it, schemas are fine because you know up-front what you want.

In more ad-hoc, constantly changing, "ops this didn't work because of unknown factor X" type of projects, schemas are a pain. It's sounds really nice to have a data store that adapts when it's impossible to know up-front what you need from your data structures.

In more ad-hoc, constantly changing, "ops this didn't work because of unknown factor X" type of projects, schemas are a pain.

Such a pain:

    ALTER TABLE foo ADD COLUMN baz varchar(64);
From what I've seen, a significant chunk of the desire to use schemaless NoSQL hipster DBs is simply a desire to avoid learning SQL as if it were a real programming language.

The only real use case I've ever seen for schemaless DBs is fields with lots of ad-hoc columns added by multiple people (typically logging/metrics data).

"lots of ad-hoc columns added by multiple people"

Inevitably completely un-normalized and junk data. Even worse with no documentation or procedure anything ever added becomes permanent legacy that can never be removed. Been there, lived it, hated it, won't allow it to happen again.

So you won't allow logfiles to happen again?

LOL, no, not as a primary store of data, no never again.

Plain text logs are a great place to funnel all "unable to connect to database" type of errors for dbas / sysadmins to ponder, however.

I've implemented quite a few systems where data changes are logged into a log table, all fully rationalized, so various reports can be generated about discrepancies and access rates and stuff like that. This is also cool for endusers to see who made the last change etc.

Trying to reverse engineer how some data got into a messed up condition using logs that can be JOINed to actual data tables as necessary is pretty easy, compared to trying to write a webserver log file parser to read multiple files to figure out who did what, when, to the data resulting in it being screwed up. You parse log files for data just once before you decide to do that stuff relationally. Debug time drops by a factor of 100x.

I'd like to pick your brain as that's the problem I'm facing right now - I have a web site that is accessed by users, and I would like to get a comprehensive picture of what they do. I already have a log table for logging all changes (as you said - I can show it to the users themselves so they know who hanged what in a collaborative environment), but I struggle defining meaningful way to log read access - should I record every hit? Would that be too much data to store and process later? Should I record some aggregates instead?

Rather than aggregates, sampling.

Sounds like you're not interested in absolutely every event for security audits / data integrity audits and more interested in general workflow. Be careful when selecting samples to store because what superficially looks random might not be (I donno, DB primary key, or timestamp ends in :01?). Is (one byte read from /dev/urandom )== 0x42 if so log it this time.

Also it depends on if you're unable to log everything because of sheer volume or lack of usable purpose or ... For example if you've got the bandwidth to log everything but not to analyze it in anything near realtime, maybe log Everything for precisely one random minute per hour. So this hour everything that happens at 42 minutes gets logged, everything at minute 02 next hour, whatever. Better mod60 your random number.

You can also play games with hashes IF you have something unique per transaction and you have a really speedy hash then a great random sampler would be hashing that unique stuff and then only log if the hash ends in 0x1234 or whatever.

If you have multiple frontends, and you REALLY trust your load balancer, then just log everything on only one host.

I've found that storing data is usually faster and easier than processing it. Your mileage may vary.

Another thing I've run into is its really easy to fill a reporting table with indexes making reads really fast, while killing write performance. So make two tables, one index-less that accepts all raw data and one indexed to generate a specific set of reports, then periodically copy some sample outta the index free log table and into the heavily indexed report table.

Its kinda like the mentality of doing backups. Look at how sysadmins spend time optimizing doing a full backup tape dump and sometimes just dump a delta from the last backup.

You're going to have to cooperate with operations to see what level of logging overloads the frontends. There's almost no way to tell other than trying it unless you've got an extensive testing system.

I've also seen logging "sharded" off onto other machines. Lets say you have 10 front ends connecting to 5 back ends, so FE#3 and FE#4 read from BE#2 or whatever. I would not have FE3 and FE4 log/write to the same place they're reading BE2. Have them write to BE3 or something, anything than the one they're reading from. Maybe even a dedicated logging box so writing logs can never, ever, interfere with reading.

Another strategy I've seen which annoys the businessmen is assuming you're peak load limited, shut off logging at peak hour. OR, write a little thermostat cron job or whatever where if some measure of system load or latency exceeds X% then logging shuts off until 60 minutes in the future or something. Presumably you have a test suite/system/load tester that figured out you can survive X latency or X system load or X kernel level IO operations per minute, so if you exceed it, then all your FE flip to non-logging mode until it drops beneath the threshold. This is a better business plan because instead of explaining to "the man" that you don't feel like logging at their prime time, you can provide proven numbers that if they're willing to spend $X they could get enough drive bandwidth or whatever such that it would never go in log limiting mode. Try not to build an oscillator. Dampen it a bit. Like the more percentage you exceed the threshold in the past, the longer logging is silenced in the future, so at least if it does oscillate it won't flap too fast.

One interesting strategy for sampling to see if it'll blow up is sample precisely one hour. Or one frontend machine. And just see what happens before slowly expanding.

Its worth noting that unless you're already running at the limit of modern technology, something that would have killed a 2003 server is probably not a serious issue for a 2013 server. What was once (even recently) cutting edge can now be pretty mundane. What killed a 5400 RPM drive might not make a new SSD blink.

(Whoops edited to add I forgot to mention that you need to confer with decision makers about how many sig figs they need and talk to a scientist/engineer/statistician about how much data to obtain to generate those sig figs... if the decision makers actually need 3 sig figs, then storing 9 sig figs worth of data is a staggering financial waste but claiming 3 sig figs when you really only stored 2 is almost worse. I ran into this problem one expensive time.)

It sounds to me like you're trying to reinvent web analytics. Is there a reason you need user level granularity or is aggregate data enough?

My web site has separate "accounts" for multiple companies, each has multiple users. I'd like three level of analytics - for a given company (both me and the company agent would like to see this), across all companies for all users (I will see this), and rolled up within each company (i.e. higher-level of activity where it's companies are being tracked, not individual users).

User-level data might be useful for tech support (although this is currently working fine with text-based log files and a grep).

So I guess I am not sure... I might be content with web analytics... Each company has its own URL in the site, like so: http://blah.com/company/1001/ViewData, http://blah.com/company/1002/ViewData, etc. Using e.g. Google Analytics I could see data for one company easily, but can I see data across all companies (how many users look at ViewData regardless of company id)? Can I delegate the owner of company 1001 to see only the part of the analytics?

Another monkey wrench is the native iPad app - ideally the analytics would track users across both native apps and the web site.

> using logs that can be JOINed to actual data tables as necessary is pretty easy

Can you give me a concrete example of how you would use this?

If by concrete you mean business case example, it comes up a lot in "the last three people to modify this record were..."

Security audit trail type stuff "Why is this salesguy apparently manually by hand downloading the entire master customer list alphabetically?".

This isn't all doom and gloom stuff either... "You're spending three weeks collecting the city/state of every customer so a marketing intern can plot an artsy graph by hand of 10000 customers using photoshop? OMG no, watch what I can do with google maps/earth and about 10 lines of perl in about 15 minutes" Or at least I can run a sql select that saves them about 50000 mouse clicks in about 2 minutes of work. Most "suit" types don't get the concept of a database and see it as a big expensive Excel where any query more complicated than select * is best done by a peon by hand. I've caught people manually alphabetizing database data in Word for example.

Another thing that comes up a lot in automation is treating a device differently WRT monitoring and alerting tools if a change was logged within the last 3 days. So your email alert for a monitored device reads contains a line something like "the last config change made was X hours ago by ...". Most of the time when X=0 or X=1 the alert is because ... screwed up, and when it isn't a short phone call to ... is problem isolation step #1.

This was all normal daily operations business use cases, aside from the usual theoretical data mining type stuff, like a user in A/B marketing test area "B" tends to update data table Q ten times more often than marketing test area "A" or whatever correlation seems reasonable (or not).

That sounds reasonable. I've been thinking about doing logging as a text blob on the affected object, but it haven't seemed useful enough.

Using your approach, I guess it would be a table like

    create table logs (
        tableName varchar,
        oldValue text,
        newValue text,
        userID int,
        when datetime
Or did I miss something?

I've worked on systems where changing a single column meant that we'd have to take four hours of downtime while MySQL slowly...did whatever it does.

Is this the old bug (or whatever) from many years ago where it was faster to drop the indexes, make your schema change, then add the indexes back? Not sure if that still applies anymore. The scenario was something like if you change the length of a column such that the index needs to recalc it, like maybe truncate a CHAR(15) to a CHAR(10), or do something weird with full text indexes, it would loop thru each row, recalc the row for the index, completely sort and insert the row into the index, and then repeat for the next line. So it scaled as if you were inserting each row one line at a time (which can be pretty slow with lots of rows and extensive indexes) but there's a sneakier way to do it.

Or, if by change a column, you mean something like an "update table blah set x=x+1;", and the x column was part of an index, that used to really work the indexing system hard, one individual row at a time. I think that issue was optimized out a long time ago. I believe there was a sneaky way to optimize around it other than the index drop and create trick, by doing all 10 million increments as part of a transaction such that it would do all 10 million increments, close out the transaction, then recalculate the index. Now there was something sneaky to the sneaky that you couldn't do a transaction on one update so you updated all the "prikey is even" and then updated all the "prikey is odd" or something like that as a two part transaction. I didn't exactly do this last week so if I misremember a detail...

> It's sounds really nice to have a data store that adapts when it's impossible to know up-front what you need from your data structures.

This seems backwards to me. Relational databases are much better at ad-hoc querying of data, whereas NoSQL scales for narrower access patterns. The fact that you can dump arbitrary columns without a migration is a nice convenience, but in general it will be less queryable than it would be if you added the column in a SQL database.

So that you can store/retrieve data without knowing the schema.

Oh but you have to know the schema right? Yes, some other part of the application knows the schema, but this part doesn't have authority over the DB. Also, the schema may be data as well.

NoSQL reduces the work needed for that.

Perhaps, but can't you just organize your objects by having a table for each type and adding a column as needed? it doesn't sound like such a big deal

Looks like fun

But why would I waste developer time doing that if I can only do db.table.insert(obj) - in MongoDB for example (obj is a JS object)

Also, finding all objects with a field named 'field1' and value '3' is slower if you do that in a relational DB (and that's the simplest case)

Why on earth would that be slower in a relational db? That makes no sense, since indices in both nosql and relational dbs are variants of b-trees.

Well, it's slower because of a join that exists in SQL (the relationship between your 'field/value' table and the entry. Apart from that, as you said indexes are similar.

The fun thing about NoSQL skeptics is how they think of only the current scenarios they work with, and they won't believe you until they get burned by it. So be it.

Well, because presumably SQL databases have features you don't get with noqsl solutions.

> but if you just add fields in an ad-hoc fashion over time the data becomes messy and it's hard to determine any invariants about a large dataset

In systems large enough to be running multiple versions of an app at the same time, talking to the same database, you have to do exactly this--but you have to shoehorn it into relations (doing things like having a field where 99.999% of entries in a column are null, and gradually get computed as the row is accessed by the new version of the app.) NoSQL lets you just say what you mean--that in some versions of the app, the schema is X, in some versions it's Y, and the database isn't the arbiter of the schema.

No. NoSQL lets you say nothing about the schema, and so it becomes a problem for the application layer, above the DB, to handle. In fact, this is much what happens with most "solutions" NoSQL presents to SQL database problems: Let's not implement it, then it's not a problem.

What happens when you push problems up the stack? Do they get solved automatically? No? Will they get solved? Perhaps, if really needed. And, for the cherry on top: Will the solutions be similar to the ones SQL databases use? They will.

You see, when you are implementing atomic transactions, for example, you may get ahead if you have some information about the problem domain. However, for most cases, you are solving the same problem SQL databases solved decades ago. And you'll find the same solution. Just not as well implemented nor as well tested.

"Let's not implement it, then it's not a problem."

Its interesting that culturally more than a decade ago, when mysql tried this strategy with transactions, namely, not having them until roughly the turn of the century, it was reviled mostly by people who don't know what transactions are nor did they need them, but were nonetheless very unhappy about mysql not having that checkbox get checked.

Now its culturally seen as a huge win to simply not implement something difficult.

I don't know if its a decline in feature list length as a fetish or just simple copy catting of others behavior (perhaps in both situations) or some kind of pull yourself up by your bootstraps romantic outlook on reimplementation or the inevitable result of homer simpson meets the database, but whatever it is, its an interesting major cultural change.

You wouldn't say that if you worked with mysql daily. I do, and every single day I long for the times when my stack used pgsql. Mysql is such an unfixable clusterfuck, with minimal speed advantages over real RDBMSs, that if you use it as the poster child of nosql's path you are effectively arguing against yourself.

I see it more as a move from "one size fits all" to more specialized tools. The term "nosql" is pretty useless as it's way too general. Both your comments about "not implementing the hard stuff", and GP's about schema, only applies to some of the "nosql" projects.

Instead of looking at it as "aaawwm! NoSQL is attacking our bellowed RDBMS", try looking at the different projects and what they bring to the table. Maybe some of them can be a useful addition to your systems.

Have to agree with the above. The problem with relational databases isn't the relational model, per se, but the complexity and cost of maintaining a relational database.

Typically, they require specialized database administrators whose primary job is to tune the database and keep it running.

Many businesses, even of moderate size, reach point where they will need to purchase expensive hardware (million dollar RamSans and expensive servers) to optimize the performance of their database because partitioning databases is challenging.

So the overhead of running an Oracle or Sql Server database is quite high.

There is huge room for improvement with these traditional database products. If someone made a good cloud database that supported the same feature set but with lower administration and maintenance costs then that might be a better option.

Now that is a reasonable concern. Keeping db hardware happy is certainly an expensive undertaking. I think I'm more used to the arguments like the one from Sauce Labs, where the VP whines, "What are schemas even for? They just make things hard to change for no reason. Sometimes you do need to enforce constraints on your data, but schemas go way too far," [1] and then goes on to say that his company is moving from using CouchDB to using MySQL as a key-value store with serialized JSON (data integrity and performance be damned -- I mean, really, the thought of converting millions of values in a table to objects just to run a home-grown MapReduce function on them when you could just LEARN HOW TO USE MySQL is pretty much the most insane thing I've ever heard lol).

Do you have any experience with Amazon RDS? I haven't tried it; I guess my concern would be the same as any other AWS product--they tend to fail catastrophically from time to time. Then again, if you're doing cloud NoSQL through Amazon, you're going to run into the same issues (see: Reddit).


The problem of complexity has less to do with being relational and more to do with the data just being large and complex. Relational or not doesn't change that much. If anything, by not keeping it relational, you are much more likely to have a disorganized database that isn't normalized.

This whole "specialized database administrator" point just seems moot considering the equivalent for that are the so-called Big Data developers.

I grew up on rdbmses and think they are great. I've created numerous pieces of software backed by them. I have long preached the mantra that data is more important than the application. The data will almost always outlive the original application and become shared to others.

With that said, the above comes at a time/flexibility cost (not as bad as it used to be but still there) when building a product that isn't quite sure what it will be yet. In these cases a different data store can be beneficial since the app itself is key until traction is gained, if ever.

Time/flexibility problem can be solved (but it's hard). Take a look at DSL Platform (shameless plug) if you are interested in building on top of database, while having very flexible model.

In some ways, it kind of is a sham. I think it is perpetuated by the blog/youtube style programming knowledge transference paradigm. Those mediums are fine but there seems to be a rallying cry against actually learning anything about computing in anything other than bite size pieces and thus get a lot of fad driven movements and an over population of redundant frameworks and libraries.

Yes, and then because companies start using NoSQL or whatever for problems they could have done fine in mysql they start asking for NoSQL experts when they are hiring.

This makes devs think that they need NoSQL experience and therefor they will find ways to shoehorn NoSQL into whatever problems they are currently solving.

The modern equivalent of object oriented programming, perhaps. I remember similar comments about it in the early/mid 90s.

I think it's more perpetuated by the fact that executives and other corporate-folk are easily hypnotized by numbers, however irrelevant they may be. "Interestingly, peak purchases for our product in March occurred Monday through Friday between 7am and 4:30pm, with 807 people looked at page x before purchasing and 806 people looking at page y. Through running our data against national averages, we've found that our peak purchasing times align and are roughly proportionate to population density across the country." You can watch in amazement as every MBA in the room's eyes gloss over; a dopey smile overcomes them slowly as the data intoxicates them.

Throw in a line graph and a heat map and you're basically on a fast track to a promotion without even saying a single useful thing.

Hanging off this great comment, please anyone wanting to learn about relational databases I strongly suggest reading Database in Depth by C J Date. It's around 200 pages, you will learn a lot about the relational model, database/query/index design in a short amount of time, with a little bit of logic thrown in there too.

Just need a good SQL reference? 'The Art of SQL' or 'SQL and Relational Theory'

Are you not entertained? Fine, choke on this : http://en.wikipedia.org/wiki/The_Third_Manifesto

And once you know the basics, I strongly recommend "SQL Anti-Patterns" for a good guide for what not to do and why not (and what mitigating circumstances might make an otherwise bad choice OK (or simply the only available option)). I read it while considering myself experienced and found it to be a useful refresher. The style/tone is light and well organised by task/objective, so I suspect everyone down to a beginner will find it well worth their time perusing.

"I read it while considering myself experienced and found it to be a useful refresher."

I'll be honest and say I read it and found it terribly embarrassing yet comforting. Remember that dumb thing I did back in '96? (insert red face) Yeah I guess I'm not the only guy to learn that the hard way. That lack of deep experience is a significant danger of nosql designs. The folks doing that now, don't even see the icebergs that relational folks successfully dodged decades ago. Much better off being nostalgic about the olden days of steam engine trains than not even seeing the diesel-electric headlight at the end of the tunnel rushing toward you.

Thank you for the recommendation, just ordered a copy for my office

Ha, I knew that looked familiar, Hugh Darwin is a lecturer where I was at university so half of the coursework for a databases module was in Tutorial D.

Funny to see another ex-Warwickian here. Yes, I remember that coursework - I also found in interesting that Hugh Darwin thought SQL was too lax and flexible. I wish I'd taken the chance to ask his opinions on nosql.

Sorry I missed the reference - mathnode == Hugh Darwin ?

mathnode != Hugh Darwen

They are some big boots to fill. I'm just some DBA in London.

Agreed. It is quite amazing what modern ANSI SQL has built into it. If one picks the right storage engine, an appropriate schema design and is rigorous about data quality, the flexibility and performance is incredible.

I've also seen cases where people who run on EC2 end up having to do a lot of extra work because of the bad IO/CPU. They end up with these solutions that are way overkill for such small websites.

From the Berkeley paper on Facebook:

Nonetheless, large jobs are important too. Over 80% of the IO and over 90% of cluster cycles are consumed by less than 10% of the largest jobs (7% of the largest jobs in the Facebook cluster). These large jobs, in the clusters we considered, are typically revenue-generating critical production jobs feeding front-end applications.

So MR job characteristics might follow a power law distribution, and @mims is focusing on one end of the tail. Sure, that's cool!

But then @mims also selectively quotes the TC article, which ends with an excellent point that contradicts his thesis:

The big data fallacy may sound disappointing, but it is actually a strong argument for why we need even bigger data. Because the amount of valuable insights we can derive from big data is so very tiny, we need to collect even more data and use more powerful analytics to increase our chance of finding them.

I think @mims over-pursues the stupid Forbes/BI straw man here. As one would expect with data, the story is complicated. Mom and pop stores don't need to worry about Cloudera's latest offering, but companies working on the cutting edge of analysis still absolutely need tools like Hadoop, Impala, and Redshift.

My feeling is that the title of the article rails against companies who think they need Big Data when they don't.

even analysing say a small number of serp results for say 40k keywords per month quickly generates lots of data.

And If the proposal to analyse a 1/4 million terms for kelly search had gone live I woudl have been creating over a LOC (Libray of congress) for each run.

I've maintained for awhile now that the distinction isn't between "big" and "small" data, but between coarse and fine data. Now that everything is done through the web, previously common data sources (surveys, sales summaries, etc) are being supplanted by microdata (web logs, click logs, etc). It does take a different skill set to analyze noisy, machine-generated data than to analyze clean, survey-like data; it's a skill set that is more biased towards computational knowledge than classical experimental design, hence the shift in emphasis.

I like this distinction. I walked into a world of hurt when I was brought on to look at application user data after years of working with international trade data and national statistics. Even when it comes to formulating a hypothesis and subsequent experiment, the approach is entirely different.

I will say that the article's distinction between small and big data is also important, but that just comes down to processing power. I think the distinction you make is far more important and knowing whether you need coarse or fine data can help keep you out of the issues that are introduced moving from small to big data.

I like that distinction.

But I also don't really mind if Big Data is truly big, because it's clearly different data than what businesses are used to collecting and interpreting today.

I agree completely. I run a company that handles high levels of compute load for financial applications. I often describe what we do as "big compute," not big data, because the data is actually very small in size. OTOH, this tiny bit of data (real-time prices on some 1,000 assets) causes an ENORMOUS amount of computation. Often this distinction doesn't get picked up either, and people might mistakenly classify us as "big data."

I also like that distinction. To me "big" data isn't big until there is a lot of it...and there is a definite distinction between "How many bananas were sold Tuesday?" and "Was user's LED email indicator on when xyz happened?"

Of course one could see it as "IT's revenge" after Scott McNealy so famously said it was dead. There is a lot of power to be had by creating an interface for the customer and then keeping everything behind that interface 'obscure'. They have to have that interface to survive, and if they don't know what goes on behind it they have no way of discerning outrageous costs from reasonable ones. The current exemplar seems to be medical costs.

Back in the 60's there was this chamber of secrets called "the Machine Room" which had the "Mainframe" and various and sundry high priests who went in and out, and if you literally played your cards, as in punched cards, right you could get a report on how sales or manufacturing was doing this month.

That got lost when everyone had a PC on their desk, and now some folks are trying to reclaim it :-)

That said the article is still poorly argued. The cost of data management is fairly high. And generally a big chunk of that cost is the cost of specialists who provide business 'continuance' which is code for "makes sure that you can always get your data when you need it, and you can get the answers you need from it in a timely and repeatable fashion." That hasn't changed at all, and whether you have some youngster doing "IT" on the creaky Windows 2000 machine running Back Office or you are using a SaaS company like Salesforce.com, data management is and will continue to be a mission critical part of staying in business.

If I ever want to get rich, I'll set up shop convincing small businesses they need to do things the way Google does, if only they want to remain competitive.

Oracle has used exactly this business model to great success, and obscene profit, for over 30 years.

Believe it or not, there's a world outside silicon valley where not every company can hire a large team of engineers to create and maintain their data infrastructure to process sales, do crm, keep track of manufacturing, etc. That's why companies like Oracle exist (and are very successful)

Edit: I noticed you meant small businesses. However, Oracle does this mainly for the large companies that don't excel at technology.

I hate the term "Big Data" but if it somehow puts Oracle down it can't be all bad.

They need to do it if the cost is affordable. And the fact the cost of Big Data processing became cheap enough to do it en masse is a driver.

Take a simple example. Every small offline shop will not refuse to count every customers head turn (with direction, angle and frequency), calculate averages, and get sales floor attention heatmap divided by day, hour, age and sex.

If that would cost them $500 one-time fee for 10 cameras and $5/day for the cloud service, and produced by pressing one big green key.

That's the Big Data driver, businesses are opening their eyes to the ability to analyze (cheap!) a huge number of small (and even smaller) factors and make better decisions.

I know that's not what you mean, but I find it quite amusing that you describe Oracle (a 30 year old company)'s business as convincing people they need to do things the same way Google (a 15 year old company) does it.

Well, there is SAP, whose business model is "look at what the big companies are doing. You small time fella, you need the same thing to grow big as well" as well

He said they have the same business model. Business models are abstract.

dewitt said "Oracle has used exactly this business model to great success, and obscene profit, for over 30 years". You claim that business models are abstract. Perhaps, except in the case where the words "exact" are used, and a specific company is named. I, like SeoxyS find the paradox of Oracle being accused of helping businesses to be "me-too" copies of Google - a company half of Oracle's age - amusing.

The model in question is "convince small and medium businesses that they need to buy my software in order to do things the same way as large companies X and Y and have a hope of remaining competitive". For Oracle X and Y were banks, retail and logistics companies, for the new generation of "big data" vendors it is Google and Facebook.

But the poster in question didn't say "X" and "Y" did they...this feels like an exercise in pedantry now, but he really did say "exactly" and "google".

Ok, here's some "conversational language" insight:

He said: "Exactly this business model" -- that is, as it pertains to it's essence.

NOT to be read as:

"Exactly this business model as it pertains to inconsequential details, like which big company they should be imitating".

To be pedantic, he said Oracle follows the same business model, not that Oracle follows Google.

But if google has only existed for 15 years how can the business model of "get money out of people trying to help them do a google me-too" have existed for 30?

You're putting us to sleep with this tiresome over-literal arguing. Be more interesting.

I guess as a software engineer I can be a little pedantic, literal and detail oriented, perhaps to my own detriment.

I would argue Oracle's model has been to suggest to businesses that they need to do things quite differently from the way Google does...

There's an important distinction to be made between the storage layer and the analysis layer. Something like HDFS can make sense as a storage layer once you hit the > 10TB range even if your average dataset for analysis is reasonably small (and it should be; 99% of the time you can get by with sampling down to single-machine size). That doesn't mean you need to be setting up all your analysis jobs to run via map-reduce; you can usually dump the dataset to a dedicated machine and do it all in one go with sequential algorithms. As a side benefit, you have access to algorithms that are really difficult to express efficiently as map-reduce (eg, computations over ordered time series).

I think that big data has made math sexy, and selling applied statistics and operations research to small and medium-sized businesses under the guise of "big data" with the intention of providing applied mathematical tools is what is happening in the market.

Statistics involves checking modeling assumptions. A lot of what I've seen with the big data people is the repetition of algorithms to the exclusion of understanding and checking modeling assumptions.

While it's nice that the big data craze is making statistics more popular in the mainstream press, it is important that statistics does not become just an application of numerical methods without consideration of underlying assumptions. I stress this because this has been largely underappreciated in my experience.

This is why I am unconvinced about the prefab products that are currently available. No matter how much you "automate" things, the fact is that you need a human brain, and a decent and careful one at that, to do anything worthwhile. I don't think the majority of companies understand this.

Your comment reminded me of this recent Krugman post which makes a similar point about economics.


There are a huge number of useful machine learning techniques that don't have checkable "modelling assumptions" per se, just good performance on given tasks (decision trees for instance are really difficult to think about in terms of underlying statistical properties). Heck, even most statistical models are demonstrably false for any given application, yet simultaneously very useful.

Reminds me of a quote I read somewhere about simulation models (specifically referring to Agent Based Modelling) saying that (heavily paraphrasing): "A lot of models are great random number generators"... or "garbage in, garbage out".

I suspect a lot of these people doing "big data models" are as you say, ignoring the importance of having solid assumptions.

Oh well, that's exactly in part what brought down the financial collapse: A bunch of kids get a formula (Blach-Scholes) and believe blindly in its magic powers so they apply it to everything. Fast forward several years and we've got what everybody knows.

"Big data" also checks model assumptions, if only if by monitoring whether or not acting on the information moves a business metric.

Statistics involves inference over prediction, but either one when done right validates assumptions.

By the way big data will sit on your face for days.

I meant checking assumptions not just to see whether the use of the big data moved a business metric, but also that the model makes sense from a statistical perspective.

A lot of statistics in business does not bother to check modeling assumptions. Models are chosen based on whether they've been used in the past and what the team is familiar with.

I don't doubt that big data (as we call it now) will one day rule. Ronald Fisher would keel over if he saw the size of datasets we work with nonchalantly on a daily basis. 50 data points (the size of the Iris data) is laughable these days.

My reservation with big data is that the technologies are often unnecessary for the size of the tasks being done. Other than a few data scientists working on truly large projects, most of the big data talk I hear comes from people who aren't fighting in the trenches (execs, marketing, journalists).

It's still amazing what businesses are able to accomplish with summing, counting, percentage of total, % change period over period, average, median, min, max.

It's even more amazing how few businesses are able to compute those operations.

Add in bonuses based on those numbers and its amazing any consistency exists in their calculation. Basically in practice you're only allowed a consistent and analytically defensible system if no ones bonus depends on the process being obfuscated. This is why a lot of "big data" is oriented around generating new ideas and new numbers, rather than fixing existing systems and data...

This is exactly right. I'm a member of INFORMS (the operations research professional society), and I can report that a staggering amount of ink has been spilled over the last few years about how to capitalize on the recent "Analytics" and "Big Data" trends.

On the one hand, people are starting to realize that quantitative analysis can help their businesses (mind blowing, right?) -- on the other hand, so much of what you see about "analytics" and "big data" is nonsensical jargon. You have two camps within the OR world: people who want to ride this bandwagon all the way to the bank, and people who want to refocus on getting the message out about what OR really is.

The bandwagon-riders have succeeded to some extent. INFORMS created a monthly "Analytics" magazine[1], created an Analytics Certification[2] (their first professional certification), and so on.

The other camp has a legitimate concern that OR already has an "identity crisis" (operations research vs. management science vs. systems engineering vs. industrial engineering vs. applied math vs. applied statistics etc etc). INFORMS has spent millions trying to get business people to just be aware that it exists. The fear is that hitching our wagon to these trends will just be another blow to our profile when these fad words are replaced by the next big thing.

[1] http://analytics-magazine.org/ (you can get a good feel for the type of content in this publication just by reading the article titles...)

[2] https://www.informs.org/Build-Your-Career/Analytics-Certific...

Not a bad assessment of what seems to be going on.

I am grateful to finally see this in an article. The "big data" craze is being pushed in areas where it really doesn't make sense. We've been bit by the Big Data bug where I'm at, but it's not coming from the statisticians. It's usually the executives proposing a shift to big data.

People underestimate how much work it would be to shift an old server onto modern technologies and tell the statisticians to use MapReduce and NoSQL instead of SAS and SQL. If the Fortune 500 world has taken this long to catch on to R, imagine how long it'll take to completely change the DBMS and analysis software!

Sure if you're dealing with 1GB of data it probably isn't worth spinning up a Hadoop cluster to run your analysis. However, if you already have Hadoop up an running for something that genuinely requires it, that 1GB job might make sense there. The data may already be in HDFS, and you already have the infrastructure there to manage and monitor jobs.

The references to Facebook & Yahoo running small jobs on huge clusters may be a little misleading. It may be simply the easiest place for them to deploy those jobs consistently.

But yeah... "Big Data" is a total meaningless buzzard.

Like that huge firetruck used to put out small fires. Cities only need them for big fires, but, if you gotta have one and keep it ready, it makes sense to deploy it every time.

But realistically 99% of fires are put out by a extinguisher or a pail of water.

"Buzzard" isn't an eggcorn I've ever heard before! Did you mean "buzz word"?

Perhaps he packs a lot of data in his carrion luggage.

You have to take some of these colloquialisms with a grain assault.

I don't like to be kept dark and dry on this one

For most data, it is in fact a waste of money.

Personally, I am loading the data I play with on a postgreSQL database on my laptop (if you have a mac and want to do that quickly, you may want to check out the link I just submitted http://en.blog.guylhem.net/post/50310070182/running-postgres... )

You can do crazy things with the current hardware specs. Like loading all the data the world bank offers you to download, index it and use it for regressions (I do). In 2013 you only need a laptop for that.

Most data is not big. Big data is "big" like in a gold rush, where the ones selling the tools are making the biggest profits.

EDIT: Thanks for the postgresapp.com link! It is a little bit diffent- here I wanted to use the very same sources as Apple, without adding too much cruft (like a UI to start/stop the daemon as I had seen in other packages). I also wanted to see by myself how hard it was to 'make it work' with OSX (quite easy besides the missing AEP.make and the logfile error). It was basically an experiment in recompiling from the sources given by apple opensource website, while staying as close to the OSX spirit as possible (ex: keeping the same user group, using dscl, using launchdaemon to start the daemon automatically during the boot sequence like for Apache)

That being said, you're right, for most people postgresapp.com will be a simpler and faster way to run a postgresql server :-)

Another option for running Postgres on OS X very quickly is Postgres.app: http://postgresapp.com

SQLite is also an excellent option for a datastore on OSX.Its not nearly as full featured as postgres but no application is required and as you have a OS independent file per db which is extremely portable.SQLite Professional is a relatively decent free gui you can use also.

SQLite has limitations on the data types it supports. Most of this can be worked around by application code, but it can be a pain when you have data that needs to be accessable by more than one application.

Re: Postgresql on OSX, slightly off topic.

I think this is probably even quicker than the steps you provided (though you have to remember to start it manually, rely on them updating the build, etc) http://postgresapp.com/

As also some one that has been in the thick of some of the "big data" projects in the industry recently, I have to agree with the article.

One of the terms I learnt in the PyData Silicon Valley in March is "Medium Data". Unless you are dealing with terabytes of RAM and Exa bytes of storage, google style, the overhead of having to maintain a cluster is something most (intelligent) people try to avoid.

When you cant avoid hundreds of machines, the cluster is a necessity and you design that way. But given where the Moore's law curve stands today, most organisations really dont need that.

You can buy servers on Amazon with 250 gigs of RAM for a few dollars an hour. They specifically call it the big data cluster. It is possible to analyse the data using tools like Pandas/Matplotlib and others in the Scientific Python eco system fairly easily.

These tools are being used by scientists and industry for a really long time, except they aren't really advertised that way.

For instance, here is some analysis I was doing recently of the children names in the US, from 1880, with 3 million records: http://nbviewer.ipython.org/53ec0c5a2fabcfebb358. My Mac could handle it without even breaking a sweat.

i often tell people "if your solution to moving data necessarily involves shipping contracts" as opposed to "we'll just upload it" or even "i'll just burn it to a DVD", you're not in big data. (this is akin to "if you don't worry about power and cooling and instead worry about FLOPS, you're not in super computing" from the 90s.)

last year i was talking about an implementation we did for some data and was asked about our scale, "hundreds of terabytes" was my answer. for the people we were talking to - people who know big data - that sufficed (although a bit small on their scales, but it did require big data thinking and constructs to get answers in a reasonable amount of time).

i hadn't realized how many people were wrongly moving to "big data" solutions until i read these discussions around this article. color me surprised.

Even if the data isn't big, there can be a benefit from the Hadoop infrastructure. Say you have just 86,400 rows of data but each row takes 1 second. That adds up to 24 hours of elapsed time, and waiting for that run can be painful, especially if you are trying to experiment, iterate. With HDFS/MapReduce you can distribute that work across N machines and divide the elapsed time by N, speeding up the pace of iteration. I've worked on a project that had exactly this challenge, before Hadoop was available, and so we had to invent our own crappy ways of distributing the data to the N machines, monitoring them, collecting the results. Hadoop HDFS and Map/Reduce, with Job Tracker, etc, would have been much better than what we came up with.

Unless your problem is I/O bound (you can't get it off the disks fast enough, or network bound -- transforming data to a worker nodes takes too long) using Hadoop is the wrong choose. CPU bound problems are better solved with Grid solutions that do a better job of scaling up (with in a single node) and scale out to multiple machines. Taking a step back, you should always ask your self if this can be done on a single machine, taking advantage of Moore's Law.

What kind of processing takes 1s per row? That's several billion instructions. And you can easily fit 86400 rows in memory, so disk seeks aren't an issue.

Decent RDBMS servers will parallelise where possible, and use the servers 8 cores (or whatever) to optimise such a problem.

I've seen a fair number of startups that throw around how they are going to make big money by utilizing the data they gather (called "big data" regardless of size) - it's all a bit of magical underpants thinking: we'll gather a bunch of people/users, we can't figure out how to make money off of advertising or charging them, so then we'll talk about how the "big data" they produce will be worth a fortune and people will pay to have access to it. Know some folks in the HR SaaS space that think this is how they'll hit $100m. It's just comedy.

"We don't have big data. Our data is small, and could be easily stored in a MySQL or even a flat file" said no dev team ever. Everyone is "just like Google" so they need NoSQL, scaling, clouds and so on.

As someone who runs jobs on giant clusters day in day out, I just looked at my last job. It indeed had input data of ~100GB. However size of input data is misleading. Job does a lot of processing and generates ~5TB of intermediate data and it took 800+ machine hours to complete. If I'd ran that on my desktop I would be waiting for a month to finish. On cluster it took ~4 hours.

I'd to smile at the statement "Is more data always better? Hardly". There is old saying the world of data scientists: There is no data like more data. Yes, the value of it may be diminishing but when your competitor is trying to squeeze out gain in second decimal, you are probably better off accepting more data.

So the moral of the story is, all these really depends. People do get fired for buying clusters. Modern cluster management software track several utilization metrics and someone some day would going to look at it and point out how bad decision it was.

The reason the "big data" pimps can get away with this is that most of the people that should know, (that aren't DB programmers, DBAs, true scientists or engineers in the domain), don't know shit about data and generally too fscking lazy to learn. So they buy into the latest wave of buzz words and hype.

It's precious to read through almost every post in this thread complaining about 'big data' and saying that everyone can just use a normal relational database or whatever. But 'big data' has brought markets to exploit to feed HN-type entrepreneurs, and jobs and loads of prestige for HN-type engineers - who I have never noticed to be shy about bragging on how much data is in their systems without regard to whether that data is particularly meaningful.

I'm getting to the age where things start coming back under new branding. I remember in my childhood when my father would talk about bell-bottoms and how trendy they were once. They they came back and he was shocked.

I remember Doc Martens. They're back. I remember gumby haircut. Its back. I remember ripped jean...also back.

Technology follows this cyclical trend as well, we just give it fancy names like Big Data, Cloud and Anything-as-a-Service.

> Most data isn’t “big,” and businesses are wasting money pretending it is

Most business leaders are not rational, and we should stop pretending they are.

Most blogs don't need javascript, and publishers are pissing off their readers by pretending they do.

Related discussion from three weeks ago: https://news.ycombinator.com/item?id=5602727

In an effort to Store All the Things a lot of companies have talked themselves into a rhetorical corner of poorly fitting shoes. At this point, we've stopped wasting time when asked about Big Table and NoSQL and instead demo their storage stack on a different framework until their eyes widen.

Then, when questions about how much engineering went into this "thing" that does such a good job of keeping data secure, and so much of it, we say it's built on Postgres.

As DevOps Borat says : https://twitter.com/DEVOPS_BORAT/status/313322958997295104

How well you can utilize it and how quickly is just as important as what kind of data you store in the first place.

Be wary about drawing conclusions from "most of the jobs were small." Most of my jobs are small -- because I'm running experiments so I won't have to redo the big one.

That said, I'm a huge proponent of running stuff simply at first. Few businesses will ever grow to the point that they need more than a single large database server and one or two backups. Don't waste your time prepping for something you'll probably never need, especially when fixing the problem when the time comes is only marginally more painful than doing it right in the first place.

For me, "big" data is increasing the linkage between your data. It's not simply more data, but much richer, less formal data relationships. It's taking your sales data and linking it to your website clicks, linking that to the weather (or whatever). Or you take something traditionally static and add a temporal dimension.

This kind of deep linking you can't measure with straight megabytes. A few gig doesn't seem that large, but if it's a complex graph with a complex hypothesis - then, sure - that's big.

Well, that's the whole point - "A few gig doesn't seem that large, but if it's a complex graph with a complex hypothesis"... then it's still not 'big data'.

It's maybe smart data, maybe detailed data, but definitely not big data - that problem will have completely opposite needs and techniques than big data analysis, and should not be mischaracterised as such.

I think that's the entire point behind this article. What you're talking about and what "big data" actually exists as, are two different things. What you're talking about is data refinement. As you said, linking things together and trying to look at your data in different contexts and dimensions than what you do normally. This can be done on a nearly any data set small or large. Big data gets into the realm of literally having so much data to process that data refinement becomes nearly impossible without a significant combination of thinkers, doers, machinery and money.

Most companies today are already using scaled up servers to host their medium size warehouses (think Teradata or Exadata). That approach is very expensive (> millions of dollars), only works well with well-defined data, and does not scale well beyond a few TBs.

Hadoop is not just about running large jobs on very large data. Hadoop also makes sense when trying to scale on commodity hardware or running ad hoc queries (which can target a small amount of data) on medium to large data sets.

Expensive - yes, comparatively. Only works well with well defined data - yes, poorly defined data is hard to use in any statistical caclulation too. Does not scale well beyond a few TB - bullshit. It does scale really well.

Oracle writes shitty 'enterprise apps' (god I hate that phrase) that they sell to big companies, because their salesmen/women wear great attire and are good at mirroring dumb ceos/cios, like the ones that run several companies I have worked for. Will someone please end this nonsense? At what point does usability/stability/utility become factors?

This is a viewpoint that I hear a lot, mostly from people who are not in the room when these grand enterprise implementation decisions are made. While it's true that a good salesperson can make a difference in winning a deal vs. another vendor, salespeople almost never convince a company that they need a big enterprise software platform. 95% of the time, the company has already decided that the current way they do X is broken, and now the salesperson can convince them that they have the solution to that.

The truth is that very often, X is broken inside an organization not because of executive management, most of whom don't care what software packages get used or who they buy from or anything else like that, but rather big software companies get brought in because the technology/backoffice organization inside the company is a disaster.

Accounting system doesn't properly allocate widget expenses to different cost centers? Takes a week to update the homepage? No one knows where exactly sensitive data is being stored?

That's all the technology organization's failure in one way or another. And when things get bad enough, senior management says, "Okay, our homegrown accounting system is just not doing the job for us anymore", and here comes Oracle, happy to sell them their accounting system, which has all of the features they could possibly want, and sure, it's expensive, but it works, as opposed to the busted system they've got currently.

Of course, the next failure then, is that the people who will be running and overseeing and architecting this solution are either the same people who cocked up the accounting system in the first place, or consultants who have absolutely zero incentive to do anything other than maximize billable hours.

This means that instead of the organization saying, "We will adapt to off the shelf software and change our processes to better align with the way the software is designed to be used", they say, "Make your software work the way we do things".

Now we're off to the races, as various fiefdoms inside of the big company make their pitch about what needs to be customized. Everything from the layout of the screens to the workflow processes to the data model, everything has to be matched to exactly the way the customer wants to do things.

Back at Oracle HQ, the RFEs have been flying in from not just that customer, but the other 200 new customers being onboarded , and every one is basically a demand for a way to modify this or that option - no one is saying, "We wish there were fewer fields on this page"

So the customers demand more features, Oracle delivers them, and then the customers promptly use those features to further complicate their platforms, because they don't have the technical discipline to say, "No, we really don't need to support different SKU revenue allocations based on currency, we'll just do it by hand at the end of every quarter".

Looking at it a different way - how is making the software simpler going to help Oracle win business? If anything, the more features the product has, the more points they get on the RFP from the next big customer.

So everyone is to blame - Oracle makes money selling and implementing very complex technology solutions because they're answering the demands of their customers who depend on overly complex technical requirements because their technology organizations are poorly run because they don't have any discipline because senior management isn't technical enough to recognize where the failure is.

tl;dr - enterprise software is not broken because of the sales people or upper management, it's broken because the technology organizations are bad at their jobs

Great insight! So very true! I feel the same with people obsessing over Perfect "ToDo" app or "Project Management" app.

More than often the underlying issue is lack of discipline and human behavior (or in Enterprise case, "organizational behavior") problems that we incorrectly label as "technology problem".

> This means that instead of the organization saying, "We will adapt to off the shelf software and change our processes to better align with the way the software is designed to be used", they say, "Make your software work the way we do things".

This must be a damned if you do, damned if you don't kind of situation, because I work for a company that attempted to use a OOTB Oracle software package and ended up getting roundly criticized by every part of the company, both internally and externally.

Yeah, it very much is a tough row to hoe on either side - and btw, even just adapting to the OOTB package will still cost you a ton of money, and often in ways you didn't expect:

I was loosely associated a few years back with a manufacturing company migrating from their 20 year-old mainframe-based ERP solution to Oracle's ERP. They really had the worst of both worlds, because not only did they have 20 year-old business processes that no one wanted to change, but the whole interface for Oracle was so radically different from the "green screen" 3270 interface of the current system that you couldn't even make Oracle look anything like that. It was doomed to be a complete mess.

But to the point I'd originally planned to make, they tested the system in limited release, and then went live with it for one particular function, which was generating and printing order cards or something like that. What no one had thought of, and didn't occur in testing because it wasn't a real workload was that the old system sent raw text to the printers at the various factory sites, while Oracle (iirc) was generating postscript, complete with logos and formatting, and sending that to the printers at the factories....which it turns out were connected over 128kb/sec links that were promptly swamped by the size of the files.

So the whole project had to be put on hold until all of the links between HQ and the factories could be upgraded, which took months, and the feedback from the userbase was, "What a piece of shit Oracle is, our 20 year old system can print to the factories, why is it so hard for them to do that?!?!"

EDIT: looked back in my notes, 128kb/sec lines, not 512

(My experience has been that) most large companies lack defined processes and business models in critical areas because their models are too complex to fully define in software -- outside organisations know crap all about these complexities, and internally the skill sets are too low to implement a complex system --- and to be honest, who wants to spend three years and millions of dollars coding up an application for a business model that has already changed in that time? Sometimes, simple CRUD Apps are all that is needed.

Wow, that was beautiful. And dead on.

It's a buzzword, not a quantifiable thing.

The fact that so many people are calling things "big data" when the data is not high volume (the most popular definition I've seen is the 5 V's definition--big seems to be a misnomer in this case, as only volume could really be called a measure of "big") lends credence to your statement.

There are two ways to define BigData.

1. The accumulation, integration and analysis of a larger number of data sources.

2. A volume of data that presents challenges running analysis functions across them... Due to the limits of the tools available.

1 is fraught with the kind of statistical pitfalls that are mentioned in the posted article. 2 describes a set of problems and boundaries that are time sensitive. What was BigData in 2006 (to, say LiveJournal or Digg) may not longer hold. As a data engineer, its important to keep a skeptical eye on marketing and make sure we're delivering valuable solutions that increase the bottom line for our business, not just produce "ain't it cool" type correlations.

Extrapolating relatively few truly random data points from massive datasets, for analysis and modeling, is what "Big Data" is all about. This article would have you think that working with clusters or snippets of impossibly ginormous datasets is somehow less "Big", but that's sorta the point. Perhaps somehow should inform the author that the more data available doesn't translate into working with more data.

I wonder if both "responsive web design" and "big data" were just hoaxes that we were all fed to sell more books and seminars.

> The “bigger” your data, the more false positives will turn up in it, when you’re looking for correlations

I think they are talking about the Sharpshooter Fallacy


Nate Silver writes about this in his book.. highly recommended.


Said this before, still want to see it fixed, as I can't stand to read the page with the huge grey box on the right side that disables scrolling without being moused over content. Last time though, I didn't have my environment info.

Windows 7 Ultimate SP1 Chrome Version 26.0.1410.64 m

I've been thinking the same thing as the premise of the article for a while now. More often, I think people just write horrible code / poorly designed systems that perform sluggishly, and underwhelm...and then someone queues mr. big data as the silver bullet.

Clarifying the scope of a project or data collection & analysis effort is paramount. You never want to attempt to boil the ocean. The key is to figure out the data the matters most per your company or organization's strategy.

Does anyone know how much indexed data Google has for their search? (Not the size of the database) I'd bet it wont be over a few hundred TB - Something that can fit on most desks in the not too distant future.

Also: a lot of people seem to think the number of records is the complexity.

This article is the equivalent of "horse drawn carriages are perfectly adequate for most journeys, and much more pleasant and commodious to boot." Good luck with that, buddy.

You're not going to know what correlations are important and which are not until you study the data. Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.

It's also more than a little insulting to FB and Yahoo to insist they are not web scale. The problem of small jobs on MR clusters is real, but even with small jobs, Hadoop turns out to be a lot more cost-effective than various other proprietary solutions which are your only real enterprise alternative. The problem of small MR jobs is being solved by things like Cloudera Impala, which can run on top of raw HDFS to perform interactive queries.

The point was that not everyone needs or has big data. That's hardly controversial. Even some instances where you think you have big data that you think needs to be handled in parallel by a cluster could easily be handled by a single server or even laptop. Again, nothing controversial.

The most important thing is knowing what data you have, how best to collect it, and what it can (and can't) tell you. Just because you find correlations doesn't mean that they are real. It takes people with real expertise to help here, and just running your data on a cluster isn't going to help you. In fact, it could even hurt.

I didn't see anything wrong with the article at all.

Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.

He doesn't tell anyone to collect the "important data," and he doesn't insist FB or Yahoo are not web scale.

His concluding paragraph is relatively weak, but the main thesis -- most businesses can ignore the Forbes/BI crap and analyze their data sufficiently using normal tools -- is true and sound.

This is such a naive response. It is the sort of response that portends to show how it's object of criticism is naive, yet it misses the point entirely. First of all, you misunderstand that the author is claiming that people are treating the terms "big data" and "analysis" synonymously and that that is erroneous.

The problem is your that your ability to explore the data and the data volume are inversely correlated.You are far more likely to find interesting things exploring an in-memory dataset using something like ipython and pandas than throwing pig jobs at a few dozen TB of gunk. Big data is great if you know exactly what you are looking for. If you get into a stage where you are trying to explore a huge DB looking for relationships your need to be very good at machine learning and statistical analysis (spurious correlations ahoy!) to come out significantly ahead.Its also an enormous time sink. In summation the bigger the data the simpler the analysis you can throw at it efficiently.

Very true. Wouldn't the typical approach to this involve probabilistic methods like taking large-ish (but not "Big") samples from your multi TB data and doing your EDA with those?

That would work very well if our random sample accurately reflected the superset of data,which it almost always does but you also want to consider the following...

Imagine our data was 98% junk with 2% of the data consisting of sequential patterns. We may be able to spot this on a graph relatively easily over the whole dataset but our random sampling would greatly reduce the quality of this information.

We can extend that to any ordering or periodicity in the data.if data at position n has a hidden dependency of data at position n+/-1 random sampling will break us.

Do random sampling plus n lines of surrounding context.

Yeah. I'm a scientist that deals with huge datasets. Huge. I must admit that I do cringe a little every time I see the words 'big data'.

Disclaimer: I haven't read the post. Only the title.

Sometimes though, you really do have lots of data and need appropriate solutions. At Quantcast, our cluster processes petabytes per day and our edge datacenters handle hundreds of thousands of transactions per second. In fact we recently open sourced our file system (QFS[1]) built on top of HDFS, which can up to double FS capacity on the same hardware. Although it's certainly true that not every company (or even not most) needs all that horsepower, there are definitely some for whom it's the core of their business.

[1]. http://quantcast.github.io/qfs/

Thank for your self advertisment. But from what I understood, that lots of businesses threat few gigabytes as big data. It's about fashion to call yourself "big data" user.

What is the author of this article trying to say here?

>it appears that for both Facebook and Yahoo, those same clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range (pdf), which means they could easily be handled on a single computer—even a laptop.

That facebook or yahoo could be run from a laptop?

It's specifically talking about analytics jobs, not the user-facing stuff.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact