Most, heck something like 99.99% of all so-called big data I've dealt is something I wouldn't even classify as small data. I've seen data feeds in KB's sent over to be handled in as big data. It happens all the time. A simple data problem sufficient enough to be easily solved on something like a small db solution like sqlite is generally taken to 'grid' these days. It reminds me of the XML days when everything had to be XML. I mean every damn thing, these days its NoSQL and Big data.
People wrongly do their schema design just so that it can get into a NoSQL, then use something like Pig to generate data for it. The net result is they end up badly reinventing parts of SQL all over the place. If only they understand a little SQL and why its exists they can save themselves all that pointless complexity they get into. Besides avoiding to use SQL where its appropriate creates all sorts of data problems in your system. You will go endlessly reinventing ways doing things similar to what SQL offers while bloating your code. You will go reading a big part of the code, only to figure out the person actually intended to something like a nested select query albeit here very badly.
Besides I find much of this big data thing a total sham. Back in the yester years we would write Perl scripts to do all sorts complex data processing(With SQL of course). Heck I've run some very big analytic systems, and automation set ups in Perl to do far difficult things people do using 'Big data tools' today.
In larger corporation this has become fashion now. If you want to be known as a great 'architect' all you need to do is bring in these pointless complexities. Ensure the set up becomes so complicated it can't explained without the help of a hundred jargons totally incomprehensible to anybody beyond your cubicle. That is how you get promoted to become a architect these days.
I get that a NoSQL-ish alternative makes sense for companies that have tons of shards spanning the globe, but for the vast majority of people, a relational database serves just fine.
I do have a problem with Oracle. Even the Oracle experts at my former job could barely get Oracle to do something sensible, and running it on my own computer was basically a death sentence for getting anything done.
I have a problem with MySQL, from a sysadmin perspective. When I had it installed, MySQL was the package that would always break on every update. No upgrade was small enough that the data files would continue working.
(I don't have experience with Postgres, but SQLite seems more comfortable than any of the mentioned alternatives)
I have a problem with schemas in my database. It requires upfront work with modelling my data. I'd rather iterate. Also, nobody I've worked with seems to put the schemas in automatically; you need to run the special "initdb" script that isn't maintained to make it start working.
I have a problem with SQL. It would be awesome if we had a standard query language, but we don't. You can't apply the same SQL to different database engines mostly because it won't even compile, secondarily because it will give different results and finally because it will have a completely different performance profile.
All of this can be fixed by learning stuff, so I know better what I am doing.
But I already know CouchDB . It took me little effort to learn, and it makes a lot of sense to my mind. I can solve problems with it, and it has neat properties with regards to master-master replication. So for me, CouchDB works just fine, just like a relational database works just fine for you :)
So, from my perspective, it seems that using some SQL solution would be the time consuming option.
: CouchDB can't be considered a "Big data" database for many cases. It is slow. But it scales neatly :)
Sure, the schema gets a bit bloated and dirty, but having undocumented fields in a dict whose existence is signified only by a line of code assigning something is not better.
Where schemaless stores are great is prototyping. Especially for more algorithmic code, where I don't really know what storage I'll be needing and the algorithm will only live there for a few days, schemas are just a burden. That's why I wrote Goatfish: https://github.com/stochastic-technologies/goatfish
In more ad-hoc, constantly changing, "ops this didn't work because of unknown factor X" type of projects, schemas are a pain. It's sounds really nice to have a data store that adapts when it's impossible to know up-front what you need from your data structures.
Such a pain:
ALTER TABLE foo DROP COLUMN bar;
ALTER TABLE foo ADD COLUMN baz varchar(64);
The only real use case I've ever seen for schemaless DBs is fields with lots of ad-hoc columns added by multiple people (typically logging/metrics data).
Inevitably completely un-normalized and junk data. Even worse with no documentation or procedure anything ever added becomes permanent legacy that can never be removed. Been there, lived it, hated it, won't allow it to happen again.
Plain text logs are a great place to funnel all "unable to connect to database" type of errors for dbas / sysadmins to ponder, however.
I've implemented quite a few systems where data changes are logged into a log table, all fully rationalized, so various reports can be generated about discrepancies and access rates and stuff like that. This is also cool for endusers to see who made the last change etc.
Trying to reverse engineer how some data got into a messed up condition using logs that can be JOINed to actual data tables as necessary is pretty easy, compared to trying to write a webserver log file parser to read multiple files to figure out who did what, when, to the data resulting in it being screwed up. You parse log files for data just once before you decide to do that stuff relationally. Debug time drops by a factor of 100x.
Sounds like you're not interested in absolutely every event for security audits / data integrity audits and more interested in general workflow. Be careful when selecting samples to store because what superficially looks random might not be (I donno, DB primary key, or timestamp ends in :01?). Is (one byte read from /dev/urandom )== 0x42 if so log it this time.
Also it depends on if you're unable to log everything because of sheer volume or lack of usable purpose or ... For example if you've got the bandwidth to log everything but not to analyze it in anything near realtime, maybe log Everything for precisely one random minute per hour. So this hour everything that happens at 42 minutes gets logged, everything at minute 02 next hour, whatever. Better mod60 your random number.
You can also play games with hashes IF you have something unique per transaction and you have a really speedy hash then a great random sampler would be hashing that unique stuff and then only log if the hash ends in 0x1234 or whatever.
If you have multiple frontends, and you REALLY trust your load balancer, then just log everything on only one host.
I've found that storing data is usually faster and easier than processing it. Your mileage may vary.
Another thing I've run into is its really easy to fill a reporting table with indexes making reads really fast, while killing write performance. So make two tables, one index-less that accepts all raw data and one indexed to generate a specific set of reports, then periodically copy some sample outta the index free log table and into the heavily indexed report table.
Its kinda like the mentality of doing backups. Look at how sysadmins spend time optimizing doing a full backup tape dump and sometimes just dump a delta from the last backup.
You're going to have to cooperate with operations to see what level of logging overloads the frontends. There's almost no way to tell other than trying it unless you've got an extensive testing system.
I've also seen logging "sharded" off onto other machines. Lets say you have 10 front ends connecting to 5 back ends, so FE#3 and FE#4 read from BE#2 or whatever. I would not have FE3 and FE4 log/write to the same place they're reading BE2. Have them write to BE3 or something, anything than the one they're reading from. Maybe even a dedicated logging box so writing logs can never, ever, interfere with reading.
Another strategy I've seen which annoys the businessmen is assuming you're peak load limited, shut off logging at peak hour. OR, write a little thermostat cron job or whatever where if some measure of system load or latency exceeds X% then logging shuts off until 60 minutes in the future or something. Presumably you have a test suite/system/load tester that figured out you can survive X latency or X system load or X kernel level IO operations per minute, so if you exceed it, then all your FE flip to non-logging mode until it drops beneath the threshold. This is a better business plan because instead of explaining to "the man" that you don't feel like logging at their prime time, you can provide proven numbers that if they're willing to spend $X they could get enough drive bandwidth or whatever such that it would never go in log limiting mode. Try not to build an oscillator. Dampen it a bit. Like the more percentage you exceed the threshold in the past, the longer logging is silenced in the future, so at least if it does oscillate it won't flap too fast.
One interesting strategy for sampling to see if it'll blow up is sample precisely one hour. Or one frontend machine. And just see what happens before slowly expanding.
Its worth noting that unless you're already running at the limit of modern technology, something that would have killed a 2003 server is probably not a serious issue for a 2013 server. What was once (even recently) cutting edge can now be pretty mundane. What killed a 5400 RPM drive might not make a new SSD blink.
(Whoops edited to add I forgot to mention that you need to confer with decision makers about how many sig figs they need and talk to a scientist/engineer/statistician about how much data to obtain to generate those sig figs... if the decision makers actually need 3 sig figs, then storing 9 sig figs worth of data is a staggering financial waste but claiming 3 sig figs when you really only stored 2 is almost worse. I ran into this problem one expensive time.)
User-level data might be useful for tech support (although this is currently working fine with text-based log files and a grep).
So I guess I am not sure... I might be content with web analytics... Each company has its own URL in the site, like so: http://blah.com/company/1001/ViewData, http://blah.com/company/1002/ViewData, etc. Using e.g. Google Analytics I could see data for one company easily, but can I see data across all companies (how many users look at ViewData regardless of company id)? Can I delegate the owner of company 1001 to see only the part of the analytics?
Another monkey wrench is the native iPad app - ideally the analytics would track users across both native apps and the web site.
Can you give me a concrete example of how you would use this?
Security audit trail type stuff "Why is this salesguy apparently manually by hand downloading the entire master customer list alphabetically?".
This isn't all doom and gloom stuff either... "You're spending three weeks collecting the city/state of every customer so a marketing intern can plot an artsy graph by hand of 10000 customers using photoshop? OMG no, watch what I can do with google maps/earth and about 10 lines of perl in about 15 minutes" Or at least I can run a sql select that saves them about 50000 mouse clicks in about 2 minutes of work. Most "suit" types don't get the concept of a database and see it as a big expensive Excel where any query more complicated than select * is best done by a peon by hand. I've caught people manually alphabetizing database data in Word for example.
Another thing that comes up a lot in automation is treating a device differently WRT monitoring and alerting tools if a change was logged within the last 3 days. So your email alert for a monitored device reads contains a line something like "the last config change made was X hours ago by ...". Most of the time when X=0 or X=1 the alert is because ... screwed up, and when it isn't a short phone call to ... is problem isolation step #1.
This was all normal daily operations business use cases, aside from the usual theoretical data mining type stuff, like a user in A/B marketing test area "B" tends to update data table Q ten times more often than marketing test area "A" or whatever correlation seems reasonable (or not).
Using your approach, I guess it would be a table like
create table logs (
Or, if by change a column, you mean something like an "update table blah set x=x+1;", and the x column was part of an index, that used to really work the indexing system hard, one individual row at a time. I think that issue was optimized out a long time ago. I believe there was a sneaky way to optimize around it other than the index drop and create trick, by doing all 10 million increments as part of a transaction such that it would do all 10 million increments, close out the transaction, then recalculate the index. Now there was something sneaky to the sneaky that you couldn't do a transaction on one update so you updated all the "prikey is even" and then updated all the "prikey is odd" or something like that as a two part transaction. I didn't exactly do this last week so if I misremember a detail...
This seems backwards to me. Relational databases are much better at ad-hoc querying of data, whereas NoSQL scales for narrower access patterns. The fact that you can dump arbitrary columns without a migration is a nice convenience, but in general it will be less queryable than it would be if you added the column in a SQL database.
Oh but you have to know the schema right? Yes, some other part of the application knows the schema, but this part doesn't have authority over the DB. Also, the schema may be data as well.
NoSQL reduces the work needed for that.
But why would I waste developer time doing that if I can only do db.table.insert(obj) - in MongoDB for example (obj is a JS object)
Also, finding all objects with a field named 'field1' and value '3' is slower if you do that in a relational DB (and that's the simplest case)
The fun thing about NoSQL skeptics is how they think of only the current scenarios they work with, and they won't believe you until they get burned by it. So be it.
In systems large enough to be running multiple versions of an app at the same time, talking to the same database, you have to do exactly this--but you have to shoehorn it into relations (doing things like having a field where 99.999% of entries in a column are null, and gradually get computed as the row is accessed by the new version of the app.) NoSQL lets you just say what you mean--that in some versions of the app, the schema is X, in some versions it's Y, and the database isn't the arbiter of the schema.
What happens when you push problems up the stack? Do they get solved automatically? No? Will they get solved? Perhaps, if really needed. And, for the cherry on top: Will the solutions be similar to the ones SQL databases use? They will.
You see, when you are implementing atomic transactions, for example, you may get ahead if you have some information about the problem domain. However, for most cases, you are solving the same problem SQL databases solved decades ago. And you'll find the same solution. Just not as well implemented nor as well tested.
Its interesting that culturally more than a decade ago, when mysql tried this strategy with transactions, namely, not having them until roughly the turn of the century, it was reviled mostly by people who don't know what transactions are nor did they need them, but were nonetheless very unhappy about mysql not having that checkbox get checked.
Now its culturally seen as a huge win to simply not implement something difficult.
I don't know if its a decline in feature list length as a fetish or just simple copy catting of others behavior (perhaps in both situations) or some kind of pull yourself up by your bootstraps romantic outlook on reimplementation or the inevitable result of homer simpson meets the database, but whatever it is, its an interesting major cultural change.
Instead of looking at it as "aaawwm! NoSQL is attacking our bellowed RDBMS", try looking at the different projects and what they bring to the table. Maybe some of them can be a useful addition to your systems.
Typically, they require specialized database administrators whose primary job is to tune the database and keep it running.
Many businesses, even of moderate size, reach point where they will need to purchase expensive hardware (million dollar RamSans and expensive servers) to optimize the performance of their database because partitioning databases is challenging.
So the overhead of running an Oracle or Sql Server database is quite high.
There is huge room for improvement with these traditional database products. If someone made a good cloud database that supported the same feature set but with lower administration and maintenance costs then that might be a better option.
Do you have any experience with Amazon RDS? I haven't tried it; I guess my concern would be the same as any other AWS product--they tend to fail catastrophically from time to time. Then again, if you're doing cloud NoSQL through Amazon, you're going to run into the same issues (see: Reddit).
This whole "specialized database administrator" point just seems moot considering the equivalent for that are the so-called Big Data developers.
With that said, the above comes at a time/flexibility cost (not as bad as it used to be but still there) when building a product that isn't quite sure what it will be yet. In these cases a different data store can be beneficial since the app itself is key until traction is gained, if ever.
This makes devs think that they need NoSQL experience and therefor they will find ways to shoehorn NoSQL into whatever problems they are currently solving.
Throw in a line graph and a heat map and you're basically on a fast track to a promotion without even saying a single useful thing.
Just need a good SQL reference? 'The Art of SQL' or 'SQL and Relational Theory'
Are you not entertained? Fine, choke on this : http://en.wikipedia.org/wiki/The_Third_Manifesto
I'll be honest and say I read it and found it terribly embarrassing yet comforting. Remember that dumb thing I did back in '96? (insert red face) Yeah I guess I'm not the only guy to learn that the hard way. That lack of deep experience is a significant danger of nosql designs. The folks doing that now, don't even see the icebergs that relational folks successfully dodged decades ago. Much better off being nostalgic about the olden days of steam engine trains than not even seeing the diesel-electric headlight at the end of the tunnel rushing toward you.
They are some big boots to fill. I'm just some DBA in London.
Nonetheless, large jobs are important too. Over 80% of the IO and over 90% of cluster cycles are consumed by
less than 10% of the largest jobs (7% of the largest jobs in the Facebook cluster). These large jobs, in the clusters we considered, are typically revenue-generating critical production jobs feeding front-end applications.
So MR job characteristics might follow a power law distribution, and @mims is focusing on one end of the tail. Sure, that's cool!
But then @mims also selectively quotes the TC article, which ends with an excellent point that contradicts his thesis:
The big data fallacy may sound disappointing, but it is actually a strong argument for why we need even bigger data. Because the amount of valuable insights we can derive from big data is so very tiny, we need to collect even more data and use more powerful analytics to increase our chance of finding them.
I think @mims over-pursues the stupid Forbes/BI straw man here. As one would expect with data, the story is complicated. Mom and pop stores don't need to worry about Cloudera's latest offering, but companies working on the cutting edge of analysis still absolutely need tools like Hadoop, Impala, and Redshift.
And If the proposal to analyse a 1/4 million terms for kelly search had gone live I woudl have been creating over a LOC (Libray of congress) for each run.
I will say that the article's distinction between small and big data is also important, but that just comes down to processing power. I think the distinction you make is far more important and knowing whether you need coarse or fine data can help keep you out of the issues that are introduced moving from small to big data.
But I also don't really mind if Big Data is truly big, because it's clearly different data than what businesses are used to collecting and interpreting today.
Back in the 60's there was this chamber of secrets called "the Machine Room" which had the "Mainframe" and various and sundry high priests who went in and out, and if you literally played your cards, as in punched cards, right you could get a report on how sales or manufacturing was doing this month.
That got lost when everyone had a PC on their desk, and now some folks are trying to reclaim it :-)
That said the article is still poorly argued. The cost of data management is fairly high. And generally a big chunk of that cost is the cost of specialists who provide business 'continuance' which is code for "makes sure that you can always get your data when you need it, and you can get the answers you need from it in a timely and repeatable fashion." That hasn't changed at all, and whether you have some youngster doing "IT" on the creaky Windows 2000 machine running Back Office or you are using a SaaS company like Salesforce.com, data management is and will continue to be a mission critical part of staying in business.
Oracle has used exactly this business model to great success, and obscene profit, for over 30 years.
Edit: I noticed you meant small businesses. However, Oracle does this mainly for the large companies that don't excel at technology.
Take a simple example. Every small offline shop will not refuse to count every customers head turn (with direction, angle and frequency), calculate averages, and get sales floor attention heatmap divided by day, hour, age and sex.
If that would cost them $500 one-time fee for 10 cameras and $5/day for the cloud service, and produced by pressing one big green key.
That's the Big Data driver, businesses are opening their eyes to the ability to analyze (cheap!) a huge number of small (and even smaller) factors and make better decisions.
He said: "Exactly this business model" -- that is, as it pertains to it's essence.
NOT to be read as:
"Exactly this business model as it pertains to inconsequential details, like which big company they should be imitating".
While it's nice that the big data craze is making statistics more popular in the mainstream press, it is important that statistics does not become just an application of numerical methods without consideration of underlying assumptions. I stress this because this has been largely underappreciated in my experience.
I suspect a lot of these people doing "big data models" are as you say, ignoring the importance of having solid assumptions.
Oh well, that's exactly in part what brought down the financial collapse: A bunch of kids get a formula (Blach-Scholes) and believe blindly in its magic powers so they apply it to everything. Fast forward several years and we've got what everybody knows.
Statistics involves inference over prediction, but either one when done right validates assumptions.
By the way big data will sit on your face for days.
A lot of statistics in business does not bother to check modeling assumptions. Models are chosen based on whether they've been used in the past and what the team is familiar with.
I don't doubt that big data (as we call it now) will one day rule. Ronald Fisher would keel over if he saw the size of datasets we work with nonchalantly on a daily basis. 50 data points (the size of the Iris data) is laughable these days.
My reservation with big data is that the technologies are often unnecessary for the size of the tasks being done. Other than a few data scientists working on truly large projects, most of the big data talk I hear comes from people who aren't fighting in the trenches (execs, marketing, journalists).
On the one hand, people are starting to realize that quantitative analysis can help their businesses (mind blowing, right?) -- on the other hand, so much of what you see about "analytics" and "big data" is nonsensical jargon. You have two camps within the OR world: people who want to ride this bandwagon all the way to the bank, and people who want to refocus on getting the message out about what OR really is.
The bandwagon-riders have succeeded to some extent. INFORMS created a monthly "Analytics" magazine, created an Analytics Certification (their first professional certification), and so on.
The other camp has a legitimate concern that OR already has an "identity crisis" (operations research vs. management science vs. systems engineering vs. industrial engineering vs. applied math vs. applied statistics etc etc). INFORMS has spent millions trying to get business people to just be aware that it exists. The fear is that hitching our wagon to these trends will just be another blow to our profile when these fad words are replaced by the next big thing.
 http://analytics-magazine.org/ (you can get a good feel for the type of content in this publication just by reading the article titles...)
People underestimate how much work it would be to shift an old server onto modern technologies and tell the statisticians to use MapReduce and NoSQL instead of SAS and SQL. If the Fortune 500 world has taken this long to catch on to R, imagine how long it'll take to completely change the DBMS and analysis software!
The references to Facebook & Yahoo running small jobs on huge clusters may be a little misleading. It may be simply the easiest place for them to deploy those jobs consistently.
But yeah... "Big Data" is a total meaningless buzzard.
Personally, I am loading the data I play with on a postgreSQL database on my laptop (if you have a mac and want to do that quickly, you may want to check out the link I just submitted http://en.blog.guylhem.net/post/50310070182/running-postgres... )
You can do crazy things with the current hardware specs. Like loading all the data the world bank offers you to download, index it and use it for regressions (I do). In 2013 you only need a laptop for that.
Most data is not big. Big data is "big" like in a gold rush, where the ones selling the tools are making the biggest profits.
EDIT: Thanks for the postgresapp.com link! It is a little bit diffent- here I wanted to use the very same sources as Apple, without adding too much cruft (like a UI to start/stop the daemon as I had seen in other packages). I also wanted to see by myself how hard it was to 'make it work' with OSX (quite easy besides the missing AEP.make and the logfile error). It was basically an experiment in recompiling from the sources given by apple opensource website, while staying as close to the OSX spirit as possible (ex: keeping the same user group, using dscl, using launchdaemon to start the daemon automatically during the boot sequence like for Apache)
That being said, you're right, for most people postgresapp.com will be a simpler and faster way to run a postgresql server :-)
I think this is probably even quicker than the steps you provided (though you have to remember to start it manually, rely on them updating the build, etc) http://postgresapp.com/
One of the terms I learnt in the PyData Silicon Valley in March is "Medium Data". Unless you are dealing with terabytes of RAM and Exa bytes of storage, google style, the overhead of having to maintain a cluster is something most (intelligent) people try to avoid.
When you cant avoid hundreds of machines, the cluster is a necessity and you design that way. But given where the Moore's law curve stands today, most organisations really dont need that.
You can buy servers on Amazon with 250 gigs of RAM for a few dollars an hour. They specifically call it the big data cluster. It is possible to analyse the data using tools like Pandas/Matplotlib and others in the Scientific Python eco system fairly easily.
These tools are being used by scientists and industry for a really long time, except they aren't really advertised that way.
For instance, here is some analysis I was doing recently of the children names in the US, from 1880, with 3 million records: http://nbviewer.ipython.org/53ec0c5a2fabcfebb358. My Mac could handle it without even breaking a sweat.
last year i was talking about an implementation we did for some data and was asked about our scale, "hundreds of terabytes" was my answer. for the people we were talking to - people who know big data - that sufficed (although a bit small on their scales, but it did require big data thinking and constructs to get answers in a reasonable amount of time).
i hadn't realized how many people were wrongly moving to "big data" solutions until i read these discussions around this article. color me surprised.
I'd to smile at the statement "Is more data always better? Hardly". There is old saying the world of data scientists: There is no data like more data. Yes, the value of it may be diminishing but when your competitor is trying to squeeze out gain in second decimal, you are probably better off accepting more data.
So the moral of the story is, all these really depends. People do get fired for buying clusters. Modern cluster management software track several utilization metrics and someone some day would going to look at it and point out how bad decision it was.
I remember Doc Martens. They're back.
I remember gumby haircut. Its back.
I remember ripped jean...also back.
Technology follows this cyclical trend as well, we just give it fancy names like Big Data, Cloud and Anything-as-a-Service.
Most business leaders are not rational, and we should stop pretending they are.
Then, when questions about how much engineering went into this "thing" that does such a good job of keeping data secure, and so much of it, we say it's built on Postgres.
As DevOps Borat says :
How well you can utilize it and how quickly is just as important as what kind of data you store in the first place.
That said, I'm a huge proponent of running stuff simply at first. Few businesses will ever grow to the point that they need more than a single large database server and one or two backups. Don't waste your time prepping for something you'll probably never need, especially when fixing the problem when the time comes is only marginally more painful than doing it right in the first place.
This kind of deep linking you can't measure with straight megabytes. A few gig doesn't seem that large, but if it's a complex graph with a complex hypothesis - then, sure - that's big.
It's maybe smart data, maybe detailed data, but definitely not big data - that problem will have completely opposite needs and techniques than big data analysis, and should not be mischaracterised as such.
Hadoop is not just about running large jobs on very large data. Hadoop also makes sense when trying to scale on commodity hardware or running ad hoc queries (which can target a small amount of data) on medium to large data sets.
The truth is that very often, X is broken inside an organization not because of executive management, most of whom don't care what software packages get used or who they buy from or anything else like that, but rather big software companies get brought in because the technology/backoffice organization inside the company is a disaster.
Accounting system doesn't properly allocate widget expenses to different cost centers? Takes a week to update the homepage? No one knows where exactly sensitive data is being stored?
That's all the technology organization's failure in one way or another. And when things get bad enough, senior management says, "Okay, our homegrown accounting system is just not doing the job for us anymore", and here comes Oracle, happy to sell them their accounting system, which has all of the features they could possibly want, and sure, it's expensive, but it works, as opposed to the busted system they've got currently.
Of course, the next failure then, is that the people who will be running and overseeing and architecting this solution are either the same people who cocked up the accounting system in the first place, or consultants who have absolutely zero incentive to do anything other than maximize billable hours.
This means that instead of the organization saying, "We will adapt to off the shelf software and change our processes to better align with the way the software is designed to be used", they say, "Make your software work the way we do things".
Now we're off to the races, as various fiefdoms inside of the big company make their pitch about what needs to be customized. Everything from the layout of the screens to the workflow processes to the data model, everything has to be matched to exactly the way the customer wants to do things.
Back at Oracle HQ, the RFEs have been flying in from not just that customer, but the other 200 new customers being onboarded , and every one is basically a demand for a way to modify this or that option - no one is saying, "We wish there were fewer fields on this page"
So the customers demand more features, Oracle delivers them, and then the customers promptly use those features to further complicate their platforms, because they don't have the technical discipline to say, "No, we really don't need to support different SKU revenue allocations based on currency, we'll just do it by hand at the end of every quarter".
Looking at it a different way - how is making the software simpler going to help Oracle win business? If anything, the more features the product has, the more points they get on the RFP from the next big customer.
So everyone is to blame - Oracle makes money selling and implementing very complex technology solutions because they're answering the demands of their customers who depend on overly complex technical requirements because their technology organizations are poorly run because they don't have any discipline because senior management isn't technical enough to recognize where the failure is.
tl;dr - enterprise software is not broken because of the sales people or upper management, it's broken because the technology organizations are bad at their jobs
More than often the underlying issue is lack of discipline and human behavior (or in Enterprise case, "organizational behavior") problems that we incorrectly label as "technology problem".
This must be a damned if you do, damned if you don't kind of situation, because I work for a company that attempted to use a OOTB Oracle software package and ended up getting roundly criticized by every part of the company, both internally and externally.
I was loosely associated a few years back with a manufacturing company migrating from their 20 year-old mainframe-based ERP solution to Oracle's ERP. They really had the worst of both worlds, because not only did they have 20 year-old business processes that no one wanted to change, but the whole interface for Oracle was so radically different from the "green screen" 3270 interface of the current system that you couldn't even make Oracle look anything like that. It was doomed to be a complete mess.
But to the point I'd originally planned to make, they tested the system in limited release, and then went live with it for one particular function, which was generating and printing order cards or something like that. What no one had thought of, and didn't occur in testing because it wasn't a real workload was that the old system sent raw text to the printers at the various factory sites, while Oracle (iirc) was generating postscript, complete with logos and formatting, and sending that to the printers at the factories....which it turns out were connected over 128kb/sec links that were promptly swamped by the size of the files.
So the whole project had to be put on hold until all of the links between HQ and the factories could be upgraded, which took months, and the feedback from the userbase was, "What a piece of shit Oracle is, our 20 year old system can print to the factories, why is it so hard for them to do that?!?!"
EDIT: looked back in my notes, 128kb/sec lines, not 512
1. The accumulation, integration and analysis of a larger number of data sources.
2. A volume of data that presents challenges running analysis functions across them... Due to the limits of the tools available.
1 is fraught with the kind of statistical pitfalls that are mentioned in the posted article. 2 describes a set of problems and boundaries that are time sensitive. What was BigData in 2006 (to, say LiveJournal or Digg) may not longer hold. As a data engineer, its important to keep a skeptical eye on marketing and make sure we're delivering valuable solutions that increase the bottom line for our business, not just produce "ain't it cool" type correlations.
I think they are talking about the Sharpshooter Fallacy
Windows 7 Ultimate SP1
Chrome Version 26.0.1410.64 m
You're not going to know what correlations are important and which are not until you study the data. Telling people to just collect the "important data" is like telling someone who has lost his keys just to go back to where he left them.
It's also more than a little insulting to FB and Yahoo to insist they are not web scale. The problem of small jobs on MR clusters is real, but even with small jobs, Hadoop turns out to be a lot more cost-effective than various other proprietary solutions which are your only real enterprise alternative.
The problem of small MR jobs is being solved by things like Cloudera Impala, which can run on top of raw HDFS to perform interactive queries.
The most important thing is knowing what data you have, how best to collect it, and what it can (and can't) tell you. Just because you find correlations doesn't mean that they are real. It takes people with real expertise to help here, and just running your data on a cluster isn't going to help you. In fact, it could even hurt.
I didn't see anything wrong with the article at all.
He doesn't tell anyone to collect the "important data," and he doesn't insist FB or Yahoo are not web scale.
His concluding paragraph is relatively weak, but the main thesis -- most businesses can ignore the Forbes/BI crap and analyze their data sufficiently using normal tools -- is true and sound.
Imagine our data was 98% junk with 2% of the data consisting of sequential patterns. We may be able to spot this on a graph relatively easily over the whole dataset but our random sampling would greatly reduce the quality of this information.
We can extend that to any ordering or periodicity in the data.if data at position n has a hidden dependency of data at position n+/-1 random sampling will break us.
Disclaimer: I haven't read the post. Only the title.
>it appears that for both Facebook and Yahoo, those same clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range (pdf), which means they could easily be handled on a single computer—even a laptop.
That facebook or yahoo could be run from a laptop?