Hacker Newsnew | comments | show | ask | jobs | submitlogin
Big data: are we making a big mistake? (ft.com)
199 points by pietro 335 days ago | comments



Another conclusion to draw from this article (which I really enjoyed, by the way) is that Big Data has been turned into one of the most abstract buzzwords ever. You thought "cloud" was bad? "Big Data" is far worse in its specificity.

I can't count the number of times I'll be talking to some sales rep and they'll describe how they scan the data within whatever application they're demoing and "suggest" items using "big data techniques". In almost all cases they're talking about a few thousand or hundred thousand records, tops.

I've found that when non-hardcore techies talk about Big Data, what they really mean is "they have some data" vs before, when they had zero data.

From the article:

"Consultants urge the data-naive to wise up to the potential of big data. A recent report from the McKinsey Global Institute reckoned that the US healthcare system could save $300bn a year – $1,000 per American – through better integration and analysis of the data produced by everything from clinical trials to health insurance transactions to smart running shoes.

What these consultants mean is that by having just some data compared to the silo'd data that is the norm in US healthcare, they could save a lot, and they're right. My previous company had a large data set (20+ million patients) and we'd find millions of dollars of savings opportunities for every hospital we implemented in, but that's because we had the data, not because we were running some kind of non-causual correlation analysis like the article references. It was just because we could actually run queries on a data set.

-----

Off Topic - how annoying is it that when you copy & paste from the FT, they preface your copy with the following text?

High quality global journalism requires investment. Please share this article with others using the link below, do not cut & paste the article. See our Ts&Cs and Copyright Policy for more detail. Email ftsales.support@ft.com to buy additional rights. http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabd...

-----


Less than a month ago I was at an event about Big Data in marketing. One of the speakers spoke about how they used user data to improve their client's brand experience. If was very effective and I agree they did something extraordinary.

I then asked what tools they used. He responded with a well-known relational database. I then asked the total size of their dataset, with a good idea of what the upper bounds would be. He responded "around 100 million events" since the product started, 6 months ago.

It's really sad because they may end up under fire despite the effectiveness of their work.

Big Data is a lot like teen sex.

-----


>Big Data is a lot like teen sex.

LOL.

-----


But, like "cloud" or "web 2.0" or any similar buzzword there obviously is some substance to it unspecific, un-novel, abused, or not. It just break into unsatisfying mush when you look at it to closely.

Web 2.0 was some sort of a shift over web 1.0, the line between publisher and consumer melted. Cloud is etherealizing computing and data. There was a thread a few days ago about the film Her. "Where is Samantha" (the AI) is borderline nonsensical statement. It doesn't come up to a viewer. That's because people are used to cloud as an idea now It doesn't really matter that servers, replication, dumb clients, remote data or whatever were invented a long time ago.

-----


I enjoyed your post, nemesisj. Within your field of longitudinal patient data, if I am correct in what you have written, your large datasets simply have a new name, paranthetically Big Data, and that you could get what you need to save money without the newfangled algorithms. Within academic bioscience, I think there is great consensus on what Big Data is - I have not seen much argument at all; however, it is still very hard to define within that field. The best I can do, over this cup of coffee, is to state that there is a clear distinction between the study of a gene, up to a few pathways vs. computational analysis of multiple OMICS (genomics, metabolomics and proteomics) datasets. I know that definition is terribly lacking and I am fighting the urge to delete it for the sake of getting the post completed. Anyway, Big Data is clearly changing the academic biosciences through the funding trend. That is, grants with a computational focus, or sub focus, certainly seem to be doing comparably well. I mention this because todays academic funding trends influence the direction of tomorrows startups as those being trained are disproportionally within the better funded labs, and draw upon their previous experience when forming companies. So, I personally believe this Big Data thing, however it is best defined over all, or within a given field, is in some way something new, and will continue to shape the startup sphere for years to come, especially in the areas of genomics, metabolomics and proteomics. This is my first post: ) Thanks!

-----


" This is my first post: )"

I really don't mean to be snarky, but please try to use paragraphs. It'll make it much more likely people will read your post.

-----


Out of curiosity, when does it effectively become "big data"?

I ask not to be snarky, but it might be the case that it's "big data" to someone else, but not necessarily to you. I figured it was a relative term for your industry/business, but the hacker crowd definitely seems to peg that amount in the millions of data points before calling it big data at all.

Seems fair, but I'd rather clarify.

-----


I usually follow DevOps Borat's definition [1]:

"Big Data is any thing which is crash Excel."

Many a true word spoken in jest.

[1] https://twitter.com/DEVOPS_BORAT/status/288698056470315008

-----


"Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM."

https://twitter.com/DEVOPS_BORAT/status/299176203691098112

-----


This is very inaccurate/misleading IMHO. Big Data is something which does not fit in a regular machine for a given operation. You can sort billions of records on an iPhone, for example. You can grep a string within a terabyte-file data on a single personal computer, and I am not convinced you'd go faster with a distributed system (reading the file on cold storage will be the limiting factor). People claiming to do "big data" in these situations do not generally understand the underlying concepts.

-----


With a distributed storage system you should be able to read said terabyte file using far more disk heads.

It would also be easier to engineer it so the terabyte file was entirely in RAM by distributing it across multiple machines (although single machines with TB ram capacity are no doubt continuing to become more common)

Sure, store it on a single tape or disk and distributing the computation won't help. You need distributed storage to properly leverage distributed computation for otherwise I/O bound processes.

-----


You underestimate Excel! You can point Excel to a DB table and use pivot tables on top.

I know it can at least get up to several million, didn't have a chance to test it beyond that! :)

-----


Haha, perfect.

-----


I think Hadley Wickham has a decent description of big data in terms of the analytical process... to expand slightly on his description:

On normal data you can iteratively explore and visualise it hitting return and seeing plots or model results instantaneously or at most a few seconds.

When you have time to grab a coffee after hitting return then you have bigger data.

If you carefully think through what you are about to ask the computer to do before pressing return then maybe you have big data.

I actually think this is a better description than just size of files or data distributed across many computers as an algorithm that just streams over a massive dataset maybe in parallel can be less challenging than one that has to hold a much smaller e.g many Gb dataset fully in memory.

-----


So my complicated algorithm that processes 200,000 data points is big data because it takes 1/2 hour to run, but someone else's petabyte algorithm that takes 1 second on a cluster is small to "bigger" data? I don't think this makes sense.

It's a measure of the size of the problem to be sure, but it is not a measure of the size of the data, or an indication of what techniques might be required to solve the problem.

-----


Yes to a data analyst that is big data. If you are doing some MCMC and that is really what it takes on that size of data then you have a big data problem.

The more sophisticated a statistic, the more high dimensional the data, the more sampling required, or the more of the dataset it requires to memorise at once - then the smaller your big data threshold will be.

It depends a lot on your point of view too. If I google something now it may bounce across lots of crazy server farms but to me I don't feel like I'm doing big data.. the person who built it all probably feels differently.

-----


I think defining "big data" as something that depends on the algorithm you are applying is not very useful. In that sense, almost anything is big data when you are trying to solve an NP-complete problem (nice course about algorithmic approaches to these problems: https://www.coursera.org/course/optimization).

The problem is, if you define "big data" as something that depends on the algorithm, then it makes no sense to include the word "data" in it. The expression "big data" as it is commonly used refers to flows of data so big that you need specialized approaches even when applying simple transformations to the data.

Since the dawn of computing we've always wanted to solve problems, with small or large amounts of data, that required complex algorithms. The usage of a new expression is justified by the fact that huge flows of data are now available to many companies (mainly because of the web), not because these companies are attempting to perform extremely complicated transformations to the data.

TL;DR: Big data means big volume/flow of data, and not (as you are defining it) using large Big-O complexity algorithms on some set of data. In fact, the size of big data precludes applying large Big-O complexity algorithms to said data.

-----


I appreciate the definition of Big Data as requiring >1 machine.

However... Small Data, that which traditional researchers handle, is normally much, much smaller than that: perhaps 10-10000 data points (and most often on the small end of that). An experienced researcher can essentially can know everything about this data set, including its outliers and quirky points, and get a good sense of it by drawing out simple graphs.

There is clearly some disconnect between these two ideas: is that "Medium Data"?

I would accept a concept of "Big Data" as data that cannot easily be eyeballed to get a sense of what's going on, so 10000+ points would count (under some circumstances). Maybe the concept of "six sigma" is useful - enough data that you would reasonably expect a six sigma outlier.

Mathematically/statistically, the storage limit is not a particularly important milestone: the ideas and methods don't change once you reach this scale (except for potential parallelisation).

-----


I was just going to post the ">1 machine" definition when I saw your comment.

I think there are also at least two meanings of "Big Data". The more popular one is simply a trendy name for good old and boring "statistics", but with a twist that the data comes by way of the Internet, social media, all that.

The second one (and a little closer to my heart) is what ">1 machine" means from a developer/sysadmin perspective. This is where the hadoops, hives, cassandras, etc. come into play, and it's A LOT to learn, even for seasoned developers.

I think it's also a little intimidating for people who have become very comfortable with the typical rdbms stack. Parallel processing can be hard to understand, it's not something you can tinker with on your laptop over the weekend, and it's not surprising to hear all the "your big data thing is stupid" comments.

-----


I thought the definition in the article was actually really insightful: big data is when you start to behave as if you have N=all for a non-trivial sample.

-----


Big Data used to mean petabytes, ie above the limits of performant scale up.

-----


When you are constrained to O(n) methods, you have big data.

-----


Some people are constrained to O(WTF) methods and have no idea about O. So everything is Big Data.

-----


I prefer this definition to others involving the size of memories, or number of computers, because it underscores the data rate instead of just its' (instantaneous) volume.

-----


The typical definition is where standard data management approaches do not work due to high volume, velocity, and/or variety of data sources.

What are standard data management approaches? I don't know. Usually they mean single machine relational db's.

But the thing is that once you get to a certain point on these three you need specialized solutions. High volumes of transactional data with real-time reporting might be handled well by something like Postgres-XC, but that won't handle data of sufficient variety. High velocity data may be best handled with something like VoltDB, but it can't handle volume. Etc....

-----


I believe the accepted definition for big data is data you can't handle on a single machine and need a cluster to process. So Moore's law makes it a moving threshold.

-----


Fourteen years in IT, ten of them in BI... My definition is as follows: data pertaining to and generated by the source systems that govern various business processes are 'data' data (internal data owned by the business.)

Data pertaining (in whatever abstract sense) to the business, generated by systems outside of the business are 'big'(external data.)

Nothing to do with rowcounts directly IMHO.

-----


I think "big data" is a term characterized more by the analytical techniques you apply on them rather than the size of the data. Traditional inferential statistical techniques work on "small data", while newer Bayesian techniques work on "big data" - note that this does not imply that one cannot work on the other.

-----


Not necessarily. Try running basic descriptive statistics on terabyte scale data.

-----


When you can't read all the data.

~(From a math professor I worked with)

-----


Big Data is less about size and more of a characterization. Human generated data can never be big data - there's just not enough humans to make it all. Possibly with the exception of the biggest social networks.

Big Data is machine generated by systems. Typically its logs, IoT etc.

-----


Not sure how you came to that conclusion. Humans generate voice data, which are then translated to a time series of frequencies for analysis.

Comment data on this site alone would be a pretty big task to analyze.

-----


Re: copying from FT, if you're using Firefox you can set dom.event.clipboardevents.enabled to false to get around that. Will probably break copying in some web apps.

-----


NoScript also solves the problem, since they need to run JavaScript code to disable copying. (Their site displays just fine without JavaScript.)

-----


Noam Chomsky had the best response to "big data": it's basically a nonsense concept (which I agree with) because "thinking is hard."

-----


Aiming to support the parent's claim (ElDiablo66, is this what you're referring to?), here's a quote from an article [1] about the Chomsky vs. Norvig debate a couple years ago:

Chomsky critiqued the field of AI for adopting an approach reminiscent of behaviorism, except in more modern, computationally sophisticated form. Chomsky argued that the field's heavy use of statistical techniques to pick regularities in masses of data is unlikely to yield the explanatory insight that science ought to offer. For Chomsky, the "new AI" -- focused on using statistical learning techniques to better mine and predict data -- is unlikely to yield general principles about the nature of intelligent beings or about cognition.

[1] http://www.theatlantic.com/technology/archive/2012/11/noam-c...

[2] HN thread on [1]: https://news.ycombinator.com/item?id=4729068

-----


Who said anything about thinking, and how do you know it's hard?

EDIT: I'm getting downvoted, but your statement is incredibly vague and I believe wrong. "Big Data" might be overused as a buzzword, but it's not a "nonsense concept". "Thinking is hard", I assume you are talking about strong AI, and it's not related to this at all. Saying it's "hard" adds nothing of value, and we don't even know if it's true (in the sense that when someone does figure it out, it might seem simple and obvious in retrospect.)

-----


Downvoted for mentioning downvoting but totally agree.

I see no connection between big data and AI. Like everything you can of course apply AI to it but I think step one is getting the analytic side down pat.

And also agree thinking may not be hard. It is hard to create a thinking machine (Other than using DNA) but I don't necessary think there is anything special to it actually thinking.

I'd be disappointed if Chomsky actually thought this way, would need context.

-----


>Saying it's "hard" adds nothing of value, and we don't even know if it's true (in the sense that when someone does figure it out, it might seem simple and obvious in retrospect.)

If it takes decades of hard working geniuses to figure it out, then even if "it seems simple and obvious in retrospect", it IS hard.

-----


They only add their preface if the text is greater or equal to 185 characters, it seems.

-----


I believe Tynt pioneered this copy/paste: http://www.tynt.com/product_copypaste

http://daringfireball.net/2010/05/tynt_copy_paste_jerks

-----


"But while big data promise much to scientists, entrepreneurs and governments, they are doomed to disappoint us if we ignore some very familiar statistical lessons.

“There are a lot of small data problems that occur in big data,” says Spiegelhalter. “They don’t disappear because you’ve got lots of the stuff. They get worse.”"

This should be the main learning point. Humans can be astonishingly bad at dealing with stats and biases which can led to erroneous decisions being made. If you want an example where such decisions by very smart people can have catastrophic consequences, look up the Challenger disaster [1].

I rarely see people stating their assumptions upfront, which doesn't help the problem (I guess it's not cool to admit potential weaknesses). The more people/companies that get into 'big data' (without adequate training) the more false positives we're going to see.

[1] http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disast...

-----


This article reminds me of the argument [0] between Noam Chomsky [1] and Peter Norvig [2]. TL;DR (paraphrased with hyperbole) Chomsky claims the statistical AI of Norvig is a fancy sideshow that doesn't understand _why_ it is doing a thing. It just throws gigabytes of data at an ensemble and comes out with an answer.

[0] - http://www.theatlantic.com/technology/archive/2012/11/noam-c...

[1] - http://en.wikipedia.org/wiki/Noam_Chomsky

[2] - http://en.wikipedia.org/wiki/Peter_Norvig

----

Norvigs rebuttal, http://norvig.com/chomsky.html

-----


Also relevant to this discussion is Douglas Hofstadter's solitary pursuit of 'thinking machines', outlined recently in this Atlantic profile: http://www.theatlantic.com/magazine/archive/2013/11/the-man-...

This analogy is particularly illuminating,

"“The quest for ‘artificial flight’ succeeded when the Wright brothers and others stopped imitating birds and started … learning about aerodynamics,” Stuart Russell and Peter Norvig write in their leading textbook, Artificial Intelligence: A Modern Approach. AI started working when it ditched humans as a model, because it ditched them. That’s the thrust of the analogy: Airplanes don’t flap their wings; why should computers think?"

While the Norvig-Chomsky debate is about the philosophy of the science of AI, it has practical implications to practitioners who tend to apply statistical techniques as if they are popping a pill. Engineers applying statistical learning, etc. should understand the limitations of the techniques, as outlined by Chomsky in the debate. The outcome of the Chomsky-Norvig (or Hofstadter vs. everyone else in CS) debate is less important than the arguments put forth by both the groups.

-----


The problem with the analogical comparison between the tupples [birds, airplanes, flight] and [humans, AI-machines, intelligence] is that flight is a clear and ambiguous achievement whereas intelligence is something we haven't fully defined and for-which humans, we ourselves are our only accepted model (and self-interrogation is an activity that can feel easy but in-which we found many subtle and obvious problem).

-----


I think they are both wrong. You need better models than just throwing lots of data at something simple which Norvig likes. But they are still statistical models at some level.

-----


Norvig's rebuttal is an excellent read.

-----


> a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”

I think that's indicative of Wired breathless enthusiasm for technology that turned my off buying the print version many years ago.

Scrape away some of the hyperbole and it is true that data driven management has made many companies more competitive and, if I dare mention the hobgoblin, efficient.

Hunches and ideas can only get you so far. It is important to visit the data gemba and do the genchi genbutsu.

http://en.wikipedia.org/wiki/Gemba

http://en.wikipedia.org/wiki/Gembutsu

-----


I have some Wired issues from mid 90s in the bathroom and the tone is the same.

It seems pretty much everything they write about is supposed to change the world in a major paradigm shift.

-----


It delights in the techno-utopia envisaged by Nicholas Negroponte, personally I just can't be doing with it.

http://en.wikipedia.org/wiki/Nicholas_Negroponte

-----


I'm much more impressed when someone can squeeze information out of small data. W.S Gosset was extracting tons of information from as little as two observations. I'm very grateful that my advisor guided my cohort to work with two-observation MLE in many contexts. This type of practice focuses the analyst on squeezing out as much information as possible. When applied to big data, this approach can be very useful. Big data comes with data wrangling challenges, but if you don't carefully squeeze out information, you'll be leaving tons and tons on the table.

-----


The misconceptions about big data are similar to those surrounding the word science.

Many people associate "science" with things: cells, microscopes, the inner workings of the body. But science isn't a set of things; it's a process, a method of thinking, that can be applied to any facet of life.

Big data is similar, in my opinion. It's not so much about the stuff —  the size or diversity of a company's datasets. It has more to do with the types of observations you're making and the statistical methods involved.

This distinction is important for two reasons:

1. If Big Data is recognized as a process rather than a circumstance, businesses will be more deliberate in deciding whether to use the methods. They will weigh the benefits of, say, MapReduce against other approaches.

2. The idea that "Big Data" techniques have everything to do with size is somewhat misleading. A comprehensive query of a 50,000 user dataset can be more computationally expensive than a simple operation on a 100,000-record dataset.

-----


It's the misconception that measurable observations equal the real distribution of the underlying events. Even professional data people often get that wrong, and it's not strictly limited to big data.

One of the most obvious examples was this one: A data set of all known meteorite landings[1] turns into "Every meteorite fall on earth mapped" [2] with looks like a world population maps sprinkled with some deserts known for their meteorite hunter tourism. The actual distribution can be theoretically described as a curve falling towards the poles.[3]

While this example is pretty obvious, one could expect similar observation biases in other data sources. A danger lies where data analyst do not bother to investigate what their data actually represents and then go on to present their conclusions like it would be some kind of universal truth.

[1]http://visualizing.org/datasets/meteorite-landings

[2]http://www.theguardian.com/news/datablog/interactive/2013/fe...

[3]http://articles.adsabs.harvard.edu//full/1964Metic...2..271H...

previous discussion of this: https://news.ycombinator.com/item?id=5240782

-----


You have the same problem with historical global temperature data: weather stations tend to be in or near populated areas, which excludes oceans (70% of the earth's surface area) and huge, sparsely populated regions like the Arctic, the Antarctic, deserts, rain forests, remote mountain ranges like the Himalayas and Andes, etc.

-----


Agreed, the buzzword 'Big Data' has nothing to do with the actual size of a given dataset except that it is about gathering as much data ( really metadata) as possible and finding novel ways to extract value from that data.

-----


I get the impression from looking at local "big data" events that the enterprise software crowd has tuned into big data.

I fear that now that SOAP and enterprise buses have gone their way, they look a new buzzword to sell. More solutions looking for problems...

-----


I find it amusing that the article talks about big mistakes in polling data, when the clear winner of the last two US elections is one Nate Silver, who aggregated polls to get predictions so close to the actual results, one wonders why people actually vote anymore.

Now, just like with every other technological solution, we only learn about the limits of its use by overuse. There's plenty of people out there storing large amounts of data and getting no valuable conclusions out of it. But the fact that many people will fail doesn't mean the concept is not worth pursuing.

Chasing what is cool is a pretty dangerous impulse. The trick is to be able to tell when it can pay off, and to quickly learn when it will not, and cut your losses. Maybe you don't need big data, just like maybe your shiny cutting edge library might not be ready for production.

-----


Nate's approach is based on evaluating the quality of the various polls - which is the thrust of the FT article. In fact he actively weighted each of the polls & corrected for known biases.

-----


But the article also talks about good polling data too. In effect, people have been making good and bad election predictions for decades.

-----


Great article. I think the brightest gem here is the Multiple comparisons problem:

http://en.wikipedia.org/wiki/Multiple_comparisons

-----


http://xkcd.com/882/

-----


If we aren't careful the singularity AI will believe in God, and not necessarily us.

-----


I didn't get it. Would you please elaborate?

-----


This is my favorite line and the one that damns so many "big data" efforts:

"They cared about ­correlation rather than causation."

Analytics are a tool to help find correlations and patterns so that humans can do the hard work of determining and testing for causation. Computers are doing their jobs; humans aren't.

-----


The “with enough data, the numbers speak for themselves” statement has several meanings.

In one sense, if you can observe real phenomena, you don't have to guess at what is happening. For businesses that collect troves of it, they may need statistics 'less' because the sample size may approach the population size.

But calculating basic (mean, standard deviation, etc.) statistics is hardly the most interesting part. Inferential statistics is often more useful: how does one variable affect another?

As the article points out, the "... the numbers speak for themselves” statement may also be interpreted as "traditional statistical methods (which you might call theory-driven) are less important as you get more data". I don't want to wade in the theory-driven vs. exploratory argument, because I think they both have their places. Both are important, and anyone who says that only one is important is half blind.

Here is my main point: data -- in the senses that many people care about; e.g. prediction, intuition, or causation -- does not speak for itself. The difficult task of thinking and reasoning about data is, by definition, driven by both the data and the reasoning. So I'm a big proponent of (1) making your model clear and (2) sharing your model along with your interpretations. (This is analogous to sharing your logic when you make a conclusion; hardly a controversial claim.)

-----


What executives say it does...

"Facebook’s mission is to give people the power to share and make the world more open and connected."

What it actually does... (that will be left to the reader.)

"Big Data" is often sold as one thing by Enterprise software folks. But what value the data, or processing of it actually has is usually much more dependent on the user and his context (like FB!) and usually doesn't fit as nicely onto a PPT slide.

Articles like this usually confuse the PR definition and the analyst definition.

-----


A few other comments have raised this point, but Big Data is basically the new Web 2.0. Aside from being a buzzword, as a term it's so nebulous that half of the articles about it don't really define what it is. When does "data" become "big data"?

-----


Conclusion: "Big Data" is a stupid buzzword and it makes me cringe every time I'm forced to say it to sell some new solution or frame something in a way someone who barely knows anything about computer science can understand.

It's nebulous. I've seen it applied to machine learning, data management, data transfer, etc. These are all things that existed long before the term, but bloggers just won't STFU about it. Businesses, systems, etc. generate data. If you don't analyze that data to test your hypotheses and theories, at the end of the day, you don't understand your own business and are relying on intuition for decision making.

-----


There is definitely value to big data, but isn't it also a form of legitimizing stereotypes, at least in some cases? I mean, the general premise of big data, is to glean conclusions and new knowledge of the world from billions of records. When humans are the source of the data that is being extracted and analyzed, are the conclusions not stereotypes of those individuals, unless the correlation is 100%? This might be ok, and even useful, when trying to optimize clicks on ads, but what about when the government uses it to make policy decisions?

-----


if i work for facebook and i want to figure out something about my users, isn't it safe to say N = All since the data im accessing is all user data from fb? it's easy to go wrong with big data, and although the article glossed over some fairly important things (assuming the people who work on these datasets are much dumber than they are in reality), they're right on about idea that the scope and scale of what big data promises may be too grandiose for it's capabilities

-----


Whilst, in the example you provide, it might be the case that "N = all", the cautionary tale offered in the article is that you always need to make sure you are asking the right question, and it is pretty easy to confuse yourself.

So you said "if i work for facebook and i want to figure out something about my users", and for whatever you were doing, looking at your existing user base might be the right thing to do. Perhaps, though, you actually want to know something about all your potential users, not just the users you happen to have right now. Whether or not your current user base offers a good model for your potential user base would then be a pretty important question, and one that almost certainly isn't answered by "big data".

I think that, as with most of statistics, the key point is "think about your problem", and that focusing on a set of solutions rather than the problems themselves can get in the way of that.

-----


Even if you have the full population in question and thereby avoid sampling issues, you still have a lot of pitfalls. For example if you just start correlating every variable against every other one and picking out ones that hit some test of statistical significances as "findings", you run into a range of familiar problems generally grouped under the pejorative term "data dredging".

-----


At first I thought so too. But it's actually easy to come up with cases where N != all. As a radical example, Facebook preserves the accounts of dead users.

-----


Any either-or discussion is doomed to fail. Saying that BigData is the end of theory is clearly nonsense.

BigData vs. Theory, Java vs. C++, Capitalism vs. Socialism, Industry vs. Nature, Good vs. Bad, etc.

BigData allows to store a lot of data and provides a means run some computation on that data. Not more, and not less.

-----


"Big data can tell us what’s wrong, not what’s right."

from: http://www.wired.com/2013/02/big-data-means-big-errors-peopl...

-----


I think this site is really related to this topic, even if it doesn't involve term 'big data'. http://www.statisticsdonewrong.com/

-----


Reminds me of chartism vs http://en.wikipedia.org/wiki/Efficient-market_hypothesis

-----


I like to think that every object or living being in this world has properties and methods as in programming ... This the source of data, small or big depending of actions or complexity

-----


Well I was considering a career as a Data Scientist having a strong interest in this sort of thing and as a poker player.

This just kills my vibe, man.

-----


We're making a big mistake with an every big thing, that's the way we handle buzzwords.

-----


Excellent article.

New favorite phrases "data exhaust" and "digital exhaust".

-----


TL;DR. Too much 'data exhaust' can cause 'data vomit'.

-----


Nonsense. Google Flu was not "Big" data, they had only a few years worth of data at best. Additionally, when combined with current CDC data, it's predictions were better than models based on CDC data alone. And in all likelihood they can improve it with better methods.

-----




Applications are open for YC Summer 2015

Guidelines | FAQ | Support | Lists | Bookmarklet | DMCA | Y Combinator | Apply | Contact

Search: