Hacker News new | comments | show | ask | jobs | submit login
How to Call B.S. On Big Data: A Practical Guide (newyorker.com)
360 points by aaronchall 152 days ago | hide | past | web | 88 comments | favorite



I recently got entangled with some "big data" and "machine learning" B.S. in the form of the U.S. health care system.

CMS, the federal agency that administers Medicare, introduced a hospital quality ratings system last year. It is supposed to combine a variety of objective metrics into an easy-to-understand grade for hospitals.

However, the techniques they used are really bad. For example, a programming error makes the model give different results depending on how the data are sorted. Some measures get negative weights, meaning a hospital should do worse to get a better rating.

I wrote more about the technical failures here: https://sites.google.com/site/bbayles/index/cms-star-ratings

Another criticism: http://jktgfoundation.org/data/An_Analysis_of_the_Medicare_H...


This is very fascinating, so thank you for looking into this. I think you may have misunderstood the rating system. You said

>>>Let’s say I really have to produce a rating, though. What would I do? I would probably: - Find some experts and ask them to assign weights to my various measures on the basis of how much they contribute to quality

But isn't this is exactly what the latent variable model (really a PCA) is doing? The only difference is - rather than have experts pick 60 weights for each component, which would require 60 contentious decisions, the PCA does some form of dimensional reduction so the experts need only pick weights for 7 components, which "unravel" into 60 weights. This sounds reasonable to me - assuming of course, the PCA components each have meaningful interpretations, and measure the degree of "good".


I admire your charitable interpretation. I'd be fine with the rating system using automation to reduce the number of subjective decisions to be made - I thought the LVM approach was a good one when I first heard about it.

As the second article points out, even this is kind of crazy, and the implementors didn't seem to care at all about what the results were - for example, the imaging category is driven by a single measure related to abdominal CT scans.


>This sounds reasonable to me - assuming of course, the PCA components each have meaningful interpretations, and measure the degree of "good".

But they won't, unless it happens to project near one of your dimensions.

If the 7 components can only be understood as linear combinations of 60 factors, your experts still need to provide 60 scores.


The cynic in me is suggesting those errors are deliberate. Politics. Sorting data should not change the data. Why would a measure of a good thing negatively influence the rating? I know... Don't attribute to malice that which can be easily explained by incompetence, and all that. But when is government, politics has to be lurking.


I loved the final paragraph.

> Mind the Bullshit Asymmetry Principle, articulated by the Italian software developer Alberto Brandolini in 2013: the amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it. Or, as Jonathan Swift put it in 1710, “Falsehood flies, and truth comes limping after it.”


I use the following quote on my profile since the beginning of time ( of HN):

Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital. ~Aaron Levenstein

It's fundamentally the same and can be used multiple times in this thread it seems :p


one of the most valuable comments on here in a long time. thank you.


The UW class, Calling Bullshit in the Age of Big Data

http://callingbullshit.org/syllabus.html

They say video will become available.



This is an awesome wealth of videos! Check out the Data Visualization section for some Tufte inspired lectures.

These guys really know their bullsh!t :)


Once you understand Simpson's Paradox and Anscombe's Quartet, you will simply never believe any statistics that anyone shows you ever. Infact you will probably never even believe your own calculations, and that's a good thing if it keeps you on your toes.


I only believe in statistics that I doctored myself


It is correct that you don't need a math degree to detect data b.s.

However, the suggested "red flag" method is only part of the solution. You learn to detect the red flags and you understand, if red flags accumulate, to start worrying about the truthness of what you are presented...

and then you hit a bullshitter or lyier who, through skill or pure luck, starts validating things that you know are probably wrong but really really hope to be true. And suddenly you yourself start to support the b.s. thesis because you really want it to be true. Then, what do red flags help you? You will actively try to invalidate them. They may stay objectively true, but in our poorly structured, limited way of thinking, they won't.

The true art of survival is to detect when something is pulling your inner optimist. You need to learn to recognize when this guy wakes up. And when he wakes up you need to assume that you are being cheated, that someone is actively trying to sell you something you wouldn't buy otherwise. Because if there is a person with malicious intend all other ways of thinking will make him win. But if it's just coincidence it doesn't hurt to protect yourself.

Defending yourself usually means to ignore your pride and ignore logic and simply look out to not invest anything. Not your time, not your money, don't give signatures, don't stay on the phone call [1], drop the book.

So the true solution to b.s. is: Learn to recognize when somehting pulls you in, and when you detect it start defending against it by not giving in.

[1] i.e., https://www.reddit.com/r/personalfinance/comments/6ix0jy/irs...


and then you hit a bullshitter or lyier who, through skill or pure luck, starts validating things that you know are probably wrong but really really hope to be truey wrong but really really hope to be true. And suddenly you yourself start to support the b.s. thesis because you really want it to be true.

There's a word for this kind of person: politician.


A friend of mine often expresses similar sentiments.

My normal reaction is to ask "what would you have instead of politicians?". He has problems forming a coherent answer.

What would your answer be?


The solution is ethical and moral people. Politicians or not, the greatest modern issue is that we don't value honesty and integrity as a society anymore. This means, across the board, we have this weird form of nuanced corruption where everyone lies without compunction just because it's accepted practice. This means we can't trust politicians, data analysts, data scientists (anyone who uses data to proclaim "the truth") or anyone who makes any claim whatsoever to "the truth."

Until, and unless, we as a people value integrity and honesty more highly, we should hold all "truth" to be suspect, and guilty until proven innocent.

I worked at a job where, as data scientists, we were discouraged from revealing what we found in the data, and compelled to produce from the data what our boss thought the answer should be. It was asinine. If you think you already know the answer, why consult the Oracle? Put your understanding into practice and find out yourself if you're right. Or, look at the data first with as little bias as possible, and build testable hypotheses from there. I think the tendency to take a hypothesis to the data is intended to reflect the scientific method, but that only works if you're open to whatever the answer may be. It doesn't work if you just keep pushing to slice the data in ever crazier ways trying to get it to validate your hypothesis.


Ethical and moral people fall prey to cognitive biases all the time.

It's not about honesty and integrity, if you (or others) don't ask [the right] questions.

We are biased, by ideology, world view, experience (and history/genetics/family/friends/aesthetics) by long- and short-term (self-)interests and so on. With money it's very common to spot the conflict of interest, the bias when people give advice, with other things, it's a lot harder.

That said, people by default are gullible, susceptible to persuasion and so on. People are by nature social animals and easy to misled, to influence, and so on. We are naive.

It takes training to spot the bullshit, even in our own thinking. (And we haven't even mentioned the psychopathologies that can also very seriously undermine our blossoming critical thinking by sheer force of emotions - anxiety/depression/impulsiveness/xenophobia/etnophobia - or via a persistent insistence on fringe patterns - schizoid-type disorders, hallucinations, paranoia.)

> compelled to produce from the data what our boss thought the answer should be.

That's not asinine, that's misconduct. That's fraud.

So, all in all, the situation was never good. The Enlightenment never happened. It started to, but suddenly stopped.


Politicians are, by definition, shapeshifters. Or, at least, that's how it used to work.

Now they're narrowly defined silos where (the normally human) "flip flopping" (i.e., learning more and updating their position) is considered a form of weakness. Imagine that. The one thing that enabled the species to survive and thrive is suddenly off limits.

What can go wrong?

Note: I'm not blaming politicians. It's a reflection of all of us.


The problem is we are voting for people while imagining ourselves voting for policies.


Yup. Or as I like to say: We elect the electable. Qualifications and capabilities are irrelevant. And they we wonder why they're so incompetent.

It's a sad state of affairs really.


What would your answer be?

I would put "none of the above" on ballots, and if it wins, all the candidates are barred from politics or public office of any sort for life, and the election is re-run.


Might not give the outcome you expect. Spoiling of ballots is quite rare even in countries with compulsory voting.

UK

http://www.votenone.org.uk/spoilt-ballot-results-2015.html

1% spoiled ballot papers. One third didn't vote (less in 2017 as much higher turnout).

Australia

http://www.bbc.co.uk/news/world-asia-23810381

Compulsory voting and 6% spoiled papers


But this is a very different thing than "spoiling the ballot" that I am proposing. Under my scheme there would have been no Trump vs Hillary nor Macron vs Le Pen because the potential consequences of losing to "none of the above" and never being allowed to do politics again, forever, would act as a deterrent to sleazy candidates.


Er - not so sure but it would be fun to try.

UK is parliamentary so you are voting for your local MP. People quite often vote for the actual member as well as/despite the party so could be tricky with the sanctions. There is also a constitutional thing about being able to stand in elections except under very narrowly defined conditions [1].

How about lotteries for local government - like jury service? Local government in UK is problematic as it is unpaid and often elections have very low turnout.

Not sure that I or the people of France would necessarily agree that M. Macron is a sleazy candidate - he represents something different absolutely, and you need to be careful designing systems where it becomes hard to change things. Podemos and SYRIZA were both extremely popular. They are now facing the reality of governing, hence winners and losers, and so the sheen is somewhat tarnished but they still command a significant level of support.

[1] https://www.electoralcommission.org.uk/__data/assets/pdf_fil...


He was wildly unpopular for Macronomics - Le Pen was about the only candidate he could possibly have beaten, and Le People very much voted against her rather than for him.

I agree that this would be a constitutional change but the beauty of an unwritten constitution is that you can do that as circumstances change.


Good luck getting major changes to the Representation of the People act in the direction that you are suggesting through Parliament and the Lords.

PS: Macron's strange new party did rather well in the Assembly elections as well remember.


But with very low turnout


For me this sentiment just means it's good to have a healthy skepticism of everything politicians say.


Yes. But nine of ten times that data (if you will) comes pre-filtered by The Media. Pardon me, but most of them don't seem qualified to make me coffee. Yet they are the ones most responsible for crafting public opinion.

Perhaps the messenger should be (metaphorically) shot?


I suspect that public opinion is less craft-able than has been previously thought, or at least that is what I think is the takeaway from the UK EU referendum and the last couple of general elections (GE2015 and GE2017 in the jargon).

What 'qualifications' do you see as appropriate for election to (say) Parliament?


Perhaps. But my point is, The Media is VERY biased. And that does affect attention and opinion.

Take for example all this talk of Russia hacking the USA election. How many articles have you seen that ask: Wait. This happened under the previous administration. Why weren't they prepared? Why did the intelligence community seemingly fail again? But instead we get "...and today Trump tweeted..."

Or what about Comey? This is the same guy who used his position and power to try to influence the election (by dredging up HRC emails, etc.) Yet that's forgotten and The Media speaks of him as completely trustworthy? Really? How's that?

I don't mean to bring up politics per se. But it does make it easier to point out for much (so called) journalism is barely grade-school level story telling.

It's embrassing. You can't have a true and honest Democracy without a healthy and proper Fourth Estate.


Willingness to compromise is the qualification no one seems to value anymore. Yet isn't that the key to effective politics, in any country?

Second would be background/experience. For example, in the USA, we tend to elect lawyers. Then a science issue comes up and we flip out because they aren't science aware / science literate.

Well...duh. What did we think we were going to get. The point is, it's silly to vote for a cat and then complain that it doesn't bark. Yet, that's what happens too often.


Calling BS on big data is really important, but this article is weak. The New Yorker should be doing better. Try Weapons of Math Destruction by Cathy O'Neill for a much more informed critique.

https://www.amazon.com/Weapons-Math-Destruction-Increases-In...


"The New Yorker should be doing better. "

That alternative you linked is free reading on the New Yorker or some other site? Or are you saying people willing to pay for a better article/book/source/training can get it? That's almost always true.


I'm was saying that the New Yorker usually has very high quality writing, but this article wasn't.


How to Lie with Statistics is a classic on this:

https://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/03...


Sadly this book gets the description of statistical significance completely wrong. Not particularly surprising, given how unintuitive reasoning behind p-values really is, in Jeffreys words: "What the use of P implies, therefore is that a hypothesis that may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure."


Also "how to lie with maps".


This is a pretty bad article.

Firstly, the course is called Calling BS in the age of Big Data. That's a big difference.

Secondly, Google fixed Flu Trends and that kind of undermines the article's whole thesis: http://www.nature.com/articles/srep12760

One might almost say I called bullshit on this article.


The article says that Google Flu Trends does worse than a "simple model of local temperatures". From a very quick, level-1 read [1], the paper you link to doesn't mention that simpler model. Instead, it compares Google flu to previous versions of itself.

I guess you can say that Google improved their flu model, but, "fixed"? Also, I don't see that the article's "thesis" is "undermined". Sorry about the scare quotes.

I mean that the article is taking small liberties to score a small point against Big Data™ (and who else best to score them against, other than Google?) but is that really enough to call bullshit on it?

I don't see anything misleading in the article is what I'm saying. So why "bullshit"?

___________

[1] Read abstract and conclusions, eyball a couple of tables, scan the rest, i.e. just enough to argue on the internet as if I know what I'm talking about.


The article says that Google Flu Trends does worse than a "simple model of local temperatures"

Indeed, that is what the article says. It's bullshit though.

You are right, though that "Google improved their flu model". No model is ever "fixed" if that means 100% correct.

Flu trends worked well (much better than a "simple model of local temperatures") except in the 2009 flu season, when it missed the A/H1N1 pandemic. It was then modified, and these modifications seem to have caused it to estimate a pandemic in the 2012/13 season which didn't occur.

"simple model[s] of local temperatures" do work quite well as a baseline, but they don't pickup pandemics either. However, in that 2012/13 season it would have done better than flu trends. [1] is a good overview.

So this is complicated topic. I have a research team working on this exact problem, and we'd love Google search data because there is no doubt that it can and does work. But like all models it breaks down when something it hasn't seen before occurs.

My bigger problem is with the thesis of the article. I'd summarize my reading of that as "big data is BS", which is a more extreme form of their title "How to Call BS on Big Data".

But the course this is based on isn't that at all. It's about understanding how big data can be used to draw wrong conclusions, NOT that big data is BS in any way at all.

I think the course is a very important and useful thing. But what it is doing is dramatically different to what this article claims, and the way that they use the implied authority of the course to support their "big data is BS" claim is what lead me to to say "bullshit".

[1] http://journals.plos.org/ploscompbiol/article?id=10.1371/jou...


Thanks for the clarification and it's good to hear you are speaking from experience with the kind of model being discussed (although I'd still like to know what that "simpler model" is exactly, or where it comes from anyway; but that's probably not for you -or even Google- to answer, since it's mentioned in the original article in the first place).

That said, I don't agree with you, in that I didn't read the article as saying that Big Data is BS by default. It's a short article and not terribly thorough but I didn't read a blanket condemnation in it.

Btw, I'm not sure why you and nickpsecurity are being downvoted to grey. I expected this strong disagreement to be reserved for personal attacks etc.


"No model is ever "fixed" if that means 100% correct."

One definition of broken for a proposed alternative to the status quo is if the alternative under-performs it. Kind of makes one ask why anyone would adopt it to begin with. There's a simple model using temperature that works pretty well. Google's solution is said to perform worse than that with more false positives. Google's isn't "fixed" or "working" until they show it outperforms the simple solution that works with similar error margin and cost.

In other words, it isn't good until people would want to give up existing method to get extra benefits or cost savings new one brings.


In other words, it isn't good until people would want to give up existing method to get extra benefits or cost savings new one brings.

They do.

The current state of the art methods used "in production" today all use Flu Trends data from Google[1], other forms of digital data[2], ensemble methods incorporating them all, or human-based "crowdsourced" forecasting[3]

One definition of broken for a proposed alternative to the status quo is if the alternative under-performs it.

Which is not what happened here.

Here's the report into the 2014 CDC's Flu Forecasting competition: https://bmcinfectdis.biomedcentral.com/articles/10.1186/s128...

Note that there is no mention of the "simple mean temperature" model. That's because it isn't very useful. That model predicts flu increases in winter, and picks up minor variations because of weather patterns.

To simplify even further, you can average all the CDC flu data and use that as your prediction and on an average year you'll have a decently performing model.

This isn't useful as a forecast, because the people who need forecasts already know this.

Better models (eg, SI, SEIR, Hawkes process based etc) can sometimes pick up epidemic or unusual conditions, but only after the conditions have changed. This is still useful, because there is a (best case) 2 week lag between ground conditions and CDC data being available.

Digital surveillance techniques (Flu Trends, Twitter data, etc) all push that data lag back.

This is incredibly useful for the people who need forecasts because it gives them lead time.

To understand this you need to consider the metrics. The most common metrics for flu forecasting is the "peak week", and "number of people infected at peak". Sometime the total number of people infected in a season is also reported.

Temperature-based models do really well on average at both these tasks, but they fail completely at picking the unusual seasons.

Google's solution is said to perform worse than that with more false positives.

Google flu trends picked the 2009 epidemic season really well, but failed in the 2013 season (when it falsely picked an epidemic). On average that might make it worse than a temperature based model, but that is just bad selection of metrics.

It's like reporting average income when your sample has a billionaire: the metric is misleading.

If that isn't the perfect example of "bullshit" then I don't know what is.

[1] http://www.healthmap.org/flutrends/about/

[2] http://delphi.midas.cs.cmu.edu/nowcast/about.html

https://gcn.com/articles/2016/12/21/cdc-flu-predictions.aspx


" From a very quick, level-1 read [1], the paper you link to doesn't mention that simpler model. Instead, it compares Google flu to previous versions of itself."

This is actually a proven method of disinformation called false/incomplete comparison that advertisers use to sell products. I'm not accusing Google of doing that so much as saying them scoring a new tech against a defective one to say something about the new tech in general should be dismissed by default since it's a broken comparison of same kind used in fraudulent advertising. Aka it's bullshit.


Google Flu Trends is no longer published and hasn't been since 2015. Fixed? More like improved and then shuttered.

https://www.google.org/flutrends/about/


Yes. Not sure if you are agreeing or disagreeing with me since the claim was that it didn't work.


There was some discussion on this article ~20 days ago: https://news.ycombinator.com/item?id=14476474


Back in undergrad, one of my toxicology courses spent a full class on just this topic (calling bullshit on scientific studies). We went through 5 different papers from prestigious journals and identified the issues that make conclusions shaky.

Fascinating stuff. Now whenever I see some extraordinary claim in a paper I automatically assume it's wrong until proven otherwise. Sadly I'm often correct.


In my CS MSc program, my research group had a weekly paper reading group. This seems to be a common thing. What I'm not sure about is how common the general outcome was. We had one faculty member in the group that was excellent at identifying methodology errors and raising them as discussion points. I'd guess about 1/2 of the papers we read, he was able to find something to pick on. Sometimes small, sometimes enough to bring the whole paper into question. Very great learning experience and helped me hone my bullshit detector.


Any resources for learning the Fermi estimation techniques listed there? Seems like a collection of complementary skills, each of which could be improved:

memorizing useful facts, selecting facts that lead to a meaningful estimate, the mental math to compute the final result

https://en.wikipedia.org/wiki/Fermi_problem


Mainly the list you stated.

Practicing basic arithmetic and judicious application of the distributive property (much like decomposing complicated problems into smaller subproblems) will take one very far in this sort of thing.

I was introduced to dimensional analysis in my high school physics class. We generated an expression off by just a constant for some property (which I don't recall) of a large scale dust cloud simply by identifying pertinent quantities (e.g. density, the classical gravitational constant) and resolving the powers each quantity must have in order to yield the correct units (corrected due to below comments; thanks) of the property. It made an impression on me, and I used the technique often as a guesstimate to "motivate" or provide a calibration for a solution to various problems all the way through grad school. It's not infallible, and can even be wildly misleading, but it's a fantastic tool.


Good point. Dimensional analysis is a great "space" to traverse to get answers. It's a great grounding for your thinking, in addition to helping you get to within the right order of magnitude.

I suppose I was wondering if there are any good drills, exercises or puzzles to help internalize these skills. Instead of a daily crossword, maybe there's a daily estimation puzzle somewhere.


However, dimensional analysis is for when you need to figure out the formula.

"Estimating" is when you have the formula and fill it with estimates to obtain a combined estimate.


Could you elaborate on this? By "resolving the powers" do you mean magnitude of interacting forces? And by correct dimensions do you mean spatial dimensions?


By dimensions he means units. So taking into account the units of density/the gravitational constant (and any other pertinent quantities) and the units of the quantity you are calculating for you can derive an approximate formula just by looking at it and saying okay this unit needs ^2 and this one needs ^-1 and this one needs ^-3 for the end units to work out.

https://en.wikipedia.org/wiki/Dimensional_analysis


I teach a Quantitative Methods course and in it, I have students read through the Guesstimation book by Weinstein and Adams (listed on that wiki page). There is also a second volume.

They do 11 blog entries each on modifying a question from each chapter and then computing it out.

I also have them watch TED talks and come up with guesstimation critiques of the talks, also in the form of 11 blog entries.

These are highly effective exercises. I recommend doing something similar. If you can, find some others who are interested in doing the same. Reading each other's questions and answers is very valuable in detecting mistakes and comparing your own technique to theirs.

Fundamentally, you just have to do it.

The mental facts needed to memorize are surprisingly minimal. The Guesstimation book lists some good suggested ones. Often you will find you already have some sense of a number. By taking reasonable boundaries on either end of the plausible range and then taking the geometric mean, you can get a decent approximation.

The mental math is also fairly minimal. It just requires two digit arithmetic for the most part along with being comfortable with powers of 10. There are books on mental math, but I think practice is sufficient for two digits: http://arithmetic.zetamac.com


Fantastic! Thanks!


The two books by Sanjoy Mahajan, freely available as pdfs from the publisher, teach precisely this subject: https://mitpress.mit.edu/authors/sanjoy-mahajan




In engineering school, we learned BOEC, useful in the time of slide rules. That is Back Of The Envelope Calculation.


"Calling B.S." seems to be a snark way of saying "Applying the scientific method".


There's a famous saying, I forget who said it: "In God we trust, all others must bring data."

This suggests "data," as a thing, is infallible. Or that data holds _The Truth_.

Problem is, as a data scientist, I've become very skeptical that either is true. Not that data is useless, but mostly that if there's unequivocal truth in data it will remain unfound because those searching for it operate under such profound bias that they will be incapable of either a) finding the truth, or b) recognizing it.

The better quote, which can be broadly applied to anything data-related, is: "All models are wrong, but some are still useful."

Usually, I look at data as presenting only one side of the story. And models as hopefully useful, if used with caution. The proof is always in the pudding: do actions derived from our understanding of the data yield results? If "yes" then our understanding of the data contains some difficult to quantify level of truth. Do our classification, clustering, and prediction techniques work? If "yes" than our models reflect some of the truth (never all of it).

In my six years since college, and going on three as a data scientist, I've become convinced that intentionally (or not) a great deal of analysis and modeling (including machine learning models) is fundamentally wrong. Sometimes because the practitioner, with the best of intentions, screwed up (all too easy to do), and often because the practitioner used the data to tell whatever story they wanted to. You can usually manipulate any given data set into giving the answer you, or your boss, or your boss's boss, thinks is the "right" answer. And even if you come to the data with the purest intentions you'll often find "the truth"--only to have application and time prove it wrong.

My assessment: data is slippery, and often like wrestling snakes. Or, it's the modern version of panning for gold. We can make ourselves, or the business, much richer when we find those rare nuggets within the data which prove, with application and time, to reflect some measure of truth. The proof is always in the pudding.


Only semi-related but nonetheless interesting:

Calling BS on people claiming to do big data. Ask them how big exactly their data is and then point out that it fits on one machine, so it's not "big". This is independent of their usage of Hadoop.


There's even a standard model for that kind of thing:

http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...


Strange memories on this nervous night in Big Data. Has it been five years, six? It seems like a lifetime, the kind of peak that never comes again. Datascience in the middle of '10 was a very special time and place to be a part of. But no explanation, no mix of words or histogram or memories could touch that sense of knowing that you were there and alive, in that corner of time in the world, whatever it meant.

There was madness in any direction, at any hour, you could strike sparks anywhere. There was a fantastic universal sense that whatever we were doing was right, that we were winning.

And that, I think, was the handle. That sense of inevitable victory over the forces of Old and Evil. Not in any mean or military sense, we didn't need that. Our energy would simply prevail. We had all the momentum, we were riding the crest of a high and beautiful wave.

So, now, less than five years later, you can go on a steep hill in Buzzwordvill and look west. And with the right kind of eyes you can almost see the High Water Mark. That place where the wave finally broke and rolled back.


Guessestimates are fantastic, this is one of the first things I always wanted to instill in my students (physics).

This until you hit exponential events when guessestimating the exponential part leads to catastrophes. It is worth keeping in mind that the more linear something is, the more useful the guessestimate will be.


I look forward to their piece on blockchain.


For an interesting philosophical take:

http://journals.sagepub.com/doi/pdf/10.1177/2053951716664747


A great book along these lines is called "The Halo Effect". More focused on business books than anything, but it's along the same lines.


Misleading clickbait title - the article delivers on "how to call BS on statistical claims" (and does that well), but it has nigh-on nothing to do with Big Data.


I think this is part of a trend on HN called "defining clickbait down." Clickbait is deliberately withholding information or writing a headline in such a way as to prompt a click to figure out what the article is about. "10 Celebrities You Never Knew Were Gay, #8 Will Surprise You!" And at least one of the celebrities is Ellen DeGeneres. "You Won't Believe What Happens When This Homeless Man Doesn't Have Money For Food!" Someone buys him lunch. That's clickbait. The title to this is just a title. It's not clickbait.


Clickbait is deceptively framing an piece (article, video, audio) through description and/or graphics to create an anticipation or misunderstanding, deliberately.

Mischaracterising a story about A to be about some topical interest B would fall into that category.

Among the more annoying forms (one I disagreed with dang on a couple of days ago) is using a deliberately vague title (the book by the author in this case was far more clearly described).

One I particularly dislike is using more words to not tell you what something is about than what actually telling you what it was about would have required.

Any title telling me how I should feel or respond also gets an automatic flag: "Shocking..." "You will..." "You won't...", etc. Fuck you and stop telling me what to do.


Clickbait is whenever your title has less information than your article.


If the title contains all the information in the article, that's a pretty bad article.


This Headline Only Contains Partial Information, You Won't Believe What The Article Says


I'd add to your definition and include anytime an author adds a buzzword that is irrelevant or not mentioned in the article. Then OP's claim that this is clickbait would be true - its for statistics and not big data.

So "How to Call B.S. On Big Data: A Practical Guide" is clickbait, but "How to Call B.S. On Statistical Claims: A Practical Guide" is not.


I disagree with this definition, and even if I did agree I think this doesn't describe the article. As just one example:

> Beware of Big Data hubris. TheGoogle Flu Trends project, which claimed, with much fanfare, to anticipate seasonal flu outbreaks by tracking user searches for flu-related terms, proved to be a less reliable predictor of outbreaks than a simple model of local temperatures. (One problem was that Google’s algorithm was hoodwinked by meaningless correlations—between flu outbreaks and high-school basketball seasons, for example, both of which occur in winter.) Like all data-based claims, if an algorithm’s abilities sound too good to be true, they probably are.

If that's not a claim about Big Data, then what is?


It seems a bit unfair to call the title clickbait when the article is about a course named "Calling Bullshit in the Age of Big Data".


Why? It provides a few examples of big data failures such as "Google Flu Trends" to support their hypothesis.


As with many other publications, the author of the story had nothing to do with the titling of it.


The underlying course has a video series on Youtube with a very very close proximity to the title: https://www.youtube.com/playlist?list=PLPnZfvKID1Sje5jWxt-4C...


I too thought it would be more specific. Nowhere does it mention sample selection bias, the scourge of most "big" datasets I've been offered. Quantity can't always compensate for lack of quality.


The syllabus of the course contains sectionsbclearly regarding big data, and name them so.


Thanks


A BS article about calling BS on BS. Also, this comment is a total BS.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: