
Mathematician Says Big Data Is Causing a ‘Silent Financial Crisis’ - rbanffy
http://time.com/4471451/cathy-oneil-math-destruction/
======
ThePhysicist
There is a lot of interesting research on the societal impact and the
(potential) problems caused by the use of large-scale data analysis and
artificial intelligence, e.g. by Kate Crawford at MSFT Research.

Last year I gave a talk on this at 32C3 where I tried to elucidate some of the
problems that can pop up when using data analysis to automate things that were
previously done by humans:

[https://www.youtube.com/watch?v=iRY9IceaVig](https://www.youtube.com/watch?v=iRY9IceaVig)

Personally I see a lot of potential in data analysis & artificial
intelligence, but like all technologies they pose significant risks as well.
What we really need therefore (IMHO) is to teach ethics to people that work
with data, and establish organizations and methods that ensure data analysis
isn't used to do harm (deliberately or not).

~~~
Noseshine

        > What we really need therefore (IMHO) is to teach ethics to people
    

The issue with almost all organized teaching (as opposed to when you don't
even know you are doing it, e.g. to your children, every day) is that it only
targets conscious thinking. Do you already see the problem with teaching
"ethics" (and empathy)? It's the wrong brain area. People will ace the tests
but how they actually act won't change. Another big reason for why teaching
ethics (the way it is usually done) is that much of behavior is environment-
driven. So you can teach someone all you want, if you then place them in a
cut-throat competitive "result-driven" environment you can see empathy and
ethics quietly sneak out the back door.

~~~
tunap
Being deliberately obtuse goes a long way to delay a logical ending. Flashback
to 2004 and the same type of debate/affirmations were posited to Dr
Mandelbrot's apocalyptic warnings[0][1] of the imminent subprime crisis.

Instead of learning lessons, the players continue to 'print money out of thin
air'[2] and deny any fault for the last collapse. The rationale seems to be,
2007 was an extreme culmination of events that wouldn't happen again in
millions of years. Problem is, in a universe of infinite variables, an
infinite number of combinations exist and could manifest anytime a given set
of vulnerabilities and unanticipated outlier events present themselves
concurrently.

[0] The (Mis)Behavior Of Markets 2004

[http://www.goodreads.com/book/show/665134.The_Mis_Behavior_o...](http://www.goodreads.com/book/show/665134.The_Mis_Behavior_of_Markets)

[1] 2006 reminder

[http://www.ft.com/cms/s/2/5372968a-ba82-11da-980d-0000779e23...](http://www.ft.com/cms/s/2/5372968a-ba82-11da-980d-0000779e2340.html)

[2]The Quants

Great Monday Morning QB analysis on what went wrong & why it will be
repeated(hint:greed +ability +hubris)

[http://www.goodreads.com/book/show/7495395-the-
quants](http://www.goodreads.com/book/show/7495395-the-quants)

edit: fixed links

------
bko
> [algorithms] decide who gets access to credit and who pays higher insurance
> premiums, as well as who will receive online advertising for luxury handbags
> versus who’ll be targeted by predatory ads for for-profit universities.

As opposed to what? Subjective decisions made by humans based on biases and
personal preferences with no semblance of reason or fairness?

> for example, who lives in an area targeted by crime fighting algorithms that
> add more police to his neighborhood because of higher violent crime rates
> will necessarily be more likely to be targeted for any petty violation,
> which adds to a digital profile that could subsequently limit his credit,
> his job prospects, and so on.

So having more cops in a high-crime neighborhood is somehow a bad thing,
because the presence of more police actually confirms the fact that there is
more of a need of police in this neighborhood?

> this technology was actually siloing people into online gated communities
> where they no longer had to even acknowledge the existence of the poor,

So targeted advertising to the needs of people is bad because... I can't even
make sense of this one.

~~~
guelo
> the presence of more police actually confirms the fact that there is more of
> a need of police in this neighborhood?

If we were to enforce all laws we would need massively more police in every
neighborhood. When police are sent to flood a poor neighborhood they end up
ticketing poor residents for a bunch of "violations" that they would also find
if they went to the rich neighborhoods. But if they ever actually did issue
all those ticky-tack tickets in the rich neighborhoods residents would be up
in arms contacting their city council members and it would end quickly. In the
poor neighborhoods nobody ever hears about it and the residents that can least
afford it take the financial hits.

~~~
burgreblast
> they end up ticketing poor residents for a bunch of > "violations" that they
> would also find if they went > to the rich neighborhoods.

I'm not sure what data you have to support this. But check out violent crime
rates. Larceny-thefts. Burglary/Property theft. Motor Vehicle Theft.
Aggrevated assult. Robbery (mugging, stickups. etc).

These very popular crimes aren't so popular in nicer neighborhoods. Rich
neighborhoods have their own problems, but let's not appeal to the spirit of
lawlessness (cops are bad, rules are bad, unfairly arrested, etc.)

------
Chris2048
> add more police to his neighborhood because of higher violent crime rates
> will necessarily be more likely to be targeted for any petty violation

He'll also be less likely to be the victim of violent crime in an area with a
proven record for violent crime. For the price of avoiding 'petty' crimes, it
seems like a good trade-off.

~~~
s_q_b
That's a solid theoretical argument. However, it is contradicted by the
empirical evidence.

What usually happens is that the police issue fines so frequently that they
become considered an occupying force by the populace. In turn, the populace
withholds information from the police, and both violent crimes and petty
violations rise.

Police statistics make quantitative conclusions difficult, since they are so
often fudged for political purposes. They rise when the police need more
funding, and fall when they need political capital. Murder rates are generally
considered the only reasonably accurate figures.

On the ground, however, the reality is very clear.

~~~
ar0
I would suggest, however, that this is not an argument for less police but
instead an argument for fewer laws that are only infrequently and
inconsistently enforced. It is these laws that give police the opportunity to
"issue fines so frequently" and to give affected populations (rightfully or
not) the impression that they are singled out. (Laws against loitering or the
consumption of alcohol in public come into mind, to a lesser degree probably
also the criminalization of marijuana consumption.)

~~~
dsfyu404ed
Right... but those laws only ever get revoked when they hit the rich with
them. And for every one you revoke there's ten more wealthy/connected people
with sob stories that could have been prevented if X were illegal who are
contacting their elected representative.

I agree with you on principal but IMO society on Mars will be having this
debate (I hope I'm wrong).

------
ramblenode
An interesting point touched on in the article is the extent to which feedback
from a predictive algorithm contributes to its later predictions being biased.
If police are targeting a specific neighborhood because it is flagged as being
a high crime area, then there is likely to be a higher arrest rate for
innocent people too, i.e. a higher false positive rate. Unless there is a good
way of differentiating false and true positives (which seems unlikely in a
justice system heavily focused on plea bargains), then a cycle of over-
policing may ensue. A good predictive algorithm would need to account for this
type of temporal autocorrelation.

~~~
IncRnd
One could legitimately say that learning of patterns is the process of bias
creation.

------
cs702
On a related note, a Princeton professor and some colleagues just published a
blog post and a research paper documenting that "language contains human
biases, and so will machines trained on language corpora:"
[https://news.ycombinator.com/item?id=12356111](https://news.ycombinator.com/item?id=12356111)

------
dade_
A wide array of concerns, but I stopped reading when the article explains that
algorithms are going to keep people from being aware that poor people exist.
If the only awareness someone has of poverty is from the Internet ads or any
other content, then they really don't know anything about it at all. This
sounds more like the people that protested computers in the 60's because they
are used for war.

~~~
sitkack
You should probably keep reading.

------
pzh
I've always taken it for granted that discrimination and inequality would be
the natural result of any optimization/ML algorithm, especially when
maximizing for profit or any other utilitarian goal. I wonder why people would
be surprised by this logical outcome given that it has been long known that
statistical discrimination is profitable, and companies rarely have incentives
to improve equality and diversity.

------
zeveb
> Unlike the WMDs that were never found in Iraq

They _were_ found: [http://www.military.com/daily-news/2015/04/03/army-seeks-
to-...](http://www.military.com/daily-news/2015/04/03/army-seeks-to-identify-
troops-veterans-exposed-to-chemical.html)

~~~
IncRnd
We knew that WMD were in Iraq, prior to finding any, because the WMD came from
outside Iraq.

------
anoplus
Big data has the potential to create either unprecedented equality or
inequality. Transparency is a key word in this post (I often Ctrl+F to see how
many matches for "transparen" I see in this type of posts).

When it comes to software, one policy to make things more transparent is open-
sourcing. The question is opening what and to what degree. I feel strongly
that organizations today could and should be much more open and are instead
absurdly opaque.

It's our responsibility as a society to talk about transparency as key for
trust and equality.

~~~
chendies
Transparency of pricing models does not immediately result in better outcomes
for consumers. In some cases, it may actually result in _worse_ outcomes.

Consider when there are few companies within an industry, like rare metal
mining. If they chose to use the same, open-source pricing model, and any
changes to that pricing model would be reflected by all companies, the optimal
pricing model would be to price at the monopoly level -- to raise the price to
where it would be if there were no competition and a single firm in the
marketplace.

In the case of lending, however, different firms have different levels of risk
acceptance. Many firms (big banks) avoid high-risk borrowers. A few firms
(pay-day lenders) are willing to lend to those borrowers, but only at a high
interest rate. Somewhere in the middle is a collection of scrupulous lending
opportunities.

Any regulation that would restrict the acceptable interest rates to a narrow
band would reduce not just the (unscrupulous) loan sharks but also the banks
lending to high-risk customers.

This would necessarily result in _loss of borrowing ability_ to those labeled
"high-risk".

Is this a desirable outcome? Maybe, maybe not. But it should certainly be
discussed.

~~~
anoplus
It's true openness isn't free of side effects. I think a mining company that
has exclusive possession over a public resource should be rewarded with the
value creation (mining) and not by possession (natural resources). A
government should designate the natural resources to public welfare and is
ultimately achieved by openness. A resource greedy mining company would rather
dispense with transparency in this case.

Regarding the borrower problem. This is a common fear of revealing one
weaknesses by openness. I believe society will invent new alternatives to
support the new needs. Again, it seems openness is positive overall, despite
new challenges it presents.

------
jchrisa
Disallowing the use of race as a signal in machine learning / ad targeting
could help intercultural awareness. Here's the last paragraph of the article:

"""

O’Neil also proposes updating existing civic rights oriented legislation to
make it clear that it encompasses computerized algorithms. One thing that her
book has already made quite clear – far from being coolly scientific, Big Data
comes with all the biases of its creators. It’s time to stop pretending that
people wielding the most numbers necessarily have the right answers.

~~~
nl
_Disallowing the use of race as a signal in machine learning_

That sounds like a great idea, but it isn't as simple as it appears. Most
systems won't have "race" directly encoded as a feature, but that is
insufficient.

See slide 22 from [1], further in depth discussion discussion from [2] or just
this feature (which many systems could automatically discover):

    
    
      Feature6578 = Loc=EastOakland && Income<10k
    

I don't know what the solution to this is, but it's a pretty hard problem, and
even the best intention are insufficient.

Delip proposes in [1] to introduce a "fairness constraint" where the
probability of a favourable outcome in the majority class is the same as in a
protected class. This sounds sensible, but make optimising systems much
harder.

[1] [http://deliprao.com/archives/129](http://deliprao.com/archives/129)

[2]
[https://www.chrisstucchio.com/blog/2016/alien_intelligences_...](https://www.chrisstucchio.com/blog/2016/alien_intelligences_and_discriminatory_algorithms.html)

~~~
mseebach
The problem is deciding which features are reasonable to include. Yes,
location and income can probably predict race to a certain precision, and if
you're building a race-detector-by-proxy, you'd want to include those. But
what if you're evaluating eligibility for a mortgage? Certainly location and
income are critical. A black person in East Oakland with an income below $10k
_is_ going to struggle with a mortgage, but so is a white person with those
characteristics (and a black person in Connecticut making $250k is going to do
fine and so is a white person). But if more black people than white
legitimately are in a situation where they might struggle to service a
mortgage, more black people than white would get flagged as such by an
objective algorithm. Does that make the algorithm racist? The slide seems to
imply so (although without speaker-notes it's difficult to ascertain).

The flip-side is that under a human-biased system, the first white person
might get the mortgage based on being white, and the second black person might
be denied. That would be reversed under an algorithmic system only considering
legitimate factors, and surely that's progress?

~~~
yummyfajitas
The problem is actually different.

Race is often very predictive. Blacks don't perform worse simply because they
have income < $10k. Blacks perform worse on many measures even if you hold all
that stuff equal. Here are a few of the many studies reproducing this:

[http://ftp.iza.org/dp8733.pdf](http://ftp.iza.org/dp8733.pdf)
[http://www.mindingthecampus.org/2010/09/the_underperformance...](http://www.mindingthecampus.org/2010/09/the_underperformance_problem/)
[https://randomcriticalanalysis.wordpress.com/2015/05/16/on-c...](https://randomcriticalanalysis.wordpress.com/2015/05/16/on-
concentrated-poverty-and-its-effects-on-academic-outcomes/)
[https://randomcriticalanalysis.wordpress.com/2015/11/22/on-t...](https://randomcriticalanalysis.wordpress.com/2015/11/22/on-
the-relationship-between-negative-home-owner-equity-and-racial-
demographics/#my_analysis)

If you were right, then the machine learning algorithm would quickly determine
that race doesn't matter. Redundant encoding, direct encoding, etc would be
irrelevant - the algorithm would ignore it as noise.

The problem is that the algorithm is supposed to uncover hidden patterns that
predict loan default. But then there is one _specific_ hidden pattern that is
socially and legally taboo to detect, yet also highly predictive.

(Incidentally, if you can solve this problem in lending, it's easily a unicorn
startup.)

------
jokermatt999
I'm surprised the term "disparate impact" hasn't come up. I'm too well versed
in anti-discrimination law, but it seems like this is the term for exactly
what they're trying to avoid.

[https://en.wikipedia.org/wiki/Disparate_impact](https://en.wikipedia.org/wiki/Disparate_impact)

------
carlmcqueen
Jerr Thorpe actually has a tool called floodwatch[1] that keeps an eye on your
targeted ads and shares at the meta level to try and hold advertisers to the
rules. Sadly for it to work you have to browse the internet with no ad-
blockers.

[1] [https://floodwatch.o-c-r.org/](https://floodwatch.o-c-r.org/)

------
supergirl
I find it insulting when statements start with "Mathematician says", "Experts
say". Whoever does this thinks little of his readers. They count on that
prefix to make their story more convincing. Sadly, it probably works.

------
matteuan
"she shows how the algorithms [...] are based on exactly the sort of shallow
and volatile type of data sets that informed those faulty mortgage models in
the run up to 2008" so the fault is of the technology, right?

~~~
gaius
Well there is a paradox here, if the models were faulty, then why was every
bank scrambling to offload the mortgages to someone else? If the models said
the mortgages were sound, why didn't the banks want to hold onto them as a
cash cow? Could it be the data spoke but no-one wanted to hear what it said?

~~~
lucozade
Depends on which model you mean.

The banks were holding higher tranches of the securitised mortgage pools.
While those pools had traditional levels of defaults, the tranches had their
expected value. As the defaults accelerated beyond historic levels, they
plummeted in value. In both cases the same model said the instrument
had/didn't have value. As these instruments are fair value accounted the fact
that they were still bringing in cash didn't help any.

However, as the mortgage market started to go south, the default models
underestimated the level of defaults, particularly in the equity release
market. This underestimation was endemic in the industry at the time, from the
banks to the regulators and ratings agencies.

There is, of course, a very strong argument to say that the ignorance was
willful, everyone was on quite a nicer earner and reacted late to the evidence
as it started appearing.

So there wasn't really a paradox. The banks were holding tranches while the
expected defaults were within historic levels. Once it became clear that that
wasn't a good predictor anymore, they tried to limit the effect but it turned
out that so many participants were in the same boat that there weren't enough
buyers.

I have to say I'm not entirely sure what this has to do with ad analytics but
I'm not writing a book on the subject so maybe that's not too surprising.

------
tboyd47
Just like programs only know to do what they're coded to do, algorithms only
learn what's already in their dataset. It's up to data scientists and analysts
to scrub their dataset of potentially discriminatory data. If you're
contractually obligated to do otherwise, you should either find another job or
do the right thing anyway. If you choose the latter and someone takes action
against you, then you should take it to court.

I was in a situation like this in my own work and I regret not standing up for
the right thing now...

------
dgudkov
The article emphasizes the fact that data driven decisions amplify not only
efficiencies of the modern society, but also its inefficiencies. Which is
probably true for any technology.

------
sitkack
If we train a mortgage engine on Redlining [0] data, it will then red line.
Algorithms are not somehow "fair" because there isn't directly a human in the
path. They are as "human as we are" but only proxy and at much lower fidelity.

Not to Godwin the thread ... but big data has been used for immoral purposes
since well, big data [1].

The even larger danger, is that people who don't know better, will simply
"train a model" and having no overt ill-will, will still create a biased
algorithm.

[0]
[http://www.encyclopedia.chicagohistory.org/pages/1050.html](http://www.encyclopedia.chicagohistory.org/pages/1050.html)

[1]
[https://en.wikipedia.org/wiki/IBM_and_the_Holocaust](https://en.wikipedia.org/wiki/IBM_and_the_Holocaust)

~~~
yummyfajitas
This is simply not true. If we train a mortgage engine on redlining data, it
will only reproduce the red line if it's predictive. I.e., the algorithm will
only reproduce the red line _if the red line is statistically valid_.

I discuss this in detail here, with numerical examples:
[https://www.chrisstucchio.com/blog/2016/alien_intelligences_...](https://www.chrisstucchio.com/blog/2016/alien_intelligences_and_discriminatory_algorithms.html)

The fact is that machine learning takes biased inputs and produces unbiased
outputs, and this is a pretty normal occurrence. If I tell you that my
algorithm to target advertisements treated `displayedInterest x isMobile` with
15% more weight than `displayedInterest x isDesktop` because it corrected for
bias induced by latency on mobile connections, you'd think nothing of it.

~~~
SamBam
> The fact is that machine learning takes biased inputs and produces unbiased
> outputs

That is not necessarily true, because it assumes that the biased inputs have
no influence on the outputs.

Suppose the training data is a bank that has historically given higher-
interest loans to minorities. Those loans will default more often, and so the
algorithm could "correctly" conclude that minorities default more often.

Cause and effect are very hard for an algorithm to distinguish when it's the
clustered _input_ variables that are linked together for some opaque reason.
The fact is, both statements in this example would be true: "higher interest
is correlated with more defaults" and "minority borrowers are correlated with
defaults." The algorithm doesn't know one from the other.

Sure, more training data could help the algorithm distinguish between the two,
but there may not _be_ enough training data that shows the inverse -- low-
interest loans given to minorities -- if the historical data is biased enough.

~~~
yummyfajitas
_Suppose the training data is a bank that has historically given higher-
interest loans to minorities. Those loans will default more often, and so the
algorithm could "correctly" conclude that minorities default more often._

But if the relationship is actually `defaultProb = a x interest_rate + b`,
then the algorithm will reflect that. I discuss this case explicitly in the
blog post I linked to in the section "What if black people don't perform as
well?", and explicitly give an example of a linear model NOT doing what you
claim it will do.

 _The fact is, both statements in this example would be true: "higher interest
is correlated with more defaults" and "minority borrowers are correlated with
defaults." The algorithm doesn't know one from the other._

No, it wouldn't because you'd also have cases of white people with high
interest who defaulted. The model using race as a predictive variable would
fail to fit those data points. I explicitly discuss this case in my blog post.

You are correct that in some cases you don't have enough training data. But in
these cases, there is no particular reason for bias to have a particular sign.
Again, as I discuss in the blog post I linked, why would Captain Kirk be
biased against "black on left guy" vs "black on right guy"?

Insufficient training data causes _variance_ , not _bias_ , and could just as
easily result in bias _in favor_ of blacks. (In fact, I link to a number of
examples where it does.)

~~~
ramblenode
> Insufficient training data causes variance, not bias

Key point.

If one were to compare the two high-interest groups, a small sample for one
group would be reflected in a large confidence interval around its mean, a
greater overlap in the CIs of both group means, and less confidence that a
real difference exists.

------
vonnik
Let's blame the math. It couldn't possibly be the fault of businesses
targeting the poor.

------
holografix
I found this article problematic. First the assertion that "big data comes
with all the biases of its creators". Isn't this exactly what big data (which
I believe is a term used here to reference some sort of predictive model or
machine learning) is supposed to avoid - ie. Making decisions based on
evidence and not bias?

Maybe it's the cynic in me but someone who worked in a hedge fund, started yet
another advertising tech company and now releases a book about the evils of
"big data" abusing the poor... guess I'm biased.

~~~
muraiki
Big data somehow being unbiased and impartial is a myth. The algorithms are
still a product of humans. One notable example is a Google algorithm for
labeling objects in photos that tagged black people as gorillas [1]. Now I
don't have particular insight into the development process behind this
algorithm, but this result may have occurred because the training data used
did not include enough photos of black people.

Another example (from an old version of Picasa -- I'm not trying to pick on
Google but I have firsthand experience of this) is when a friend of mine had
photos of two different Asian people in his collection, but Picasa always
recognized person B as being person A, presumably because it couldn't
distinguish features unique to the faces of Asian people.

[1] [http://www.cnet.com/news/google-apologizes-for-algorithm-
mis...](http://www.cnet.com/news/google-apologizes-for-algorithm-mistakenly-
calling-black-people-gorillas/)

~~~
yummyfajitas
What evidence do you have that one specific classification error resulting in
a tweetstorm was the result of bias at all? I've been mislabeled as a tank in
at least one image processing demo (admittedly, this was well before the era
of deep learning). Image classification errors happen.

Also, it's a bit strange that you suggest ML algos are racist against Asians.
You do know that Asians are wildly overrepresented among people building such
algos, right?

Consider the possibility that some classification problems are actually more
difficult than others. As one example, darker objects are harder to
distinguish than bright ones and there are very clear mathematical reasons for
that. That's true even in fields where questions of "racism" don't make sense,
e.g. in MRI (magnetic resonance imaging is non-optical, but contrast still
matters):
[https://www.chrisstucchio.com/pubs/wf_segment.pdf](https://www.chrisstucchio.com/pubs/wf_segment.pdf)

~~~
muraiki
I did not make a blanket statement that all ML algos are racist against
Asians. My point was to convey that if the training data do not accurately
represent the population, the algorithm will carry the biases of that
inaccurate representation: the algorithm is only as good as the data it is
trained upon (for supervised learning).

As a sibling comment stated, humans are deeply involved in all aspects of
designing and interpreting an algorithm's results. To the extent that we are
flawed, so will our algorithms be.

Another example I heard recently from a conference was that of analyzing
blighted homes in areas. The underlying data was generated by human inspectors
who turned out to have given passes to violations in more affluent areas and
fines in less affluent areas. The fines compounded the problem of blight as
the residents could not afford to pay them and ended up causing a feedback
loop of blight increasing in neighborhoods identified as blighted.

I am frankly confused as to what is so controversial about the statement that
algorithms created by humans are not free from human biases. I apologize if I
have only added confusion rather than a productive analysis.

~~~
yummyfajitas
_I did not make a blanket statement that all ML algos are racist against
Asians. My point was to convey that if the training data do not accurately
represent the population, the algorithm will carry the biases of that
inaccurate representation: the algorithm is only as good as the data it is
trained upon (for supervised learning)._

But that's not true. Inadequate training data will cause _variance_ , not
_bias_. This variance can have any direction.

In a linear prediction scenario (i.e. any go/no go decision process), which
exactly what Cathy O'Neil is criticizing, the effect could just as easily be
"more loans for black people" as "less loans for black people".

In a multidimensional predictor, it does directly decrease accuracy but that's
all. "Black guy -> gorilla" is just as likely as "yummyfajitas -> tank".

Also, realistically, if we want to think about your black guy -> gorilla
example, most likely if insufficient training data was the problem, then it
was the number of gorillas that was too few. Google has vastly more black
humans in their training set than gorillas - a quick google search suggests
there are only 100,000 gorillas in the world, most of whom are probably
unphotographed.

 _As a sibling comment stated, humans are deeply involved in all aspects of
designing and interpreting an algorithm 's results. To the extent that we are
flawed, so will our algorithms be._

Sure, algorithms are flawed. But this _does not imply that bias must be
present_ proportionally to ours. Insofar as we do good statistics, human bias
is mitigated and can even go away. Science works.

As a classical example, consider Morton's analysis of human skulls. Morton was
a super racist guy trying to prove how white people are superior. But he did
an unbiased analysis and accurately measured skull volume - filling a skull
with buckshot is a pretty unbiased process.

[http://journals.plos.org/plosbiology/article?id=10.1371/jour...](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001071)

 _I am frankly confused as to what is so controversial about the statement
that algorithms created by humans are not free from human biases._

What's controversial is that this is the idea of "original sin" but applied to
science. And most of the proposed mechanisms _are simply mathematically
wrong_. Also many of the examples cited by proponents of this original sin
concept are also wrong.

~~~
thaw13579
Thanks for the interesting article. Although, from my reading, it seems to
support the importance of both data handling and statistics in avoiding bias.
That is, a bias in the dataset (or which subset is analyzed) would
subsequently bias the statistical results (Box 1 and 2).

Regarding big data and ML, I guess the question is then what kind of biases
exist in data collection, something that seems dependent on the details and
history each particular dataset.

------
lucozade
In other news, it has been discovered that a Time article was written in
English. A language that has previously been used for hate speech.

The only possible conclusion is that we should ban Time magazine. Or some such
nonsense.

------
yakult
The irony is that Times is partnered with advertisers torun these exact
algorithms on the readers of this very article. In fact, they have anti-
adblocking scripts to prevent circumvention, which unfortunately breaks reader
mode.

