
How Cambridge Analytica’s Facebook targeting model really worked - Dowwie
http://www.niemanlab.org/2018/03/this-is-how-cambridge-analyticas-facebook-targeting-model-really-worked-according-to-the-person-who-built-it/
======
etiam
Spoiler warning. Article punchline ahead.

"The whole point of a dimension reduction model is to mathematically represent
the data in simpler form. It’s as if Cambridge Analytica took a very high-
resolution photograph, resized it to be smaller, and then deleted the
original. The photo still exists — and as long as Cambridge Analytica’s models
exist, the data effectively does too."

That's an eloquent piece of explanation of a very important point. And apropos
the discussion about privacy legislation, it's also going to be a very
interesting point. Will the Cambridge Analyticas of the world be able to claim
they have held on to no personal data, when strictly speaking the raw data has
indeed been deleted after being used to create a derivative work that can for
all important purposes be used to recreate the original? Assuming I find out
I'm being profiled and demand to have my data removed, will society grant me
rights to have derivative forms removed or adjusted too? I'm somewhat
pessimistic that legal hairsplitting about matters like these will make
enforcement very difficult.

~~~
darawk
> when strictly speaking the raw data has indeed been deleted after being used
> to create a derivative work that can for all important purposes be used to
> recreate the original?

To be precise, you almost certainly cannot use this data to recreate anything
remotely resembling the original dataset. This type of dimensionality
reduction would throw away enormous volumes of data. There is no meaningful
sense in which you can reconstruct the data from it.

What they have done is distill some insights about people from this data. It's
arguable whether they should be allowed to keep those insights, but there's no
privacy risk there really.

It's honestly kind of disingenuous to describe dimensionality reduction in the
way that they do here. It _is_ like reducing the resolution of a photo, but
it'd best be described as reducing that resolution to say, the 20 most
representative pixels. There's no real sense in which the photo still exists.

~~~
stochastic_monk
That's only accurate in the sense that because an LSTM's hidden layer is much
smaller in dimension than the data on which it is trained, there is less
information in it.

However, it concisely represents a manifold in a much larger dimensional space
and effectively captures most of the information in it.

It may be (and is) lossy, but don't underestimate the expressive power of a
deep neural network.

~~~
alexcnwy
You're throwing out buzzwords instead of addressing the response.

It's dimensionality reduction. You cannot recover the original object. It's
like using a shadow to reconstruct the face of the person casting the shadow.

Note this has nothing to do with the expressive power of a deep neural
network. You are by definition trying to throw away noisy aspects of the data
and generalize a lower dimensional manifold from a high dimensional space. If
it's not lossy, it won't generalize.

~~~
stochastic_monk
You're right that it's really just a form of dimensionality reduction. My
point was just that it's a more powerful form of dimensionality reduction than
PCA or NMDS.

[Edit: _and_ that the salient characteristics are likely contained in the
model.]

~~~
darawk
Precisely because it's more powerful, it doesn't encode the identifying
information of the original data. Something like PCA likely would retain
identifying characteristics (depending on how many low-rank vectors you drop).

~~~
stochastic_monk
Outside of the fact that they have identities for all of the people whose data
they acquired, yes, it would be harder to reconstruct individual people with
it than PCA because of the direct interpretability of its data.

~~~
darawk
They claim to have deleted that data. If they haven't deleted the data, then
of course it's still an invasion of privacy. But the ML model really has
nothing to do with it.

~~~
etiam
I think the ML model has a lot to do with it in this case. One of the
arguments I expect to see is that "Oh, no! We removed _all_ the _data_. It's
gone. I mean, that was only a few hundred megabytes per person anyway, but we
just calculate a few thousand numbers from it and save in our system, then
delete the data. That's less data per person than is needed to show a short
cute cat GIF. What harm could we possibly do with that?"

~~~
darawk
My point isn't that there is no harm here in them storing this model. It's
also not that the data in their model is worthless. It's specifically that the
way this article is talking about the issue is incorrect. The analogy they use
would lead you to draw false conclusions about what's going on, and how to
understand it.

There is a real issue here of whether or not they should be allowed to keep a
model trained from ill-gotten data. But the way I would think about it is: If
you steal a million dollars and invest it in the stock market, and make a 10%
return, what happens to that 10% return if you then return the original
million? That's a much better analogy for what's going on here. They stole an
asset, and made something from it, and it's unclear who owns that thing or
what to do with it.

------
lordnacho
Finally, I was waiting for someone to talk about the model itself. It makes
sense that SVD or something like it (PCA, co-occurrence, etc) would be used.

But I also wonder what exactly you are going to do with the predictions. What
exactly do you show to someone to make them more likely to go and vote if they
are inclined to vote your way, or make them stay at home otherwise? Is there
evidence that whatever you're showing actually works? Or do you try to change
people's minds? What do you do?

Knowing how the state of things -in this case, people's voting inclinations-
is not the same as knowing what to do, ie a strategy.

I don't know how effective it is, I'd like to learn more. But I smell the
possibility that these CA type firms are simply selling snakeoil to desperate
political activists.

~~~
tomrod
> What exactly do you show to someone to make them more likely to go and vote
> if they are inclined to vote your way, or make them stay at home otherwise?

Qualitatively: show things that get them angry.

Quantitatively: test and control pop splits.

~~~
mattmanser
How do you test anything? There's only one vote, you can't iterate.

~~~
metaobject
Maybe with polling?

~~~
icelancer
Data is terrible, especially for polarizing candidates like Trump. People
simply lie in public about not voting for him, afraid of backlash that they
will receive.

------
emodendroket
I am really puzzled by the Cambridge Analytica scandal. It's not particularly
savory, but is there something happening here that it wasn't basically already
known about how Facebook worked? By the protests of their own executive, the
system was working as designed, and at worst Cambridge Analytica misled them
about how they intended to use the data, right? There was no actual security
breach here, as far as I can understand it.

~~~
frozenlettuce
It became a "problem" because it helped Trump win.

~~~
vuln
This is exactly it. At least it stops the news from droning on and on about
Russia.

I thought Clinton spent large amounts of money on data and the Democrats
admitted the data was bad or at least that was their excuse. How much did CA
pay for this data? I still find it crazy that Trump campaign spent 30% of what
Hillary did and still won. The Russians used 100k$ worth of ads to sway the
election. This stuff doesn't t add up.

~~~
neuronexmachina
Do you have any sources for your claim about how the Clinton campaign acquired
FB data and how they used it? Was any of it acquired fraudulently and/or in
violation of FB's ToS, like CA's data was?

~~~
hueving
Do you have any sources for your claim that the parent poster claimed the
democrats purchased Facebook data?

~~~
neuronexmachina
This is a thread about how CA acquired and used data from Facebook, so I
assume the parent comment was trying to make an apples-to-apples comparison.
The alternative is that the poster was disingenuously trying to imply a false
equivalence.

------
tunesmith
Isn't the accuracy of the predictions kind of orthogonal to the fact that they
were basically lying in their attempts to change behavior?

Using lies to convince someone to do something is going to be more effective
than using truth, if that "something" is not in accordance with the truth.

The comparison with netflix really breaks down there. You're not going to be
able to convince me that I liked Crash, so recommendations based off of that
aren't going to be very useful to me.

But if you reinforce my false belief that Obama and Soros are gonna use the
deep state to invoke Sharia Law on the 2nd amendment, then that might better
convince me to vote for so and so.

------
spitfire
When talk about CA first emerged on HN before the election some posters found
the original papers referred to. They were looking at pictures in the story
and zoomed in to find the titles.

I cannot find those posts for the life of me again. Not suggesting anything
nefarious here, I just can't find them. Does anyone have a link to those early
conversations or make copies of the papers?

I made copies earlier but deleted them before I put them into my papers
archive.

~~~
erikpukinskis
Here are some leads:

[https://news.ycombinator.com/item?id=14486365](https://news.ycombinator.com/item?id=14486365)

[https://news.ycombinator.com/item?id=14393991](https://news.ycombinator.com/item?id=14393991)

[https://news.ycombinator.com/item?id=14330547](https://news.ycombinator.com/item?id=14330547)

[https://news.ycombinator.com/item?id=14284502](https://news.ycombinator.com/item?id=14284502)

[https://news.ycombinator.com/item?id=13939814](https://news.ycombinator.com/item?id=13939814)

And the query:
[https://hn.algolia.com/?query=mercer&sort=byPopularity&prefi...](https://hn.algolia.com/?query=mercer&sort=byPopularity&prefix&page=1&dateRange=custom&type=comment&dateStart=1470528000&dateEnd=1506816000)

~~~
spitfire
No none of those are it. I believe it was before the elction that the article
came out. There was a picture of someone reading the paper CA was supposedly
based upon in the article. Someone zoomed in and found the paper.

I'll keep looking.

------
godelski
> has revealed that his method worked much like the one Netflix uses to
> recommend movies.

I'm not sure this is the model you want to emulate. The suggestions are
terrible and continually getting worse.

~~~
rjurney
It's a hard problem but Netflix's model represents the state of the art in
machine learning for recommendations. Still scared of the singularity? :)

~~~
nicoburns
Really? Because in terms of actual usefulness, I find youtube's suggestions to
better...

~~~
scoggs
I tend to agree with you. The only thing I dislike about it is when I happen
to open a link I'll randomly be sent or notice in an article that's something
unlike what I normally enjoy or something I close right away and then for the
next few days, or until I watch a bunch more of what I usually enjoy, the
entire suggestion list is only things to do with that one random link.

~~~
zrobotics
I don't think it's possible on a phone, but in a browser you can remove videos
from your watch history. Click the three dots next to the video for options,
and select not interested. Google actually seems to take these into account, I
had watched an Alex Jones video that was linked from elsewhere, and while it
is good to not have too much of a bubble there's just certain things I don't
need to see more of.

Or, just use a separate browser. I typically only use opera for YouTube &
anything where I don't mind Google tracking me, but if I open a YouTube link
in Firefox I'm not logged in (and w/ opera's VPN enabled it doesn't appear to
affect YouTube recommendations). I'm sure Google does correlate traffic
between the two to some extent, but this seems like the only useful way to use
operas integrated VPN.

------
cosmic_ape
Besides the data, its interesting how the actual targeting was performed.

Does facebook provide an option to show a particular given ad to a particular
given user? Or is it possible to select a group of people with a given set of
likes? How fine-grained is facebook's audience selection mechanism for ads?

Or was the targeting performed by creating fake groups, befriending people?

------
daenz
Am I understanding this correctly? Facebook user data (likes/profile info) was
scraped to produce low-dimension feature vectors for users (similar to
word2vec). These feature vectors were then run through some ML model to
predict...what exactly? Targetability for effective political ads?

~~~
emodendroket
It seems like the purpose was narrowly tailoring messages, which is something
political campaigns are really keen to do now (Obama's campaign was kind of a
trailblazer here, right?).

~~~
jacquesm
> Obama's campaign was kind of a trailblazer here, right?

It's a pretty big gap between _using_ and _abusing_ social media and as far as
I know Obama's campaign did not 'narrowly tailor messages'. They did target
broad groups using generic messages and they did quite effectively use social
media presence to build support.

But they did not - as far as I know, so please correct me if I'm wrong - go so
far as to single out individuals or really small groups with the express
intent of flipping their votes or targeting them with disinformation in order
to try to stop them from voting.

And Cambridge Analytica seems to have been doing just that if the currently
available information is to be believed.

~~~
emodendroket
[https://devumi.com/2017/12/social-media-case-study-how-
barac...](https://devumi.com/2017/12/social-media-case-study-how-barack-obama-
became-president/)

> The former president also hired Facebook co-founder Chris Hughes to help in
> developing his social media strategy. Obama furthered the use of Facebook
> for his 2012 re-election bid, utilizing it to encourage young people to cast
> their votes. His team developed a Facebook app that looked into supporters’
> friends list to find younger voters. The team then asked supporters to share
> online content with these voters. More than 600,000 supporters responded to
> the call, sending content to over 5 million contacts.

> During his presidency, Obama continued to use Facebook to reach out to the
> public. In 2016, he became the first president to go live on the site, just
> before his final State of the Union Address.

~~~
jacquesm
Yes, that pretty much confirms what I wrote above. Your point being?

Please read the article and compare what we know about Cambridge Analytica vs
what the Obama campaign did, it is comparing snipers with someone setting off
fireworks.

~~~
emodendroket
Look, I don't think anyone can realistically doubt that Obama's campaign was
the first to effectively slice-and-dice the electorate and use social media to
target them. You're arguing against a much more expansive claim than I'm
making.

~~~
jacquesm
You used the word 'narrowly', and in the context of a post about Cambridge
Analytica that word has a pretty specific meaning.

------
soared
Very interesting article but I wish it went one step further. Why does it
matter that cambridge analytica knew a user's big five or that they were an
old, uneducated republican? How was this (inferred) data used?

I assume they wrote/created different ads for different sets of users... but
how many segments did they have? Did their graphic designer build 500
different ads, or was text/images dynamically inserted based on these
variables? How did they figure out which message would resonate with each
segment? How did they test something like this, with so many potential
variables? Was this knowledge used only on facebook, or across all digital
channels? Was it implemented in non-digital channels as well?

I'd kill to have access to their campaign set ups.

------
katebrooks
And we can't blame the companies because we agreed our data to be monitored
when we installed the apps and granted permissions. Facebook and Google has
taken so much of our data and knows more about our behavior than we ourselves.
Even if I demand my data to be removed I don't know if they will 'actually'
delete it.

~~~
gleglegle
Who says we can't blame them?

------
albertTJames
Yeah yeah, they only tried SVD on a multimillion dollar dataset. Psy-ops,
netflix, same deal, what is all the fuss about! Strange that a research
scientist is released of his NDA a couple of days after channel 4 documentary.

------
scottybowl
I'm excited for GDPR.. The hard part is going to be getting the truth out of
these companies about the actual extent of the data they hold on us

~~~
rossdavidh
That "hard part" is going to be more than just hard; I think the word you're
looking for there is "impossible". They don't have any way of knowing what
data Google, Facebook, Amazon, or any other company has. As this article does
a credible job of explaining, you have to understand a fair amount about
statistics (PCA) and machine learning to even know if if you were looking at
it, and they won't know where to look for it. They have no enforcement
mechanism in mind, and they passed a law anyway, which effectively means "you
can't admit to having this", which will mean that the more willing a company
is to lie, the bigger an advantage they will have over their competitors.

~~~
rmateus
Sure. But at least when they are actually whistleblowed (by people who "just"
work for the company) there is a law which can be used to call to a court the
theople who are legally accountable for the company

------
vcdimension
This is the standard way of analysing this kind of data, and I'd be very
surprised if the Obama campaign didn't use the same or very similar methods
with the facebook data they obtained. The only difference is that Cambridge
Analytica managed to obtain much more data over a wider demographic.

------
willart4food
archived for future reference
[http://archive.is/dMIcN](http://archive.is/dMIcN)

