Hacker News new | past | comments | ask | show | jobs | submit login
How Cambridge Analytica’s Facebook targeting model really worked (niemanlab.org)
427 points by Dowwie on March 30, 2018 | hide | past | favorite | 195 comments



Spoiler warning. Article punchline ahead.

"The whole point of a dimension reduction model is to mathematically represent the data in simpler form. It’s as if Cambridge Analytica took a very high-resolution photograph, resized it to be smaller, and then deleted the original. The photo still exists — and as long as Cambridge Analytica’s models exist, the data effectively does too."

That's an eloquent piece of explanation of a very important point. And apropos the discussion about privacy legislation, it's also going to be a very interesting point. Will the Cambridge Analyticas of the world be able to claim they have held on to no personal data, when strictly speaking the raw data has indeed been deleted after being used to create a derivative work that can for all important purposes be used to recreate the original? Assuming I find out I'm being profiled and demand to have my data removed, will society grant me rights to have derivative forms removed or adjusted too? I'm somewhat pessimistic that legal hairsplitting about matters like these will make enforcement very difficult.


> when strictly speaking the raw data has indeed been deleted after being used to create a derivative work that can for all important purposes be used to recreate the original?

To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.

What they have done is distill some insights about people from this data. It's arguable whether they should be allowed to keep those insights, but there's no privacy risk there really.

It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.


That's only accurate in the sense that because an LSTM's hidden layer is much smaller in dimension than the data on which it is trained, there is less information in it.

However, it concisely represents a manifold in a much larger dimensional space and effectively captures most of the information in it.

It may be (and is) lossy, but don't underestimate the expressive power of a deep neural network.


You're throwing out buzzwords instead of addressing the response.

It's dimensionality reduction. You cannot recover the original object. It's like using a shadow to reconstruct the face of the person casting the shadow.

Note this has nothing to do with the expressive power of a deep neural network. You are by definition trying to throw away noisy aspects of the data and generalize a lower dimensional manifold from a high dimensional space. If it's not lossy, it won't generalize.


You're right that it's really just a form of dimensionality reduction. My point was just that it's a more powerful form of dimensionality reduction than PCA or NMDS.

[Edit: and that the salient characteristics are likely contained in the model.]


Precisely because it's more powerful, it doesn't encode the identifying information of the original data. Something like PCA likely would retain identifying characteristics (depending on how many low-rank vectors you drop).


Outside of the fact that they have identities for all of the people whose data they acquired, yes, it would be harder to reconstruct individual people with it than PCA because of the direct interpretability of its data.


They claim to have deleted that data. If they haven't deleted the data, then of course it's still an invasion of privacy. But the ML model really has nothing to do with it.


I think the ML model has a lot to do with it in this case. One of the arguments I expect to see is that "Oh, no! We removed all the data. It's gone. I mean, that was only a few hundred megabytes per person anyway, but we just calculate a few thousand numbers from it and save in our system, then delete the data. That's less data per person than is needed to show a short cute cat GIF. What harm could we possibly do with that?"


My point isn't that there is no harm here in them storing this model. It's also not that the data in their model is worthless. It's specifically that the way this article is talking about the issue is incorrect. The analogy they use would lead you to draw false conclusions about what's going on, and how to understand it.

There is a real issue here of whether or not they should be allowed to keep a model trained from ill-gotten data. But the way I would think about it is: If you steal a million dollars and invest it in the stock market, and make a 10% return, what happens to that 10% return if you then return the original million? That's a much better analogy for what's going on here. They stole an asset, and made something from it, and it's unclear who owns that thing or what to do with it.


The ML model might know more about me than I’m willing to admit about myself. I only find some — but not much — comfort in the proposition that it can’t conjure my PII.


Is this basically a choice between .mp3 and .ogg, png vs jpg vs gif?


It’s kind of comparable.

Regardless, I still think having the most relevant features already extracted is all they need to ask many of the questions they might want to. The point is that that’s still quite bad.


Right, I was just trying to confirm an analogy. It seems like this stuff is like a lossy codec for traits.


If you can still run Java applets, this is a nice intro: http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html


It's dimensionality reduction. You cannot recover the original object.

Makes me think of the Simulacrum[1]. "The map is not the territory."[2]

1. https://en.wikipedia.org/wiki/Simulacra_and_Simulation

2. https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation


Your SSN is "dimensionality reduction" over your data. It's still your private data. Same for your race, sexiual orientation, hobbies, etc


I don't think you understand what dimensionality reduction means.

SSN is a lookup key into the raw data. Dimensionality reduction is by definition lossy since it's used in scenarios where: rows of data = n <<< m = number of features


Only if the "true" data actually lives in a lower dimensional manifold and the data acurrately can encode it with low noise. I doubt anyone can tell who you will vote for depending on which cat videos you liked, no matter how magic your regressor.


I do think that the most significant components of a personality will likely be targetable with a relatively low nuclear norm.

And, for example, where someone's proclivity on the exploration/exploitation spectrum, if you will, (IE, how strongly do they respond to fear-based messaging) falls is probably quite predictable from a spectrum of likes.

Cat pictures may be less informative, but not all of these people clicked exclusively on feline fuzzy photos.


I'm not an expert on personality so won't disagree (except to say that I am a little sceptical of a static personality profile actually existing and I think people who always vote a certain way would be the easiest to regress and also the most useless to target). As I said in another post, it really depends on what part of your privacy you are trying to protect. It is also a mistake to think of anything on the internet as a private forum.


the most significant components of a personality will likely be targetable with a relatively low nuclear norm.

Is this falsifiable? It reads like a tautology to me.


I think it is falsifiable. More precisely, the claim is that the most significant features for psychoanalytic purposes will be contained in the model after training even with low nuclear norm. It’s possible for the most salient features for this purpose to not be in the model. It was unclear the way I first said it.


Alternatively, "If I take a FLAC you own, make a 320kbps MP3 from it, and store it on my laptop, am I still in possession of any IP belonging to you?"


I think a better analogy might be "If I take a few hundred thousand MP3s and come up with a clever way to reduce each to a short representation of its genre, mood, tempo, etc that can be used to identify similar music, then throw away all the original MP3s, am I still in possession of the original music". The whole point is to turn the individual data into broad, general categorisations that are easier to handle because they contain much less information. Remember, they're using this for ad targeting, and the reason they're doing it is so they can target broad groups of people rather than having to manually go through and target ads at each individual one by one.


I like that analogy. I'll make it more tenuous with - "I took a copy of your album collection without your permission, ripped them to MP3, played them so much everyone is sick of them. but you've still got all the original CD's you don't even use, so no problem right?"

On this tangent, IP ownership for deep learning models is interesting - how to you prove (in court) someone has/hasn't copied model/stolen a training set? If you fed someone else's training/model into your system, how easy is it to prove? Will we see the equivalent of map 'trap streets' in trained CNN models?

Which led me to: https://medium.com/@dtunkelang/the-end-of-intellectual-prope...


Except Facebook let them take a copy of the album collection, albeit for a different use case, but it was allowed nonetheless. That doesn't absolve CA in any way, but should make us wary of people we willingly give our "album collections" - they will use them to make money, and what they allow people do with them can easily be things we don't agree with, but didn't have the imagination to think of when we signed the EULA.


Where do neural nets come into this? The Koscinski-Stillwell-Graepel paper talks about using the reduced-dimensionality data with logistic regression.


Especially when the dataset probably isn't that high in entropy. Something like PCA can drop the dimensionality by significant amounts as long as the data has enough clear signals in it.


" cannot use this data to recreate anything remotely resembling the original dataset."

This by itself may be mostly true perhaps - and many of the comments get into ways of playing with this dataset to make it better, I don't have experience with those methods, but,

what I have not seen anyone mention, if you have this dumbed down dataset, the original is gone.. you can still combine with other data sets that are either public or previously created and likely fine tune;

dumbed down set + public voter records + public arrest records + previous whatever records - sort, match, what's left over.

and pretty much recreate what you needed from the original, maybe not 100%, but I would guess you could get really close.


Or "better" (for some value of better), join the data to something identifiable (e.g., the public records you listed) before developing models for everything you want to retain, then discard the original data once your deep enough net has effectively auto-encoded whatever you wanted to retain in an identifiable manner.


> distill some insights about people from this data

i'd argue that insight is the bit that's important, and the bit that's the privacy risk.


So if I compress a BMP by 100:1 using "lossy" techniques then it's not equivalent to the original? I'd say that depends on how recognizable the result of reconstruction is, and not on the amount of reduction. MPAA would be very unhappy with your argument.

To be more extreme there are many compression/extraction methods that can perfectly reconstruct the original data with very high compression ratios. GIF/PNG can reproduce many images exactly. Certainly, they are derivative works?


This compression analogy is really a bad one. A much better analogy would be: If you read 10 novels by a particular author, and are now able to recognize his/her style, because you have a compressed representation in your mind of the way they write.


PNG isn't a derivate work. It's an alternate encoding of the same work.


This type being ... something like PCA? It's up to the user how much to actually reduce the dimensionality.

The pixel analogy is bad, but to use it anyway -- you get to choose how many pixels you keep. You could keep literally all of them.


> It's arguable whether they should be allowed to keep those insights, but there's no privacy risk there really.

So if Google has distilled someone's emails over the years into "closeted homosexual with a deeply repressed leather fetish", that's not an invasion of their privacy as long as they throw away the source materials?


As long as they retain no data which could specifically identify the original person, yes. There is nothing wrong with building segmentation models as long as they aren't specific enough to identify a specific person.

My concern would be, how granular is too granular? What if we added "and live in zip code 12355 and is registered Green Party"? This now gets eerily specific, and might be sufficient to identify an individual.


Why would they ever discard that? Why would there be a granularity where ML suddenly stops working? Why would you even stop at one model per person, instead of one model per mood, or modes of thought at different stress points?


In fact they would desire that granularity most of all so as to reconcile the past and future state psychographic profiles for an individual- then they could attempt to isolate the causation of a state change- basically they need to identify the moment an individuals profile reflects the change from democrat to republican or vice versa. Or Religious to atheist etc.


Since the source data was deleted - according to current standards and policies - their hands are probably technically clean. But there may be another angle of attack.

In the US, you're not allowed to benefit directly from a crime you committed. For example, if you rob a bank, you can't buy your mother a car with the money and say "sorry, it's gone!" when the police come knocking.

With that line of reasoning and if there was a legal, privacy, or at least a TOS breach in collecting the data, the derivative machine learning models may be tainted also. Then again, it's likely impossible to prove exactly what data went into the model, so hard to establish which models might be tainted.


If they kept information like that, then yes that would be an invasion of privacy. But that sort of information is almost certainly not encoded in an ML model trained on 50 million people's data.


Let's say I take age and income of everyone in a city and train a regression model that predicts income from age. The model has slope and intercept that "encode" the information from all the people.

It would not be possible to make inferences about the income of any particular person from the slope and intercept, so it would be ok to share those values in, say, a journal article, even though disclosing income of a particular person would not be ok.


How do you know what CA trained on, or what's possible? Do you have qualifications in ML?


I know what they trained on because it's been reported on. They got around 50 million people's FB profiles, and a smaller subset's (300k, I think) personality test results.

I use ML models every day in my work, and understand how they function. It is true that individuals information is probabilistically encoded into the parameters of the model. However, if the model is any good, the people they trained on's information is encoded only a bit more than that of the entire population.

There is sort of a privacy issue in the following sense: The models they've built have learned relationships between preferences and personalities that they wouldn't otherwise have been able to learn. But these relationships are abstract. They are not tethered to any particular, identifiable individual.

A reasonable argument can be made that those learned relationships are, in a sense, stolen property. And I think arguments along those lines are interesting things that we'll have to explore as this sort of thing becomes more common. But the idea that this model invades individuals privacy just isn't really true.


Is there a reason that people are only talking about the privacy angle?

People very much don't want these models to exist. They don't want a predictive model which will guess their affiliation just by providing unrelated Activity bread crumbs.

That's why I assumed this whole issue has exploded recently.

Not the privacy, but the implications.


But if the resulting model doesn't contain information about individuals, how does this help targeting individuals for the campaign?

Edit: is it that the model is then applied to only strictly public data about the person? If so I guess the interesting question then becomes whether the model is definitely not anything near overfitting (i.e. containing enough information to match a person's public data directly since it was trained on it (amongst other data))? (I'm not an ML developer.)

Edit 2: also, going with your comparison with the "20 most representative pixels", it seems interesting then that 'this much' (although not exactly sure how much) information can be inferred from a public profile when just also knowing enough about the whole Facebook population. OK, so perhaps a human would be able to infer about as much, but doesn't scale, and that's why the model becomes valuable?


> But if the resulting model doesn't contain information about individuals, how does this help targeting individuals for the campaign?

I don't know exactly what they were modeling, but from the published reports, it sounds like they were trying to predict big 5 personality characteristics (conscientousness, neuroticism, openness, extraversion, agreeableness) from FB profile data (e.g. likes, dislikes, bio, post content, etc.). So in that case, the model would contain weights that measure the strength of relationship between characteristics like "likes punk rock music" and "openness". That description really only literally applies to a linear model - but nonlinear models are, for these purposes, the same.


> I know what they trained on because it's been reported on.

What reason do you have to think their data set consisted of only what has been reported?

How do you know anything about the models they used?


That's a ridiculous response. If they managed to infer this characteristic from emails, what they would keep is a tool which, given that set of emails again, infer the same characteristics (and theoretically a similar set of emails). They would by no means be allowed to keep the kind of information you described.

What is more relevant is a model which, given characteristics such as "closeted homosexual with a deeply repressed leather fetish", they would be able to infer other characteristics, such as support of particular political candidates, responsiveness towards targeted political or commercial ad campaigns, etc. That's what's relevant here.


> To be precise, you almost certainly cannot use this data to recreate anything remotely resembling the original dataset. This type of dimensionality reduction would throw away enormous volumes of data. There is no meaningful sense in which you can reconstruct the data from it.

First off, I think that's wrong. The idea is after all to keep the information that will result in the smallest error compared to the original on the dimensions one cares about. Within what the model emphasizes a reconstruction can be not only "remotely resembling the original dataset" but as closely resembling the original dataset as is possible with the capacity of the representation.

Next, I'm really not talking only about the particular method described in the post. It's definitely possible to choose to make a light enough reduction to preserve the aspects of the information one is interested in, and to optimize for recall rather than generalization. A more realistic context is going to be that some information about the affected individuals is still exposed or kept (maybe in a compact derived form), which would in many cases give excellent possibilities to restore information accurately enough that claims to have the removed the data are effectively deceptive.

Even for cases where the models are in good faith created only to "distill some insights" I'm skeptical that they really are useless for recovering individual information. I'm by no means an expert in differential privacy but I do listen when it comes up, and a lot of what we see from that field seems to come down to being able to trade off the relation between keeping the data useful and how many pieces of additional information (or assumptions and brute force) are needed to break the integrity protections. With surprises that tend to be on the side of 'Oops. Turns out this clever trick can recover the originals easier than we thought.'

> It's honestly kind of disingenuous to describe dimensionality reduction in the way that they do here. It is like reducing the resolution of a photo, but it'd best be described as reducing that resolution to say, the 20 most representative pixels. There's no real sense in which the photo still exists.

In my honest opinion the original analogy does an excellent job of intuitively explaining that most of the informative aspects of the data are kept (we can still see just fine what's in the image) while irrelevant details are discarded, and that is probably what was intended.

If anything comes off as disingenuous in that context it's your representation that it's like a strong reduction in the pixel domain (where it does indeed destroy a lot of the information). What can be done is much more like running the picture through a high-performance Imagenet classifier and keeping the 20 (or 2048, or whatever's needed) most informative values at a level that corresponds strongly to semantic content of the picture, and holding on the model. We could probably generate images that people would have a hard time distinguishing from the original with that.


You're making lots of arguments by analogy here, and they're all just not correct. I'm not sure how better to explain it. Yes, it is theoretically possible to do a style of dimensionality reduction that would not destroy very much information. But nobody uses models like that to make predictions. The models people actually use to make predictions destroy enormous quantities of information, and reduce dimensionality in the extreme. It is not like compressing a JPEG. It is like looking at a photograph of a person and remembering that someone with brown hair was in it.


Analogy does not work here and is misleading. You cannot do much if anything with 20 most representative pixels (if there is such a thing) but you can infer highly valuable characteristics about the person. Yes, you cannot recreate the original data but what you end up is potentially much worse (sensitive/private) than the original data.


That's not really true, and is kind of a fundamental misunderstanding of how these things work.


Unless the data is completely random it's not crazy to say that the data can be reconstructed from a reduced version.

If you have a million points that largely fall on a 3-dimensional line and you project that into 2 dimensions, you can easily recover that lost dimension with losses relative to the deviation. And that loss may not even matter depending on the kinds of data and margins of error you're working in.


This is actually a nice illustration of the central problem with this argument: the more personally identifiable a piece of information is, the less recoverable it'll be, and vice-versa. If all of the points of data are on some n-dimensional line, then obviously all of them can easily be recovered, but knowing all those things about a person doesn't actually tell you any more about them than knowing just one of those things. Conversely, if the points of data are very random then it'll only require a handful of points to uniquely identify a person and find the entry in the original data set with all their other information, but dimensionality reduction will have to throw that data away - you simply won't be able to recover that information from the model. (We actually know from the literature on de-anonymization that a lot of data falls into the second category.)


Except that that toy example bears no resemblance to the actual situation.


How many dimensions were they working with and how much variance and correlation was there in the features? What's the margin of error for the end product?


I don't know precisely, but it's pretty obvious that there'd be no way to reconstruct personally identifying info from it.


> "will society grant me rights to have derivative forms removed or adjusted too? "

I am in favor of no. Imagine I build a gender classification model off public tweets, and then you later delete your twitter account and demand my model not be used because it was trained off 'your data'.

I am in the camp that so long as the data isn't traceable back to you specifically, then don't put any information out there you are not OK with sticking around.


I guess the question is what are you trying to protect? The model is fundamentally lossy as it is a rank reduction method so your original data is gone (i.e. no one would be able to accuse you of liking a particular controversial post, just that you are likely to like that post). So it sort of has the differential privacy thing going on. I guess it is another question as to if such models should be built at all. I think the fidelity of the models will answer that in time, if they work really well it is scary, if they are poor models they will cease to be used. I suspect that it will be in the middle and highly sensitive to the quality of the original data and the quality of the implementation like all ml applications.


To take a completely different approach in terms of 'derivative works':

Say we have a bunch of profile images, and then describe them in text. "Blonde, caucasian, large nose, curls, receding hairline, strong jaw, big ears", or perhaps even more specific stuff like "has a mole on the left cheek at the same height as the right earlobe" and "right nostril is larger than the left" and "dimple in chink".

Based on a description like this, we could identify an individual in probably a short paragraph. Nonetheless, on the data side, this is a lot less information than is represented by the raw pixels.

When it comes to the topic of CA's tools, and 'psychosocial' targeting, we can't separate the broader context and the way in which one single term can encode tons of data ("looks like George Clooney with a bigger forehead"). I'd argue the same princple applies to political views, and personality.


Based on that description I'd say it's closer to taking a RAW format and converting it into a JPG, which contains most of the (human visual) relevant data from a DCT while removing the remaining "noise"... which is a bit different from simply resizing (eg subsampling). What I would find a bit satisfying about this description is that you probably wouldn't get away with claiming that a JPG of an image was un-copyrighted and only a BMP (of the same resolution) was.

To be fair, it does depend on the amount of compression before it is not recognizable, but if you can still squint and see the Mona Lisa (when you also have her phone #)... have you not violated her privacy?


If I tell you I like furry porn, is there a way for me to make sure you forget that? This has lots of implications, many of them placing the "blame" on me for telling you this.


I think that's a fair point, but I question if it's an apt comparison to the situation with some big players of the surveillance economy.

Let me make that a bit more convoluted for you:

Let's say, very hypothetically, that you, like most of the general population suck at things technological and just switch off mentally when someone mentions phrases like "social graph" or "Javascript", and I'm a semi-psychopath who plans to make a fortune by scamming dumb fucks like you. I take on my mask of sanity and most endearing nerd T-shirt and appear at your door to give you a FREE robot servant. Except to keep it FREE the robot servant is going to pause what it's doing sometimes and whisper subliminal messages to you from my sponsors. But you're not afraid of stuff like that are you? And all you have to do is sign my brick of legal documents in complexified Legalese, which of course you don't have mental stamina to read through. "But, hey", you think "people are nice and trustworthy and if there was something really bad going on here that would be illegal and punished, and besides what's the worst thing that could happen", and I get your signature and you got your FREE robot.

If you had bothered to learn complexified Legalese and do your reading you'd have noticed you also just approved that the robot spends it's spare computation cycles surreptitiously watching you and getting to know you, and one of the many things it does is it glances over your shoulder when you get your porn fix, and in collaboration with my team of highly trained robot masters it concludes not only that you're into furry porn but also which particular furries really push your buttons. We catalogue this away for future use. Years later, in spite of your technical ineptitude, but maybe let's say because you're really a good people person, you've risen to become the highly respected mayor of your town. Then a business partner of mine who used to work covert operations over at CI6 but now has switched to lucrative private contracting, comes to me asking for the files collected from your robot for a influence gig he's taken on from Toxico, and I sell them to him the data for a suitably juicy sum for a man of you stature. Ex-CI6-guy studies the file with interest and uses it to select two skilled furries he can tell you'll be incapable of resisting and sends them to cross your path at a representation dinner. They very convincingly persuade you to have them over the next weekend your wife's away, and it's WILD(!).

And of course "your" robot is carefully documenting the whole thing, which I also sell to my partner, for an additional cut of the profits from the Toxico job. While you still wrestle with your conscience about whether last weekend was really a Good Thing, someone appears at your office door to propose you use your influence to switch the city energy supply over to a 90-year contract on Toxico's patented owl-burning power plant (with levels of carcinogen emissions they'll never be able to get us for!). Incidentally that someone at your office door has probably also been chosen according to your robot file to be a kind of person you'd have a harder-than-usual time to say no to for one reason or the other, but in the end you still refuse because despite some personal weaknesses you're a decent man, who values and protects your town, it's people and it's environmental surroundings. And then you're informed that someone may have videos you'd really rather not become public. Unless you agree to the Toxico proposal and do so generously, your comfy little life may meet with a sudden and radical change of fortune.

So who's to "blame" in this scenario? I'm sure there's a point where ignorance should be illegal, but I think generally society looks with some leniency on getting deceived. And winding back to the early parts, what does this do for your moral rights to have me remove your data? It was right there on page 200 in crisp, clear letters that you allowed me to watch your porn surfing habits and share that data and derived works with selected partners. You agreed to this! We haven't done anything that we're not allowed to do according to the contract.

...


That’s strangely specific....

Hmmm


This analogy immediately reminds me that with enough high res / low res pairs we can rebuild a high resolution image from its low res version with fairly good results. Wonder if the same could be done here.


Finally, I was waiting for someone to talk about the model itself. It makes sense that SVD or something like it (PCA, co-occurrence, etc) would be used.

But I also wonder what exactly you are going to do with the predictions. What exactly do you show to someone to make them more likely to go and vote if they are inclined to vote your way, or make them stay at home otherwise? Is there evidence that whatever you're showing actually works? Or do you try to change people's minds? What do you do?

Knowing how the state of things -in this case, people's voting inclinations- is not the same as knowing what to do, ie a strategy.

I don't know how effective it is, I'd like to learn more. But I smell the possibility that these CA type firms are simply selling snakeoil to desperate political activists.


One example I can provide is of gun control topics.

If you understand someone's mentality on the subject you can decide if they see:

1) An ad with someone breaking into a home and the homeowner defending themselves with a firearm (sell insurance?)

2) A grandfather and grandson on a hunting trip (hunting supplies?)

3) Or maybe gun violence hotline with powerful images.

The people seeing these ads are under the assumption that everyone else sees them, not that it's specifically targeted at their personality type. These affect if you think other people understand your issue or not. Thus affecting your motivation and attitude.

If you see an ad that fits your mindset, you think you're on the majority side. This was powerful in classic media, it's just as powerful now.


> The people seeing these ads are under the assumption that everyone else sees them, not that it's specifically targeted at their personality type.

How long will that be true? Do people make that assumption about search results?


Outside of the tech bubble, simply saying "yes" would be disingenuous. They're not even asking the question in the first place


I think retargeting has thoroughly blown up the idea that ads online are shown to everyone. My non-technical acquaintances are very aware of why certain products follow them around the internet in ads.


But do people assume the same thing about, say, Google search results? Promoted posts on Reddit? The ads (or natural posts) on Snapchat or Instagram?

I agree that it's pretty obvious you're being retargeted when ads for camping supplies start showing up three days after you search for them on Amazon. But the practice of "personalization" of results and ads is far larger and deeper, to a degree that most people never seem to think about.


Agreed. Although we all hate it, if my mom searches for hard sided luggage on amazon and ads for it follow her to all manner of other sites - that’s the best way for non-tech types to get some of the idea here.

The truth is WAY worse of course, but she immediately knows the ads she saw won’t show for me as well.


There's an occasional thing in UK politics where some public figure X has a go at another figure Y for having offensive ads on their website, without realising about the targeting which means the ads are driven by their own search results.


I think most people do assume the same with search results. How many do you think assume that the ads on the TV they see could be different than what the neighbor is seeing when watching the same channel with the same cable company? I think a lot of people assume that others see the same news and the way people act you'd think they assume that others see the same things in their newsfeed/timeline / facebook thing - and wonder how others could have a different view.

Even when I explain how ads can be different, I don't think people really want to believe it, or understand it, and they certainly do not realize the power of these targeting abilities..


Before Netflix stopped showing the number of stars next to content, I used to sort of depend on it for choosing a movie. In fact, I sort of miss it now, and spend more time sifting through content undecided. That's because I am clueless about movies. I believe that there are people who are as unsure about electoral candidates (as in, at a given day, they don't favor one candidate above others) as I am about choosing movies. When push comes to shove (my wife's irritation quotient above threshold) in terms of making a decision, an advertisement that someone saw couple of days back can definitely assist in making a choice at the split second.


> I don't know how effective it is, I'd like to learn more. But I smell the possibility that these CA type firms are simply selling snakeoil to desperate political activists.

According to the article:

"The accuracy he claims suggests it works about as well as established voter-targeting methods based on demographics like race, age, and gender....the digital modeling Cambridge Analytica used was hardly the virtual crystal ball a few have claimed."

It's pretty clear that they were selling snakeoil. In fact, the use of CA wasn't particularly helpful to anyone [1]...hiring them was just a prerequisite for obtaining campaign contributions from the Mercer family, who had put up the money behind CA [2].

[1] http://www.businessinsider.com/cambridge-analytica-facebook-...

[2] https://twitter.com/kenvogel/status/975756418128187393


It is reasonable to assume that a marketing message written with the profile of the targeted person in mind works better than generic message.

In Facebook campaigns you can use certain things, such as user's interest, to select who sees your message.

I'm not an expert on Facebook analytics, but I believe you can get pretty good stats on how your campaigns are working, how much promoted posts get shared etc.

This sounds like the holy grail for advertising. You get to write your message for certain profile and get quick feedback how it worked. Even if the system is not perfect, you would have an advantage compared to somebody else who is spending the same amount of money and not using similar targeting.

Maybe their model also allowed them to find social influencers with many followers. Being able to targer these people and get them to share your message would be really good.

The article compares this to the effectiveness of traditional voter targeting methods. I'm not sure what the parameters used on those are, but maybe all of them are not available on FB, justifying the need for something else.


>I don't know how effective it is, I'd like to learn more.

Here is an interesting Ted Talk which discusses an FB experiment that details how effective minor UI changes can be on voter turnout (13:40)

https://www.ted.com/talks/zeynep_tufekci_we_re_building_a_dy...


> What exactly do you show to someone to make them more likely to go and vote if they are inclined to vote your way, or make them stay at home otherwise?

Qualitatively: show things that get them angry.

Quantitatively: test and control pop splits.


How do you test anything? There's only one vote, you can't iterate.


Geography, at a first pass, over multiple elections. This is how TV testing works. Pick "similar" geographies and run your marketing in one. If the effect is large enough, it pops out. Not quite a diff-in-diff, but a start.

Or don't look at votes, look at candidate likes and shares over time, especially as they shift.

The defined metric doesn't have to be "propensity for this individual to vote for a candidate." It can be "percentage delta over untreated markets compared to prior campaigns."


Maybe with polling?


Data is terrible, especially for polarizing candidates like Trump. People simply lie in public about not voting for him, afraid of backlash that they will receive.


They were going to vote anyway; nothing can tell you otherwise?


> Quantitatively: test and control pop splits.

How do you actually do this? Presidential elections come once every 4 years.


And there's a big question mark over whether lessons learned (ie parameters) from one election are valid for the next.

What if all the sensitivities are dependent on the length of the candidates' hair? It seems the total hair length of the two candidates was a maximum at the last election. Another time you might be sampling more towards the middle.


Door to door canvassers these days carry devices that tell you what topics to bring up and what topics not to bring up at a certain address, even distinguishing between individuals at an address; some are told to demand a husband let them talk to the wife, for example.


I don't know about the specific campaigns that you are referring to, but in my experience a lot of the information used in campaigns I've been involved in comes from previous canvassing sessions. Political parties in most countries are involved at many levels where there are elections. Canvassing doesn't just take place for the big elections.

One year they will have been round and had a lengthy discussion with Mrs X, but Mr X slammed the door in their face another time. This was somewhat lower tech: the information was printed out and attached to a clipboard.

Most of the time this information is correct. It's more interesting when it's really incorrect. That said, some of the best sessions I've been involved in were where there was no information.


Exactly. This is the old-fashioned approach to campaign targetting that Cambridge Analytica was trying (and failing) to replace: just send a bunch of volunteers to talk to them about who they're voting for and why, then put that in your big database. One of the dirty not-so-secrets about CA is that according to the Trump campaign, they were abandoned completely in favour of that old-fashioned approach because they were worse. Similarly, if you've been paying attention, you might have noticed a few insider stories about how one of the Hillary Clinton campaign's big screw-ups was underestimating the importance of that data compared to modern big data tech and basically throwing a lot of it in the trash. This didn't get nearly as much coverage as the idea that Cambridge Analytica, Trump, and Facebook were conspiring to brainwash the population, probably because it was less juicy a narrative and kind of embarassing to the Clinton campaign and the DNC.


> demand a husband let them talk to the wife, for example.

Demand? That seems like a great way to get arrested or shot for trespassing.


Shortly after the election, I read something saying that the actual ads were targeted soundbites at specific demographics likely to vote Democrat, run shortly before the election with the intention of suppressing voter turnout.


So negative advertising aimed the core constituency of the opposition's voters speaking to their deep seated concerns about their candidate.

I could imagine this working on Dem voters who are wavering on Hillary with leads like "she thinks the TPP is the gold standard" etc.


I am really puzzled by the Cambridge Analytica scandal. It's not particularly savory, but is there something happening here that it wasn't basically already known about how Facebook worked? By the protests of their own executive, the system was working as designed, and at worst Cambridge Analytica misled them about how they intended to use the data, right? There was no actual security breach here, as far as I can understand it.


There doesn't have to be a security breach for it to be a very bad example of using data collected in one way for a completely different purpose. It violates the 'lawful basis for processing' part of privacy legislation.


Which US legislation?


That app collected data on many more than just US residents so more than just US legislation applies. This is one of those pesky little problems of doing stuff 'on the internet', especially when you start doing stuff that is purposefully or accidentally illegal.

Besides that they apparently also used similar trickery in their consultancy for the Brexit side.

https://www.theguardian.com/politics/2018/mar/26/pressure-gr...

This is far from over.

https://www.theguardian.com/commentisfree/2018/mar/23/plenty...


Cambridge Analytica are in the UK. Which is why the Information Commissioner got a warrant to search their offices.


No. I know at least one political consultancy that does a similar work in Spain, although I only know they work with data, use micro targeting and is run by a sociologist.


Who said anything about a security breach? Most of the controversy has been about the company influencing elections using data scraped from people (and their friends) unaware of what the data was being used for.


The degree to which it influenced the election is questionable. Despite all the headlines, I haven't yet seen any convincing analysis of the impact of facebook on the election (I'm not sure how one would even go about doing so). So far it seems like it's just a convenient vehicle for people that dislike the outcome of the election to express indignation.


This is obviously unmeasurable - there isn't convincing analysis because there can't be convincing analysis, as you admit.

The fact that people were willing to spend an amount of money that breached electoral law in the UK, and presumably even more in the US, suggests that there was some reason for them to do so. This happened only because experts in this field believed it would influence the outcome of the election.

That's your evidence.



Can you point to the part of that article containing evidence of to what degree they affected the election?


The admission by the company executive.


That's interesting. How would he know the degree to which he influenced the election? Believing any claims to somehow fact rather than plain old self-promotion seems rather naive, or am I missing something?


Even if we presume those sentiments are completely sincere and disinterested, I don't know why we should believe he is an authority on US elections whose claims can simply be accepted at face value.


I would tend to agree.

It's often not hard to convince people of something they want to believe.


It became a "problem" because it helped Trump win.


This answer ignores some known details of the story. Cambridge Analytica didn't just buy targeted ads from Facebook. They used a sockpuppet to release a fake "take a personality profile" app, which then allowed them to gather tons of data against the Facebook terms of use.

The CEO of Cambridge Analytica has also been recorded telling a (fake) potential client that they routinely blackmail people using prostitutes and who knows what else.

So unless you can show that Clinton's campaign was doing the same things, your claim is a false equivalence.


I find it hard to believe that the American public is up in arms about someone violating Facebook's terms of use.


This is exactly it. At least it stops the news from droning on and on about Russia.

I thought Clinton spent large amounts of money on data and the Democrats admitted the data was bad or at least that was their excuse. How much did CA pay for this data? I still find it crazy that Trump campaign spent 30% of what Hillary did and still won. The Russians used 100k$ worth of ads to sway the election. This stuff doesn't t add up.


Yes, that's a good point. Russia, Cambridge Analytica... anything that allows people to feel like the Trump phenomenon is a nefarious foreign import rather than homegrown. I'm no fan of Trump, but I'm incredibly dismayed that all the Democrats have talked about since he was elected is "Russian meddling."


Do you have any sources for your claim about how the Clinton campaign acquired FB data and how they used it? Was any of it acquired fraudulently and/or in violation of FB's ToS, like CA's data was?


Do you have any sources for your claim that the parent poster claimed the democrats purchased Facebook data?


This is a thread about how CA acquired and used data from Facebook, so I assume the parent comment was trying to make an apples-to-apples comparison. The alternative is that the poster was disingenuously trying to imply a false equivalence.


Sigh....the technical legality of obtaining the data is not the point of contention. Do you think that is what this is about, whether CA "broke the law"?



It doesn't have to add up, most people are too busy with their real lives to manually search and find reliable details (what we get from the media is not reliably unbiased or true), and then read and understand them, so they believe what they see and hear repeated over and over on the TV, radio, and newspaper: the American President is controlled by Vladimir Putin. Even most smart people don't seem to care about actual evidence.


We can freely infer things just from reading his own Twitter feed. Such as the silence of the Salisbury poisoning vs. the instant reaction to other UK terrorist incidents.


Oh please. His administration threw out 60 ambassadors in response to this very incident, and this is hardly the first instance of them acting against Russian interests. This paranoid Manchurian Candidate stuff needs to die already.

https://www.lrb.co.uk/v40/n01/jackson-lears/what-we-dont-tal...


Except he didn’t. He told russia to replace their 60 diplomats. That’s pretty much ,in relation, the equivalent of a threat to bitch slap you.


https://www.nytimes.com/2018/03/26/world/europe/trump-russia...

> Trump and Western Allies Expel Scores of Russians in Sweeping Rebuke Over U.K. Poisoning

> WASHINGTON — President Trump ordered the expulsion of 60 Russians from the United States on Monday, adding to a growing cascade of similar actions taken by western allies in response to Russia’s alleged poisoning of a former Russian spy in Britain.

> [...]

> On March 15, the Trump administration imposed sanctions on a series of Russian organizations and individuals for interference in the 2016 presidential election and other “malicious cyberattacks,” its most significant action against Moscow until Monday.

> [...]

> Mr. Trump has said that, despite its denials, Russia was likely behind it. “It looks like it,” he told reporters in the Oval Office on March 15, adding that he had spoken with Prime Minister Theresa May of Britain.

You have to wonder how far Trump has to go before something he does is considered hostile to Russia. Does he have to nuke Saint Petersburg?



From this you infer that he is under the control of Putin? He is pro Russia no doubt, but I don't think that is what's being asserted by the media. I'd prefer they stick to facts, do you disagree?


He's "pro-Russia" in the sense that he seems to have some sort of admiration for Putin's tough-guy persona, but I can't see much other sense in which that's meaningfully true.


I agree, but unfortunately that minor inconvenience won't stop newspaper reporters from writing evidence free articles implying the contrary.


Isn't the accuracy of the predictions kind of orthogonal to the fact that they were basically lying in their attempts to change behavior?

Using lies to convince someone to do something is going to be more effective than using truth, if that "something" is not in accordance with the truth.

The comparison with netflix really breaks down there. You're not going to be able to convince me that I liked Crash, so recommendations based off of that aren't going to be very useful to me.

But if you reinforce my false belief that Obama and Soros are gonna use the deep state to invoke Sharia Law on the 2nd amendment, then that might better convince me to vote for so and so.


When talk about CA first emerged on HN before the election some posters found the original papers referred to. They were looking at pictures in the story and zoomed in to find the titles.

I cannot find those posts for the life of me again. Not suggesting anything nefarious here, I just can't find them. Does anyone have a link to those early conversations or make copies of the papers?

I made copies earlier but deleted them before I put them into my papers archive.



No none of those are it. I believe it was before the elction that the article came out. There was a picture of someone reading the paper CA was supposedly based upon in the article. Someone zoomed in and found the paper.

I'll keep looking.


> has revealed that his method worked much like the one Netflix uses to recommend movies.

I'm not sure this is the model you want to emulate. The suggestions are terrible and continually getting worse.


It sure seems that way. It's almost as if Netflix is giving in to pressure from content owners (which now includes themselves) to downplay or even weaken their suggestions.

First, they got rid of those wonderful ranked lists that made us love Netflix in the first place, replacing them with the much more opaque cover art carousel view. Then they started mixing in lower-ranked items into the carousel. Finally, they switched from the five-star rating to the thumbs up and down buttons, which can't possibly give them as much information about your opinion.


Currently I have zero trust in their rating system. Actually, zero is a bad number because if I see a high ranking I think I am going to dislike the suggestion. This is opposed to me previously trusting the system a lot.


I've heard rumours that the internal backlash at Netflix against 5 star ratings happened with Amy Schumer's most recent comedy special which had thousands of 1 star reviews on Netflix. Which was one of their most expensive comedy productions and whose release coincided not long before the switch over to vague thumbs up/down.

Note, the comedy special was similarly panned across the press and social media as being repetitive of her previous work, extremely predictable punchlines, and when seemingly good later exposed to be highly derivative of other comedians work.

But regardless of the reality/honesty of her rating it was apparently not good for Netflix's business when they let users destroy content they produced destroy it on their own web site via user generated content.

So the suits (heavily swayes by their production studio and Hollywood) were able to convince the product team to hurt the UX for 90% in order to protect the popularity of the 10% of content they own.

The truth may be good for consumers in almost all situations. But sadly the interests of executives dealing closely with high value B2B partners and investors tend to have a way to out-valuing the interests of the average user (not to mention far out valuing power users).

So I guess we have to rely on 3rd party IMDb web extensions which inject into Netflix in order to get honest ratings.


By "heard rumours" do you mean "read about it on Breitbart?".

http://www.breitbart.com/big-hollywood/2017/03/18/netflix-sc...

Notice that they're careful to say that they made the switch "amid" the special, not because of it. Also as far as I can tell, they have no actual data on the fact, and they're the only "newspaper" reporting it.


Nope... it was a Reddit self-post (incl. people who put together the viral compilation video of her joke 'borrowing', breaking down why she's no longer as popular, and analyzing Netflix's timing and stated rationale).

I'm curious why you brought up Breitbart? I just googled it and found a ton of other (non-political/right-wing) sites which drew the same exact conclusion between Schumer and the ending of 5 star reviews. Are you trying to say her thousands of awful reviews were somehow political? Or that it's all just some "alt-right" right-wing conspiracy?

- https://movieweb.com/netflix-cancels-5-star-rating-system/

- http://ew.com/tv/2017/03/16/netflix-star-ratings/

- http://collider.com/netflix-rating-system-thumbs-up/

- http://screenertv.com/television/goodbye-stars-hello-thumbs-...

- https://www.washingtontimes.com/news/2017/mar/17/netflix-cha...

I saw Amy's show live, which she later filmed, and it was just awful. Those reviews were highly justified. And I used to be a big fan of hers before anyone knew who she was.

If Amy Schumer wasn't the original reason why they started to ditch 5-stars then it was most certainly a motivating factor to get it pushed out (conveniently right after her special was famously destroyed - which got widespread press before the switch over to thumbs). As it most certainly also faced internal resistance as they were basically abandoning a decade of 5-star reviews via user-generated-content, in favour of a vague Thumbs Up/Down.

Amy provided a perfect example of the inconvenient and conflicting goals which Netflix has as a producer and content platform. The timing of the release would have been HIGHLY coincidental if it wasn't related in some fashion.


Outside of the reddit echo chamber, some people like Schumer, some don't, and most don't care.


Then I guess those 5-star ratings must have been some Trump supporter conspiracy brought to you by Breitbart and Steve Bannon then... not the fact she went off a cliff talent-wise when she moved to LA. Which was more than apparent to even non-fans of hers. And I say that as a big fan of her earlier work and as someone totally indifferent to US politics.


It's a hard problem but Netflix's model represents the state of the art in machine learning for recommendations. Still scared of the singularity? :)


There's a problem with all ML-based content recommendation, whether it's advertising, movies, whatever: the people who make the content have strong opinions about what should get shown to whom and when. If those people have negotiating leverage over your organization, your business team will compromise your recommendations to placate those people. That means that after you've build this beautiful model that minimizes whatever cost function you've chosen, there will be a bevy of business rules and content-specific score adjustments that will be overlaid. Over time, this shaggy, bad-assumption-laden system will dominate the user experience, and unless your management has a vested interest in maintaining the integrity of the recommendation system, your model will eventually be lost in a sea of human-generated noise. Not that I'm bitter about this or anything.


Personally, I used to get substantially better results. It would suggest to me a movie I'd never seen before with a score above 90, I'd watch, and enjoy. Now I'm getting things like kids shows suggested to me. I double checked by history to make sure no one watched one on my account, but they keep popping up with high percentage. I also really like horror and get weird suggestions [1] for similar movies. After I watched Alien it suggested the Great British Baking show. I literally have 0 trust in the new ranking system. They push certain shows way too hard and TV shows I'm watching show up on page 3 of "continue watching".

This is far from my previous stance of "Huh, movie I've never seen before and in a different language? It's 95% so yeah, I'll give it a go."

And before someone says anything, I do vote up and down quite frequently. But can't help but notice that my suggestions were better when I had the more nuanced star system.

[1] https://i.imgur.com/MOt3XlL.png

[1.1] How I'd rate these. Cars 3: not really interested. Liked the first though. Stitches: idk, doesn't look appealing. Teeth: Classic cult film but yeah... Big Mouth: I have ZERO interest in watching this, please stop suggesting. I have downvoted this! Waterboy: I like it, but far from 95%. I'll give it like an 80.


The coverage of the Netflix ratings format switch has always been frustrating to me because what people are complaining about (you're not the only one; https://www.polygon.com/2017/4/7/15212718/netflix-rating-sys...) was predictable from the beginning.

I do research in this area and it's fairly well established that when you go from something like five points to two points with ratings, you throw away tons of information. There's diminishing returns with numbers of points, but as you go lower you lose information.

The "ratings don't matter because what you want is implicit signals from peoples' actual behavior" is also disingenous because the rating behavior is a behavior that's directly tied to the stimulus in question. Not saying that indirect behavioral correlates aren't useful, only that the rating is a very powerful, direct correlate that tends to be very specific. Going back to the topic of the thread, sure, all those Facebook likes are going to be useful in predicting how much you like a candidate, but you're sure as hell going to get a lot of information by just asking them "on a scale of 1 to 5, how much do you approve of X?"


I mean it is pretty obvious that the loss of information is going to make it harder to predict. My big problem is that the system suggests completely off base content. Not to mention overly pushing their own content.


Since switching from 1-5 start ratings to like or dislike it's gotten much worse.

I honestly thought the 1-5 ratings were not useful since I either like or don't like movies, I am not interested in nuances. But, it's not working out as I expected.


Really? Because in terms of actual usefulness, I find youtube's suggestions to better...


Spotify is king IMO. Their 'discover weekly' playlist turns up gems every time.

But that's because they dive into other user playlists that contain the same music you play. Pretty simple. I assume Youtube do somthing similar.

If Netflix had playlists or a 'want to watch' feature I bet their recommendations would improve.


I tend to agree with you. The only thing I dislike about it is when I happen to open a link I'll randomly be sent or notice in an article that's something unlike what I normally enjoy or something I close right away and then for the next few days, or until I watch a bunch more of what I usually enjoy, the entire suggestion list is only things to do with that one random link.


I don't think it's possible on a phone, but in a browser you can remove videos from your watch history. Click the three dots next to the video for options, and select not interested. Google actually seems to take these into account, I had watched an Alex Jones video that was linked from elsewhere, and while it is good to not have too much of a bubble there's just certain things I don't need to see more of.

Or, just use a separate browser. I typically only use opera for YouTube & anything where I don't mind Google tracking me, but if I open a YouTube link in Firefox I'm not logged in (and w/ opera's VPN enabled it doesn't appear to affect YouTube recommendations). I'm sure Google does correlate traffic between the two to some extent, but this seems like the only useful way to use operas integrated VPN.


Youtube does better than Netflix for me, but they still suggest stuff that I have already watched. Sometimes stuff I literally watched an hour ago. And I watched one Joe Rogan episode and it will not stop suggesting it to me but fails to notify me on things that I watch every episode on; like 3Blue1Brown, Robert Miles, or Rare Earth.


There’s a bug confound with the catalog: Netflix’s streaming catalog is much smaller than their DVD selection was and it changes as licensing deals expire. No model can make up for that entirely.


I use the DVD one and I do find the suggestions are mostly pretty reasonable ones.


How hard can it be? Any film that I didn't watch all the way to end ought to be a strong signal that I didn't enjoy it. Not that I want to watch other movies just like it.


OTOH, that you even chose to watch a movie is a sign that you are interested in the genre. Just because I abruptly stopped watching “Star Trek IX”, doesn’t mean that I have abandoned the sci-fi genre or even Star Trek


I have a friend who shares my account since I stay with him when I am in the UK. It isn't worth the effort to have two profiles.

So, he likes horror movies and will watch any horror regardless of any signal that it's going to be poor. You can look at the viewing history and see he rarely goes beyond 5 minutes of watching any of them.

I watch Netflix regularly throughout the year, he watches in phases that last a week or two and then nothing at all for months at a time (for reasons that should seem obvious by now).

As a non-horror movie aficionado, I can say with certainty that only 2% of horror movies are ever worth watching and only 50% of these are any good. As a consequence, my personal viewing history includes almost no movies in this genre.

My favoured genre is drama and I normally watch all the way through.

You should be able to guess by now that I should rarely be recommended horror movies but, alas, Netflix thinks otherwise.

Btw. I also rate movies I watch - my friend doesn't.


I’m confused — you honestly expect Netflix’s model to have figured out that your profile is actually 2 people based on what you believe to be regular and obvious cyclical patterns? I would have to imagine that this is a relative edge case for Netflix, and there is no obvious answer for what to do with someone who mostly likes drama but for some reason, goes on a periodic horror binge.

I assume that Netflix’s model has the premise that profiles aren’t in fact, very easy to create. I have separate profiles for my parents, as well as a test profile to see what happens when a user only seems to like the “Human Centipede” trilogy.


2 people but only one consistent regular user. My recommendations are heavily biased towards the occasional user - who is also giving very strong negative feedback?

It doesn't bother me since I know what I like. There are not that many good films that I'm not going to find them anyway.

Someone, or a group of people, are being paid for nothing though. I don't know anyone who subscribes to Netflix because of their recommendation algorithm.


> as a test profile to see what happens when a user only seems to like the “Human Centipede” trilogy.

And? You can't just leave us hanging on that.


After creating the account and immediately giving a thumbs up to the trilogy, I've only occasionally logged in and feigned "interest" by clicking on the movies as if I'm about to watch them, or that I enjoy re-reading the synopses. The recommendations are all normal and not noticeably feces-related. But maybe I haven't yet met the threshold for the model to consider me a particularly engaged (or real) user.


I feel like watching 20 minutes of a movie and then downvoting should be a strong indicator. This does not appear to be the case. In fact, it seems to be more likely to appear as the first suggestion when I do this.


Yes, it's a hard problem, but no, Netflix recommendations are nowhere near state of the art. They were at the time of Netflix prize, but since then the field has advanced a lot, while Netflix only dumbed down their recommendation models.


In my anecdotal experience, it seems like Netflix's own content gets weighed disproportionately high. This could be a result of the fact that first-party content never gets removed, so it builds up a more complex graph of recommendations, but personally it feels like they're trying way too hard to cram their content down my throat -- to a point of recommending shows that I would never even consider watching, like Jane the Virgin (not exactly my kind of humor). In the end it just feels like they've compromised their own recommendations to push their own product, which makes me mistrust all of their recommendations.


Are you saying all of Netflix suggestions are getting worse? How are you measuring that?

Or are you saying your netflix suggestions are getting worse?

Mine are pretty good and have held steady for a while, at least in terms of my own preferences, though I also am probably using it a little less as I've got Prime and Hulu now as well, so there's probably less times I'm randomly searching through Netflix and finding nothing.


I can only tell you what my personal experience is. I laid it out more in another comment under rjurney's reply.

I'll also add that I am more frequently searching for 15 minutes then switching to another service. I used to find a movie to watch in 5 minutes. I am a big movie person too, and will watch most things. But I am also more aware that if I watch one show that is just "meh", then I am going to be bombarded with shows of similar quality for the next few weeks.

Also, there is a fairly obvious pattern that shows up from the movies in "my list". Those do not seem to be weighted more heavily.


It is surprising how bad the recommendations can be sometimes though – I avoid watching comedy specials on Netflix because if I watch one, then suddenly I must want to watch every single comedian ever. It's reminiscent of buying something on amazon, then seeing tons of retargeting ads for the same thing. Oh well, like I said, in general, diversifying my streaming sources has helped a lot.


Besides the data, its interesting how the actual targeting was performed.

Does facebook provide an option to show a particular given ad to a particular given user? Or is it possible to select a group of people with a given set of likes? How fine-grained is facebook's audience selection mechanism for ads?

Or was the targeting performed by creating fake groups, befriending people?


Am I understanding this correctly? Facebook user data (likes/profile info) was scraped to produce low-dimension feature vectors for users (similar to word2vec). These feature vectors were then run through some ML model to predict...what exactly? Targetability for effective political ads?


They used it to predict political affiliation of people that don't explicitly state a party preference.

The two parties already have a list of registered party members (and they can see who on Facebook explicitly states their party preference), for those members the main goal is higher turnout (they are the training data). The other voters they're interested in are unregistered (e.g. independent) voters that are likely to be on their side ideologically.

The core idea is very simple, they believe that if someone says they're independent, but their preferences/features (age, gender, location, likes, posts) are predicting moderate or high likelihood of $PARTY affiliation, then showing this person political ads may move them from 'maybe vote for $PARTY' category and get them in the 'definitely vote for $PARTY' category.

If you have continuous access to new Facebook data as you're serving ads, you can verify your ads are working on an individual basis by checking the predicted 'score' for $PARTY affiliation predicted by your model before and after an ad (I want to stress that this can be done on an _individual basis_). The likely sequence of events is that they did AB testing on different kinds of ads and found that fake inflammatory ads were most effective at achieving this goal in a very measurable way ($PARTY score), the resulting media/political atmosphere is collateral damage (hopefully unintended).

Source: I am a data scientist / machine learning scientist and this is how I would do it and how it seems to me others. I don't work on political data but I have worked on personalized recommendations which are similar.


Did the app really grant them continuous access to user info? I thought it was a one-off thing - they get your data at the time of use (and your friends') and that's it.

Plus the approach you outlined would require the user to like/dislike things based on the add they saw, so CA can observe a change in the predicted affiliation (they didn't have access to posts as far as I know). I don't think it would have that effect (even if the add influences you, I doubt that it would make you go unlike Obama's page for example). Not to mention that by any likelihood you shouldn't be able to verify that a particular add was shown to a given individual.

I suspect it was a simpler use case - they would group users into segments, and then craft different add strategies for each one (maybe based on other research or just expert opinion).


> I suspect it was a simpler use case - they would group users into segments, and then craft different add strategies for each one (maybe based on other research or just expert opinion).

It is in this last process that it is individual-based. It is in this last process that AB tests are done individually as a function of the specific strategy applied to him/her


It seems like the purpose was narrowly tailoring messages, which is something political campaigns are really keen to do now (Obama's campaign was kind of a trailblazer here, right?).


> Obama's campaign was kind of a trailblazer here, right?

It's a pretty big gap between using and abusing social media and as far as I know Obama's campaign did not 'narrowly tailor messages'. They did target broad groups using generic messages and they did quite effectively use social media presence to build support.

But they did not - as far as I know, so please correct me if I'm wrong - go so far as to single out individuals or really small groups with the express intent of flipping their votes or targeting them with disinformation in order to try to stop them from voting.

And Cambridge Analytica seems to have been doing just that if the currently available information is to be believed.


https://devumi.com/2017/12/social-media-case-study-how-barac...

> The former president also hired Facebook co-founder Chris Hughes to help in developing his social media strategy. Obama furthered the use of Facebook for his 2012 re-election bid, utilizing it to encourage young people to cast their votes. His team developed a Facebook app that looked into supporters’ friends list to find younger voters. The team then asked supporters to share online content with these voters. More than 600,000 supporters responded to the call, sending content to over 5 million contacts.

> During his presidency, Obama continued to use Facebook to reach out to the public. In 2016, he became the first president to go live on the site, just before his final State of the Union Address.


Yes, that pretty much confirms what I wrote above. Your point being?

Please read the article and compare what we know about Cambridge Analytica vs what the Obama campaign did, it is comparing snipers with someone setting off fireworks.


Look, I don't think anyone can realistically doubt that Obama's campaign was the first to effectively slice-and-dice the electorate and use social media to target them. You're arguing against a much more expansive claim than I'm making.


You used the word 'narrowly', and in the context of a post about Cambridge Analytica that word has a pretty specific meaning.


No it was Hillary's campaign in 2008 that was big on "microtargeting".

ie, moving beyond "soccer moms" or "defense dads" but "soccer moms with one kid and expensive tastes".


[ Deleted. Nothing I say on HN ever matters. Move along ]


"What sort of lies" is pretty hand-wavy when it comes to labeling training data for a model. Are you summarizing from a source? I'm interested in the technical details of what happened.


Very few "undecided" voters truly are; elections are won and lost by getting your supporters to go to the polls. So if you wanted to use scurrilous, fake news to help your candidate, you'd be better off sending stories that will get your supporters really fired up and eager to vote and get their friends to vote, not trying to persuade the practically nonexistent undecided demographic.


Are you able to specify the sources based on which you are supporting these claims, namely that elections are not not decided by "undecided" voters but rather by pushing your supporters to the election polls?


I thought this was common enough knowledge not to want citations, but I think you will find these satisfactory.

http://www.stat.columbia.edu/~gelman/research/unpublished/sw...

https://www.politico.com/magazine/story/2014/01/independent-...

https://www.thenation.com/article/what-everyone-gets-wrong-a...

The last piece has a short summary of the salient point:

> In fact, according to an analysis of voting patterns conducted by Michigan State University political scientist Corwin Smidt, those who identify as independents today are more stable in their support for one or the other party than were “strong partisans” back in the 1970s. According to Dan Hopkins, a professor of government at the University of Pennsylvania, “independents who lean toward the Democrats are less likely to back GOP candidates than are weak Democrats.”

> While most independents vote like partisans, on average they’re slightly more likely to just stay home in November. “Typically independents are less active and less engaged in politics than are strong partisans,” says Smidt.

> [...]

> The conventional wisdom holds that the parties need independents to win general elections, but the reality is that they’re increasingly devoting their resources to getting their own voters—including their “closet partisans”—out to the polls rather than trying to sway the dwindling number of genuine swing voters. “We’ve seen a huge increase in technology and the ability to turn out the vote,” says Smidt. “So in terms of a cost-benefit analysis, the parties and candidates see that it’s much easier to turn out people who agree with them than it is to change someone’s mind. And then there’s also the question of how many of us are even open to changing our minds.”


Very interesting article but I wish it went one step further. Why does it matter that cambridge analytica knew a user's big five or that they were an old, uneducated republican? How was this (inferred) data used?

I assume they wrote/created different ads for different sets of users... but how many segments did they have? Did their graphic designer build 500 different ads, or was text/images dynamically inserted based on these variables? How did they figure out which message would resonate with each segment? How did they test something like this, with so many potential variables? Was this knowledge used only on facebook, or across all digital channels? Was it implemented in non-digital channels as well?

I'd kill to have access to their campaign set ups.


[flagged]


We've banned this account for crossing repeatedly into personal attack and ignoring our requests to stop breaking the site guidelines.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future.

https://news.ycombinator.com/newsguidelines.html


Oh, I forgot you were the one who called me mind rapist for working in advertising. I'm tempted to do an AMA here so people can hear the perspective of an insider (not to sway opinions necessarily, but to quell some misconceptions and understand how we got here/how things actually work)


And we can't blame the companies because we agreed our data to be monitored when we installed the apps and granted permissions. Facebook and Google has taken so much of our data and knows more about our behavior than we ourselves. Even if I demand my data to be removed I don't know if they will 'actually' delete it.


Who says we can't blame them?


Yeah yeah, they only tried SVD on a multimillion dollar dataset. Psy-ops, netflix, same deal, what is all the fuss about! Strange that a research scientist is released of his NDA a couple of days after channel 4 documentary.


I'm excited for GDPR.. The hard part is going to be getting the truth out of these companies about the actual extent of the data they hold on us


The EU should do a unionwide ad campaign pointing out the fact that the much-maligned eurocrats have actually been working for years to fix this very difficult problem that's only now becoming apparent to the wider public. The timing couldn't be better as GDPR goes into effect just after the Facebook/Cambridge scandal.

Unfortunately all EU institutions are terrible at marketing. If they did an ad campaign, it would probably be a TV commercial showing Jean-Claude Juncker giving a speech with subtitles in 15 languages.


That "hard part" is going to be more than just hard; I think the word you're looking for there is "impossible". They don't have any way of knowing what data Google, Facebook, Amazon, or any other company has. As this article does a credible job of explaining, you have to understand a fair amount about statistics (PCA) and machine learning to even know if if you were looking at it, and they won't know where to look for it. They have no enforcement mechanism in mind, and they passed a law anyway, which effectively means "you can't admit to having this", which will mean that the more willing a company is to lie, the bigger an advantage they will have over their competitors.


Sure. But at least when they are actually whistleblowed (by people who "just" work for the company) there is a law which can be used to call to a court the theople who are legally accountable for the company


This is the standard way of analysing this kind of data, and I'd be very surprised if the Obama campaign didn't use the same or very similar methods with the facebook data they obtained. The only difference is that Cambridge Analytica managed to obtain much more data over a wider demographic.


archived for future reference http://archive.is/dMIcN




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: