Hacker News new | past | comments | ask | show | jobs | submit login
Mozilla blogger bought 1 million Facebook entries (full name, e-mail) for $5 (talkweb.eu)
182 points by dbcooper on Oct 23, 2012 | hide | past | favorite | 106 comments

I suppose any 15 year old with programming skill can do this. Create an app (kill zombies!) and then when the popularity dies down, sell out your users info to the highest bidder.

Kinda sensationalist at the end - "DO YOU STILL FEEL SECURE?" Uh yeah, I do. I get hundreds of spam a day that Google puts into a little folder for me to erase. Go ahead and email me, and join that exclusive folder right away.

Oh you want to send me a private message on Facebook? Facebook is kind enough to put messages from non-friends in a special folder too, and I never check that.

I feel secure.

Do you even need an app for this level of information? I feel like a crawler could look for this information for users with low privacy settings on their accounts by successively going through friends' friends' friends' ... about pages.

edit: refuted by Permit's comment below: http://news.ycombinator.com/item?id=4688893

Sure, the only "security" that "hiding" a Facebook user's information provides is to ... Facebook. They are secure from users switching services and emailing all their friends about it.

If the sender adds the required headers into his email then it is very likely that google or any other email service provider will not consider the email as spam.

Regarding facebook I partially agree with u. Sometimes the message from non-friends goes into other folder and sometimes I can see it in the regular mail folder.

Adding the proper headers such as DKIM, SPF, etc., are used to make sure a spammer can't send mail pretending to be someone else.

All spam systems still look at the content of the message plus the reputation of the IP/domain when determining if a message should be marked as spam or not.

Don't tell the spammers making millions of dollars about that header trick.

Seriously though, I don't think a couple of headers will cause an email to become not spam. Google's filter is probably Bayesian, and considers the content of the message, the matching between the originating server and the reply-to address, the spam history of the originating server, any links in the message, how many times it sees the exact same message, how many people mark it as spam, etc.

So maybe it will get through to the first few hundred people, and then they'll block it.

Wow, five bucks.

As the saying goes, if you're not paying for the product, you're the product. The new twist here is that the product (i.e., your FaceBook info) is now being sold in the open market for only $5.00/1,000,000, or $0.000005 per person.

Prices normally go down only when supply exceeds demand, so the inescapable conclusion is that there's abundant oversupply of this product in the open market. Yikes!

Your implication is that Facebook themselves sold this data. The orignial article says that it was from apps that people isntalled from a 3rd party, not Facebook.

Doesn't matter. The company that got the data then sold it on for $5, which gives us an indication of how much they had to pay [1] to get the data from FB in the first place.

[1] I'm using "pay" loosely here. I have no idea what they had to give up or produce in order to get the data, but presumably FB received something of value.

I imagine this data came from the "allow this app to access XYZ of your information: YES/NO" thingy that pops up when authorizing apps on these social media platforms. The only payment to FB that I can think of would be in the form of marketing costs (or does FB have a developer membership cost like Apple?)

Most likely that's true. In this case the company dedicated time and resources to making an app, and Facebook is enjoying some of the fruit of those efforts. And the fact that this kind of data can sell for $5 gives us an indication of the value of those efforts. [1]

[1] Not necessarily a good indication, as this may be a last ditch effort convert _some_ value out of their app development efforts, and who knows to how many buyers the seller has sold this data.

does FB have a developer membership cost like Apple?

No, it doesn't cost anything to create a Facebook app.

>Prices normally go down only when supply exceeds demand

I don't think that rule applies for digital goods where the cost of reproduction is zero. The supply is infinite.

Nitpick: Cost per unit of digital goods is low, but not zero. Servers, bandwidth, CC fees, sysadmins, etc.

Not a nitpick, beacuse the cost to enter the data by the user is also not zero (thus supply is not infite either). These are subtle but not trivial things to keep in mind, when dealling with massive scale (A billion users, ect.)

Those are costs, but they're not marginal costs. They have no bearing on the cost of each copy, nothing to do with scale.

They're absolutely marginal costs, if you look at the right way.

Takes more servers and fatter pipes to support 10,000 downloads a day rather than 500.

For something like a list of a million user names and email addresses? You put it on pastebin and set up a script to email out links to it when you get a Paypal payment confirmed email. The only cost is to acquire the data, once that is done, there is zero cost.

If you want to talk in totally abstract terms, digital goods in general tend to have marginal costs associated with them. In the context of this discussion, there is no supply and demand factor, there are no marginal costs, and there is no market force called scarcity.

Let N=(1000 items of unique information) Let W= (2000 items of unique information)

2(N) does not yield W, regardless of cost to copy (N).

To get W, you will need to do something more. This will not be cost-less. That's the more general case.

That's not what marginal cost means to a supplier. The question isn't whether it costs more to acquire 2000 email addresses than it does to acquire 1000 email addresses, the question is whether it costs more to distribute to twenty buyers than it does to distribute to ten.

Thus, the cost of hosting is a marginal cost (probably zero in this world of pastebins and digital lockers). The fee taken by the payment processor is a marginal cost. The cost of finding twice as many emails is not.

No, a "supplier" has to pay for all of his raw materials costs. That includes inventory costs as well as distribution. Of course you can always restrict your timeframe and assume away this cost (inventory as already incurred), but this is not true in the general sense. In particular, if this is true, by assumption, the there is a limited supply by deduction. If you increased your supply [of information bits, not duplicate bits], you would have to pay to incur inventory at that margin precisely. So you never have together zero marginal cost and unlimited supply, this makes no sense.

notatoad 1 day ago | link

I don't think that rule applies for digital goods where the cost of reproduction is zero. The supply is infinite.

To sum, "the cost of reproduction" is <not> the cost of "supply", unless the supply is assumed fixed. Thus the second sentence does not follow per-se.

I don't think you are understanding. Of course there are big costs in acquiring more product to sell. The question is: Do you have to pay those costs for each customer, or can you pay them once and amortize the cost over many sales?

For example, Adobe Photoshop probably costs a lot to design. It has really high fixed costs, because you need to hire good developers and implement a bunch of advanced operations. However, once Adobe pays the fixed costs, the marginal cost of Photoshop is pretty minimal: packaging, printing a DVD, maybe some marketing. It still costs a lot because the fixed costs are so high, and there's not much competition.

Conversely, a plumber has relatively low fixed costs: a truck, some tools, and some training. But plumbers also cost a lot, and this is because they have really high marginal costs: they have to spend an hour at the house of each and every customer.

So I agree with you, there may be high costs in acquiring email addresses to sell. My point is that they are in no way marginal costs.

The costs are marginal at the point of periodicity.

example: reseller> pays adobe every month/quarter example: adobe> pays versioning costs every 24 months

Provided you shrink the window of analysis, you can say "already paid for inventory, just amortizing it". But in that case, you don't have unlimited supply, you just have whatever you paid for.

In the case of adobe, despite having "unlimited copies" of CS5, they would (eventually) run out of supply of salable product if they did not version into CS6. So while its trivially true they could make unlimited copies of CS5, its not a great idea to perceive this as unlimited supply. The supply that matters is the part people are willing to pay for--this is the marginal information content-- not the marginal bit content of what is delivered.

In some ways I don't think we're disagreeing, just focusing on different elements of the analysis. My larger point was exactly that -- keep in mind the broader elements that are considered as relevant by CxO.

THe CEO of adobe makes decisions, for examople, about how often to incur the marginal cost of versioning the next Creative Suite, how rapidly and how much to budget, etc. COO of facebook looks at the marginal cost of data centers for the next 200 million users, etc, in part because s/he is looking at timeframes and scales which are not the same at the level of a project team, etc.

Care to explain a bit more? On the face of it, I don't believe this can make sense. All costs scale, all costs are at some stage marginal. What are you proposing is the trigger or cause of incurrence?

I think the case they are describing is where the marginal cost is highly nonlinear and the price delta between two reasonable values is so negligible that the marginal cost isn't meaningful.

A 2MB file does not cost twice as much to email send to someone as a 1MB file; you aren't going switch to a different internet connection or email provider because of your file is twice as big. The first 1 byte is very expensive and every subsequent byte has no observable marginal cost until 20 orders of magnitude later.

Appreciate your comment. I think I was unclear earlier in my post. Was not trying to talk about the supply of > undifferentiated bits per-se. Those are trivial to scale, in minor orders. The micro cost is in the acquisition of <unique> bit sets. In your example, its not the cost to send or replicate the e-mail with a 2 mb attachment. Its the cost to acquire, verify, etc the contetents of the 2mb file have any value. [1]

You can't create more <valuable> data per-se by making X copies of the same data (in the sense of it having value for marketing/analytics). that is just monetizing existing data. ie, The marginal cost to relicate a set, provided it was given to you for free...just assumes away a non-trivial part of the equation....getting the data.

At scales of 100m to a 1Billion...is not trivial or costless. Lastly, if you only have one set of data (say 5m users of data), that is a finite supply. You could have 2 sets (10m users). Thats not the same as having 2 copies of 1 set (of 5 million). A customer might pay per user for a lead, but wont pay twice for two copies of the same info. now, if somebody shows up with 100s million, it might impact supply/demand (depending on comparabilit/uniquenss). But those differences cannot be assumed away at zero cost, imho. Hope this make more sense, was not trying to argue just for the sake of it.


[1] That scales with data entry, etc (if nothing else) at the origin (ie, this is a FB user cost == per user). Even if its non-cash its ~$0.85c per 15 minutes of time for a western eurpoean ABC1, back of the envelope. And that scales linearly.

Thanks for the reply; I honestly am not sure about this situation in particular, I was mostly trying to explain how the statement about marginal costs can be true in any scenario.

I didn't actually intend to say that the cost here is mainly bandwidth or X for any X. My point is more than it can be true that the marginal cost of anything can be so many orders of magnitude less than the non-marginal cost for certain ranges that it isn't worth considering. Data sets based on facebook profiles almost certainly come from people approving shady apps or someone set up a crawler in a way that is able to get a lot of information before being detected as a crawler. In either of those scenarios, the person who set it up effectively paid a flat upfront cost and ends up with X number of users, and there are no linear costs (no manual verification or paid data entry at any point). They cannot spend half as much time and get X/2 users or even maybe X/100 amount of information.

In practice they could spend more time getting more users to give access to their random app, but the marginal cost function is just insanely nonlinear; its effectively 0 at some places and probably tends towards infinity an order of magnitude higher than that.

You're all looking at the supply side of things - there might also be very low amounts of demand.

The price is not low because of supply and demand, but because it is not a very useful list. When you pay a lot for a list it is because it has been curated and you are buying a list with X user having Y attribute[s]. Y='Facebook user' here, that is hardly worth anything. The network is too large for it to be significant for any real marketing campaign. Essentially you have a list of 1,000,000 VERY loosely related users.

I was considering the possibility of selling these emails to a spam email list, but then realized that since so many accounts' email addresses were changed to FOO@facebook.com, the value of these 1MM facebook accounts has diminished rather significantly in this regard.

A 'small' slice of your facebook info.

I don't see how the fact that he's a Mozilla blogger is relevant.

The file contains "just" full name, e-mail and URL. Thieves got the information thanks to their Facebook apps (no idea of its name), it could happen with any third-party app.

I'd imagine you could scrape a million people's publicly available info, too.

Have you ever tried scraping Facebook?

What should I expect?

They're pretty clever. When I started programming in 2009, I wrote a small scraper that would create accounts, friend people and steal their info if they accepted. (I never released it past my own friends list and never sold the data).

There were the obvious checks for CAPTCHAs when too much activity was detected, but other subtleties as well. If you looked at too many people's profiles, emails wouldn't be displayed as text, but as images. A person would be unlikely to notice as the pages looked identical, but dynamic changes like that make it harder to scrape some things. Introducing even rudimentary OCR requirements is enough to turn away a lot of programmers.

I'm not saying it's not possible to pull off. But Facebook has set it up so any money you might make this way will likely not be worth the development time required.

Glad you found our anti-scraping stuff to be neat! I work on the team that builds a lot of that technology at Facebook. Any interest in interning here sometime and helping us improve our systems even more?

You guys do a really great job.

To be perfectly honest, I've kind of fallen out of love with web development in the last year and have taken more of an interest in algorithmic trading. I appreciate the interest, though. :)

Soon we'll have very clever, slow going, open source Facebook scrapers, created for free just because we love a challenge.

You could friend people, get to know them, get their email, go to a party to meet their friends, friend them.... and eventually scrape the whole network, if you had a team working in parallel.

Exactly what I'm saying....most of these people probably could care less whether their info is public or not.

Many people talk about caring about their security in an almost idealistic view; few actually care in application.

It gives context?

It associates Mozilla too, unfairly imho.

Here's what I think is the offer on gigbucks: http://gigbucks.com/Social-Marketing/26055/instantly-give-yo...

You don§t have to show the URL. Those guys have to be banished, not advertised.

That's like saying "don't think of a camel". My immediate reaction to your comment was to click on the link - which I otherwise wouldn't have. This is not a criticism, just an observation.

Why is he called a Mozilla blogger? He doesn't work for Mozilla foundation or Mozilla corp as far as I can see (nor does he claim to). His linkedin states he associates with the "Mozilla community", but that's hardly an official representative of Mozilla as "Mozilla blogger" implies.

His blog is syndicated to planet.mozilla.org. It's hardly a high quality blog.

I don't see this as a particularly big deal. Databases of email addresses have been available for cheap for a long time, as is evident from the amount of spam we all get. This is after all why spam blockers are so important.

How is that news? Many of these names can be probed for using the public FB API, without being logged on and without access_token:

e.g. try https://graph.facebook.com/1112112584 with curl or so ... (sorry random member from the published list)

Some spammers have probably been harvesting that API for a long time ...

One additional data that I see in the list he purchased is email address, which is probably what the spammers would value the most.

Because that list also has email addresses?

Try CURLing even a 100 of these URLs and see what happens.

Also, this doesn't give you the email.

I think instead of having users agree to permissions on Facebook and Android apps, they should have to explicitly grant permissions to the app. Maybe by dragging an icon or something that represents "email address", "real name", and other concepts.

This would dramatically lower the conversion for developers, and unhappy developers make facebook unhappy. Facebook could make it way more obvious if they wanted to, but they just try to balance how much they screw users vs how much they screw developers.

This is FB we're talking about. They're not interested in any privacy awareness, that would only decrease the "engagement" and thus makes no sense to them.

what's so surprising about this? you can get list of valid emails and just run them through facebook search and you ll get names and profile URLs, both public information no matter what your security settings are. nothing special about this

Where do you get a list of 1 million valid emails?

Type "buy email lists" into Google and pick any ad. Even Salesforce appears to be selling 30 million e-mails.

That's what I'm saying...

You just crawl the web. I am not surprised either by this, I don't consider emails and first/last name as private data.

FB emails are probably a bit "fresher" than ones found by randomly crawling the web.

anywhere, but thats not my point. the point is, that its the same as getting list of email addresses. and no1 would be surprised if i would publish list of emails but this list is suddenly security issue

I suppose you'd get rate limited.

there are plenty free proxy lists out there

The email I used for facebook is facebook@mydomain.com. I have gotten several credit score emails sent to it... This started happening after I had deactivated my account. I never used apps that much, though I did sign up for them occasionally.

I have never noticed this with any of my other unique emails, just the facebook one.

Is that domain publicly accessible? Could facebook@(any domain you know exists).com be a reasonable shot in the dark?

Yes, but it doesnt go anywhere except a blank page with my name on it.

The sheets are named "sayfa". That's a hint about who he bought the list from. Anyone knows which language is that?



Excuse my naïveté, but what could one use this list for? Aside from spam email blast?

The OED confirms naivety is also an English word that's been around since at least the 1700s. I am not implying naïveté is wrong in English, just that there is a simpler option should you so choose.

autocorrect on my ipad went with the more pompous version :-P

You could use it for phishing. It might make users more likely to click through and provide security information if you display their correct full name and even Facebook URL.

You could start Groupon clone right away (presuming you have the deals).

Wonder if you could just spam groupon deals with your affiliate link included?

This is one of the reasons why I don't use FB apps. The privacy and security controls are far too loose and broad. It's never clear exactly how your data will be used.

One of my clients has a FB app so work with this stuff daily. It's so easy to build a full profile of of an active user's life, their interests, their friends, their work and education history. Their geotagged photos and check-ins tell you exactly where they like to go. Pure gold for marketers / spammers.

The information is just too accessible and valuable for people to not abuse it.

And that's why I use the https://mypermissions.com plugin, so my email doesn't end up in places like this.

I'm sure much of this information is public on their profiles to begin with; with a simple web scraper, you can acquire information about millions of fb accounts and their respective email addresses.

OFC there's an oversupply since they can give out an unlimited amount of copies of this same million FB entries.

Big Data is great because it's super re-usable and can be purposed for anyone's specific need.

I actually did build an FB scraper about 2 years ago. If I'm remembering correctly, something like 1/2000 people have their e-mail address public. Name + URL is always available, but the e-mail is a bit more valuable.

I just checked facebook people search page does not show user's email address.


Wasn't there a massive dump of FB names paired with emails, released by lulzsec or some other 'Anonymous' organization last year? Could make this 'deal' quite useless.

I'm quite skeptical of just about every offer to login with Facebook, or install an 'app' within the site, whether it's Apple, Etsy or some other dishonest corporation. Just about everyone wants mainly to gather your information for some other purpose.

Unfortunately, as usual your average user thinks nothing of clicking through the permissions page on FB without reading or understanding what it says.

I laugh at the censor fail. See selected field.

I'm not seeing it. What is the censor fail in line 66?

The picture was edited in the meantime. Previously the formula field (displaying content of current cell) was not grayed.

That's hardly a deal, it is possible to scrape name, url, gender, locale and profile picture just by bruteforcing user ids:



You don't get the email but that would be really bad.

Just curious, is this illegal? What happens if the seller didn't disclose the method in which he aggregated the data or says I just stood in a shopping mall and asked people to voluntarily disclose these details? (Sure he'll have a hard time proving that) ....

Chances are it's in the app's privacy policy that they can share your data with "carefully selected third parties" or something similar.

Not allowed by Facebook developer TOS (which they never enforce).

I wonder what other data the seller has. It seems likely that you could get more valuable data from the users of these facebook apps since people tend to says yes when apps ask for permission to access data.

Looks like Facebook isn't happy about it: http://talkweb.eu/openweb/1842

Maybe should have blurred out the edit line, too.

BTW, where can I find this deal? Would love to see this list in it's legitimate form if nothing else.

The url is in a comment above. -- http://news.ycombinator.com/item?id=4687676

Does anyone have any idea how the Facebook accounts were taken from the owners? Scary.

What do you mean by "taken"? I guess the Facebook application just used the privileges the users provided to it to compile this list.

I don't think the accounts were 'taken', I thought he just had some of their data ...

dev make and app, user uses app by granting acces to basic infos, which every apps asks for. dev save details of each user in spreadsheet and sells it.

man that's nothing. You should see what I am able to get for free. although I did pay $15 for a list of 100m emails, most of which have phone numbers, names, addresses attached to them.

According to the link the data was collected through an app. This is not Facebook's fault, this is the developer being an idiot, because Facebook TOS are very clear when it comes to data; the developer is in big trouble if found.

and then, the data provider sells the credit card information you gave them to buy this list in another market... Do YOU feel safe?

So what did he buy that he couldn't get free using Google search? Nothing?

There are phone books out there that list more information that that. Do you feel secure?

1 million Facebook entries isn't cool. u know what's cool?

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact