
Data Science of the Facebook World - mh_
http://blog.stephenwolfram.com/2013/04/data-science-of-the-facebook-world/
======
taliesinb
I did the analysis and worked with Stephen on the science side of it.

If anyone would like to ask questions about what we did, I'd be happy to
answer them.

There's still lots more interesting stuff to do, but it was enough for a blog
post. Suggest away if you think we missed something obvious!

~~~
susi22
The only visualization I didn't like was the chord diagram:
[http://blog.stephenwolfram.com/data/uploads/2013/04/chordplo...](http://blog.stephenwolfram.com/data/uploads/2013/04/chordplot-
movers-by-state3.png)

Did you try this visualization?: <http://bl.ocks.org/mbostock/4062006>

~~~
taliesinb
Yes, I did.

We'll have to agree to disagree. I think the visualization you linked to is
much harder to read, because the visual weight accorded to each edge is a non-
linear (and somewhat arbitrary) function of the 'true' weight. It also doesn't
scale well with number of vertices.

Whereas with the chord diagram, your eye is naturally drawn to the big arrows,
and you can easily follow them. It's also bidirectional in a more
straightforward way.

------
kevinalexbrown
I found this interesting. What I would love to have seen, however, is a probe
into the dynamics. You did a nice abstraction over time as you measured
property X as age was varied. I would have loved to have seen the manner in
which topics and ideas spread over your network.

For instance: If an event occurred in New York, say, how long would it have
taken to spread to San Francisco? If there were no progression, topic times
would center around the same time. This would indicate that people were
getting their information from national, not local sources (e.g. the evening
news), then talking about it on facebook. On the other hand, if a local topic
was spread on facebook alone, we should see some sort of progression.

It's possible that this progression could take more interesting forms besides
geolocation, but that might require a more extensive network. A simple
experiment would work like this: A few thousand people who are not friends but
have a similar interest (say an interest in Elizabeth Warren) post
independently a video of her. This particular esoteric interest is unlikely to
be valued a priori by their friends, but perhaps they are compelled to repost
the information. What's the threshold of "esotericness" such that it won't "go
viral?" Is there a way to predict virality as a function of how popular it is
to begin with? Is there no actual progression across the network, but rather a
small bump in topic expression, until it is picked up by larger media sources
at which point the entire network is inundated with people reposting Elizabeth
Warren recaps from HuffPo et al?

The reason this is interesting is that it sheds insight into the role of
social networks: are we fundamentally disposed toward central sources like the
NYTimes, or is facebook a fundamental _sharing_ mechanism? That is, do I post
on facebook just to have my views expressed, validated, and challenged, so
that they might change the world over a few years? Or do I post on facebook to
have my views _propagate_ across the world much more quickly?

Finally, a question: How did you estimate the power law? I know how difficult
it is to do this (e.g. not linear regression on a log-log scale). Did you
compare the power law fit to other, similar distributions, like lognormal?
Preferential attachment is indeed a beautiful theoretical result, because it
implies the existence of power law degree distributions. Unfortunately, many
networks are not as well represented by power laws as by alternative
distributions, which casts doubt on the preferential attachment hypothesis as
is. (Also, many sampling methods give rise to fictive power laws). That said,
a fat tail can still be interesting.

In any case, this is a beautiful piece of work.

~~~
taliesinb
Interesting points.

1\. Dynamics

You're right, that would be very interesting. The most obvious way we could
have done this is by looking at the spread of our app itself as people started
to use it. Unfortunately, we only started recording anonymized stats for the
second release, so we've somewhat missed the boat there.

To do it with links and general "memes" would be technically much harder,
because we'd have to periodically rescrape walls of all the donors to see time
evolution. It was somewhat out of scope of the blog post, given all the more
basic stuff we could do instead.

I'd be surprised if Facebook didn't already do an analysis of this when they
"cracked down" on app virality a while back.

Bit.ly's Hilary Mason might have looked at this question too, and I'm sure it
has been done to death with Twitter, though the demograph info is much sparser
there.

2\. This not being a scientific paper, we estimated it by drawing on the log-
log CDF. Barring the noise that "deparadoxing" the friend's friend count
distribution induces on the low end of the distribution, it was very linear
over two decades. We didn't think the exact number was all that interesting,
so we didn't spend any more effort than that. Facebook's anatomy paper
probably has a very accurate number.

I'd heard about the fictive power law stuff. What makes me even more skeptical
is that FB friends are probably a poor proxy for 'true' friends. You'd be
better off looking at number of friends as defined by some cross-commenting
threshold.

3\. Thanks! It was a lot of fun!

~~~
szhorvat
About the "fictive power law" thing: this is THE paper to read:
<http://arxiv.org/abs/0706.1062> (it's an easy read, explaining what the
maximum likelihood method is, etc.). Despite what they say, fitting the log-
log CDF usually gives pretty good results when done right (fitting the PDF
does not)

~~~
taliesinb
Thanks! What do you know? Shalizi!

~~~
carlob
Also from Shalizi, if you want a TL;DR

[http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/491.htm...](http://vserver1.cscs.lsa.umich.edu/~crshalizi/weblog/491.html)

------
greiskul
I wonder if the higher friends count of Brazilian users is caused by the
previous use of Orkut, where it was popular to try to have as many friends as
possible.

~~~
taliesinb
Ha! That is interesting! I didn't know about that... will tell Stephen.

------
geekam
How is it possible that Facebook, which owns the data, does not give tools
like these but others tap this using their data?

~~~
programnature
The attitude of companies like Facebook, Google, Twitter is: if the product
isn't addictive or useful to Billions of people, its not worth doing.

Hence there are vastly more resources dedicated to assimilating eg photos and
games into their ecosystem, than into something computationally innovative.

This is IMHO a huge mistake, since they could instead be introducing simple
forms of programming that takes you on a continuous curve from using the
product, to developing for it. There is a huge hunger in the masses for better
forms of programming.

This point will probably become obvious if the Wolfram Language is successful.

~~~
taliesinb
I'm sure Facebook's Data Science team does a lot of interesting things
internally. They do in fact have some interesting papers [0] and [1], though
obviously with more of an 'academic' feel than the blog post.

[0]: <http://arxiv.org/abs/1111.4503>

[1]: <http://arxiv.org/pdf/1201.4145> (edit)

Edit: they also have this FB page which has a steady stream of interesting
stuff: <https://www.facebook.com/data>

~~~
p3r1
Thank you so much. Edit: I don't work at facebook.

~~~
taliesinb
Are you on the Facebook datascience team? Your profile is pretty sparse.

------
austinl
I've been doing Facebook network visualization for a while now with Gephi.
Here are some of the graphs I came up with:
[http://visualizingpolitics.wordpress.com/2012/05/02/facebook...](http://visualizingpolitics.wordpress.com/2012/05/02/facebook-
network-visualizations/)

~~~
taliesinb
Nice! Do you know if Gephi can do something similar to the summarization that
we did using cluster diagrams? The whole "ball of hair" problem doesn't have
any other real solution, I don't think (well, unless you use edge clustering,
but that doesn't help in-group connections).

~~~
austinl
I'm 90% certain that it can, but I haven't worked with it in a while. I
remember a setting in the display options that simply combined all the dots of
one color into one large group.

~~~
taliesinb
Cool! I wonder how one could combine the best of both worlds... what we're
really talking about here is a hierarchy of graph plots in which you can drill
down to each node = graph at a lower level.

------
sskates
Wow- this is awesome! It's really cool how people's friend distribution by age
is a convolution of their age and the age of the general facebook population.
It's also scary in a way to see a snapshot of how I'm likely to change in the
future with regards to my clusters of friends, my relationship status, and
what I'll talk about.

------
xk_id
The traditional way to plot the assortativity by age is using a scatter plot /
heatmap. This is similar to what they did for country homophily on p12 of the
Facebook anatomy paper. The result would be a plot with a prominent diagonal,
illustrating that "same attracts same".

That aside, imo, Facebook is an incredibly idiosyncratic "app", which makes
almost no sense. And yet, it gave us so many opportunities for interesting
discussions, like the insights in this blog post. Nice job.

~~~
taliesinb
Yeah, we tried a couple. Those heatmaps I think are quite hard to read..
because it is natural to want to take marginals, but you can't easily do that
visually.

I think this whole "octile plot" thing turned out quite nicely. It's in a
sense a way of 'slicing' the CDF into 8 even strips and projecting them onto a
single axis. It's quite intuitive to read, too. Facebook seems to use it too
for some of their papers.

------
photorized
One thing that bugs me is how comments are linked to "interest". There are
many topics that interest people (passive consumption), that do not
necessarily translate into engaging in a conversation with others publicly.

As a marketing term - sure, that would be a good indicator of interest. Since
this article is more scientific than marketing-oriented, I would clarify what
some of the metrics mean (or don't mean).

Excellent, fantastic visualizations though!

~~~
taliesinb
You're right. A more ambitious thing might get a bit closer to people's "real
interests" would be to follow posted links and topic model the contents of
those links.

~~~
photorized
I am very interested in that angle - let's connect when you have time, twitter
@iTrendTV

------
pbnjay
How much of the friends with zero friends is simply because that information
is blocked? If my friends "donated" their data, I would show as having 0
friends if I've blocked that information to apps.

~~~
taliesinb
Actually, NONE of the people in our dataset had zero friends. The x-axis
starts at 1, not 0. The point is that resampling to remove the friendship
paradox shows that there are many more people with single-digit friends than
we expected.

~~~
pbnjay
Given mcintyre1994's comment, I think this still explains the same situation.
People with single-digit friends are simply people who have friends blocked to
apps but have multiple friends who've donated data.

~~~
taliesinb
No, we can tell when someone's privacy settings have given them a zero friend
count, and they're not included in that aggregation.

------
CurtMonash
Introducing "data science for Facebook" in 2013 is ... odd.

All the more so because Jeff Hammerbacher is often credited with coining the
term "data science", and he started doing it at -- that's right -- Facebook.

~~~
taliesinb
Well, the title is "Data Science of the Facebook World", there is no "for".
The data _comes from_ our "Personal Analytics for Facebook" product.

Didn't know Jeff Hammerbacher coined "data science" at Facebook -- that's
interesting!

------
brown9-2
Very nice looking graphs, but running "Wolfram Alpha Personal Analytics for
Facebook" for my own profile comes with a rather nerve-wracking warning:

 _Wolfram Connection would like to access your public profile, friend list,
email address, custom friends lists, News Feed, relationships, birthday,
status updates, checkins, education history, hometown, current city, photos,
religious and political views, videos, likes and your friends' relationships,
birthdays, education histories, hometowns, current cities, photos, religious
and political views and videos._

~~~
dusing
How do you think they are going to get all the data to analyze your facebook
account?

~~~
taliesinb
Yeah, that's data you _see_ in your report.

You have to opt in to being a data donor for us to store any of it.

Otherwise we just record basic anonymized statistics -- like number of
friends, sex, age, etc... and throw all the detailed stuff away. Our privacy
policy has more: <http://www.wolframalpha.com/fbfaqs.html>

We also encrypt with public keys like there's no tomorrow.

------
jonpeda
The Mathematica system makes some beautiful, informative graphs, and
presumably users can make those graphs with a minimum of fuss and bother. It's
technically very nice.

Yet, in the entire blog post, is there one insight that wasn't a priori
obvious? Maybe the bits about migration.

I don't see the "art and science" in this analysis, I see "stamp collecting"
(<http://en.wikiquote.org/wiki/Ernest_Rutherford>)

~~~
carlob
So Rutherford says: "All science is either physics or stamp collecting" and
then wins the Nobel prize for…

You got that right: chemistry!

~~~
rdouble
Chemistry is applied physics, after all.

~~~
carlob
I'd rather imagine him fuming in one corner: "stupid Nobel committee… they
should have kept their crappy stamp collecting prize…"

------
jonpeda
People _donate_ their data to support Wolfram's closed-source, paid-license,
for-profit program?

~~~
mcintyre1994
They said they'll use it to support their 'personal analytics' programme [0],
which is free via wolframalpha.com - I don't see how this data would help with
Mathematica or anything else they charge for?

[0] <http://www.wolframalpha.com/facebook/>

~~~
maxerickson
They are using it for marketing.

(I mean the post is what it is, I don't mean to whine about it)

