
Foursquare dataset free to download and analyze - rsobers
http://www-users.cs.umn.edu/~sarwat/foursquaredata/
======
sneak
Obligatory:

[https://en.wikipedia.org/wiki/AOL_search_data_leak](https://en.wikipedia.org/wiki/AOL_search_data_leak)

[http://techcrunch.com/2006/08/06/aol-proudly-releases-
massiv...](http://techcrunch.com/2006/08/06/aol-proudly-releases-massive-
amounts-of-user-search-data/)

[http://www.nytimes.com/2006/08/09/technology/09aol.html?page...](http://www.nytimes.com/2006/08/09/technology/09aol.html?pagewanted=all)

TL;DR: It's fairly easy to deanonymize datasets like this, provided they are
somewhat complete.

~~~
route66
The dataset comes with GPS locations of users and venues. With that data alone
you can retrieve individual addresses in not too densely populated areas.
Missing links will get caught in the social net.

We already learned that Warhol's 15 minutes of fame should read _15 megabytes_
, but, to cut the "it's the users choice to post that data" apologists short:
almost no-one I speak to understands the implications of all possible
interpretations, classifications and groupings that their online traces allow.

~~~
sneak
Furthermore, it's not the users' fault that 4sq isn't sufficiently rate
limiting or otherwise protecting this data. Why should arbitrary users be able
to see the social graph of others they're not friends with? Also, why should
people outside of your immediate one-hop social graph be able to see your
checkins?

Giving it to 4sq for data mining is different than giving it to UMN and/or the
whole internet for data mining and/or deanonymization.

------
danso
Jesus Christ. The bulk scraping in violation of the TOS is egregious enough,
but redistributing it with a mandate that the researchers get credit? For
what, scraping a generous public API?

~~~
Irishsteve
This is usually common practice. It serves two purposes 1) Makes it easier to
find the dataset that was used for experiments. And 2) Improves citation count
for the author which is usually important in research.

~~~
mzs
That's true, but here is the paper from Microsoft Research and it seems to be
lacking in how those data files were generated:

[http://research.microsoft.com/pubs/156453/icde12_lars.pdf](http://research.microsoft.com/pubs/156453/icde12_lars.pdf)

------
nicholassmith
That doesn't look like Foursquare has handed that over. What's the legality of
scraping a service for their data in this way?

~~~
onedev
That doesn't look completely legal to me either.

~~~
ozh
I don't get why scraping publicly and freely accessible data would be illegal.
The redistribution under their own terms is another matter, though.

~~~
chimeracoder
> publicly and freely accessible data would be illegal.

Data is not "publicly and freely accessible" if accessing it requires you to
agree to separate terms of service for it that restrict your ability to access
and redistribute it.

(Whether or not one believes the data _should_ be freely and public accessible
is a separate matter, but given the above, it's hard to make the case that it
_is_ ).

Amusingly, this data _still_ isn't "freely accessible", because these people
have attached their own, separate terms to reusing and redistributing the
data.

~~~
mvanvoorden
If you have to apply to specific terms, but are able to access everything
without complying to these terms, this effectively means that the data is
publicly and freely accessible.

Rules on itself don't restrict anything, enforcement does.

------
galapago
[http://webcache.googleusercontent.com/search?q=cache:hLI5FqD...](http://webcache.googleusercontent.com/search?q=cache:hLI5FqDixY8J:www-
users.cs.umn.edu/~sarwat/foursquaredata/+&cd=1&hl=en&ct=clnk)

(the direct link is not working, but this confirmed that was freely available)

------
boothead
No mention of the data format. Is it json, csv what? I know you can always
head -n the file but a little hint would be helpful!

~~~
nachi
It looks like an ASCII-formatted table. Pretty disappointing that it isn't
machine readable out of the box.

~~~
timthorn
What's not machine readable about an ASCII table? A fixed width table has its
own advantages over eg CSV - for instance, to read a specific field you can
reach it by offset rather than having to count delimiters.

~~~
cwmma
fixed width might have technical advantages but CSV has the advantage of being
able to be read by a lot of things out of the box.

~~~
timthorn
So does fixed width. Excel, MySQL, R, Perl... :)

------
interskh
> This data set contains 2153471 users, 1143092 venues, 1021970 check-ins,
> 27098490 social connections, and 2809581 ratings that users assigned to
> venues

The number of check-ins seems to be low compared to other numbers.

------
davidmat
Could anyone recommend some solid introductory material on data analysis/data
visualisation?

I'm thinking this data set seems like a fun way to fill a rainy weekend, going
for a dive into these worlds :)

~~~
benmanns
The Social Network Analysis course on Coursera started yesterday. [0]

0: [https://www.coursera.org/course/sna](https://www.coursera.org/course/sna)

------
m4tthumphrey
Look's like it's been removed. Damn.

Edit: Not removed, just unaccessible. 403.

~~~
annnnd
I'm quite sure it will surface somewhere. Looking forward to it. :)

EDIT: I would love to get my hands on this data... anyone? :)

~~~
mvanvoorden
I have it, but "The user may not redistribute the data without separate
permission."

But if you have any questions about it, I can try to answer them, my username
here also corresponds to a gmail account I use ;)

------
renownedmedia
Looks like the data is only up-to-date as of July 2012 (judging from the zip
compression times).

------
xntrk
sounded too good to be true. I guess we'll have to find it on bittorrent.

~~~
dotBen
filename was "umn_foursquare_datasets.zip" in case that helps

~~~
galapago
[http://ul.to/betfn1vh](http://ul.to/betfn1vh)

------
rajbala
The data set has been removed?

~~~
lgas
[http://archive.org/details/201309_foursquare_dataset_umn](http://archive.org/details/201309_foursquare_dataset_umn)

------
waynesonfire
why was this not posted as a torrent?

