Hacker News new | past | comments | ask | show | jobs | submit login
Foursquare dataset free to download and analyze (umn.edu)
113 points by rsobers on Oct 8, 2013 | hide | past | favorite | 35 comments

The dataset comes with GPS locations of users and venues. With that data alone you can retrieve individual addresses in not too densely populated areas. Missing links will get caught in the social net.

We already learned that Warhol's 15 minutes of fame should read 15 megabytes , but, to cut the "it's the users choice to post that data" apologists short: almost no-one I speak to understands the implications of all possible interpretations, classifications and groupings that their online traces allow.

Furthermore, it's not the users' fault that 4sq isn't sufficiently rate limiting or otherwise protecting this data. Why should arbitrary users be able to see the social graph of others they're not friends with? Also, why should people outside of your immediate one-hop social graph be able to see your checkins?

Giving it to 4sq for data mining is different than giving it to UMN and/or the whole internet for data mining and/or deanonymization.

Jesus Christ. The bulk scraping in violation of the TOS is egregious enough, but redistributing it with a mandate that the researchers get credit? For what, scraping a generous public API?

This is usually common practice. It serves two purposes 1) Makes it easier to find the dataset that was used for experiments. And 2) Improves citation count for the author which is usually important in research.

That's true, but here is the paper from Microsoft Research and it seems to be lacking in how those data files were generated:


That doesn't look like Foursquare has handed that over. What's the legality of scraping a service for their data in this way?

I'm a Foursquare engineer. We have explanations of our API policies here: https://developer.foursquare.com/overview/community

We'll be contacting this researcher to ask where they got this data and whether it conforms to our policies.

Thanks for responding, I didn't realise Foursquare actually gave so much of their data away freely. Which makes it slightly more seedy someone has scraped the rest and dumped it online.

That doesn't look completely legal to me either.

I don't get why scraping publicly and freely accessible data would be illegal. The redistribution under their own terms is another matter, though.

> publicly and freely accessible data would be illegal.

Data is not "publicly and freely accessible" if accessing it requires you to agree to separate terms of service for it that restrict your ability to access and redistribute it.

(Whether or not one believes the data should be freely and public accessible is a separate matter, but given the above, it's hard to make the case that it is).

Amusingly, this data still isn't "freely accessible", because these people have attached their own, separate terms to reusing and redistributing the data.

If you have to apply to specific terms, but are able to access everything without complying to these terms, this effectively means that the data is publicly and freely accessible.

Rules on itself don't restrict anything, enforcement does.

then re-releasing it with distribution terms is probably not legal at all.


(the direct link is not working, but this confirmed that was freely available)

No mention of the data format. Is it json, csv what? I know you can always head -n the file but a little hint would be helpful!

It looks like an ASCII-formatted table. Pretty disappointing that it isn't machine readable out of the box.

What's not machine readable about an ASCII table? A fixed width table has its own advantages over eg CSV - for instance, to read a specific field you can reach it by offset rather than having to count delimiters.

fixed width might have technical advantages but CSV has the advantage of being able to be read by a lot of things out of the box.

So does fixed width. Excel, MySQL, R, Perl... :)

sed 's/[[:blank:]]//g' dataset.dat | sed 's/|/\t/'

> This data set contains 2153471 users, 1143092 venues, 1021970 check-ins, 27098490 social connections, and 2809581 ratings that users assigned to venues

The number of check-ins seems to be low compared to other numbers.

Could anyone recommend some solid introductory material on data analysis/data visualisation?

I'm thinking this data set seems like a fun way to fill a rainy weekend, going for a dive into these worlds :)

The Social Network Analysis course on Coursera started yesterday. [0]

0: https://www.coursera.org/course/sna

Look's like it's been removed. Damn.

Edit: Not removed, just unaccessible. 403.

I'm quite sure it will surface somewhere. Looking forward to it. :)

EDIT: I would love to get my hands on this data... anyone? :)

I have it, but "The user may not redistribute the data without separate permission."

But if you have any questions about it, I can try to answer them, my username here also corresponds to a gmail account I use ;)

Looks like the data is only up-to-date as of July 2012 (judging from the zip compression times).

sounded too good to be true. I guess we'll have to find it on bittorrent.

filename was "umn_foursquare_datasets.zip" in case that helps

The data set has been removed?

why was this not posted as a torrent?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact