
Dataset of 13 Billion Clicks available for research - Anon84
http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset
======
nivla
Oh boy, this is going to end badly. Remember when AOL released a huge
anonymized dataset of their searches? People were still identified because
naturally people searched for their own names, the names of their friends and
families, local businesses, personal websites, etc.

This dataset is even worse since it includes both referral and the
destination.

Keep in mind websites often put the usernames within the URL

Eg: <http://www.facebook.com/Your.Name>

<http://www.reddit.com/user/USERNAME/>

<http://slashdot.org/~USERNAME>

<http://news.ycombinator.com/user?id=USERNAME>

So no matter how much you think you have it anonymized, a person's browsing
history could reveal a lot more than you think.

~~~
politician
The data doesn't seem to contain enough information to cluster unrelated URLs
(e.g. by User Agent). Each record contains 1) a timestamp, 2) the requested
URL, 3) the referring URL, 4) a boolean classification of the user agent
(browser or bot), 5) a boolean flag for whether the request was generated
inside or outside IU.

Although you could reconstruct who's looking at Facebook from inside IU, it'd
be difficult to further associate them with some other website like Hacker
News. On the other hand, the timestamps do leak some information which could
potentially be used to identify patterns of activity.

~~~
kordless
You've seen that bit where the user-agent is fairly unique to a particular
user, right?

<https://www.eff.org/deeplinks/2010/01/tracking-by-user-agent>

~~~
politician
Yes, but they aren't sharing the user agent. They're just sharing a boolean
derived from the user agent which says whether they think it was a bot or a
browser.

------
IvyMike
I'm really surprised anyone would take the risk to release data like this,
even with their security protocols in place. It just doesn't seem worth it:

\- The potential upside is a few citations in research papers.

\- The potential downside is a widescale invasion of privacy of IU students
and staff, and a huge PR disaster.

------
weareconvo
This shit is going to be available on TPB before I can even click 'add
comment'.

------
triplesec
Marc Smith at Microsoft Research had a Usenet dB for research porpoises
created about 6 years ago or so, and provided it to any researchers who wanted
it. Although I didn't care about Usenet for my stuff, it was a good and useful
offering for various researchers, and I hope this newer dB also proves useful!
Thanks to Indiana for going to the trouble.

------
afhof
How did they collect this data without someone raising privacy flags?
Releasing this data is almost certainly a bad idea, since it will likely
reveal who the people are who made those requests. Anonymized data usually
isn't.

~~~
DanBC
> _Additionally, while the dataset has been approved by the Indiana University
> IRB for “non-human subjects research” (protocol 1110007144), it might
> potentially contain bits of stray personal data. Therefore we require that
> you follow these instructions to request the data. You will have to sign a
> data security agreement._

> _Data Transfer: If your request is approved, you will send a blank 3TB hard
> drive to the address below._

> _We will return the loaded drive to the address specified in your request
> within 10 business days of receiving it. The data on the drive will be
> encrypted using TrueCrypt. The password to decrypt the data will be emailed
> to address that you specified in your request. It will be your
> responsibility to install and configure TrueCrypt
> (<http://www.truecrypt.org/>) on the system where you will be accessing the
> data._

> _4.) Data Removal: When you have finished using the data, you are
> responsible for securely and permanently removing the data including the
> drive that was used to transfer the data. For more information about secure
> data removal please see:
> (<https://protect.iu.edu/cybersecurity/data/secureremoval>) ._

> _I have read and agree to abide by all University data security practice
> related to access to University confidential data. To the best of my
> ability, I will comply, keep secure, or return all information provided to
> me._

TL:DR they're just crossing their fingers and hoping for the best.

~~~
kordless
Serious question here. What's the point of all this?

------
kmregan
If you are interested in this kind of data, it's worth noting that there are
some older, but, in a sense, more manageable datasets at the Internet Traffic
Archive [1]---the data there can be downloaded and does not require being
physically shipping through the post.

The largest dataset consists of 1.3 billion requests (for the 1998 World Cup
website).

[1] <http://ita.ee.lbl.gov/html/traces.html>

~~~
kordless
From a research standpoint this data set is much less interesting than a bunch
of students/faculty/bots/apps clicking and surfing their way around the whole
Internets.

------
berlinbrown
Can someone actually post the real data.

~~~
archgoon
This would be a direct violation of the Data Security Agreement
([https://protect.iu.edu/system/files/Data-Security-Access-
Agr...](https://protect.iu.edu/system/files/Data-Security-Access-
Agreement.pdf)) that all persons requesting the data must agree to.
Furthermore, they seem to indicate that they will only be releasing the data
to researchers or large organizations.

That being said, feel free to request the data from them.

[http://carl.cs.indiana.edu/data/webtraffic/click-
dataform.pd...](http://carl.cs.indiana.edu/data/webtraffic/click-dataform.pdf)

~~~
berlinbrown
Let's get easy access in honor of Aaron's recent actions.

