

How To Get Experience Working With Large Datasets - m3mb3r
http://www.philwhln.com/?p=394

======
drblast
US Census data is multiple gigabytes, and well documented.

If you want to run a database through its paces beyond the point where all the
data fits in memory, that's a good place to start.

~~~
lsb
Census data's definitely interesting. If you have the cash, you can rent a box
with 68GB of memory for $0.70/hr on EC2, as a spot instance.

------
wrath
One nice and free dataset which you can play with is the BestBuy open data.
You can download the full catalog of products from BestBuy in JSON and XML
format. <http://developer.bestbuy.com> Simply register for a key and you'll
have access to the data.

------
andrewjshults
Along the same lines, NYC's Big Apps 2.0 competition is going on right now
(<http://nycbigapps.com/>). Not affiliated, but I went to NYTM last year where
they demoed the winners and there are some interesting (and impressively
large) datasets to play with. One of my favorites was the mobile app,
CabSense, that crunched the TLC data to determine the best corners to catch a
cab on depending on the time of day

------
fmw
They might be relatively small, but <http://www.grouplens.org/node/12> has
some interesting datasets that can be used to experiment with recommendation
systems, e.g. book and movie reviews.

------
ashtophoenix
What a silly article - When it said how to get experience working with large
datasets I was expecting it would explain more about
storage/scalability/design/caching issues etc. There are myriad ways to get
(or generate) data to play with...

------
earl
What's with the recent fetishization of Big Data? I'm moving to Dziuba's camp
-- its a developer dick size contest.

~~~
elai
An interview I had asked my experience with large data sets and high traffic
web sites and such and what I thought about them. I got an impression they
were a bit shell shocked by the workload of a high traffic website. Many
software paying positions and startup problems deal with large data sets vs.
the small easy ones you usually deal with in client applications, many games,
iPhone apps, etc.

