
How To Get Experience Working With Large Datasets - Anon84
http://www.bigfastblog.com/how-to-get-experience-working-with-large-datasets
======
retroafroman
Sometimes, when I want to play with/benchmark a new technology stack,
framework, library, etc, I run into the problem of not having a database to
test out with it. As a result, I spend a non-trivial amount of time (I'm not
the best coder) banging out a script that fills out a custom database with the
fields and datatypes I want. However, this strikes me as an example of
unnecessary 'yak shaving'. My first impulse is to make a web app that creates
custom databases, but luckily I googled around a bit and found this first
which already does that: <http://www.generatedata.com/#generator>

~~~
rgarcia
Except generatedata has a 200 record limit. Luckily it's open source, so if
you can host a PHP app you can bump it up to whatever you like. I did this a
few weeks ago (bumped the limit up to 10000):

<https://github.com/rgarcia/generatedata> <http://smooth-
frost-5744.herokuapp.com/>

------
gtani
There's a lot of other sites dedicated to datasets:

<http://aws.amazon.com/datasets>

<http://getthedata.org/>

<http://www.kdnuggets.com/datasets/index.html>

infochimps, datamarket.com,

reddit/r/opendata and datasets,

<http://thedatahub.org/dataset>,

NYT and Guardian,

UCI machine learning repository,

~~~
gtani
Kaggle.com,

<http://trec.nist.gov/data.html>

<https://sqlazureservices.com/browse/Data> (expired Cert, but i'm reasonably
sure it's a legit MS site)

<http://www.aggdata.com/data>

------
zerop
Large Datasets open to public : [http://www.quora.com/Data/Where-can-I-get-
large-datasets-ope...](http://www.quora.com/Data/Where-can-I-get-large-
datasets-open-to-the-public)

------
hogu
I think it's much more helpful to have an idea of what you want to get out of
your dataset. Generating random data can be useful for stress testing systems,
but without an idea of what you want you'll have no idea whether what your
exercising is useful or not.

~~~
vonmoltke
Random data or mashups of public datasets are good for learning the mechanics
of specific processing frameworks, but you really need a clear objective
guiding the analysis to understand the concepts behind processing big data.

Random data is god for the _how_ and the _with what_ (to an extent), but not
for the _when_ and the _why_.

------
alexro
Can anybody share any "Aha!" moments coming from crunching big data? It would
be interesting and useful to know how you went from theory to actual
validation, especially if the result couldn't have been obtained any other
way. Thanks

------
padobson
Voter registration files are also freely available.

Here are CSVs for several million names, addresses and voting patterns for the
great state of Ohio.

[http://www2.sos.state.oh.us/pls/voter/f?p=111:1:322955311706...](http://www2.sos.state.oh.us/pls/voter/f?p=111:1:3229553117069219)

------
sid6376
Seems to be down. Here's the cached version:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://www.bigfastblog.com/how-
to-get-experience-working-with-large-datasets)

~~~
philwhln
It's up again now and staying up. The original server it was on wasn't
prepared for the wrath of Hacker News.

~~~
zrgiu_
Ironic that "BigFastBlog" wasn't prepared for a few thousands (tens of
thousands maybe) hits. It's not loading atm for me either.

------
besquared
Shameless plug, but if you want to work with big data you could just apply for
this position at Yammer. DO IT!

<https://www.yammer.com/job_description?jvi=ogZcWfw7,Job>

~~~
hessenwolf
Your link is dead.

~~~
draven
Are you using noscript? The link worked once I allowed jobvite & yammer.

------
algoshift
I have a potential project that would involve testing somewhere in the range
of 500 Cassandra nodes. What are the best tools to use in load testing such
installations? Is everything pretty much custom?

~~~
fleitz
With out more information it's very difficult to give advice.

If you're not load testing with your real actual application code and real
application load I wouldn't even bother testing. The numbers will be so
misleading that it's mostly pointless.

What does it matter if your cassandra install can do 500,000 writes per second
if your real app exhibits lock contention issues that brings that number down
to 5,000 per second, or latency issues that bring the number down to 50,000.

Since you should be performance testing with real application code and load
you'll need to add two things to your code:

1) Code to record the load (logs can work great for this) 2) Code to playback
the load at a multiple

Then I'd add the parameters you want to tune for to your testing code and use
a genetic algorithm to tune the parameters for your cassandra install.

So, yes, real load testing always involves custom code. If you're just looking
for numbers to impress management then use whatever because it's not going to
correlate to anything.

------
faucet
Another possibility would be to work for companies that have big data already.

~~~
vonmoltke
That can be a chicken-and-egg problem, though, since many of them only want
people who already know how to work with big data at some level.

