How To Get Experience Working With Large Datasets

retroafroman · on Feb 21, 2012

Sometimes, when I want to play with/benchmark a new technology stack, framework, library, etc, I run into the problem of not having a database to test out with it. As a result, I spend a non-trivial amount of time (I'm not the best coder) banging out a script that fills out a custom database with the fields and datatypes I want. However, this strikes me as an example of unnecessary 'yak shaving'. My first impulse is to make a web app that creates custom databases, but luckily I googled around a bit and found this first which already does that: http://www.generatedata.com/#generator

rgarcia · on Feb 21, 2012

Except generatedata has a 200 record limit. Luckily it's open source, so if you can host a PHP app you can bump it up to whatever you like. I did this a few weeks ago (bumped the limit up to 10000):

https://github.com/rgarcia/generatedata http://smooth-frost-5744.herokuapp.com/

darkxanthos · on Feb 21, 2012

Why even use a database then? Fill your memory with a bunch of objects until you need persistence... Even then a flat file works well until you expect multiple people to post at once.

gtani · on Feb 21, 2012

There's a lot of other sites dedicated to datasets:

http://aws.amazon.com/datasets

http://getthedata.org/

http://www.kdnuggets.com/datasets/index.html

infochimps, datamarket.com,

reddit/r/opendata and datasets,

http://thedatahub.org/dataset,

NYT and Guardian,

UCI machine learning repository,

gtani · on Feb 21, 2012

Kaggle.com,

http://trec.nist.gov/data.html

https://sqlazureservices.com/browse/Data (expired Cert, but i'm reasonably sure it's a legit MS site)

http://www.aggdata.com/data

DanielRibeiro · on Feb 21, 2012

Thanks for the resources.

zerop · on Feb 21, 2012

Large Datasets open to public : http://www.quora.com/Data/Where-can-I-get-large-datasets-ope...

hogu · on Feb 21, 2012

I think it's much more helpful to have an idea of what you want to get out of your dataset. Generating random data can be useful for stress testing systems, but without an idea of what you want you'll have no idea whether what your exercising is useful or not.

vonmoltke · on Feb 21, 2012

Random data or mashups of public datasets are good for learning the mechanics of specific processing frameworks, but you really need a clear objective guiding the analysis to understand the concepts behind processing big data.

Random data is god for the how and the with what (to an extent), but not for the when and the why.

alexro · on Feb 21, 2012

Can anybody share any "Aha!" moments coming from crunching big data? It would be interesting and useful to know how you went from theory to actual validation, especially if the result couldn't have been obtained any other way. Thanks

padobson · on Feb 21, 2012

Voter registration files are also freely available.

Here are CSVs for several million names, addresses and voting patterns for the great state of Ohio.

http://www2.sos.state.oh.us/pls/voter/f?p=111:1:322955311706...

sid6376 · on Feb 21, 2012

Seems to be down. Here's the cached version: http://webcache.googleusercontent.com/search?q=cache:http://...

philwhln · on Feb 21, 2012

It's up again now and staying up. The original server it was on wasn't prepared for the wrath of Hacker News.

zrgiu_ · on Feb 21, 2012

Ironic that "BigFastBlog" wasn't prepared for a few thousands (tens of thousands maybe) hits. It's not loading atm for me either.

besquared · on Feb 21, 2012

Shameless plug, but if you want to work with big data you could just apply for this position at Yammer. DO IT!

https://www.yammer.com/job_description?jvi=ogZcWfw7,Job

raphinou · on Feb 21, 2012

Not for me just yet as I want to finish my studies first, but I wondered what experience you expect from candidates.

My personal experience is that I have no problem understanding data mining and machine learning in theory, but translating that in practice on my own is quite another thing. Is the market open for would-be data scientists with very little experience?

hessenwolf · on Feb 21, 2012

Your link is dead.

draven · on Feb 22, 2012

Are you using noscript? The link worked once I allowed jobvite & yammer.

algoshift · on Feb 21, 2012

I have a potential project that would involve testing somewhere in the range of 500 Cassandra nodes. What are the best tools to use in load testing such installations? Is everything pretty much custom?

_3u10 · on Feb 21, 2012

With out more information it's very difficult to give advice.

If you're not load testing with your real actual application code and real application load I wouldn't even bother testing. The numbers will be so misleading that it's mostly pointless.

What does it matter if your cassandra install can do 500,000 writes per second if your real app exhibits lock contention issues that brings that number down to 5,000 per second, or latency issues that bring the number down to 50,000.

Since you should be performance testing with real application code and load you'll need to add two things to your code:

1) Code to record the load (logs can work great for this) 2) Code to playback the load at a multiple

Then I'd add the parameters you want to tune for to your testing code and use a genetic algorithm to tune the parameters for your cassandra install.

So, yes, real load testing always involves custom code. If you're just looking for numbers to impress management then use whatever because it's not going to correlate to anything.

MLnick · on Feb 21, 2012

You may want to check out the Yahoo cloud serving benchmark, it is a pretty standard load-testing tool for this kind of thing. https://github.com/brianfrankcooper/YCSB/wiki

_7ez6 · on Feb 21, 2012

Another possibility would be to work for companies that have big data already.

vonmoltke · on Feb 21, 2012

That can be a chicken-and-egg problem, though, since many of them only want people who already know how to work with big data at some level.