Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How To Get Experience Working With Large Datasets (bigfastblog.com)
108 points by Anon84 on Feb 21, 2012 | hide | past | favorite | 23 comments


Sometimes, when I want to play with/benchmark a new technology stack, framework, library, etc, I run into the problem of not having a database to test out with it. As a result, I spend a non-trivial amount of time (I'm not the best coder) banging out a script that fills out a custom database with the fields and datatypes I want. However, this strikes me as an example of unnecessary 'yak shaving'. My first impulse is to make a web app that creates custom databases, but luckily I googled around a bit and found this first which already does that: http://www.generatedata.com/#generator


Except generatedata has a 200 record limit. Luckily it's open source, so if you can host a PHP app you can bump it up to whatever you like. I did this a few weeks ago (bumped the limit up to 10000):

https://github.com/rgarcia/generatedata http://smooth-frost-5744.herokuapp.com/


Why even use a database then? Fill your memory with a bunch of objects until you need persistence... Even then a flat file works well until you expect multiple people to post at once.


There's a lot of other sites dedicated to datasets:

http://aws.amazon.com/datasets

http://getthedata.org/

http://www.kdnuggets.com/datasets/index.html

infochimps, datamarket.com,

reddit/r/opendata and datasets,

http://thedatahub.org/dataset,

NYT and Guardian,

UCI machine learning repository,


Kaggle.com,

http://trec.nist.gov/data.html

https://sqlazureservices.com/browse/Data (expired Cert, but i'm reasonably sure it's a legit MS site)

http://www.aggdata.com/data


Thanks for the resources.



I think it's much more helpful to have an idea of what you want to get out of your dataset. Generating random data can be useful for stress testing systems, but without an idea of what you want you'll have no idea whether what your exercising is useful or not.


Random data or mashups of public datasets are good for learning the mechanics of specific processing frameworks, but you really need a clear objective guiding the analysis to understand the concepts behind processing big data.

Random data is god for the how and the with what (to an extent), but not for the when and the why.


Can anybody share any "Aha!" moments coming from crunching big data? It would be interesting and useful to know how you went from theory to actual validation, especially if the result couldn't have been obtained any other way. Thanks


Voter registration files are also freely available.

Here are CSVs for several million names, addresses and voting patterns for the great state of Ohio.

http://www2.sos.state.oh.us/pls/voter/f?p=111:1:322955311706...


Seems to be down. Here's the cached version: http://webcache.googleusercontent.com/search?q=cache:http://...


It's up again now and staying up. The original server it was on wasn't prepared for the wrath of Hacker News.


Ironic that "BigFastBlog" wasn't prepared for a few thousands (tens of thousands maybe) hits. It's not loading atm for me either.


Shameless plug, but if you want to work with big data you could just apply for this position at Yammer. DO IT!

https://www.yammer.com/job_description?jvi=ogZcWfw7,Job


Not for me just yet as I want to finish my studies first, but I wondered what experience you expect from candidates.

My personal experience is that I have no problem understanding data mining and machine learning in theory, but translating that in practice on my own is quite another thing. Is the market open for would-be data scientists with very little experience?


Your link is dead.


Are you using noscript? The link worked once I allowed jobvite & yammer.


I have a potential project that would involve testing somewhere in the range of 500 Cassandra nodes. What are the best tools to use in load testing such installations? Is everything pretty much custom?


With out more information it's very difficult to give advice.

If you're not load testing with your real actual application code and real application load I wouldn't even bother testing. The numbers will be so misleading that it's mostly pointless.

What does it matter if your cassandra install can do 500,000 writes per second if your real app exhibits lock contention issues that brings that number down to 5,000 per second, or latency issues that bring the number down to 50,000.

Since you should be performance testing with real application code and load you'll need to add two things to your code:

1) Code to record the load (logs can work great for this) 2) Code to playback the load at a multiple

Then I'd add the parameters you want to tune for to your testing code and use a genetic algorithm to tune the parameters for your cassandra install.

So, yes, real load testing always involves custom code. If you're just looking for numbers to impress management then use whatever because it's not going to correlate to anything.


You may want to check out the Yahoo cloud serving benchmark, it is a pretty standard load-testing tool for this kind of thing. https://github.com/brianfrankcooper/YCSB/wiki


Another possibility would be to work for companies that have big data already.


That can be a chicken-and-egg problem, though, since many of them only want people who already know how to work with big data at some level.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: