Sometimes, when I want to play with/benchmark a new technology stack, framework, library, etc, I run into the problem of not having a database to test out with it. As a result, I spend a non-trivial amount of time (I'm not the best coder) banging out a script that fills out a custom database with the fields and datatypes I want. However, this strikes me as an example of unnecessary 'yak shaving'. My first impulse is to make a web app that creates custom databases, but luckily I googled around a bit and found this first which already does that: http://www.generatedata.com/#generator
Except generatedata has a 200 record limit. Luckily it's open source, so if you can host a PHP app you can bump it up to whatever you like. I did this a few weeks ago (bumped the limit up to 10000):
Why even use a database then? Fill your memory with a bunch of objects until you need persistence... Even then a flat file works well until you expect multiple people to post at once.
I think it's much more helpful to have an idea of what you want to get out of your dataset. Generating random data can be useful for stress testing systems, but without an idea of what you want you'll have no idea whether what your exercising is useful or not.
Random data or mashups of public datasets are good for learning the mechanics of specific processing frameworks, but you really need a clear objective guiding the analysis to understand the concepts behind processing big data.
Random data is god for the how and the with what (to an extent), but not for the when and the why.
Can anybody share any "Aha!" moments coming from crunching big data? It would be interesting and useful to know how you went from theory to actual validation, especially if the result couldn't have been obtained any other way. Thanks
Not for me just yet as I want to finish my studies first, but I wondered what experience you expect from candidates.
My personal experience is that I have no problem understanding data mining and machine learning in theory, but translating that in practice on my own is quite another thing. Is the market open for would-be data scientists with very little experience?
I have a potential project that would involve testing somewhere in the range of 500 Cassandra nodes. What are the best tools to use in load testing such installations? Is everything pretty much custom?
With out more information it's very difficult to give advice.
If you're not load testing with your real actual application code and real application load I wouldn't even bother testing. The numbers will be so misleading that it's mostly pointless.
What does it matter if your cassandra install can do 500,000 writes per second if your real app exhibits lock contention issues that brings that number down to 5,000 per second, or latency issues that bring the number down to 50,000.
Since you should be performance testing with real application code and load you'll need to add two things to your code:
1) Code to record the load (logs can work great for this)
2) Code to playback the load at a multiple
Then I'd add the parameters you want to tune for to your testing code and use a genetic algorithm to tune the parameters for your cassandra install.
So, yes, real load testing always involves custom code. If you're just looking for numbers to impress management then use whatever because it's not going to correlate to anything.