Of course, that still might not be the result you want.
Maybe we should have a different storage strategy if the data is too big? File storage? I just meant for it to be simple.
If you are going to use redis for storage then you'll need to fine tune it to the processing you are doing (we have).
Why restrain to sequential reducers when you can parallelize with partitions and sorting?
We thought of parallel reducers and it does make a lot of sense. The reason they are sequential is to get a first release out so we can juggle ideas with people. If you care to contribute we'd love it. Even if you just create an issue.
r³ was designed from the ground up to adhere to HTTP. That means it's pretty easy to scale using our old and well-proven techniques: caching and load-balancing.
Here's a short version:
There's a collective ecosystem problem of fragmented applications, not-quite-right command line utilities, web interfaces that look like they were designed in 1995, noisy log files people actually have to read constantly, and cross coupling of dependencies that make keeping a cluster live for production use a full time job.
There's the programming problem of nobody actually writing hadoop mapreduce code because it's impossibly complicated. Everybody uses hive and pig and half a dozen other tools to compile to pre-templated java classes (this knocks off 5% to 30% of your performance if you could do it by hand).
It hasn't grown because it's so amazing, performant, and company saving. It grows because people jumped on a fad wagon then got stuck with having a few hundred TB in HDFS. The lack of a competing project with equal mindshare and battle-testedness doesn't foster any competition. It's the mysql of distributed processing systems. It works (mostly), but it breaks (in a few dozen known ways), so people keep adding features and building on top of it.
There are all sorts of oddities and you can mostly work around them but it is...exhausting, and I spend a lot of time thinking "surely there must be a better way".
Have you heard of any other projects outside of disco that are more performant than hadoop when used for similar applications?
I worked on a contract for a large, very well-known social networking company a while back who refused to consider Hadoop because of this.
If you don't have that much data, MR on redis is fine
- 67 characters
brew install redis
- 31 characters
sudo make install
We use tornado for the stream (the task processor). That means that only one user gets to run a task simultaneously.
That said, the stream is just an http application.
This means that you can scale it as easily as you would any web app.
Is there something like this for php?