

Yahoo Open Sources S4 Real-Time MapReduce Framework - atarashi
https://github.com/s4/core

======
rektide
the entire point was that it's not mapreduce, it's a stream processing system.
try <http://wiki.s4.io/Manual/S4Overview#toc1> for a " _what is S4_ " intro.

 _The architecture resembles the Actors model, providing semantics of
encapsulation and location transparency, thus allowing applications to be
massively concurrent while exposing a simple programming interface to
application developers._

and

 _S4 is a general-purpose, distributed, scalable, partially fault-tolerant,
pluggable platform that allows programmers to easily develop applications for
processing continuous unbounded streams of data._

go stream processing. also see _Flume_ for related recent works.

~~~
DanielRibeiro
Seems very much like Percolator from google:
<http://www.infoq.com/news/2010/10/google-percolator>

~~~
schumihan
I don't think so. Percolator does not support stream operators, and it
supports transaction semantic. Streaming process engine like s4 will never
handle the race conditions brought by multiple writers.

------
Groxx
That's not a readme! At best that's a weak INSTALL.

Anyone care to illuminate to me how this works? "Real-time MapReduce" sounds
almost like an oxymoron.

The project's site ( <http://s4.io/> ) isn't very helpful either. Full source:

    
    
      the S4 open source project. coming soon.

~~~
varworld
<http://wiki.s4.io/>

------
swah
It's kind of funny how almost no one is advised to write code in Java these
days, and on the other hand really cool infrastructure projects like this or
Cassandra are almost always written in Java.

I understand they have to perform better than the average webapp, but
still...funny.

~~~
beagle3
My recent experience with Java code is that efficiency conscious Java code is
ridiculously under-performing compared to efficiency conscious C code.

By virtue of Java's structure and culture (as codified in its libraries), you
have to spend approximately twice as much physical memory as you would in C.
E.g., I had a 100,000,000 record array that I needed to sort. That's great --
Java has an Array.Utils.sort() method; But I needed to know the _sorting
permutation_ as well. That requires an array of Integers, each taking >=16
bytes (so, 1.6GB of ram). Which slows everything to a crawl. So I wrote my own
heapsort routine which needed only 400MB of ram. But it's 10 times slower than
the equivalent C -- I suspect because of array bounds that the compiler can't
throw away.

Those infrastructure projects are "cool", but horribly done. I don't know
about Cassandra, but Hadoop makes jobs take between twice and ten times the
resources that they need. Sure, it scales to 4000 machines, but you'd only
need 1000 if it were properly written. And if you only have a 10-node cluster,
you could probably run it on one machine with comparable throughput.

Color me unimpressed with Java.

~~~
swah
I was actually thinking of Ruby/Python folks, perhaps there is a group that
will skip Java/C/C++ entirely.

Against C there is always the "Java makes it more difficult for you to mess
things up" argument...

~~~
beagle3
Yep. It's much harder to mess things up when you use regular expression.

Between C, Python and Jython, I don't think there's a single reason left to
use Java.

------
anandkesari
@atarashi: S4 is a stream processing platform that has been developed at
Yahoo!. The website and github repository are under development. Follow us on
twitter <http://twitter.com/s4project>

------
yarapavan
Website is live: <http://s4.io/>

