

S4 - m0th87
http://s4.io/

======
chr15
The Github repo has an example application:
[https://github.com/s4/examples/tree/master/twittertopiccount...](https://github.com/s4/examples/tree/master/twittertopiccount/src/main/java/io/s4/example/twittertopiccount/)

It's a twitter topic counter: "This application detects popular hashtags on
Twitter by listening to the Twitter gardenhose."

From <http://labs.yahoo.com/event/99> :

"S4 is a general-purpose, distributed, scalable, partially fault-tolerant,
pluggable platform that allows programmers to easily develop applications for
processing continuous unbounded streams of data. Keyed data events are routed
with affinity to Processing Elements (PEs), which consume the events and do
one or both of the following: (1) emit one or more events which may be
consumed by other PEs, (2) publish results. The architecture resembles the
Actors model [1], providing semantics of encapsulation and location
transparency, thus allowing applications to be massively concurrent while
exposing a simple programming interface to application developers."

Actor model: <http://en.wikipedia.org/wiki/Actor_model>

~~~
jamii
The model is very similar to the one argued for in this paper:

<http://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf>

That paper was a big inspiration when we were redesigning the betting exchange
at smarkets. It's a very well reasoned exposition of why this is the only
sensible architecture for large scale distributed systems.

~~~
pants
Does your version have "full" fault tolerance or only the "partial" fault
tolerance noted on the s4 site.

~~~
jamii
I guess it depends how you define 'full' fault tolerance. In the case of a
machine failure it would lose a couple of seconds worth of transactions. In a
data-processing scenario thats not an issue - just re-run the data that hasn't
been processed yet. In an exchange the recovery delay is a bit more painful
and there is always the potential to permanently lose transactions.
Unfortunately there doesn't seem to be a way around that - we must have
consistency and performance so machine failure is always going to cause some
interruption to availability

------
SteveArmstrong
I was hoping to find a "for example, S4 can be used to" line in there, but I
didn't see it initially. I assume filtering the Twitter fire-hose of data
could be a common use?

~~~
LiveTheDream
Here are some examples: <https://github.com/s4/examples>

Sure enough, there is a Twitter example.

------
rchowe
The first thing I thought of was this:

<http://www.supersimplestorageservice.com/>

~~~
maximilian
I also thought something along the lines, "wasn't this an April fools
joke...". I'm glad someone found it so I could see it again.

------
ajessup
Yahoo just released a paper explaining what S4 is, the rationale for it's
development, and provides detailed comparison with Hadoop (and map/reduce
frameworks in general).

<http://labs.yahoo.com/node/476>

------
barrkel
It sounds like a slightly more structured and distributed Unix shell pipeline;
but from looking at the twitter example, a lot more awkward to use, owing to
being structured around Java.

I imagine a composition language (DSL) wrapped around it could improve its
usability - especially ad-hoc experimentation - greatly; at least one better
than Spring IoC xml.

~~~
strlen
You can write clients that stream data in (much like Hadoop streaming),
however.

If you want to stick with the JVM, have you considered using Scala and the
cake pattern?

[http://jonasboner.com/2008/10/06/real-world-scala-
dependency...](http://jonasboner.com/2008/10/06/real-world-scala-dependency-
injection-di.html)

If you want to stick with Java and want to use IoC without the hell that is
Spring, I suggest Guice (which consists of a smaller, cleaner core and uses
annotations and DSLs in place of XML):

<http://code.google.com/p/google-guice/>

------
gfodor
This looks really great. If it delivers as advertised will be a very nice
replacement for certain classes of MapReduce jobs.

------
vijayr
This has some examples <https://github.com/s4/examples>

------
dnewcome
I wish I had heard about this a few months ago. I wanted to implement a way to
create and connect streaming web services. I hacked up something then called
webpipes (<https://github.com/dnewcome/webpipes>) using node.js. Unfortunately
I haven't looked at it again since I first put it up on github. S4 looks a ton
more advanced than anything I was envisioning, but I still think that
something simple done in one of the evented servers like node.js (the S4
implementation looks to be Java) would be useful.

~~~
requinot59
A good, simpler than S4, solution may be to use zeromq (PUB/SUB).

------
skullsplitter
In the twitter demo, I noticed this pathological looking string concat
statement

[https://github.com/s4/examples/blob/master/twittertopiccount...](https://github.com/s4/examples/blob/master/twittertopiccount/src/main/java/io/s4/example/twittertopiccount/Status.java#L110)

Im too dim to figure out why it would be done this way (besides the fact thats
its an early proof of concept demo). Any idea?

~~~
mzl
I'm not sure why you think it looks pathological? If you want to construct the
data they want to construct, what would you do?

------
wrath
Does anyone have any real life examples of what this could be used for? I get
what it does, just not quite sure where it fits in.

For example, do I push data into S4, does S4 poll for data. Is this like a
distributed task system, where I distribute my tasks evenly across multiple
servers seamlessly?

------
bryansum
This seems conceptually similar to <http://www.cascading.org/> (at least
looking the code examples:
<http://www.cascading.org/1.1/userguide/html/ch02.html>).

~~~
anandkesari
S4 processes streams of data, one element at a time, as they arrive; outputs
are produced incrementally. MapReduce and its derivatives (Cascading, Pig,
etc) are batch-oriented.

Stream processing jobs can be massaged to fit into the MapReduce paradigm, but
S4 provides a more natural solution.

------
warstory
website sure looks a lot like <http://mootools.net>

~~~
Inviz
Even header colors on both sites are exactly the same - #c17878.

------
LiveTheDream
Nitpick - the highlighted/emphasized text is essentially indistinguishable
from hyperlinks.

------
nspiegelberg
So... this is basically scribe?
<http://en.wikipedia.org/wiki/Scribe_(log_server)>

~~~
lrm242
Well... No. Scribe is not general purpose.

------
sriramk
My first reaction is that this sounds similar to SQL Server Stream Insight (in
terms of processing continuous streams of data)

------
stupidsignup
Well, I would need a little bit more info than the "detailed information"
presented there.

------
ithkuil
great hype but still not yet clear what is this good for. Does anybody know
about any other use case except the twitter topic count example?

------
sabat
_S4 is a general-purpose, distributed, scalable, partially fault-tolerant,
pluggable platform that allows programmers to easily develop applications for
processing continuous unbounded streams of data._

I'm sure this is cool and useful technology. At this moment, from the
marketing-speak, I have no idea what it does except that it has something to
do with volumes of streaming data. Whose data? Is it a service? (Maybe not,
since you can download it?) What could it do for me (in simple terms)? What's
a basic use case? Why do people assume that we can mind-read?

~~~
olalonde
Up voted the story just so more people read your comment. This kind of
"marketing-speak" focused on technical details is well too widespread. The
project may be technical by nature but there's got to be a higher level way of
describing it.

~~~
sabat
It's frustrating, isn't it? S4 sounds like it does something cool and useful,
and I'd like to know what this is, just for reference at least. But I don't
have time to go and figure it out on my own; I'm too busy trying to stay
productive.

~~~
sanswork
If you're too busy to spend the time to read the docs to figure out if its
useful for yourself. How will you ever find the time to learn the system and
implement it into your work processes?

~~~
bluesnowmonkey
Do you read the docs of every new software product that gets released, in
order to see if it's useful to you? No, there aren't enough hours in the day.

