Hacker News new | comments | show | ask | jobs | submit login
S4 (s4.io)
350 points by m0th87 on Nov 4, 2010 | hide | past | web | favorite | 59 comments

The Github repo has an example application: https://github.com/s4/examples/tree/master/twittertopiccount...

It's a twitter topic counter: "This application detects popular hashtags on Twitter by listening to the Twitter gardenhose."

From http://labs.yahoo.com/event/99 :

"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers."

Actor model: http://en.wikipedia.org/wiki/Actor_model

I always thought it would very interesting to try an build a language that had an Actor focus but used simple objects for basic stuff (a pure model seems a little odd - I guess C++ for actors). It just seems like a natural way to organize a big system. We talk about certain objects as doing things to other objects and it would provide a simpler concurrency.

Actors, linda, tuples..... hum....

The model is very similar to the one argued for in this paper:


That paper was a big inspiration when we were redesigning the betting exchange at smarkets. It's a very well reasoned exposition of why this is the only sensible architecture for large scale distributed systems.

Does your version have "full" fault tolerance or only the "partial" fault tolerance noted on the s4 site.

I guess it depends how you define 'full' fault tolerance. In the case of a machine failure it would lose a couple of seconds worth of transactions. In a data-processing scenario thats not an issue - just re-run the data that hasn't been processed yet. In an exchange the recovery delay is a bit more painful and there is always the potential to permanently lose transactions. Unfortunately there doesn't seem to be a way around that - we must have consistency and performance so machine failure is always going to cause some interruption to availability

I was hoping to find a "for example, S4 can be used to" line in there, but I didn't see it initially. I assume filtering the Twitter fire-hose of data could be a common use?

Here are some examples: https://github.com/s4/examples

Sure enough, there is a Twitter example.

The first thing I thought of was this:


I also thought something along the lines, "wasn't this an April fools joke...". I'm glad someone found it so I could see it again.

Yahoo just released a paper explaining what S4 is, the rationale for it's development, and provides detailed comparison with Hadoop (and map/reduce frameworks in general).


It sounds like a slightly more structured and distributed Unix shell pipeline; but from looking at the twitter example, a lot more awkward to use, owing to being structured around Java.

I imagine a composition language (DSL) wrapped around it could improve its usability - especially ad-hoc experimentation - greatly; at least one better than Spring IoC xml.

You can write clients that stream data in (much like Hadoop streaming), however.

If you want to stick with the JVM, have you considered using Scala and the cake pattern?


If you want to stick with Java and want to use IoC without the hell that is Spring, I suggest Guice (which consists of a smaller, cleaner core and uses annotations and DSLs in place of XML):


This looks really great. If it delivers as advertised will be a very nice replacement for certain classes of MapReduce jobs.

This has some examples https://github.com/s4/examples

I wish I had heard about this a few months ago. I wanted to implement a way to create and connect streaming web services. I hacked up something then called webpipes (https://github.com/dnewcome/webpipes) using node.js. Unfortunately I haven't looked at it again since I first put it up on github. S4 looks a ton more advanced than anything I was envisioning, but I still think that something simple done in one of the evented servers like node.js (the S4 implementation looks to be Java) would be useful.

A good, simpler than S4, solution may be to use zeromq (PUB/SUB).

In the twitter demo, I noticed this pathological looking string concat statement


Im too dim to figure out why it would be done this way (besides the fact thats its an early proof of concept demo). Any idea?

I'm not sure why you think it looks pathological? If you want to construct the data they want to construct, what would you do?

This seems conceptually similar to http://www.cascading.org/ (at least looking the code examples: http://www.cascading.org/1.1/userguide/html/ch02.html).

S4 processes streams of data, one element at a time, as they arrive; outputs are produced incrementally. MapReduce and its derivatives (Cascading, Pig, etc) are batch-oriented.

Stream processing jobs can be massaged to fit into the MapReduce paradigm, but S4 provides a more natural solution.

Does anyone have any real life examples of what this could be used for? I get what it does, just not quite sure where it fits in.

For example, do I push data into S4, does S4 poll for data. Is this like a distributed task system, where I distribute my tasks evenly across multiple servers seamlessly?

website sure looks a lot like http://mootools.net

Even header colors on both sites are exactly the same - #c17878.

Nitpick - the highlighted/emphasized text is essentially indistinguishable from hyperlinks.

So... this is basically scribe? http://en.wikipedia.org/wiki/Scribe_(log_server)

Well... No. Scribe is not general purpose.

My first reaction is that this sounds similar to SQL Server Stream Insight (in terms of processing continuous streams of data)

Well, I would need a little bit more info than the "detailed information" presented there.

great hype but still not yet clear what is this good for. Does anybody know about any other use case except the twitter topic count example?

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

I'm sure this is cool and useful technology. At this moment, from the marketing-speak, I have no idea what it does except that it has something to do with volumes of streaming data. Whose data? Is it a service? (Maybe not, since you can download it?) What could it do for me (in simple terms)? What's a basic use case? Why do people assume that we can mind-read?

It's not marketing speak, it's research speak. I have worked on a similar project (and will work on it again in the future), and I know exactly what they mean by those things.

General purpose: in the same way C is a "general purpose" language. It can handle arbitrary problems.

Distributed: designed to be used across multiple compute nodes.

Scalable: they've made the effort to ensure that performances increases as they increase the number of compute nodes.

Partially fault-tolerant: node failure does not mean the results of the computation are lost. "Partially," I assume, implies they can't guarantee this completely.

Continuous unbound streams of data: think sensors that are constantly sending more data. Or a stock market ticker. Or radio telescopes constantly monitoring the sky. Or a medical patient's various monitors.

The reason these terms don't resonate with you is that these type of applications - this type of programming - is something you're not familiar with.

No, I think what sabat is saying is that the front page documentation is geared towards the wrong audience and needs to have less general information on usage.

I think if the following actually had examples of use, I'd try it out:


Currently it only has:


   1. Check out sources from git
   2. Create Eclipse configurations: mvn eclipse:eclipse
   3. Import project into Eclipse
   4. set variable M2_REPO to the local Maven repository: e.g. ~/.m2/repository
   5. Set up formatting:
          * Spaces for indentation
          * Tab width = 4
I completely understand that this isn't there yet, though. It is a new project.

it's opensource, right? so anyone could go in and fix any problems they see, right?

My attempt at a more friendly description of S4 (based on my very limited understanding of it - please correct me if I'm wrong):

When processing large amounts of streaming data, you have to process the data as fast as it comes in or else you can't keep up. You probably wonder: "Why not simply delay the processing of new data while the old data is being processed?". The problem lies in the fact we're dealing with a stream which always brings in new data. Eventually, the virtual line up of delayed data will occupy all available memory.

A solution to this is to dispatch the data between multiple computers which can each independently process the data they receive and then send back the result of their processing to a central computer whose job is to put back the results together. Can't keep up with the stream? Simply add a new computer! That's more or less what S3 does.

Good, except stream processing does not imply that there needs to be a master node - it can truly be distributed. I don't know if S4 is set up like this, though.

It does. Processing is distributed across multiple nodes. See http://wiki.s4.io/Manual/S4Overview

Thank you for this. Yahoo! should copy and paste your comment onto the S4 homepage.

it's opensource, anyone could do that, not just Yahoo!

Are you confusing open source with a wiki editable by everyone?

Open source just means that you can fork it and do whatever you want with your own version, e.g. fix it. The original owner's don't have to accept your fix.

Which, incidentally, often is true for wikis as well.

One way of looking at it is that S4 is to map-reduce what Streambase or Coral8 are to SQL.

In the enterprise software world, this is what's called CEP - Complex Event Processing: http://en.wikipedia.org/wiki/Complex_event_processing

Yeah, I had gotten the impression it was distributed CEP.

"At Yahoo! Labs we design algorithms that are primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware."

Via http://labs.yahoo.com/event/99

I could've sworn there was a blurb there about where they are using and I recall using "real time map-reduce jobs" for things live bidding on ADs and such; another use case would be with stock market data.

You're right that it's marketing speak, this project has gotten too much attention on HN in the last few days even when all the git repo had was initial commit. It's too bad they don't have proper explanation, maybe it's because they weren't probably expecting all this attention yet.

Lets be clear. S4 is not real-time MapReduce, its a stream processing system.[1]

[1] http://twitter.com/s4project/status/29611855285

> Whose data? Anyone's data that you can stuff into the system: "The drivers to read from and write to the platform can be implemented in any language making it possible to integrate with legacy data sources and systems."

> Is it a service? No, it's a platform. You could turn it into a Platform-as-a-Service, like Amazon does with various technologies. "S4 is a ... platform"

> What could it do for me (in simple terms)? What's a basic use case? My first thought would be real-time trending calculations. You have a massive, never-ending stream of data...how do you extract real-time insights from that?

> Why do people assume that we can mind-read? Perhaps because after being immersed in a project for a long time, it's easy to forget what is obvious to you, but non-obvious to others.

looks like it lets you take a large, continuously updating stream of data and distribute the processing of it across multiple nodes. kinda like mapreduce.

As I understand it the processing model is similar to Yahoo pipes or good old unix pipes, except with easy support for parallel processing, distribution and fault tolerance.

So... It's like Unix pipes with GNU Parallel?

Just looked a little at the documentation. It seems like its an engine for stream processing. Think of it of data mining up front. You figure out what information you want and collect it as the data comes in instead of storing all the data and mining for what you want from the accumulated pile.

Up voted the story just so more people read your comment. This kind of "marketing-speak" focused on technical details is well too widespread. The project may be technical by nature but there's got to be a higher level way of describing it.

Maybe you should cultivate some fucking respect. Yahoo, which gave us such critical technologies as Hadoop and Pig, just gave us what appears to be a robust distributed stream processing thingamabob. They open sourced it first, and will quickly bring the docs and presentation up to speed. Instead of mouthing off, just say, 'thanks!' or nothing at all.

could just be we're not the audience yet, they see this as something they want to break out within businesses, so they have to communicate it to business types that want to know practically and conceptually how this new thing is better than what they have, but don't care about the actual code.

Surely it's aimed at technical people, able to understand technical language?

It's not marketing speak. It is a clear, simple, and short sentence that describes exactly what S4 is. It doesn't have irrelevant details ("we use technology XYZ for distributing our nodes"), nor very much marketing describing the value ("S4 solves all your problems!!!", although they do use the word "easily" once).

If the terms used are not clear to someone reading it, they are probably not the right person to implement a good distributed system anyway.

It's frustrating, isn't it? S4 sounds like it does something cool and useful, and I'd like to know what this is, just for reference at least. But I don't have time to go and figure it out on my own; I'm too busy trying to stay productive.

If you're too busy to spend the time to read the docs to figure out if its useful for yourself. How will you ever find the time to learn the system and implement it into your work processes?

Do you read the docs of every new software product that gets released, in order to see if it's useful to you? No, there aren't enough hours in the day.

If this were relevant to you, you'd know what most of that means, particularly stream processing. Why exactly is yahoo faulted for your lack of understanding of the technical vocabulary? I'm mostly annoyed that you think that using the usual vocabulary to discuss a problem means that people have to mind-read.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact