"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model , providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers."
I always thought it would very interesting to try an build a language that had an Actor focus but used simple objects for basic stuff (a pure model seems a little odd - I guess C++ for actors). It just seems like a natural way to organize a big system. We talk about certain objects as doing things to other objects and it would provide a simpler concurrency.
That paper was a big inspiration when we were redesigning the betting exchange at smarkets. It's a very well reasoned exposition of why this is the only sensible architecture for large scale distributed systems.
I guess it depends how you define 'full' fault tolerance. In the case of a machine failure it would lose a couple of seconds worth of transactions. In a data-processing scenario thats not an issue - just re-run the data that hasn't been processed yet. In an exchange the recovery delay is a bit more painful and there is always the potential to permanently lose transactions. Unfortunately there doesn't seem to be a way around that - we must have consistency and performance so machine failure is always going to cause some interruption to availability
I wish I had heard about this a few months ago. I wanted to implement a way to create and connect streaming web services. I hacked up something then called webpipes (https://github.com/dnewcome/webpipes) using node.js. Unfortunately I haven't looked at it again since I first put it up on github. S4 looks a ton more advanced than anything I was envisioning, but I still think that something simple done in one of the evented servers like node.js (the S4 implementation looks to be Java) would be useful.
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
I'm sure this is cool and useful technology. At this moment, from the marketing-speak, I have no idea what it does except that it has something to do with volumes of streaming data. Whose data? Is it a service? (Maybe not, since you can download it?) What could it do for me (in simple terms)? What's a basic use case? Why do people assume that we can mind-read?
It's not marketing speak, it's research speak. I have worked on a similar project (and will work on it again in the future), and I know exactly what they mean by those things.
General purpose: in the same way C is a "general purpose" language. It can handle arbitrary problems.
Distributed: designed to be used across multiple compute nodes.
Scalable: they've made the effort to ensure that performances increases as they increase the number of compute nodes.
Partially fault-tolerant: node failure does not mean the results of the computation are lost. "Partially," I assume, implies they can't guarantee this completely.
Continuous unbound streams of data: think sensors that are constantly sending more data. Or a stock market ticker. Or radio telescopes constantly monitoring the sky. Or a medical patient's various monitors.
The reason these terms don't resonate with you is that these type of applications - this type of programming - is something you're not familiar with.
1. Check out sources from git
2. Create Eclipse configurations: mvn eclipse:eclipse
3. Import project into Eclipse
4. set variable M2_REPO to the local Maven repository: e.g. ~/.m2/repository
5. Set up formatting:
* Spaces for indentation
* Tab width = 4
I completely understand that this isn't there yet, though. It is a new project.
My attempt at a more friendly description of S4 (based on my very limited understanding of it - please correct me if I'm wrong):
When processing large amounts of streaming data, you have to process the data as fast as it comes in or else you can't keep up. You probably wonder: "Why not simply delay the processing of new data while the old data is being processed?". The problem lies in the fact we're dealing with a stream which always brings in new data. Eventually, the virtual line up of delayed data will occupy all available memory.
A solution to this is to dispatch the data between multiple computers which can each independently process the data they receive and then send back the result of their processing to a central computer whose job is to put back the results together. Can't keep up with the stream? Simply add a new computer! That's more or less what S3 does.
"At Yahoo! Labs we design algorithms that are primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware."
I could've sworn there was a blurb there about where they are using and I recall using "real time map-reduce jobs" for things live bidding on ADs and such; another use case would be with stock market data.
You're right that it's marketing speak, this project has gotten too much attention on HN in the last few days even when all the git repo had was initial commit. It's too bad they don't have proper explanation, maybe it's because they weren't probably expecting all this attention yet.
> Whose data?
Anyone's data that you can stuff into the system:
"The drivers to read from and write to the platform can be implemented in any language making it possible to integrate with legacy data sources and systems."
> Is it a service?
No, it's a platform. You could turn it into a Platform-as-a-Service, like Amazon does with various technologies.
"S4 is a ... platform"
> What could it do for me (in simple terms)? What's a basic use case?
My first thought would be real-time trending calculations. You have a massive, never-ending stream of data...how do you extract real-time insights from that?
> Why do people assume that we can mind-read?
Perhaps because after being immersed in a project for a long time, it's easy to forget what is obvious to you, but non-obvious to others.
Just looked a little at the documentation. It seems like its an engine for stream processing. Think of it of data mining up front. You figure out what information you want and collect it as the data comes in instead of storing all the data and mining for what you want from the accumulated pile.
Up voted the story just so more people read your comment. This kind of "marketing-speak" focused on technical details is well too widespread. The project may be technical by nature but there's got to be a higher level way of describing it.
Maybe you should cultivate some fucking respect. Yahoo, which gave us such critical technologies as Hadoop and Pig, just gave us what appears to be a robust distributed stream processing thingamabob. They open sourced it first, and will quickly bring the docs and presentation up to speed. Instead of mouthing off, just say, 'thanks!' or nothing at all.
could just be we're not the audience yet, they see this as something they want to break out within businesses, so they have to communicate it to business types that want to know practically and conceptually how this new thing is better than what they have, but don't care about the actual code.
It's not marketing speak. It is a clear, simple, and short sentence that describes exactly what S4 is. It doesn't have irrelevant details ("we use technology XYZ for distributing our nodes"), nor very much marketing describing the value ("S4 solves all your problems!!!", although they do use the word "easily" once).
If the terms used are not clear to someone reading it, they are probably not the right person to implement a good distributed system anyway.
It's frustrating, isn't it? S4 sounds like it does something cool and useful, and I'd like to know what this is, just for reference at least. But I don't have time to go and figure it out on my own; I'm too busy trying to stay productive.
If this were relevant to you, you'd know what most of that means, particularly stream processing. Why exactly is yahoo faulted for your lack of understanding of the technical vocabulary? I'm mostly annoyed that you think that using the usual vocabulary to discuss a problem means that people have to mind-read.