It's a twitter topic counter: "This application detects popular hashtags on Twitter by listening to the Twitter gardenhose."
From http://labs.yahoo.com/event/99 :
"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model , providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers."
Actor model: http://en.wikipedia.org/wiki/Actor_model
Actors, linda, tuples..... hum....
That paper was a big inspiration when we were redesigning the betting exchange at smarkets. It's a very well reasoned exposition of why this is the only sensible architecture for large scale distributed systems.
Sure enough, there is a Twitter example.
I imagine a composition language (DSL) wrapped around it could improve its usability - especially ad-hoc experimentation - greatly; at least one better than Spring IoC xml.
If you want to stick with the JVM, have you considered using Scala and the cake pattern?
If you want to stick with Java and want to use IoC without the hell that is Spring, I suggest Guice (which consists of a smaller, cleaner core and uses annotations and DSLs in place of XML):
Im too dim to figure out why it would be done this way (besides the fact thats its an early proof of concept demo). Any idea?
Stream processing jobs can be massaged to fit into the MapReduce paradigm, but S4 provides a more natural solution.
For example, do I push data into S4, does S4 poll for data.
Is this like a distributed task system, where I distribute my tasks evenly across multiple servers seamlessly?
I'm sure this is cool and useful technology. At this moment, from the marketing-speak, I have no idea what it does except that it has something to do with volumes of streaming data. Whose data? Is it a service? (Maybe not, since you can download it?) What could it do for me (in simple terms)? What's a basic use case? Why do people assume that we can mind-read?
General purpose: in the same way C is a "general purpose" language. It can handle arbitrary problems.
Distributed: designed to be used across multiple compute nodes.
Scalable: they've made the effort to ensure that performances increases as they increase the number of compute nodes.
Partially fault-tolerant: node failure does not mean the results of the computation are lost. "Partially," I assume, implies they can't guarantee this completely.
Continuous unbound streams of data: think sensors that are constantly sending more data. Or a stock market ticker. Or radio telescopes constantly monitoring the sky. Or a medical patient's various monitors.
The reason these terms don't resonate with you is that these type of applications - this type of programming - is something you're not familiar with.
I think if the following actually had examples of use, I'd try it out:
Currently it only has:
1. Check out sources from git
2. Create Eclipse configurations: mvn eclipse:eclipse
3. Import project into Eclipse
4. set variable M2_REPO to the local Maven repository: e.g. ~/.m2/repository
5. Set up formatting:
* Spaces for indentation
* Tab width = 4
When processing large amounts of streaming data, you have to process the data as fast as it comes in or else you can't keep up. You probably wonder: "Why not simply delay the processing of new data while the old data is being processed?". The problem lies in the fact we're dealing with a stream which always brings in new data. Eventually, the virtual line up of delayed data will occupy all available memory.
A solution to this is to dispatch the data between multiple computers which can each independently process the data they receive and then send back the result of their processing to a central computer whose job is to put back the results together. Can't keep up with the stream? Simply add a new computer! That's more or less what S3 does.
Open source just means that you can fork it and do whatever you want with your own version, e.g. fix it. The original owner's don't have to accept your fix.
In the enterprise software world, this is what's called CEP - Complex Event Processing: http://en.wikipedia.org/wiki/Complex_event_processing
I could've sworn there was a blurb there about where they are using and I recall using "real time map-reduce jobs" for things live bidding on ADs and such; another use case would be with stock market data.
You're right that it's marketing speak, this project has gotten too much attention on HN in the last few days even when all the git repo had was initial commit. It's too bad they don't have proper explanation, maybe it's because they weren't probably expecting all this attention yet.
> Is it a service?
No, it's a platform. You could turn it into a Platform-as-a-Service, like Amazon does with various technologies.
"S4 is a ... platform"
> What could it do for me (in simple terms)? What's a basic use case?
My first thought would be real-time trending calculations. You have a massive, never-ending stream of data...how do you extract real-time insights from that?
> Why do people assume that we can mind-read?
Perhaps because after being immersed in a project for a long time, it's easy to forget what is obvious to you, but non-obvious to others.
If the terms used are not clear to someone reading it, they are probably not the right person to implement a good distributed system anyway.