

Implementing Real-Time Trending Topics in Storm - skadamat
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

======
scott_s
I work on IBM InfoSphere Streams
([https://www.ibmdw.net/streamsdev/](https://www.ibmdw.net/streamsdev/)), and
the language our system uses is SPL (Streams Processing Language). I was
curious to see what this problem would look like implemented in SPL, so came
up this this:

    
    
      composite Main {
      graph
        @parallel(width=2)
        stream<rstring word> Words = Custom() {
          logic onProcess: {
            while (true) {
              // a "word" looks like "0", "1", ..., "99"
              submit({word=(rstring)((int32)(random() * 100))}, Words);
            }
          }
        }
    
        @parallel(width=3, partitionBy=[{port=Words, attributes=[word]}])
        stream<rstring word, int32 rolling_count> Counts = Aggregate(Words) {
          window Words: sliding, time(9), time(3), partitioned;
          param partitionBy: word;
          output Counts: word=Any(), rolling_count=Count();
        }
    
        stream<rstring word, int32 rolling_count> Sorted = Sort(Words) {
          window Words: sliding, time(2), time(2);
          param sortBy: rolling_count;
        }
    
        stream<list<rstring word, int32 rolling_count> hist> Rankings = Aggregate(Sorted) {
          window Sorted: tumbling, punctuation;
          output Rankings: hist=Collect();
        }
      }
    

I did not parallelize the reduce part. I'd need to do some more thinking on
how to do that, as I split it into two operators. I think it shows the flavor,
though.

Two caveats: One, I have not compiled and tested this, so errors may abound. I
was just curious what the "flavor" of it would look like in our system. Two,
standard disclaimer of I speak in an unofficial capacity, what I say does not
represent IBM or Streams.

------
samspenc
Michael Noll has some really good introductory articles to Big Data, and I
highly recommend his posts for anyone starting in this field.

His post on writing Python streaming programs in Hadoop - [http://www.michael-
noll.com/tutorials/writing-an-hadoop-mapr...](http://www.michael-
noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/) \- got me
started in this area two years ago, and that's what I'm still doing these
days! :)

------
dxbydt
I read this one a few months back & felt, like most programmers, that it was
an awesome article, but I would forget all of it unless I rewrote everything
from scratch all by myself :) So here's my rewrite in Scala - its not
realtime, but still takes only 4 minutes on my laptop for a typical weekly
dataset, so you can put it on a cron schedule & get a trending topics with a 5
min delay. Ofcourse, no storm, no hadoop etc. just a plain old scala program
anybody can run on their laptop. [http://bit.ly/J82szS](http://bit.ly/J82szS)

~~~
pinhead
Maybe check out Spark [1], it primarily supports Scala and they have great
streaming functionality that seems to beat out Storm and others in terms of
performance for some workloads [2].

[1] [http://spark.incubator.apache.org/](http://spark.incubator.apache.org/)

[2]
[http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_str...](http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf)

------
capkutay
Sorry for the shameless trolling, but if you are a fan of storm and see
potential in the real-time side of big data, consider applying to work on the
Platform team at WebAction.

We're a well-funded company (raised an $11m Series B from a $15billion private
equity firm) real-time big data analytics company for the enterprise. We're
based in Palo Alto. If interested, feel free to shoot us a message!

jobs@webaction.com

------
adambard
Storm (and Trident) are just fantastic. If you're looking for a streaming
complement (or replacement!) for Hadoop, you can't go wrong here.

