
The world beyond batch: Streaming 101 - dpmehta02
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
======
alephnil
Interesting article, but the author often seems to be carried away. Like
below:

> Quite honestly, I’d take things a step further. I would argue that well-
> designed streaming systems actually provide a strict superset of batch
> functionality.

At least seen from a computer science perspective, this is strictly wrong. In
problems like bin-packing, it has been shown that there are cases where
streaming is provably less optimal than batch processing. In the batch case
you always have all data available, while in the streaming case need to make
decisions before you have seen all the data, which may lead to suboptimal
results. From this perspective it is the other way around. Batch processing is
a strict superset of streaming.

The reason streaming often is preferred to batch processing is that you get
the result sooner, and don't have to wait until all data are there before you
have a result. Such answers are often much more valuable than the accurate
answer you get in the end.

~~~
jdmichal
Batch processing is a subset of stream processing, assuming that the stream
has a finite set of data. That is, it could be implemented as a collection of
all the input from the stream, then executing the batch.

However, not all streams have finite data sets. This is a very important
distinction.

------
chollida1
I made a comment a few weeks ago here:
[https://news.ycombinator.com/item?id=10459992](https://news.ycombinator.com/item?id=10459992)

that made the point that hedge funds are now spending a lot of their
processing and research budgets on consuming many different types of streams
of data.

Complex Event Processing has been a main stay of algorithmic traders for the
past 8-10 years and has been a key tool in dealing with amalgamating these
streams of real time data. The author didn't define CEP as it relates to
Stream processing but this quora article does a good job of trying to:

[https://www.quora.com/How-is-stream-processing-and-
complex-e...](https://www.quora.com/How-is-stream-processing-and-complex-
event-processing-CEP-different)

So instead of the database example of where you'd store your data and then ask
questions about it, you'd ask the question first and then push your streams of
data through the library and have it call you when the question you asked
became true.

For a practical example of how this is used in algorithmic trading. You might
have streams that consist of market data, a stream for twitter sentiment, a
stream for the sentiment of a real time news feed, streams to follow the
futures market and a stream to indicate if there are any upcoming Fed
reporting windows.

You might then create a streaming query to say notify me when a gold stock has
4 new daily highs in a 5 minute window and the gold futures are within 10% of
their daily highs and no negative news sentiment for this stock has been seen
in the past 30 minutes and there are no upcoming fed reports due today.

If this query reports back to you that this event has occurred, you could then
translate this into an order to be sent to the market.

you give up some performance for the CEP overhead but it makes maintaining the
logic much easier than having each algo have to manually track the state of
each of those streams.

They also have the nice ability to make your back testing and unit testing
easier by allowing you to step time forward in discrete intervals. So if you
have an application that needs to be notified 20 milliseconds after an event
you can step time forward in a non real time manner, probably faster when unit
testing to verify your call back is correctly called.

~~~
kbenson
Do you happen to have any good resources on best practices or things to
consider when working on complex event processing? After reading your comment,
it occurred to me this has been something I've been getting closer and closer
to at work, but haven't really gotten quite to this level yet (currently we
allow more targeted queries that run at regular intervals, but it's always
been the eventual goal to be notified in real-time). It's far too infrequent
that I get a chance to get ahead of a project and do some research, given the
frequently shifting focus and short time-frame for work here, but I know this
will be a priority at some point in the not too distant future.

------
127001brewer
This might be too much of a newbie question, but how do these types of data
stores (such as "data lakes") get populated?

For example, and again I might be too uninformed here to ask the correct
question, do you use ETL-type tools to get and store data? Or is it usually
scripts that pull in (and process) data from various sources?

~~~
felixgallo
With ETL, usually you start out with handmade scripts, and then depending on
how rich you are and/or how complex your situation gets, you possibly move up
to either an internal tool or a vendor tool.

