

Ask HN: How to build a stack for real-time analysis application? - zedzan

I am building a real-time data-analysis application where I need to aggregate and analyze data in real-time from myriad sources. 
I am trying to figure out what is the best technology to use in order to build a scalable stack for my solution.<p>My choices: 
- Node.JS: with Asynchronous capabilities, and the rich JS ecosystem. 
- Scala&#x2F;Akka: Many companies such Walmart, Linkedin.. are using this approach. 
- Go Lang: as a backend language for their capabilities in concurrency.
- Integrate ROR with Storm Apach (as hadoop is not powerful in such cases).<p>Someone would argue here saying that Go (is a programming language, not a web framework), Node (is a javascript runtime, not a web framework), Scala (is programming language ), Akka (is an model).<p>I am torn which technology should I use here, the potential of technology is the average of the overall factors ( programming language, web framework, ecosystem, community ..) eg: Go is powerful, but their ecosystem still not solid .. etc.<p>I am not trying to compare between languages, but between the over-all technology as a stack. Can anyone share his story building similar applications, challenges that he could overcome, and problems that he faces in the middle of road.
======
valarauca1
I do most my concurrent realtime analysis in C or Java.

The biggest issue I run into with realtime analysis, I work in flow
measurement systems so you may never encounter this but communication errors.

Having a value randomly flip from 50 to -30 just for one data point can throw
you stand deviation, averages, etc. Straight out the window. And potentially
ruin a solid test.

What I find you want is a very threaded model. I normally just throw threads
at the scheduler and let it sort out the details.

Typically you want IO/error handling done in its own heavy thread for each
from of IO. Or each source, Ethernet, DAQ card, etc. Luckily ethernet does
most of this for you.

Next your post-error processed IO should get sorted into something, normally a
structure of some sort, and dumped into a generally read only structure.

This structure is read by 2 threads. 1 logs it, 1 processes it further (before
moving to another thread to be logged).

This lets your processing be largely independent and have no IO slow down.
Which is handy when your approaching ~20GB/hr+ of data streaming. Most of what
you get, and generate internally doesn't have to be logged, unless you
_really_ like buying hard drives.

Also by keeping these separate you can have multiple _processing_ threads. But
when you do be on guard. Haveing 1 IO -> N processing -> 1 IO will result in
data arriving at the 2 be logged IO not in the order it was recieved.

You will likely want a catch all thread sitting between logging and IO to sort
the last ~500 items by time stamp, so they can be logged properly.

TL;DR

1) Make sure you keep track of when it arrived log it in the same order

2) More threads are better then none.

3) Make sure communication errors don't occur (if your using TCP the OS does
this for you).

