

Ask HN: How to effectively generate big data reports? - Skywing

My co-worker and I have been developing a tool in our free time to help us monitor our servers at work. At work, we have to keep an eye on nearly 100 servers - most are virtual machines. We wrote a tool that fits our exact monitoring needs (monitors a particular process, and various attributes about it) and have been collecting data on a per-minute basis from all 100 servers - we have quite a bit of data collected thus far. We're storing all of this data in MongoDB at this point.<p>The second part of our goal is to be able to view reports via an intranet web app. These reports are going to include data from <i>all</i> 100 servers, or that's our goal anyways. We'd like to start off by simply sorting these servers by CPU usage (we're storing CPU usage for each server, every minute), and viewing a graph of CPU usage over time.<p>First, uneducated plan of attack would be to simply run a report generating routine every minute or so that does the big query and stores that data set on a per minute basis. So, the dataset for this CPU usage would just include all server IDs and their CPU usage every minute - leaving out everything else we collect. I'm not sure if this scales very well, though.<p>Does anyone have any suggestions or pointers on how to plan out my scaling strategy for this? Thanks.
======
latch
Have you tried doing this in real time yet, or near real time with maybe a
short lived cache? I mean, 100 points doesn't seem like much. If you just want
the last load measurement for 100 servers, seems like a pretty lightweight
mapreduce with an inline/streamed return value. You could cache the results in
your app server for ~10seconds and have the front-end refresh every 10 seconds
(giving you max-stale data of 20 seconds).

Beyond that, for your scaling question, I am a fan of generating reportable
data asynchronously and keeping the UI stuff as a very simple select/find
(which seems to be where you want to start). Each report would be its own
transform - some of which might be more real-time than others. If your data is
really huge and a single report is limited by MongoDB's single-threaded map-
reduce, you'll need to look at the Mongo-Hadoop bridge.

Another solution is to simply store your data into a reportable format
upfront. Of course, if you need to look at your data 50 different ways, it now
turns into 50 inserts instead of just 1.

I think I'd try in real time, especially just for your initial cpu load, then
move to an asynchronous transformation..and when that no longer meets your
needs, deal with it then...I think it'll be a while before you get there.

You might be well served by asking this in the [active] MongoDB user group:
<https://groups.google.com/forum/#!forum/mongodb-user>

~~~
Skywing
When you suggest trying it in real time, initially, are you saying to just go
ahead and generate a new report every time the page is requested? I might as
well go ahead and try this, knowing that it's not going to be under heavy user
load. That might give me a good cpu load benchmark, as you said.

~~~
latch
yes, though like I said, there's no reason you can cache the results for a
short time (whatever works for you..I've seen as short as 5 seconds and as
long as 5 hours). That way, if 10 users happen to look at it at the same time
(common if you have an auto-refresh ajax thing going), it will keep the load
low.

------
asharp
Make sure you generate everything you need when processing data in one pass.
ie. you take the last 100 data points, (or what have you), turn that into your
graph. Then work on the next set of points, etc.

The key to this is to make sure that everything is constant sized. Don't have
anything that has to run over every data point you collect.

As an example, you can take the last 5 minutes worth of data points (as a
buffer), and then create a single five minute graph. You can then create an
average of the 5 minutes worth of points and move that in to a second buffer
which would have say an hours worth of points in 5 minute intervals, etc. Each
step has a constant sized buffer and will draw in constant time.

Also, when you view your report, only view pregenerated graphs/tables/etc. It
just makes things easier on you, and it stops you from having to care
particularly about how long it takes to run your analysis.

~~~
prodigal_erik
It's worth noting that a sliding window of summarized monitoring samples is a
solved problem: <http://en.wikipedia.org/wiki/RRDtool>

------
ZackOfAllTrades
One way to do it: every XXX minutes generate a file containing all the data
pulled from MongoDB for the past YYY minutes. Have that sit someplace
accessible within the Intranet.

Use some XMLHttpRequest on the browser side to pull the data up. Load it up
and parse it into arrays and then throw it into a dataTable[0] or some custom
Raphael.js [1] stuff. Reload every XXX minutes.

It seems like that if you just generate data on the server side and then do
all the processing on the client side, you can keep the load light on the
server and still do neat stuff with your data.

[0]<http://www.datatables.net/> [1]<http://raphaeljs.com/>

------
jerf
What's wrong with Cacti?

I ask this not as a veiled suggestion to "just use Cacti", but as a serious
question, to figure out what it is you are trying to do and specifically why
that's not good enough.

~~~
Skywing
We did setup Cacti but we were wanting to focus on tracking data particularly
about the CPU of each server, as well as 1 particular process running on it.
For that process, we wanted to track various stats like thread count, open
network connections by it, open files by it, etc. We also wanted to monitor
some environment metrics. The system we wrote is simply a Python app running
under a cron job, reporting this data to our database server(s). We're going
to be adding some notification type functionality to it so that we will be
notified if things go awry, which they tend to do at least once a week. :(

I'm pleased with out data collecting setup - it's working well. But, obviously
the power is in the reporting which is what we will be working on this
upcoming week. I'm just researching the best methods for doing this in a
scalable manner.

~~~
jerf
It's still somewhat unclear to me how Cacti is deficient. You know it can
track anything, right, not just what it ships with out of the box? I have a
chat-like server process that holds open TCP streams to clients indefinitely,
and we wrote something to feed Cacti that number, and it tracks it like any
other.

I don't know how the internals look so I don't know whether it ends up being
useful for other uses or not.

The other thing you probably ought to specify is the frequency of the reports
and what you want them to be reporting on. On a decently-spec'ed system it
just doesn't seem like you're really talking about _that_ much data here, you
should probably try to be more clear where your bottleneck is.

And you should probably head to some community that can pay attention to
something like this for a longer period of time, a forum or a newsgroup or
something.

