
How we built our Real-time Analytics Platform - jdorfman
http://blog.maxcdn.com/learned-stop-worrying-love-logs/
======
j_s
TL;DR: TokuMX (MongoDB upgrade) through Redis with Go on the back end and
Node.js for the API.

The biggest take-away for me was that someone is selling ($2500/server/year
for hot backup + support) an improved ('accelerated speed for inserting data
as well as compression') GPL2 MongoDB.

~~~
mxpxrocks10
nice.

------
nodesocket
Awesome job. Curious did you investigate fluentd
([http://www.fluentd.org](http://www.fluentd.org))?

Also, seems a bit odd to use MongoDB for write optimized work loads, since it
essentially has a global write lock. Doesn't Cassandra perform better at
writes?

~~~
bconklin42
We did prototype for a while with fluentd as well as Cassandra and a few other
technologies which we decided not to use in the end. The answer to why we
chose a lot of what we did was to satisfy our requirements. The "Minimal
Hardware with Redundancy And Scale" and "Minimal Software Layers" requirements
came into play in regards to your question. We found that most off the shelf
technologies are great at scaling, but they have a cost that would have
required us to employ a lot of hardware. We were able to reach the speed we
needed by writing our own Go client to ship off the logs to our cluster, and
that also gave us the customizability and control to be able to make sure we
weren't consuming all of the resources our servers needed during peak times to
deliver content.

As for why we chose the database we did, while TokuMX is basically a drop in
replacement for MongoDB, it has features that make it much more usable for our
situation. Specifically, a compression rate of over 80% without negatively
impacting our insert speed and document level locking. And because TokuMX
works as a drop in replacement for MongoDB, we were able to use the MongoDB
driver (mgo [http://labix.org/v2/mgo](http://labix.org/v2/mgo)) for Go which
is just a really great tool. Additionally, we are now able to leverage the
MongoDB aggregation framework which allows us to use the data in a way that
builds very helpful aggregations of the data very quickly.

~~~
kiyoto
>We were able to reach the speed we needed by writing our own Go client to
ship off the logs to our cluster

That's a fair argument: when written intelligently, Go has great performance.

The other reason is performance v. flexibility. Any generic log collector
(Logstash, Flunetd, etc.) needs to support multiple data sources and outputs
and is probably less optimized for the specific situation at hand.

Just curious: what's the buffering strategy for your custom agent written in
Go? Does it support file-/memory-based buffering?

~~~
bconklin42
There isn't a whole lot of need for buffering since we've got all the tweaks
in place that make our system as fast as it is, but we do have a couple of
levels of buffering incase of network issues, downtime before a replica set
member takes over as primary, etc. so that we can try to avoid any isolated
failure from slowing down the processing that comes before that point in the
work flow.

Working from the end backwards, the first (or last) buffer is a Go buffered
channel running within the process on the router which feeds into the
database. These channels work as a sort of queue between concurrent Go-
routines, or "workers", which have a set amount of allocated space in memory.
These are empty most of the time, but if there is a failure with pushing to
the database they can start to fill up in order to not block the process
before it in the workflow until the system recovers. Before those Go workers
on each of the "Router" servers that push to the database is a Redis queue
which basically serves as a holding pin in between the CDN servers and the
database cluster, which is the data's first stop after leaving it's
originating servers.

On those originating servers is the Go process which reads the logs and pushes
the data to Redis on the routers, these also utilize buffered channels. So all
these layers of buffers work to prevent a block in the workflow during
momentary downtimes while the system recovers. However, each of these layers
of buffering do have a limit and if they fill up will begin to block on the
process before them. In the event of a major failure (such as all members of a
replica set being down or the network between the CDN server and the Router
being down) the Go process running on the CDN server will stop reading the
logs at a position where it lands when it is unable to feed more logs into the
channel and hold that position until the channel starts to be drained from the
other side and space in the channel becomes available.

------
dueprocess
I stopped using MaxCDN since a website I took offline months earlier suddenly
achieved "bandwidth overage" according to them, and my credit card was
charged.

Recently, they advertised they now provide server logs to prove usage, but I'm
done with them.

~~~
mxpxrocks10
hi hi - Sorry that happened. If you have a sec can you drop me details? chris
at maxcdn com

