

Real-Time Log Collection with Fluentd and MongoDB - kzk_mover
http://blog.treasure-data.com/post/13766262632/real-time-log-collection-with-fluentd-and-mongodb

======
nl
Very nice & all, but... MongoDB?

MongoDB is very popular, but all the (limited) criticisms of it seem to
related to insert performance once the dataset it too big to fit in RAM.

Normally the easy-of-development arguments make up for that, but log files is
one of those areas that has a tendency to expand quickly beyond any
expectations.

There is a reason why most companies are using HDFS and/or Cassandra for
structured log file storage.

~~~
kzk_mover
fluentd's greatest advantage is, it's written in Ruby. So, it's really easy to
write the plugin for any datastores.

This is HDFS (Hoop, the HDFS REST gateway) plugin.

* <https://github.com/tagomoris/fluent-plugin-hoop>

And also, Cassandra plugin are now in under development.

* <https://github.com/tomitakazutaka/fluent-plugin-cassandra>

You can see the user contributed plugin list here.

* <https://github.com/tomitakazutaka/fluent-plugin-cassandra>

~~~
seany
If you were going from logs -> hdfs, flume would be a much better chose imho

------
boredandroid
This is SUPER helpful! Just the other day I was wondering how someone like me
could get involved in the hard scalability problems I read so much about here
on the hackers news. But how to make my boring old highly cachable read-only
web traffic into a major scalability problem? Then I read this blog entry, and
wow, now each log entry on my site turns into a random btree update in MongoDB
made while holding a global write lock. Thanks again hackers news, and thanks
again BIG DATA!

~~~
viraptor
Or think about it in a different way - instead of adding disk IO on the server
itself, you're offloading the log processing to another server which does
delay writes (you don't usually need immediate sync for remote logging) and
gives you better log processing capabilities (semi-structured data).

If your workload cannot be handled this way - that's another thing. But how
did we get from "mongo is webscale" to "mongo cannot be used for anything at
all"? What happened to benchmarking and taking serious decisions backed by
real data?

~~~
jallmann
Syslog works nicely over the network in a client-server configuration, and has
done so for ages.

~~~
viraptor
For write-only logging from stateless, single machine bound processes - yes.
For analytics, automated tracking stateful sessions across many nodes,
preserving context, dumping binary fragments... no, at least for me it did not
always work.

~~~
jallmann
You can reconstruct almost any system flow with good logging. While that's not
always ideal (especially if you need to query the data), the more structured
your data gets, the less it is a simple log. When you increase the specificity
of your tools, they becomes less useful in the general case, turing tarpit
notwithstanding.

------
bluesmoon
How does fluentd resume tailing the apache log if it crashes? Does it maintain
the current file position on disk? What if logs are rotated between a fluentd
crash and recovery?

I've had to solve this problem for Yahoo!'s performance team, and ended up
setting a very small log rotation timeout, and only parsing rotated logs.
There's a 5-30 minute delay in getting data out of logs (depending on how busy
the server is), but since we're batch processing anyway, it doesn't matter.

The added advantage, is that you just maintain a list of files that you've
already parsed, so if the parser/collector crashes, it just looks at the list
and restarts where it left off. Smart key selection (ie, something like IP or
userid+millisecond time) is enough to ensure that if you do end up
reprocessing the same file (eg, if a crash occurs mid-file), then duplicate
records aren't inserted (use the equivalent of a bulk INSERT IGNORE for your
db).

This scales to billions of log entries a day.

------
ngokevin
I have a syslog-ng -> MongoDB project that I've been working on at my
university.

github.com/ngokevin/netshed

It is written in Python current parses out fields from several types of logs
(such as dhcpd). It is initially set up to read from named pipes (it has a
tail function as well). Each type of log is dumped to its own database, and
each date has its own collection. I have it set up with a master/slave
configuration to overcome the global write lock. It has functions to simulate
capped collections by days. It is followed with a Django frontend for querying
via PyMongo.

This version is several weeks old and I will push out a new one soon.

~~~
alexchamberlain
Have you got more details about overcoming the global write lock?

~~~
ngokevin
Oh sorry, when I mean overcome the global write lock, I don't mean getting rid
of it, but simply allowing me to querying a replicated slave database while
the master database is getting hundreds of writes a second...so the writes
don't block the reads.

------
kordless
I'd also suggest looking at both Logstash and Greylog2. They both can use
MongoDB as the storage engine for logs, and can also do the field extractions.

~~~
ashish_0x90
FWIW, Graylog2 will be switching to ElasticSearch backend from the existing
MongoDB, citing lack of performance constraints(and lack of better FTS
functionality) specific to MongoDB. Find the entire comment here -
[http://groups.google.com/group/graylog2/browse_thread/thread...](http://groups.google.com/group/graylog2/browse_thread/thread/da6bf5d51ae34bad/0aeaea558efd9568?show_docid=0aeaea558efd9568)

This is something I am working on right now, which is to have a centralized
logging system in place for the production servers. Logs will get indexed in
ElasticSearch(pretty awesome project, imho!!), where I can run search queries
against the indexes. I am using logstash for parsing, routing logs from
production servers to elasticsearch instance.

------
doryokujin
Great! If we use Fluentd and MongoDB, we can collect realtime event without
writing some codes, but only configuration setting. I also think about more
flexible aggregation system using them: "An Introduction to Fluent & MongoDB
Plugins" [http://www.slideshare.net/doryokujin/an-introduction-to-
flue...](http://www.slideshare.net/doryokujin/an-introduction-to-fluent-
mongodb-plugins) . Please tell me if there exists more powerful use-case using
Fluentd & Mongo!

------
nodesocket
Do you have a parser for nginx and/or lighttpd? Would like push logs from
these to MongoDB.

~~~
hoop
For lighttpd, try the following in your config to log in a format identifical
to that of Apache's combined log format.

    
    
        accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" 
    

See the docs on ModAccessLog for more information:
<http://redmine.lighttpd.net/wiki/1/Docs:ModAccesslog>

