

Is anybody else loading their database by tailing log files? - petewarden

I recently had a brief chat with a couple of developers working on different data-heavy websites, who were both using an interesting pattern for filling their databases.<p>Their data-gathering components (pulling from external sources like crawlers and APIs) would append new data to the bottom of a log file.<p>Another process sat doing something like a 'tail -f' on the same file, and parsed and added the updates to the database.<p>This seems like it might solve some problems for my case:<p>- Very easy to recreate the database if the schema changes or things blow up, just reread the log files<p>- Good history for debugging<p>What worries me is that it feels funky using files for IPC, and I can't find any examples of this being used elsewhere.<p>So, is anyone else using this pattern, or have any references to it that I'm missing?
======
gstar
It's certainly an interesting approach! Depending on the volume, it could have
the benefits you described - but it may get a bit old when the volumes
increase.

I'd engineer the crawler to talk to a persistent message queue, and load the
database from there. That gives you a lot of flexibility to move loads around,
instrument the queue and you're not reinventing things, either.

------
brown9-2
_Very easy to recreate the database if the schema changes or things blow up,
just reread the log files_

This would be nasty though if your log files wrapped, the disk they were on
ran out of space, etc.

~~~
petewarden
True, that's part of what I'm trying to wrap my head around, the gotchas of
using files as the building blocks for this sort of thing.

My hazy mental picture is that I'll be creating new local log files regularly
and moving older ones to S3 backups, but it feels like there's a lot of edge
cases there that I might not think of.

------
petewarden
As an update, I did some groundwork to see how this might work in PHP, by
creating a small example that tails the Apache error log:
[http://petewarden.typepad.com/searchbrowser/2009/09/how-
to-f...](http://petewarden.typepad.com/searchbrowser/2009/09/how-to-follow-
your-apache-error-logs-in-a-browser.html)

Still feels kinda sketchy...

------
fsniper
RDBMS has logging mechanisms for this kind of recreating database. For
example, PostgreSQL has WAL - write ahead log. These can be used to rebuild
the db or for asynchronous replication. Likewise Mysql has binary logging.

------
skwiddor
cat data | tee log | data_processor

it's called "the Unix philosophy". invented by Doug McIlroy, probably before
you were born

