

Ask HN: need advice on how to write lots of data fast - geuis

I'm working on an analytics app that will be generating several http requests for every page view. Probably an average of 2-3. I'm mainly a front-end guy and my knowledge of best practices for storing data is kind of lacking.<p>I know that writing to log files is hard drive intensive. I also know that the preferred method is to write to a database, which I know how to do and can easily setup. But I'd like to stretch my boundaries a bit here and learn something new.<p>I had an idea the other to use nginx as my webserver to handle the requests. Very fast &#38; lightweight. It would dump any new data into a memcached instance, with a cron running every couple minutes to suck the memcache stuff either into a db or into a flat file. Even I can think of several reasons why this might not be optimal, but I thought it would a cool way to learn about how to use memcache and get experience with building light-weight systems.<p>So does anyone have any thoughts on anything I've said, and maybe can make some recommendations?<p>I ran into a related project called memcachedb, which purports to use the memcache api calls but writes to a berkelydb backend. I haven't looked into it much, but it does seem interesting.
======
cperciva
_I know that writing to log files is hard drive intensive. I also know that
the preferred method is to write to a database_

No. Databases are _far_ more disk intensive than logging; they use vastly
higher disk bandwidth, and depending on the database, often use lots of disk
seeks too.

~~~
vicaya
It's generally true for traditional DBs, a few new generation DBs have options
to buffer writes in memory and achieve durability via in memory replication in
other nodes and only writes to disk sequentially in large chunks, which is
better than simple logging as you'd have to sync to avoid data loss upon
hardware failure.

~~~
cperciva
If you're going to consider a database which buffers writes in memory, you
should also consider logging which buffers writes in memory.

~~~
vicaya
Maybe my comment was not clear enough: buffering writes in memory on a single
node alone has no durability to speak of. You need either a DFS or distributed
DB that does replication of the buffers on different nodes to provide some
reasonable durability.

------
newhouseb
I think you're looking for scribe:

<http://developers.facebook.com/scribe/>

------
neodude
One of the questions you need to ask is: what you are going to do with all
that data? If all you need is to store the data and never read it again except
sequentially, then flat log files are probably the best way. But if you ever
need to, say, calculate aggregates or do any kind of visualization, a database
would probably be a better initial choice.

You'll want to keep in mind that whatever you choose should be pretty
flexible, because as you discover more about the analytics you're providing,
you'll need to slice and dice that data differently. You won't know what data
and how it will answer the questions until you know the questions.

------
fauigerzigerk
Put the data in a message queue asynchronously. A second process can take the
data from there and put it in a database or wherever it's easiest for you to
analyse.

------
daleharvey
discodb looks great for analysing analytics data

appending data to a file is a pretty cheap operation, just make sure you
append properly and dont do something like read the whole file, append the
string, then write it all, a lightweight server like mochiweb will probably
have you handling more traffic than you can handle for a while before having
to do a proper logging queue.

