

Ask HN: Best way to store web traffic logs? - Mamady

I am in the process of setting up a new website, and was hoping to get some advice on how others store web traffic log data.<p>At the moment I am thinking of just using Google Analytics and Mixpanel.<p>I have no intention of saving apache access log files - so Im looking for a web based service.
I was considering rolling out a custom db table just to log hits, but it just feels wrong.<p>What are you using? If you are storing data in a db, what is the db (i.e. mongo, postgres, cassandra)?
======
sehrope
How you store your logs depends on your server configuration. Analytic
services like Google Analytics or Mixpanel will work for any type of config as
they're initiated by the client. They both also have a nice UI so can see live
user's, plot them on maps, etc.

If you want lower level detail such as data for each user's IP address you'll
need something on the server side. I haven't used Mixpanel but Google
Analytics doesn't give you raw IP addresses. Also, if a user has it blocked
(ex: by Ghostery) then you don't see them in Google Analytics. To get around
this we also log all requests server side.

The two options I know are either do it yourself (that's what we did, more
below) or use something like Piwik ([http://piwik.org/](http://piwik.org/)).
The latter is kind of like your own Google Analytics that you run on your own
infrastructure.

For our public cloud app
([https://cloud.jackdb.com/](https://cloud.jackdb.com/)) we run all the
infrastructure so we aggregate the server access logs from each nginx instance
and push them to an S3 bucket. It's pretty straightforward and _really_ cheap
(S3 costs peanuts and log data gzips well). Besides audit events (which do get
logged to a database and can be queried) any funky research is done by good
ol' awk/grep/sed.

Our public website ([http://www.jackdb.com](http://www.jackdb.com)) is hosted
on S3 so we don't even control the actual server. Instead we've got logging
enabled on the S3 bucket sent to another S3 bucket[1]. S3 creates files there
with 1-3 hour lag of all requests with full details (IP, useragent, etc). Only
pain is that S3 creates _a lot_ of files so we've got a cron job that runs
regularly to combine them into daily files, gzip them, and put them in a
different S3 bucket. Again ad hoc research is done via unix commands on either
the latest log files or the archived files (we keep a local copy in addition
to the ones in S3).

Regardless of how you get your logs onto S3. If you want to make the storage
costs 10x cheaper in the long run (again this will only matter once you
actually have a significant amount of data) then you push it from S3 to
Glacier. Even better you can setup S3 to auto expire data to Glacier after X
days[2]. Just remember that you can't access them directly from Glacier. It's
just for "cold-storage".

[1]:
[http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.ht...](http://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html)

[2]: [http://docs.aws.amazon.com/AmazonS3/latest/dev/object-
lifecy...](http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-
mgmt.html#intro-lifecycle-rules)

~~~
Mamady
This sounds reallly interesting, but it feels very painful for reporting.
Trying to run reports on this type of data for a dashboard seems out of the
question.

I understand adhoc research is done via command line, but if you wanted to
have a dashboard which shows stats, how would you handle that? I assume you
would have to import into a db of some sort to run queries regularly?

~~~
sehrope
> This sounds reallly interesting, but it feels very painful for reporting.
> Trying to run reports on this type of data for a dashboard seems out of the
> question.

Yes we've saved it mainly to look at it later. It's the lowest level of detail
so I figure we can mine it later. To do anything useful with it it would have
to be processed though I don't think it's _that_ much.

> I understand adhoc research is done via command line, but if you wanted to
> have a dashboard which shows stats, how would you handle that?

We don't use those files for reporting. We have user stat reports (activity,
actions, etc) generated and look at the numbers themselves but nothing I'd
consider "pretty". All of those are generated from the audit trail data so
it's already in a database. The ad-hoc reporting from the command line is if I
want to trace something specific in detail. I've usually filtered it down to
small enough set that grep/awk is more than enough and it only takes a couple
seconds.

------
darkxanthos
If you have the cash and analytics aren't a real distinct business advantage
just go Mixpanel.

If you decide to do it yourself this is what I've done: Create a small web
service that you can call to log data from the UI. Start with one server and
if it starts to go over 60-80% usage consistently create a second.

The server should log every call to the service in a large flat file (csv is
easiest). The file should be named by date and time down to the minute. As you
scale up servers you just have a process pull down each file and aggregate
them server side. Or just throw them into S3 and use Hive/EMR to report on the
data.

It's a middle-class man's Mixpanel. I served tens of millions of logging
events a day with this solution. At the time the cost was somewhere around
$1,500 a month I believe. I was running 6 servers on Ruby/Sinatra though and
never tried to optimize much.

EDIT: typo

~~~
Mamady
I guess the part Im really keen to setup is the S3 + Hive/EMR part. But it
sounds like it will be a bit too expensive to have up-to-date stats, and
better to just do batch processing to run reports.

------
taylorbuley
If you are planning to run at any sort of scale I advise staying away from
logging into a database. Tying throughput to i/o like that could really hurt.

------
rip747
just use Google Analytics. It's a free and extremely powerful reporting.
Trying to roll your own solution is a total waste of time.

The only thing you should be doing with your logs are archiving them in case
of a security breach so your can try to pin point how the attach happened.

Don't waste the space on your lan either for the log archives. Get an S3
account, zip them up and store on S3.

