

Ask HN: Scalable way of logging page request data from a PHP application? - ichilton

A web application I am developing (in PHP) requires the ability to log each page request.<p>Just like a normal access_log, it will store details like url requested, source ip address, date/time but I also need it to store the User ID of the logged in user (which is stored in a php session variable).<p>This data will then be queried to create site-wide or per user analytics reports as required at a later date - things such as total number of visits/unique visits, page views in a certain time period, geo-locating the ip addresses and looking at locations, most active times of day, most active members etc.<p>The obvious thing to do would be to have a mysql insert statement on each page but if the application is receiving thousands of req/sec, this is going to be a hugh bottleneck on the database so I am looking at alternative, scalable ways of doing this without big infrastructure requirements.<p>A few of the ideas i've had are:<p>1) Work on a way for Nginx (it's hosted on Nginx with PHP-FPM) to be able to log the user_id from the session/application in the normal web server access_log, which can be parsed and loaded into a database periodically (nightly). This feels like a bit of a hack and will need doing on each web server as the system scales out.<p>2) Log each page request into Redis which has high write speeds - the problem with this is the lack of ability to query the date at a later date.<p>3) Log each page request into either Memcache/Redis acting as a cache (or a message queue) and from there it would be regularly extracted, inserted into MySQL and removed.<p>4) Would something like MongoDB which has more query capability be suitable?<p>I'm interested in how HN readers would approach this and if anyone has any experience of a similar application (or has come across anything online).<p>I'm also interested on thoughts on how the data could be suitably structured to be stored in memcache/redis.<p>Thanks
======
latch
My $0.03

The first thing you need to answer is whether you'll do data processing on
select, on insert, or as a mix of both.

On select, you only insert raw data (like an access log). You can then pick
one of two ways to generate your reports and transform your data. You can do
them in real time - when a user requests the data. Or you can do it as a batch
process and do the whole OLTP to OLAP offline.

On insert you calculate everything as each hit comes in and fill in your
reporting structure directly.

Both the select with real-time report generation and the insert approach give
users access to real time data. Select with an offline transformation script
will provide cached data until the batch process is run again.

Imma guess that you're users want as-real-time as possible, with some data
having to be more real-time than other. This is where you adopt some form of
hybrid/mix.

As for actual implementation...First, can you afford to lose data? Now don't
jump the gun and immediately say "no". A lot (but certainly not all) of
analysis of this type of data works on averages and general trends. If you
happen to lose a couple hundred rows in a couple hundred thousands, it doesn't
really change the outcome that much. Anyways, it's important to know this
because some storage engines support non-durable writes (writes to memory)
which are stupid fast, but might result in lost data (or you can possibly
write your own buffering logic in the app/web code to spin off another thread
and write every 500 hits in a bulk-insert).

Technologies. MongoDB has a couple things going for it. First, writes are
fast, and they can either be done to memory (stupid fast) or to disk (or to X
replicas). Even writes to disk are pretty damn fast (to X replicas will
largely depend on network latency). The other thing MongoDB has going for it
is MapReduce...sadly it's single threaded, but whether you are doing real-time
reporting on an offline OLTP->OLAP transformation, MapReduce is significantly
more powerful than OrderBy.

In chapter 6 of the (free) Little MongoDB Book
(<http://openmymind.net/mongodb.pdf>) I actually outlined the initial OLTP to
OLAP process we were using for a very modest reporting tool.

Since then, we've switched to Redis and do it all on insert to provide real-
time analytics. However, we are only tracking 3 very basic incrementing
statistics. You can see the code here: [https://github.com/mogade/mogade-
server/blob/master/app/mode...](https://github.com/mogade/mogade-
server/blob/master/app/models/stat.rb) (the hit method).

Of course, why don't you just use Google Analytics?

 _EDIT:_

If your data processing needs are truly huge, you'll likely end up with
something like Hadoop (which I haven't had the opportunity/pleasure to need,
so I can only mention it in passing).

~~~
ichilton
Thanks - much to consider!

I can't use Google Analytics because the data needs to be tied to the user id
of the logged in user.

------
aonic
How much traffic are you really expecting on this project? Dont pre-optimize,
most of the time a regular ol' MySQL table will do just fine and people over-
worry about performance too early in the development process.

Stick it into MySQL, if it causes problems later, worry about it then (or
email me from my email in the profile)

~~~
ichilton
I'm probably looking at up to a few hundred req/sec.

Thanks for the advice.

------
stevencorona
Writing to mySQL on every page request is generally a "bad idea" unless you
aren't getting much traffic (in which case, it's the easiest solution).

The scalable way to build something like this would be to log your raw data to
a service like Scribe Server (open sourced by Facebook) or rsyslogd and then
aggregating it in a background process. The aggregated data can be stored in
mySQL or whatever datastore you prefer.

<http://scribeserver.svn.sourceforge.net/> <http://www.rsyslog.com/>

(Scribe is a bit more complicated to setup and hook into PHP, but is a much
better choice for a highly distributed/scalable system)

~~~
petervandijck
<http://www.google.com/search?q=premature+optimization+is>

~~~
stevencorona
I agree completely- most likely OP doesn't need a solution this complex- but
it doesn't make any answer any less correct.

------
petervandijck
<http://www.google.com/analytics/> probably does 90% of what you need; add a
more user-focused analytics program (there are plenty) and you'll probably get
more than you need, executed better than you could yourself.

ps: the inability of many developers to go with a "good enough" solution or to
question "requirements" always baffles me. What you think are requirements are
probably just rough guesses at what the client might want (I know, I've been
writing requirements for a living for 10 years). This sounds like exactly a
case of that, although I may be wrong of course.

~~~
ichilton
The problem with Google Analytics is that I need the data to be tied to the
user_id of the logged in user.

------
chrismiller
If you do go the MySQL route be sure to check out INSERT DELAYED
<http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html>

~~~
ichilton
Yes, I use this but the problem is still that if you fire large amounts of
traffic at it then you get a lot of these building up and still create a
bottleneck.

------
ra
No problem with you asking this here, but you might find this question is
better suited to StackOverflow

~~~
ichilton
I tried that but didn't get any replies at all....

------
bemmu
How about putting log lines into memcached, then once a minute write them to
disk?

~~~
ichilton
This is something I considered but i'm unsure of the best way to store these
so they are retrievable without looking up the key.

