Hacker News new | past | comments | ask | show | jobs | submit login
Why web log analyzers are better than JavaScript based analytics (datalandsoftware.com)
33 points by mindaugas on July 6, 2009 | hide | past | favorite | 30 comments

Note that Data Land Software (host of blog) sells an "interactive web log analyzer"

Uh, the author greatly underestimates the headache of filtering out bot traffic. It's bad enough that some of the fancier comment spam bots load javascript, but going through the server logs would be nuts. The "Contact Us" form would show as the most popular page, since it's constantly being assaulted by automated bot-net based attacks.

Truth, bots could really be PITA. Still, with log analysis you can remove (at least some of) them, but with JS you can't add them. Same applies to non-page files.

Removing them from the logs sounds difficult. And adding them to JS-based stats doesn't sound very useful. I don't particularly care whether Google indexes my site at 10am or 11am so long as it gets indexed. And I certainly don't care about the comment spam botnets.

Of course the hour of bot's visit is not important, but it could be important to see if it comes every day or not. Not to mention hits to images and other files.

Google's Webmaster tools show how many pages the Googlebot grabs daily and how long requests take, so you can monitor that and rest at night.

I don't think anyone suggests (for serious websites) that log files should be abandoned or ignored. On the contrary, script based analytics - which do indeed offer significant value that's ignored in this article - should be considered supplemental or complementary to more traditional methods.

JS based:

  * Can record java version, flash version, other plugin info
  * Can record screen size, browser window size, color space
  * Can detect and record ad block presence

Both have their uses.

Enough sites use Google Analytics wrongly (by including it at the start of the page rather than at the end of the page) that I have used AdBlock+ to block it.

Edit, since someone has downvoted this comment without offering a contrary opinion:

My comment is an argument against trusting figures on ad-blocking obtained using externally hosted javascript. If you host the analytics javascript yourself, no-one will be motivated to block you. And the reliability of statistics such as colour depth and browser window size are probably unaffected.

It is not "wrong" to put Google Analytics at the start of the page - in fact, Google _requires_ it for some GA features to work (like Event Tracking).

I use Adblock Plus to block Google Analytics, too. I do it because Google Analytics freezes scrolling on my old version of Firefox on my slow computer.

You can do those with a hybrid system. The JS loads a special url with those values are url parameters, and the log analyzer knows to look for that url.

6. Bots (spiders) are excluded from JavaScript based analytics

To me that is actually a benefit of JS based analytics programs. When I check Google Analytics in the morning I don't want to see how many search engine bots and scrapers hit my site the previous day. I want to know how many actual human beings used my site instead.

Also, and this is probably obvious since it's been pointed out that these people have a vested interest in log parsers, this article would better be titled as "10 Reasons Why Web Log Analyzers Should Be Used WITH JavaScript Based Analytics." I would argue most people serious about tracking traffic use both anyway but those that don't should see the benefits.

I'm an author of this article - thank you all for commenting. I don't have an intention to start a flame war - both methods have pros and cons, but in this "GA craziness" people tend to forget that log analysis even exists. Hence the article. :)

And yes, Dataland Software "sells" interactive web log analyzer, but I can't really see how's that important?

Full disclosure. It's the same reason that, say, foreign policy columns written by someone in the current administration will say "So-and-so is Deputy Director of Such-and-such."

Asking the provider of a service to show reasons why their product is 'better' is going to give you a fairly biased story, it is good to declare such bias up front.

Personally I found most of their 'reasons' to be fairly contrived.

I think it would be possible to come up with a much more balanced point of view than the one given in the article when comparing tag based analysis and log based analysis.

Both have their uses, some of the reasons given hold water but most of them are pretty thin.

You could make an equally unbalanced list of 10 reasons why tag based analysis is 'better'. The important thing to notice is that tag based analysis is very convenient and can give you a bunch of information that would be fairly hard for a log file based analyzer to provide.

Log file analysis has it's place though, especially when you need to dig in to locate a problem. That's when a log file analyzer is next to useless though, you are basically going to go hunting through the raw logs in order to find your evidence.

The article smacks of a business giving me reasons to buy their product in the light of free competition. (The other alternative to paid log analyzers besides tag based analysis is of course an foss implementation of such a log analyzer).

From the first paragraph of the article: "Depending on your preferences and type of the website, you might find some or all of these arguments applicable or not. In any case, everyone should be at least aware of differences in order to make a right decision."

Sorry, I'm not in a mood for a flame war...

That's substantially different from declaring up front what your biases are. No part of that sentence indicates to me that you stand to gain from convincing me your points are valid.

Oh, come on.. This article is on the company's website..

One of the great things about javascript based analytics is that the cached version of your page is just as good as someone grabbing it directly. You can set long cache times on all of your pages without worrying about people viewing your site without you knowing. This more than counteracts the handful of people who have javascript turned off.

This is also particularly important for sites like heroku who have an HTML cache sitting in front of your site. If you serve pages that are cached, javascript logging is your only option.

doesn't your cache have the ability to write log data ?

Not Heroku's, and certainly not a cache in a client's browser or their network squid cache.

The reasons 1 by 1:

1) you don't need to edit HTML code to include scripts

The authors assert that you'd have do this by hand if you had a lot of static html. This is incorrect (you could easily insert the code using some script), but it also doesn't make sense, most larger sites (if not all these days) are dynamically constructed and adding a bit of .js is as easy as changing a footer.

2) scripts take additional time to load

This is true, but it only matters if you place your little bit of javascript in the wrong place on the page (say in the header). When positioned correctly it does not need to take more time to make the connection.

3) 'if website exists, log files exist too' (...)

This is really not always the case. Plenty of very high volume sites rely almost entirely on 3rd party analysis simply because storing and processing the logs becomes a major operation by itself.

4) 'server log files contain hits to all files, not just pages'

That's true, but for almost every practical purpose that I can think of that is a very good reason to use a tag based analysis tool rather than to go through your logs. The embedding argument the author makes is fairly easily taken care of by some cookie magic and / or a referrer check.

5) you can investigate and control bandwidth usage

Bot detection and blocking is a reason to spool your log files to a ramdisk and to analyze them in real time, to do it the next day is totally pointless. Interactive log analysis (such as the product sold by this company does) can help there, but a simple 50 line script will do the same thing just as well and can run in the background instead of requiring 'interaction'.

6) see 5

7) log files record all traffic, even if javascript is disabled

yes, but trust me on this one, almost everybody has javascript enabled these days because more and more of the web stops working if you don't have it. The biggest source of missing traffic is not people that have javascript turned off but bots.

8) you can find out about hacker attacks

True, but your sysadmin probably has a whole bunch of tools looking at the regular logs already to monitor this. Basically when all the 'regular' traffic is discarded from your logs the remainder is bots and bad guys. A real attack (such as a ddos) is actually going to work much better if you are writing log files because you're going to be writing all that totally useless logging information to the disk. Also, in my book a 'hacker' is going to go after other ports than port 80.

9) log files contain error information

This is very true, and should not be taken lightly, your server should log errors and you should poll those error logs periodically to make sure they're blank (or nearly so) in case you've got a problem on your site.

10) by using (a) log file analyzer, you don't give away your business data

well, you're not exactly giving away your business data, but the point is well taken. For most sites however the benefits of having access to fairly detailed site statistics in real time for $0 vs 'giving away of business data' is clearly in favor of giving away that data.

Google and plenty of others of course have their own agenda on what they do with 'your' data, but as long as they don't get too evil with it it looks like the number of sites that analyse via tags is going to continue to expand.

RE #10: It actually may be quite beneficial to give away your traffic data if your site doesn't have a lot of inbound links. PageRank is only one small component of your ultimate placement in Google results; high traffic sites are obviously also ranked better.

Is that true ? It sounds self-serving, after all if high traffic sites are ranked higher they become even higher traffic sites and so on...

I don't have any inside knowledge, but if you were Google, wouldn't you at least factor it into the equation?

We use Google Analytics. I noticed Google Bot, but not other buts, has increased crawl frequency steadily with our recent traffic spike, but the number of visitors from search engines is still quite low at this point. This correlation hints that Google may be using traffic data to prioritize crawl rates and it would seem a logical extension to prioritize search results.

7) log files record all traffic, even if javascript is disabled

You can have a tracking pixel in noscript tags as backup.

Log analysis is a major PITA, specifically if you're operating a farm of web servers like we do. We use an epic shit ton of realtime stats (Woopra, Mint, GA) so we have most needs covered and have a real time view into what's going on.

We do rotate our logs up to S3, but haven't done anything with them thus far.

IIRC, a while ago there used to be an analysis system that you'd place in the appropriate location in your network, and it would sniff packets and piece together its own log files. I don't recall what the advantages were supposed to be... perhaps that you could get some information on speed/latency.

the advantage would be 0 overhead on the machine serving up the data.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact