

Show HN: Native Analytics (work in progress) - latitude
http://swapped.cc/native-analytics

======
latitude
I've been mulling over the idea of writing my own analytics package for a
while now, and finally decided to sit down and do it. NA is a simple web
analytics package that runs on a target web server, feeds directly from the
server's web traffic and requires no tracking Javascript code to be placed on
site's pages.

I'm early in the process, so any ideas and feedback is welcome.

------
MikeW
Building analytics from access logs is massively unreliable. I have a few
domains that answer for http traffic but only serve "Welcome to nginx". When I
tail the access logs I see thousands of requests from obvious bots, but also
that traffic seems to be pretending to be other forms of traffic also.

There are some bots that pretend to be phones browsing around, IE, firefox.

Just remember that relying purely on access log analytics isn't something that
will return accurate results.

~~~
latitude
I wouldn't call it _massively_ unreliable from what I've seen so far.

For one it is offset by the fact that the site it is used on is Ajax'ed, so it
is fairly simple to filter out those who poll just the container page and
don't follow up with querying any actual content. For two, I'm getting good at
detecting bots by simple filtering by User-agent strings and IPs. So it is far
from the picture of an utter disaster that you describe.

------
jasonkester
I'm curious as to why you chose to stick code into your web application to
explicitly insert log entries into a database rather than using the built in
logs that the server provides.

After all, the web logs that Apache spits out match your schema almost
identically. They're perfectly parseable and have the advantage of being
parseable at intervals (possibly from a different server) rather than forcing
an insert from every page load.

By doing things this way, you essentially remove the ability to scale the
application that's being logged. You're doing at least one database insert per
page load, thus destroying any gains you might have seen from caching the
content you're displaying. And of course, you can't actually cache anyway,
since you need to actually run the PHP script for the page in question so that
it will fire the database insert to register a hit.

So yeah, I'd suggest dropping that step and just having a worker swing by
every once in a while, pull up the logfile, see what's new and bulk insert it
into your database. Preferably, with said database and said worker all living
on another machine so that they don't bog your production box.

~~~
rachelbythebay
Parsing Apache logs is not the answer. Keeping things in a sensible format is
the beginning of the answer.

Parsing a "line like this" is "\"harder than it looks\"".

~~~
gst
Parsing Apache logs isn't hard. There's a well-defined spec for them. Of
course you have to keep track of stuff like escaping, but that's pretty
trivial.

~~~
rachelbythebay
No, it actually is difficult. I put this question forth to a bunch of people,
and lots of them tripped over something.

I'm beginning to think it's a decent "weed-out" question.

------
dawie
I am really interested in your project. I am working on an service where I
think analytics will have a big benefit to my customers. I am struggling with
rolling my own or using a 3rd party service. Both have pros and cons. I think
in the end rolling my own will be the way to go, since I want to show users
how many people visited their site ( and how many people contacted them) in a
very simple way. Basically just show them the two numbers.

PS: I am in Calgary (go flames go) :-)

~~~
dawie
Are you aware of this project: <http://www.openwebanalytics.com/>

~~~
latitude
I've seen it, but I haven't used it.

------
ivan_ah
Other open source js based web analytics include: <http://piwik.org/> (It is
pretty much GA on your server)

Are there any frameworks that combine server-side logging (apache),
application logging (django app log) and pull in data from google analytics?

Any recommendations for an open source "dashboard" app?

------
houshuang
Another advantage of "native" analytics is to capture downloads of all file
types. Google Analytics does not know about your PDF downloads (unless there
is something I don't know about), and or me that is quite important.

I hope you figure out how to block referrer spam. I use both GA and awstats as
provided by my webhost, and the numbers are widely different. The referrer
list at awstats is almost useless since I get hundreds of hits from viagra
selling sites etc, despite the fact that I don't publish the referrer list
publicly anywhere.

~~~
ivan_ah
For anything other than page loads, you need "Events":

[http://code.google.com/apis/analytics/docs/tracking/eventTra...](http://code.google.com/apis/analytics/docs/tracking/eventTrackerGuide.html)

for example:

    
    
       <a onClick="_gaq.push(['_trackEvent', 'Vids', 'Play', 'Title']);">Play</a>

~~~
crdoconnor
This doesn't track when you right click and click download or open in new tab.

~~~
ivan_ah
Interesting. Do you know of a work around for "open in new tab"? Would "save
link as" be counted?

I guess the server side logs would come-in handy in that case.

~~~
houshuang
Also doesn't track when people go directly to download your files, for example
finding PDFs through Google Scholar.

------
suneilp
What are the disadvantages of 3rd party services? Are you really giving away
secrets to 3rd party services?

Also, whats the realistic percentage who disable javascript nowadays? Does
AdBlock really block javascript? Does it really impact 3rd party services that
use javascript to record the visit like say Google Analytics?

~~~
huhtenberg
AdBlock subscriptions certainly come with comprehensive lists for blocking
analytics snippets. Even very obscure ones.

------
powertower
How is this different from the server-side analytics packages such as AWStats,
Webalizer, Open Web Analytics, Piwik?

~~~
latitude
tl;dr - reinventing the wheel is the only way to build a better wheel :)

Seriously though - I have used some of these in the past (settling however on
Mint, which is not open-source in its common sense), none were perfect. I am a
programmer, not a sysadmin, and it is more interesting for me to write
something than to mess with package dependencies, default installation paths
and what not. Digging through someone else's code and trying to bend it to my
needs is not the best pastime either. On the other hand, writing analytics
from scratch gives a chance to build _exactly_ what I need, especially when it
comes to the reporting function and the UI. In fact, in the UI/UX department I
am pretty sure I can do better than existing O/S packages.

------
Nikkau
One big problem for me with server-side analytics : you can't use CDN (like
CloudFlare).

Or nice caching.

Your setup need that requests hits PHP, but most requests hits Varnish on most
setups.

~~~
no_more_death
Yes. Right now we are looking at installing an affiliate plugin on a Magento
site with Varnish caching. Instead of inventing a new Magento/Varnish plugin,
we are planning to use a javascript-driven solution.

------
halayli
> Javascript dependency - not relying on the browser-side code to capture a
> page visit means that the NA can correctly account for clients using
> NoScript and AdBlock.

> Privacy - capturing the analytics data directly on the web server means not
> sharing it with other parties. For some people it is not an issue, but for
> others, me included, it is.

The requirements are very artificial IMO

~~~
latitude
To each his own, but do elaborate. I'm interested to hear why you think they
are very artificial.

------
DanielRibeiro
Looks cool. But what is the advantage of this over Mixpanel?

~~~
latitude
The main advantage is the lack of dependency on a 3rd party service.

