
Show HN: Serverless Analytics Built from Scratch - keydunov
https://statsbotco.github.io/cubejs-client/aws-web-analytics/
======
lucb1e
"serverless" is really the misnomer of the year.

~~~
mikejulietbravo
"Still on a server, but not your problem" just doesn't have the same ring to
it though

~~~
buster
"/cgi-bin/ with javascript instead of perl" doesn't make me want to buy it,
either.

------
gingerlime
Interesting and nicely presented!

I built a prototype of something very similar, but using Google BigQuery to
store and extract data[0] but never took it beyond the concept phase. I’m
still using and actively maintain an open source lambda-based A/B testing
severless framework however with similar (but simpler) architecture[1]

[0] [https://blog.gingerlime.com/2016/a-scalable-analytics-
backen...](https://blog.gingerlime.com/2016/a-scalable-analytics-backend-with-
google-bigquery-aws-lambda-and-kinesis/)

[1] [https://github.com/Alephbet/gimel](https://github.com/Alephbet/gimel)

------
rmccue
The blog post about it is probably a better link for HN:
[https://statsbot.co/blog/building-open-source-google-
analyti...](https://statsbot.co/blog/building-open-source-google-analytics-
from-scratch/)

------
soared
I mean as a POC its not bad, but google analytics is not the same as analyzing
server logs (contrary to what most people would suggest). Most of the value of
ga comes from session and user level metrics, which are 1000x more difficult
to implement than showing pageviews. Unless you are planning on building a
device graph that rivals google, you can't clone ga.

~~~
asien
> google analytics is not the same as analyzing server logs

This is what most people don’t get with ga.

Google Analytics does the heavy lift by removing incoherent , corrupted or
malicious data insertion.

Let’s say I use Puppeteer i can scrap this page a million time with completely
wrong headers like « Netscape 8.1 ». GA purify this type of malicious attempts
, it will probably look my IP Adress and figure out that it’s actually coming
from only one IP and « Netscape » is too rare to be considered as an actual
browser so it would probably ignore it.

All others « free google analytics alternatives » that exists today don’t have
this type of mechanism to prevent from data corruption.

In general they just get an Http Request and acknowledge it as a legitimate
visit.

Logging an Http request from a browser is not even a tenth of the work GA does
under the hood.

~~~
mayank
> Google Analytics does the heavy lift by removing incoherent , corrupted or
> malicious data insertion.

Unless it's referrer spam...that somehow still sticks around (at least last
time I checked, which was several months ago).

~~~
soared
If you hire an expert to set up your ga referrer spam doesn't get through, its
super easy to filter before hand.

------
code4tee
Build serverless app to track web stats. Get it featured on Hacker News and
use the flood of traffic to demo what was done. Very meta. Nice job.

------
cheriot
I'd be curious to see a cost estimate for some traffic level. I wonder if
there's a way to put the pixel in s3 and process the access logs more cheaply.

~~~
teej
I’ve seen folks put their pixel endpoint behind Fastly and process the access
log delivered in S3. A Fastly VCL can handle the same transform that this
Lambda is doing.

~~~
InGodsName
Is fastly free? Why would they use fastly and not s3?

~~~
teej
S3 access logs alone are not sufficient to replicate this pipeline. This pixel
is stateful (for the anonymous user ID) and S3 access logs don’t include
arbitrary headers, in this case the cookie with the user id. Fastly would let
you eliminate API gateway, Lambda, and both Kinesis steps.

API Gateway by itself is $3.50/million requests, which is 2-4x more expensive
than Fastly at $0.75 - $1.60/million

------
westoque
We should really stop using the word "serverless".

I would rather call them instead "zero config servers".

------
jimmychangas
I think you can use API Gateway as a proxy for Kinesis, removing the need for
Lambda.

~~~
k__
Seems to be:

[https://docs.aws.amazon.com/apigateway/latest/developerguide...](https://docs.aws.amazon.com/apigateway/latest/developerguide/integrating-
api-with-aws-services-kinesis.html)

------
teej
This sort of thing works until you have one person run a security scan on your
site, corrupting your user agents and event types.

------
manigandham
Side note: If you want to build your own mid-size event analytics data
pipeline, then I recommend looking at snowplow:
[https://github.com/snowplow/snowplow](https://github.com/snowplow/snowplow)

------
jimktrains2
Interesting. I once built a ga clone using appengine, cloud dataflow, and big
query. I guess that would count as serverless? Benchmarked it against the
official dumps to big quey too and it was pretty spot on for every metric we
could lookup!

~~~
InGodsName
Bigquery takes minimum 2-3 seconds for every query.

Google Analytics is much faster, responds in a few hundreads milliseconds.

What did you use dataflow for? How did you get data from end points and insert
them into bigquery? Using streaming inserts?

~~~
cosmie
> Google Analytics is much faster, responds in a few hundreads milliseconds.

Are you referring to their reporting API, or their collection endpoint? The
collection endpoint is certainly fast to respond, but the actual reporting API
can be quite slow depending on what you're trying to get from it.

> What did you use dataflow for? How did you get data from end points and
> insert them into bigquery? Using streaming inserts?

I'm not the parent, but I've created setups like what was mentioned. It sounds
like they hosted the collection endpoint on AppEngine, then used DataFlow for
streaming the data into BigQuery. Potentially using a Pub/Sub topic to queue
up for DataFlow, since that has native integrations with DataFlow and even has
a template available to support it[1].

[1]
[https://cloud.google.com/dataflow/docs/guides/templates/prov...](https://cloud.google.com/dataflow/docs/guides/templates/provided-
templates#cloudpubsubtobigquery)

------
sbussard
Endless loading... I think there's a bug

------
tyingq
Genuine question. What does this do that just inserting vanilla GA code in the
page doesn't? Trying to understand the "why".

~~~
teej
Some people don’t want to put a GA tag on their site because of concerns
around how Google uses the data. Also you can’t arbitrarily query GA data so
this gives you that capability.

------
graphememes
is PHP serverless :thinking:

------
InGodsName
Please explain what is cube.js doing in this? I mean, what exactly cubejs
does.

~~~
pavel_tiunov
Thanks for the question! We should do a better job describing this. In short:
1\. Generates analytic SQL queries based on Cube.js schema. It can be simple
ones like calculating page views or more advanced like calculating session
metrics, attribution models or funnels. 2\. Caches sql responses to not to
overwhelm SQL backend with user requests. 3\. Pre-aggregates data to be able
to query trillions of data points in matter of seconds. 4\. Orchestrates SQL
query execution. Organizes dependencies between pre-aggregations, queue
priorities, cache refreshes. 5\. Provides REST analytic API for end users.

~~~
InGodsName
Why do you need to select all rows in your Cubejs section when you can
directly run query in athena and get back the aggregates you need.

Basically you select all rows then cubjs does something on those rows when you
can infact directly run queries in Athena

Am i missing something?

~~~
pavel_tiunov
It actually works exactly as you describe. We generate SQL query to return
aggregates based on SQL supplied in Cube.js schema. We never fetch raw data
from SQL backend. Architecture overview can probably help to understand:
[https://github.com/statsbotco/cubejs-
client#architecture](https://github.com/statsbotco/cubejs-client#architecture)

