
Giving meaning to 100B analytics events a day with Kafka, Dataflow and BigQuery - benjamindavy
https://medium.com/teads-engineering/give-meaning-to-100-billion-analytics-events-a-day-d6ba09aa8f44
======
jwilliams
I've been working with all these tools for a while now (and a bunch more along
the way).

It's not the orthodox cloud-thinking, but you're often best off processing at
the point of creation (or ingestion). Normalize the data as much as you can,
(probably) compress[1], and send as close to the target as possible.

If you grab the data, send elsewhere, transform <repeat>... it all gets slow
and expensive pretty quickly. Also a headache to manage failures.

This is especially true if you're using BigQuery. Stage your near-raw data
into BigQuery and then use it's muscle as much as you can. A classic example
here might be de-duplicating data. A painful prospect for many distributed
systems, but pretty easy on the BigQuery side.

This is all especially true for time-series data. With BigQuery time
partitions you can keep the queries fast (and the costs reasonable).

Also limits the range of technologies and languages you need to wrangle too.

1: Choosing your data format and compression approach can make a _huge_
difference.

------
lima
I built a similar pipeline using Kafka and ClickHouse - it's amazing how easy
it is nowadays to ingest and analyze billions of events a day using standard
tools.

ClickHouse can even ingest directly from Kafka (courtesy of Cloudflare -
[http://github.com/vavrusa](http://github.com/vavrusa) contributed it).

~~~
SiempreViernes
Do you learn anything useful about humanity tho?

~~~
pcarolan
Fair comment. The title would imply there's a 'so what?' in the article, not
just a 'how'.

------
ryanworl
For anyone who wants to move this much data out of AWS and into another cloud
provider, Kinesis Streams does not charge for bandwidth out. If you do the
math, it works out to a very large savings over doing public internet data
transfers if you are moving terabytes per day. The compute costs to copy data
from Kafka to Kinesis should be minimal since you’re essentially just
operating a pipe and not doing much actual compute, and this can operate
easily on spot instances.

~~~
nh2
Can you elaborate on that? I haven't been able to conclude that from what it
says on the pricing page [1]:

> Data transfer is free. AWS does not charge for data transfer from your data
> producers to Amazon Kinesis Data Streams, or from Amazon Kinesis Data
> Streams to your Amazon Kinesis Applications.

That sounds like it's sonly free to Amazon Kinesis Applications (== inside
AWS).

And on [2] it says:

> If you use Amazon EC2 for running your Amazon Kinesis Applications, you will
> be charged for Amazon EC2 resources in addition to Amazon Kinesis Data
> Streams costs.

So that sounds like you will eventually pay the normal egress cost of EC2.

[1]: [https://aws.amazon.com/kinesis/data-
streams/pricing/](https://aws.amazon.com/kinesis/data-streams/pricing/) [2]:
[https://aws.amazon.com/kinesis/data-
streams/faqs/](https://aws.amazon.com/kinesis/data-streams/faqs/)

~~~
ryanworl
“If you use Amazon EC2 for running your Amazon Kinesis Applications” is the
key there. You don’t have to use Amazon EC2.

Amazon Kinesis Application does not imply EC2. They use on-prem examples in a
few Re:Invent presentations. It is just an HTTP API like most everything else
within Amazon. You will see other services spell out their bandwidth out
charges much more clearly.

------
maxnevermind
"In digital advertising, ..." Stopped reading right there. Maybe a great
article though. I started working on that thing called BigData not that long
time ago but now realized that 50% of the job is advertisement, not fan of it
at all, I want to like the end product or at least be neutral about it.

~~~
GrandNewbien
Could you elaborate? There's plenty of big data tasks and roles that aren't
even closely related to marketing.

~~~
beepbeepbeep1
Not my experince. I'd say the majority of big data is unethical both in
capture and use trying to sell you something or sell someone else something on
you, a small amount is for general good.

I wasn't involved on the project but a major Telecoms company i was at was
prototyping scanning for mobile devices on their provided home routers. If
they didn't recognise a mibile as being on their network they would then
target you with mobile deals. Devs saw the issues with this, essentially
scanning your home network from their routers and sending the data back to be
processed. Managers found Apple's mac address rotation annoying and lost the
point why Apple do it.

Unfortunately besides quitting the job devs on the project don't have a say as
they don't have the power or say in the final decisions at these companies.
It's easy to say you would quit but it's harder when you have a mortgage, wife
and kids for example.

~~~
manigandham
The vast majority of big data is used today to provide everything you depend
on, like email, healthcare, banking, transportation, government services, etc.

------
davidbrent
Very interesting read, but as more of an analyst, I kept waiting for the
‘meaning.’ Maybe I missed it, but for anyone else who may be considering
reading this, it is more about ‘giving business structure’ to analytic events.

Good article none the less.

------
throwaway66666
You 'd be surprised how good modern hardware is. We are being hit with between
1.5 and 2 million requests per minute (steady traffic no spikes except
increased usage in the weekends), and our analytics solution runs on just 2
main servers and costs $3k per month (total associated costs except human
labour which is 1 engineer working on it part time now).

We talked with a famous analytics company and they gave us a quote of 1
million yearly to work with us. So how did we get it down from 1m to 35k?

We pretty much do the thing the article suggests (roll-ups section). We
compute data hourly. Alongside with the hourly computation we also dump some
extra data that can be used to compute numbers for the day (eg unique user
ids, from unique user ids per hour), then from the day we get per week and
then per month. We also have less moving pieces (our stack is way more
traditional), and we manage our own hardware (key for keeping costs down)

When you get data back from the system you only hit the pre-computed cache, no
query touches the main system from the dashboards. We only allow queries
running in a 30 minute window to run on the live system - to ensure that no
crazy load is going to be built on top of it and we use that to mostly catch
anomalies on the real time data. (our parsing time is good too, between 10
seconds and 1 minute compared to the 2-30 minutes the article gives).

However this is the "You are alive but you 're not living" angst of analytics.
All the data is there, but you cannot freely prod it for answers and patterns.
If you want to get an answers about past data, you need to go through an
overly complex process of raising a new cluster and ingesting old backups,
multiple times, then waiting for a few days. It get's relatively expensive,
slow and at times will demoralize you and make you back off from getting the
answers you need. You could try keeping a smaller cluster that only gets %
percent of the data (eg only 2%) for finding trends, drawing heatmaps etc and
that one can run in realtime but your CEO will say that's a stupid idea to
your face and it's realtime-all or nothing.

You might say that's a situation you can live with provided the absolutely
insane cost savings, but when the company goes at a nicer retreat for only a
selective elite few that easily costs 20k, or runs an over the top kitsch open
party party/recruiting that costs 80k, and you are being dragged into a
meeting on a monday morning and confronted; Why did the "3k per month-3
billion requests per day" system cost 6k this month? (because we had multiple
clusters in parallel computing historical data for the past 6 months that you
asked for). You just get bitter you didn't give the analytics company the 1
million they asked for and be done with it.

~~~
buremba
If you're pre-calculating the metrics and dropping the raw data from your
systems, you can't actually get the benefits of the ad-hoc systems. You can't
ask a new question to your existing analytics data and you need to do custom
development every time you need to see a new metric.

------
xstartup
Here is an example Pipeline which supports loading data from an unbounded
source to BigQuery in batches using load jobs (evading BigQuery's Streaming
Insert cost)

See: [https://zero-master.github.io/posts/pub-sub-bigquery-
beam/](https://zero-master.github.io/posts/pub-sub-bigquery-beam/)

~~~
AWebOfBrown
Are you the author? OT but I'm amazed the author charged merely 100 euro for
implementing that solution for the subject startup, even if they're cash-
strapped. I'm not familiar with BigQuery, but I'm curious what a normal rate
for solving that issue would look like.

~~~
mentat
That's shockingly cheap for the value.

------
asavinov
> Giving meaning to ...

Ingesting such amounts of data is a challenge indeed. But problems will become
much more complicated if it is necessary to perform complex analysis during
data ingestion. Such analysis (not simply event pre-processing) can arise
because of the following reasons:

* It is physically not possible to store this amount of events. For example, assume you collect them from devices and sensors

* It is necessary to make faster decisions, e.g., in mission critical applications

* It can be more efficient to do some analytics before storing data (as opposed to first storing data persistently and then loading it again for analysis)

Such analysis can be done by conventional tools like Spark Streaming (micro
batch processing) or Kafka Streams (works only with Kafka). One novel approach
is implemented in Bistro Streams [0] (I am an author). It is intended for
general-purpose data processing including both batch and stream analytics but
it radically differs from MapReduce, SQL and other set-oriented data
processing frameworks. It represents data via _functions_ and processes data
via _column operations_ rather than having only set operations.

[0] Bistro:
[https://github.com/asavinov/bistro](https://github.com/asavinov/bistro)

------
buremba
Sounds like they over-engineered the solution. If you have ad-hoc use-case,
BigQuery is great but it's quite expensive. If you just need to pre-calculate
the metrics using SQL, Athena / Prestodb / Clickhouse / Redshift Spectrum
might be much easier and cost-efficient.

~~~
vgt
BigQuery PM here. I'd love to genuinely understand why you have that
impression.

BigQuery's on-demand model charges you EXACTLY for what you consume. Meaning,
your resource efficiency is 100% [0].

By contrast, typical "cluster pricing" technologies require you to pay for
100% of your cluster uptime. In private data centers, it's difficult to get
above 30% average efficiency.

BigQuery also takes care of all software, security, and hardware maintenance,
including reprocessing data in our storage system for maximum performance and
scaling your BigQuery "cluster" for you.[1]

BigQuery has a perpetual free tier of 10GB of data stored and 1TB of data
processed per month.

Finally, BigQuery is the only technology we're aware of whose logical storage
system doesn't charge you for loads - meaning we don't compromise your query
capacity, nor do we bill you for loads.

[0] [https://cloud.google.com/blog/big-
data/2016/02/visualizing-t...](https://cloud.google.com/blog/big-
data/2016/02/visualizing-the-mechanics-of-on-demand-pricing-in-big-data-
technologies)

[1] [https://cloud.google.com/blog/big-data/2016/08/google-
bigque...](https://cloud.google.com/blog/big-data/2016/08/google-bigquery-
continues-to-define-what-it-means-to-be-fully-managed)

~~~
slap_shot
The answer comes down to one line in your BigQuery Under the Hood article[0]:
"The answer is very simple — BigQuery has this much hardware (and much much
more) available to devote to your queries for seconds at a time. BigQuery is
powered by multiple data centers, each with hundreds of thousands of cores,
dozens of petabytes in storage capacity, and terabytes in networking
bandwidth. The numbers above — 300 disks, 3000 cores, and 300 Gigabits of
switching capacity — are small. Google can give you that much horsepower for
30 seconds because it has orders of magnitude more."

Most companies are fine letting their warehouse go underutilized, or for
queries to not be solved with such enormous resources, if it means capping
their monthly data warehouse bill at a fixed number, say $9,000.

BigQuery is an awesome piece of technology, but most publishers, ecommerce,
and saas companies have teams of anlayts, engineers, and business folks
pounding away at their warehouse all day. And it's fine if those queries
aren't as fast BigQuery.

I run an analytics companies and we load billions of events into our
customers' warehouses each day. Many have evaluated BigQuery and they all came
back with the same answer: too expensive. Most of them are big companies, but
spending nowhere near the $40K they'd have to to cap their cost on BigQuery.
And with the advent of Spectrum, they're even less likely to jump ship now.

Since you're a PM, I'd be really interested to know if you guys are aware of
this issue and if you're doing anything to offer a solution that competes with
Redshift (fixed cost/resource). I ask this as someone who runs a ton of stuff
on GCP, but we've just never found a way to make BigQuery cost effective for
us.

[0][https://cloud.google.com/blog/big-data/2016/01/bigquery-
unde...](https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-
hood)

~~~
vgt
Genuinely thanks for the info, this is super insightful.

I'll argue that BigQuery's per-query pricing charges you JUST for the
resources you use (well, data scaned), so it SHOULD be far less expensive than
a model that charges you for the luxury of having a cluster sit idle (and
often at only 30% utilization), correct?

Can you help me unpack this further? I think pay-per-query is ultra-efficient,
but difficult to predict. However, buy-a-cluster is easy to predict but
inefficient. Do you think that difficulty in planning for BigQuery spend
translates into perception of being too expensive (and potentially unbounded
spend?), or do you think BigQuery's pay-per-query is indeed too expensive?

Most existing BigQuery customers, even at large scale, do have the option to
go from pay-per-query to flat rate and back, and choose to stay on on-demand
because.. well.. its much much more efficient :)

For example, if Netflix charged you a penny per minute of watchtime, you'd
have no idea if it's more expensive or less expensive, but you'd be assured
it's more efficient.

Feel free to ping me offline as well.

~~~
slap_shot
> ...BigQuery's per-query pricing charges you JUST for the resources you use
> (well, data scaned), so it SHOULD be far less expensive than a model that
> charges you for the luxury of having a cluster sit idle (and often at only
> 30% utilization), correct?

> For example, if Netflix charged you a penny per minute of watchtime, you'd
> have no idea if it's more expensive or less expensive, but you'd be assured
> it's more efficient.

The Netflix analogy is actually quite good. Using that, let's say that I have
a family of six, I'm billed $0.01 per minute of viewership, but I'm getting
the best/quality speed. Each person watches 40 minutes a day, for all 30 days,
for a grand total of $72. Far greater than the fixed cost of $10.99 (with,
say, significantly less quality, speed, and minutes per month of viewership).

In the real world, the family of six is my team of data analysts, scientists,
and BI folks who are querying my database from 9-5 every weekday.

A customer of mine, a large NYC publisher, who you have heard of and probably
read, evaluated BigQuery in 2017. They loaded BQ with the exact same data as
their Redshift cluster, pointed their Looker instance at it, and in just one
day blew through 1/4 their typical Redshift budget. All the queries were
faster. Way faster than they needed to be, actually.

Going back to my original post above, the issue here is that just because
Google CAN throw these massive amounts of resources at my problem, doesn't
mean I can afford to use that level computation for each query, or would even
want to. In the Netflix example, I'm happy if Sally and Billy get less quality
or limited time watching Netflix, as long as my bill stays at $10.99.

For most companies, their Redshift cluster is optimized to be able to handle
their peak workload WELL ENOUGH. That means that queries won't be as fast as
BigQuery - and that's totally fine. And it means that the cluster will be
underutilized for large portions of the night and weekends - again, totally
fine. They just need their usage capped at a predetermined cost, and have
their queries finishing in a reasonable amount of time.

I've posed this Google employees before and I'm hit with "well, you can limit
how much each person can query a day." Except that isn't a acceptable
solution. I can't have analysts sitting around unable to query their database
because they've exceeded their daily limit. They'd rather just fire off their
Redshift query and if it takes a little bit longer, so be it.

~~~
manigandham
To repeat my other comments, the issue is that BQ doesn't actually charge for
compute but for data scanned. You already pay for storage (uncompressed but
that's a separate issue) so paying again on that metric doesn't make sense for
querying since you're using CPU cores, which is a time-based metric. If I use
1000 cores for 10 seconds vs 10 minutes, it makes sense to pay exactly that
regardless of how much data was scanned.

We've run into the same issue where being curious with BigQuery actually
becomes problematic as users are perfectly fine waiting an extra minute to
scan 10TBs, but are afraid of the $50 bill that comes with that, especially
for every little query that might be mistyped or out of their hands when using
BI tools that run queries of their own.

------
hinkley
I see titles like this and my first thought is “more people need to watch Real
Genius or watch it again.”

“What do you think a phase conjugate tracking system is for, Kent?”

Great. You made a system to track a billion people a day. You’re murdering
privacy and then bragging about it. And bragging about it during a giant
shitstorm caused by Facebook. The fuck is wrong with you?

~~~
Arzh
I want to know how people use my website, fuck me right?

~~~
hinkley
A hundred billion events a day isn’t “knowing how people use your website”.
It’s mass scale surveillance. It’s stalking your customers.

~~~
manigandham
Is there some magic scale threshold?

~~~
hinkley
Are you asking permission for misbehaving by trying to determine exactly how
much is too much?

Is there some magic scale threshold for excessive force? No. There is no fine
line because you _shouldn 't be anywhere near the line_.

100 billion. It's not 'lots' or millions, or a billion we're talking about
here. It's one hundred billion. That's 80 events per day on average for every
human they've ever seen (1.2 billion, in the about section). If they see half
of those on any given day that's 160 events per person, maybe 200. Per day.

Fine grained tracking you do on your own site to determine why people leave
and whether they see your new content? I could see my boss asking for that. I
wouldn't be enthusiastic. I might make excuses that I was too busy doing other
things to help. I might even complain, but we probably aren't running that
forever anyway because it tells you less and less over time but still costs
the same.

But this is an ad network, not a usability study. We are currently busy
handwringing about ad networks, and I'm a little taken aback by the dissonance
here.

~~~
manigandham
Nice spin, but you made the statement as if there's a difference so yes, I'm
asking you to define the limit. You can't say "anywhere near the line" and not
be able to tell me what the line is. Unless you have some understanding for
what is reasonable for the business metrics, it's rather useless for any
further discussion.

Those ads do have owners, paying a lot of money to both distribute and see how
people are interacting with their content so it's the same thing as a website.
And there are lots of tracking events to capture so it's easy to add up to
billions, but as the article states it's not all user events. Only 10B come
from users with everything else being backend server logs.

> I wouldn't be enthusiastic. I might make excuses that I was too busy doing
> other things to help

Ok... so being dishonest to avoid doing your job is fine? If you are against
then don't do it, but what is the point you're trying to make here?

~~~
hinkley
Whatever let’s you sleep at night, man.

    
    
        Ok... so being dishonest to avoid doing your job is fine?
    

You’d be a hoot as a guest speaker in an ethics class.

~~~
manigandham
You haven't answered any of the questions or provided any reasonable details.

What does ethics class have to do with this? I'm sure the question of whether
lying in your job to avoid doing a particular project would be far more
interesting to study anyway, I'll be sure to bring it up the next time I teach
one.

