Hacker News new | past | comments | ask | show | jobs | submit login
Giving meaning to 100B analytics events a day with Kafka, Dataflow and BigQuery (medium.com/teads-engineering)
173 points by benjamindavy on April 18, 2018 | hide | past | favorite | 68 comments

I've been working with all these tools for a while now (and a bunch more along the way).

It's not the orthodox cloud-thinking, but you're often best off processing at the point of creation (or ingestion). Normalize the data as much as you can, (probably) compress[1], and send as close to the target as possible.

If you grab the data, send elsewhere, transform <repeat>... it all gets slow and expensive pretty quickly. Also a headache to manage failures.

This is especially true if you're using BigQuery. Stage your near-raw data into BigQuery and then use it's muscle as much as you can. A classic example here might be de-duplicating data. A painful prospect for many distributed systems, but pretty easy on the BigQuery side.

This is all especially true for time-series data. With BigQuery time partitions you can keep the queries fast (and the costs reasonable).

Also limits the range of technologies and languages you need to wrangle too.

1: Choosing your data format and compression approach can make a huge difference.

I built a similar pipeline using Kafka and ClickHouse - it's amazing how easy it is nowadays to ingest and analyze billions of events a day using standard tools.

ClickHouse can even ingest directly from Kafka (courtesy of Cloudflare - http://github.com/vavrusa contributed it).

Do you learn anything useful about humanity tho?

Fair comment. The title would imply there's a 'so what?' in the article, not just a 'how'.

Can you elaborate a bit about the servers you used (hardware, cluster size, etc.)?

For anyone who wants to move this much data out of AWS and into another cloud provider, Kinesis Streams does not charge for bandwidth out. If you do the math, it works out to a very large savings over doing public internet data transfers if you are moving terabytes per day. The compute costs to copy data from Kafka to Kinesis should be minimal since you’re essentially just operating a pipe and not doing much actual compute, and this can operate easily on spot instances.

Can you elaborate on that? I haven't been able to conclude that from what it says on the pricing page [1]:

> Data transfer is free. AWS does not charge for data transfer from your data producers to Amazon Kinesis Data Streams, or from Amazon Kinesis Data Streams to your Amazon Kinesis Applications.

That sounds like it's sonly free to Amazon Kinesis Applications (== inside AWS).

And on [2] it says:

> If you use Amazon EC2 for running your Amazon Kinesis Applications, you will be charged for Amazon EC2 resources in addition to Amazon Kinesis Data Streams costs.

So that sounds like you will eventually pay the normal egress cost of EC2.

[1]: https://aws.amazon.com/kinesis/data-streams/pricing/ [2]: https://aws.amazon.com/kinesis/data-streams/faqs/

“If you use Amazon EC2 for running your Amazon Kinesis Applications” is the key there. You don’t have to use Amazon EC2.

Amazon Kinesis Application does not imply EC2. They use on-prem examples in a few Re:Invent presentations. It is just an HTTP API like most everything else within Amazon. You will see other services spell out their bandwidth out charges much more clearly.

"In digital advertising, ..." Stopped reading right there. Maybe a great article though. I started working on that thing called BigData not that long time ago but now realized that 50% of the job is advertisement, not fan of it at all, I want to like the end product or at least be neutral about it.

I work for a digital advertising company, handling the terabytes of data we produce each day - and we're not using it to track people or build profiles, we're using it to monitor our system and record important events. Several thousand publishers monetise their sites through us, so I'm happily neutral about the industry.

Could you elaborate? There's plenty of big data tasks and roles that aren't even closely related to marketing.

Not my experince. I'd say the majority of big data is unethical both in capture and use trying to sell you something or sell someone else something on you, a small amount is for general good.

I wasn't involved on the project but a major Telecoms company i was at was prototyping scanning for mobile devices on their provided home routers. If they didn't recognise a mibile as being on their network they would then target you with mobile deals. Devs saw the issues with this, essentially scanning your home network from their routers and sending the data back to be processed. Managers found Apple's mac address rotation annoying and lost the point why Apple do it.

Unfortunately besides quitting the job devs on the project don't have a say as they don't have the power or say in the final decisions at these companies. It's easy to say you would quit but it's harder when you have a mortgage, wife and kids for example.

The vast majority of big data is used today to provide everything you depend on, like email, healthcare, banking, transportation, government services, etc.

Maybe you'd find Amino interesting. They use big data for healthcare savings.


Maybe it's different in the states, because a lot of tech giants are based there. But in Europe, I feel like, there are banks that inherently have a lot of data and they can use BigData stack for ETL and other stuff, and there is the rest who just want to get some data on users and sell them some products they don't really want, it's not necessary an advisement company by itself, it can be an analytical platform or just a department in some big e-store.

Very interesting read, but as more of an analyst, I kept waiting for the ‘meaning.’ Maybe I missed it, but for anyone else who may be considering reading this, it is more about ‘giving business structure’ to analytic events.

Good article none the less.

You 'd be surprised how good modern hardware is. We are being hit with between 1.5 and 2 million requests per minute (steady traffic no spikes except increased usage in the weekends), and our analytics solution runs on just 2 main servers and costs $3k per month (total associated costs except human labour which is 1 engineer working on it part time now).

We talked with a famous analytics company and they gave us a quote of 1 million yearly to work with us. So how did we get it down from 1m to 35k?

We pretty much do the thing the article suggests (roll-ups section). We compute data hourly. Alongside with the hourly computation we also dump some extra data that can be used to compute numbers for the day (eg unique user ids, from unique user ids per hour), then from the day we get per week and then per month. We also have less moving pieces (our stack is way more traditional), and we manage our own hardware (key for keeping costs down)

When you get data back from the system you only hit the pre-computed cache, no query touches the main system from the dashboards. We only allow queries running in a 30 minute window to run on the live system - to ensure that no crazy load is going to be built on top of it and we use that to mostly catch anomalies on the real time data. (our parsing time is good too, between 10 seconds and 1 minute compared to the 2-30 minutes the article gives).

However this is the "You are alive but you 're not living" angst of analytics. All the data is there, but you cannot freely prod it for answers and patterns. If you want to get an answers about past data, you need to go through an overly complex process of raising a new cluster and ingesting old backups, multiple times, then waiting for a few days. It get's relatively expensive, slow and at times will demoralize you and make you back off from getting the answers you need. You could try keeping a smaller cluster that only gets % percent of the data (eg only 2%) for finding trends, drawing heatmaps etc and that one can run in realtime but your CEO will say that's a stupid idea to your face and it's realtime-all or nothing.

You might say that's a situation you can live with provided the absolutely insane cost savings, but when the company goes at a nicer retreat for only a selective elite few that easily costs 20k, or runs an over the top kitsch open party party/recruiting that costs 80k, and you are being dragged into a meeting on a monday morning and confronted; Why did the "3k per month-3 billion requests per day" system cost 6k this month? (because we had multiple clusters in parallel computing historical data for the past 6 months that you asked for). You just get bitter you didn't give the analytics company the 1 million they asked for and be done with it.

If you're pre-calculating the metrics and dropping the raw data from your systems, you can't actually get the benefits of the ad-hoc systems. You can't ask a new question to your existing analytics data and you need to do custom development every time you need to see a new metric.

How much data size are you dealing with? For that price range, you can probably just get MemSQL or Clickhouse and handle real-time queries across all of the data.

For those bad at arithmetic that’s 33k rps

Here is an example Pipeline which supports loading data from an unbounded source to BigQuery in batches using load jobs (evading BigQuery's Streaming Insert cost)

See: https://zero-master.github.io/posts/pub-sub-bigquery-beam/

Are you the author? OT but I'm amazed the author charged merely 100 euro for implementing that solution for the subject startup, even if they're cash-strapped. I'm not familiar with BigQuery, but I'm curious what a normal rate for solving that issue would look like.

That's shockingly cheap for the value.

You can also skip pub/sub and/or use it to write files to cloud storage, then load from there by using a cloud function that will trigger the load job when a new object is created.

FYI dataflow's bigqueryio does not use stream inserts. Instead it batches the data into many load jobs

That is configurable, but for unbounded PCollections it appears the default is to use streaming inserts https://beam.apache.org/documentation/sdks/javadoc/2.4.0/org...

> Giving meaning to ...

Ingesting such amounts of data is a challenge indeed. But problems will become much more complicated if it is necessary to perform complex analysis during data ingestion. Such analysis (not simply event pre-processing) can arise because of the following reasons:

* It is physically not possible to store this amount of events. For example, assume you collect them from devices and sensors

* It is necessary to make faster decisions, e.g., in mission critical applications

* It can be more efficient to do some analytics before storing data (as opposed to first storing data persistently and then loading it again for analysis)

Such analysis can be done by conventional tools like Spark Streaming (micro batch processing) or Kafka Streams (works only with Kafka). One novel approach is implemented in Bistro Streams [0] (I am an author). It is intended for general-purpose data processing including both batch and stream analytics but it radically differs from MapReduce, SQL and other set-oriented data processing frameworks. It represents data via functions and processes data via column operations rather than having only set operations.

[0] Bistro: https://github.com/asavinov/bistro

Sounds like they over-engineered the solution. If you have ad-hoc use-case, BigQuery is great but it's quite expensive. If you just need to pre-calculate the metrics using SQL, Athena / Prestodb / Clickhouse / Redshift Spectrum might be much easier and cost-efficient.

BigQuery PM here. I'd love to genuinely understand why you have that impression.

BigQuery's on-demand model charges you EXACTLY for what you consume. Meaning, your resource efficiency is 100% [0].

By contrast, typical "cluster pricing" technologies require you to pay for 100% of your cluster uptime. In private data centers, it's difficult to get above 30% average efficiency.

BigQuery also takes care of all software, security, and hardware maintenance, including reprocessing data in our storage system for maximum performance and scaling your BigQuery "cluster" for you.[1]

BigQuery has a perpetual free tier of 10GB of data stored and 1TB of data processed per month.

Finally, BigQuery is the only technology we're aware of whose logical storage system doesn't charge you for loads - meaning we don't compromise your query capacity, nor do we bill you for loads.

[0] https://cloud.google.com/blog/big-data/2016/02/visualizing-t...

[1] https://cloud.google.com/blog/big-data/2016/08/google-bigque...

The answer comes down to one line in your BigQuery Under the Hood article[0]: "The answer is very simple — BigQuery has this much hardware (and much much more) available to devote to your queries for seconds at a time. BigQuery is powered by multiple data centers, each with hundreds of thousands of cores, dozens of petabytes in storage capacity, and terabytes in networking bandwidth. The numbers above — 300 disks, 3000 cores, and 300 Gigabits of switching capacity — are small. Google can give you that much horsepower for 30 seconds because it has orders of magnitude more."

Most companies are fine letting their warehouse go underutilized, or for queries to not be solved with such enormous resources, if it means capping their monthly data warehouse bill at a fixed number, say $9,000.

BigQuery is an awesome piece of technology, but most publishers, ecommerce, and saas companies have teams of anlayts, engineers, and business folks pounding away at their warehouse all day. And it's fine if those queries aren't as fast BigQuery.

I run an analytics companies and we load billions of events into our customers' warehouses each day. Many have evaluated BigQuery and they all came back with the same answer: too expensive. Most of them are big companies, but spending nowhere near the $40K they'd have to to cap their cost on BigQuery. And with the advent of Spectrum, they're even less likely to jump ship now.

Since you're a PM, I'd be really interested to know if you guys are aware of this issue and if you're doing anything to offer a solution that competes with Redshift (fixed cost/resource). I ask this as someone who runs a ton of stuff on GCP, but we've just never found a way to make BigQuery cost effective for us.


Genuinely thanks for the info, this is super insightful.

I'll argue that BigQuery's per-query pricing charges you JUST for the resources you use (well, data scaned), so it SHOULD be far less expensive than a model that charges you for the luxury of having a cluster sit idle (and often at only 30% utilization), correct?

Can you help me unpack this further? I think pay-per-query is ultra-efficient, but difficult to predict. However, buy-a-cluster is easy to predict but inefficient. Do you think that difficulty in planning for BigQuery spend translates into perception of being too expensive (and potentially unbounded spend?), or do you think BigQuery's pay-per-query is indeed too expensive?

Most existing BigQuery customers, even at large scale, do have the option to go from pay-per-query to flat rate and back, and choose to stay on on-demand because.. well.. its much much more efficient :)

For example, if Netflix charged you a penny per minute of watchtime, you'd have no idea if it's more expensive or less expensive, but you'd be assured it's more efficient.

Feel free to ping me offline as well.

For background, our customers use Redshift, and we are also a heavy Redshift user for our own back-end. The ones that have done a bake-off with BigQuery (and Snowflake too, btw) come all back with the conclusion "too expensive". And if you're the BigQuery PM, I'd love to talk to you because we're planning to build for BigQuery what we've built for Amazon Redshift. lars at intermix dot io

I think the difference here is in the use case, and I think it boils down to "ETL" vs. "ELT". If you do ETL, and your analyst team runs a few ad-hoc queries, then the pay-per-query approach makes a lot of sense. It would agree that it's "ultra-efficient".

If you do "ELT", where you ingest fairly raw data on a continuous basis, and then run complex transformations within your data warehouse, then the "buy a cluster" pricing will win. The data loads (we've seen load frequency up to every 2 minutes) require resources, and then so do the transformations.


side note:

When it comes to utilization, I always have to think of this great research paper by Andrew Odlyzko:


"Data networks are lightly utilized, and will stay that way"


Agreed that most warehouses sit at 30% utilization. The customers we work with have more utilization than 30% though. That's because they're continuously ingesting data, transforming it within Redshift, and then lots of ad-hoc queries by analyst teams and data services that feed other applications. If you have a business that runs on a two or more continents, then you don't even have downtime during US nighttime, as Europe, Asia, etc. are running. And so in those cases, we've seen that customers don't want any surprises and would rather pay for the cluster. Predictability becomes more important than paying the lowest amount possible per query.


I'm a co-founder at https://www.intermix.io - we provide monitoring for data infrastructure

> ...BigQuery's per-query pricing charges you JUST for the resources you use (well, data scaned), so it SHOULD be far less expensive than a model that charges you for the luxury of having a cluster sit idle (and often at only 30% utilization), correct?

> For example, if Netflix charged you a penny per minute of watchtime, you'd have no idea if it's more expensive or less expensive, but you'd be assured it's more efficient.

The Netflix analogy is actually quite good. Using that, let's say that I have a family of six, I'm billed $0.01 per minute of viewership, but I'm getting the best/quality speed. Each person watches 40 minutes a day, for all 30 days, for a grand total of $72. Far greater than the fixed cost of $10.99 (with, say, significantly less quality, speed, and minutes per month of viewership).

In the real world, the family of six is my team of data analysts, scientists, and BI folks who are querying my database from 9-5 every weekday.

A customer of mine, a large NYC publisher, who you have heard of and probably read, evaluated BigQuery in 2017. They loaded BQ with the exact same data as their Redshift cluster, pointed their Looker instance at it, and in just one day blew through 1/4 their typical Redshift budget. All the queries were faster. Way faster than they needed to be, actually.

Going back to my original post above, the issue here is that just because Google CAN throw these massive amounts of resources at my problem, doesn't mean I can afford to use that level computation for each query, or would even want to. In the Netflix example, I'm happy if Sally and Billy get less quality or limited time watching Netflix, as long as my bill stays at $10.99.

For most companies, their Redshift cluster is optimized to be able to handle their peak workload WELL ENOUGH. That means that queries won't be as fast as BigQuery - and that's totally fine. And it means that the cluster will be underutilized for large portions of the night and weekends - again, totally fine. They just need their usage capped at a predetermined cost, and have their queries finishing in a reasonable amount of time.

I've posed this Google employees before and I'm hit with "well, you can limit how much each person can query a day." Except that isn't a acceptable solution. I can't have analysts sitting around unable to query their database because they've exceeded their daily limit. They'd rather just fire off their Redshift query and if it takes a little bit longer, so be it.

To repeat my other comments, the issue is that BQ doesn't actually charge for compute but for data scanned. You already pay for storage (uncompressed but that's a separate issue) so paying again on that metric doesn't make sense for querying since you're using CPU cores, which is a time-based metric. If I use 1000 cores for 10 seconds vs 10 minutes, it makes sense to pay exactly that regardless of how much data was scanned.

We've run into the same issue where being curious with BigQuery actually becomes problematic as users are perfectly fine waiting an extra minute to scan 10TBs, but are afraid of the $50 bill that comes with that, especially for every little query that might be mistyped or out of their hands when using BI tools that run queries of their own.

Netflix's analogy is great, and I would add: being charged $0.01 per minute of viewership is so stressful you would consume less, or you would be stressed about what your parents would say about the bill.

Thanks, this is very valuable feedback, worth in gold. Please feel free to email me directly to unpack our future plans here.

The Netflix analogy also applies if they bill $0.001 per minute - the number I chose is not indicative of reality, I was just making a point. My goal was to demonstrate the fact that it's more efficient per-resource.

Isn’t “buy 2000 BQ slots for 40k a month” essentially just buying a cluster?

Yes, essentially.. a software-defined cluster (not physical). It's a parallel pricing model. To be clear, my previous post's logic is not calling for an exception for this model - you trade efficiency for predictability.

BigQuery charges you for the storage so it's not just the queries. It also charges for the streaming inserts which is also not that cheap for large volumes of data.

I might be biased since I'm running an analytics company and this is my core business but here's our alternative:

The query and storage layer should be separated because often the expensive one the is the query layer. We have API servers which ingest the data from SDKs, enrich/sanitize and send it to a commit log such as Kinesis and Kafka. One API server is capable of handling 10k r/q.

We have the Kinesis/Kafka consumers that consume the data in micro-batches, convert them to ORC files, store it on S3 and index the metadata in a single Mysql server. The throughput is around 15k r/q.

S3 or Cloud Storage is cheap, reliable and can be used for many other use-cases as well but you need high-memory nodes for the query layer if you're dealing large volumes of data.

We have a Presto cluster which spins up when you need to run ad-hoc queries, allows you to pre-materialize the data so that you can power your dashboards. If you're running expensive queries on raw data, you can spin up 10 Presto worker, execute the queries and then just shut down them when you don't need.

That way, we're able to access all the historical data, no extra cost for the data ingestion and storage and it's even better than serverless since it will be much cheaper. The system can be automatic based on the hour of the day (during your analysts working hours) the CPU and memory load so unfortunately "BigQuery's on-demand model" is not a killing feature anymore.

Thanks, can we continue this discussion? Super interesting to me.

One clarification. BigQuery has two methods of ingest. The Streaming API you mention does carry additional cost, but for the added benefit of having your data appear in BigQuery in real-time.

Batch load, however, is entirely free, as in - it doesn't use your "query capacity", and we don't charge for it. I am rather certain this is a very compelling offering. Batch loads encode your data, replicate it, secure it, convert it into our format, and fix any issues your files may have, like optimal file sizes, optimal number of files, optimal file groupings for your queries - lots of subtle things that either slow you down or cause you headaches otherwise. We also manage all the metadata for you (your MySQL instance). We also burn a good chunk of resources post-load re-materializing and optimizing your dataset. We also maintain upgrades, downtime, and so on.

It sounds like your use case is micro-batch, so perhaps you could benefit from our free batch loads?

It also sounds like you don't mind the extra operational overhead, and would rather operate your own stack, run your own upgrades, fix your own file issues, and so on. This is a personal preference, I agree. Customers who prefer our model love the ease of use and would rather focus their energy elsewhere. It is a stated goal of BigQuery to abstract away complexity and make BigQuery as easy to use as possible for everyone, and if we're failing somewhere, we certainly want to know :)

Finally, if I had to equate BigQuery's on-demand pricing, it gives you the ability to go from 0 cores to thousands and back to 0 in sub-second intervals, scoped down to individual query size. This is exactly what's happening under the hood, but all that is abstracted away behind a "run query" button.

Our system is also near-realtime. The default batch duration is 3 minutes, it's also configurable but it will also affect the background processes (shard compaction, data organization jobs etc.) so our customers usually prefer 5 minutes which is also near real-time for 10B events per month.

Please note that this is our core business. Of course, we maintain these services but we also try hard to push more work to cloud providers. For example, both AWS and GC offers managed Mysql servers, auto scale groups for nodes, object stores such as S3 and Cloud Storage so we actually maintain the software and help the customers to upgrade the software when they need.

I agree that BigQuery will save time if the users are not familiar with distributed systems and big-data but again, "ability to go from 0 cores to thousands" doesn't make sense to me because in practice I have never experienced such case. Since we need to start the instances it may take up to 2 - 3 minutes but this is often acceptable for data analysts.

Could you please elaborate the part "re-materializing and optimizing your dataset."? We do a number of optimizations for compacting ORC files, bucketing etc. but I would love to hear how BigQuery does post-processing.

As a note, I usually tend to simplify things but most of the BigQuery customers that I see usually do overengineering because of the cost optimization. For example in the article the author uses Redshift for the dashboard data the solution they use for moving data from BigQuery to Redshift also needs to be maintained and it's not that easy. If I'm going to adopt my whole system to the way how BigQuery works and push hard to save costs, then I expect it to be pretty cheap but 40K for reserved slots doesn't sound like cheap to me. We maintain similar size clusters for 20% of this price for the same data volume and it's much more flexible.

Have you looked at Snowflake Data? They have a similar pay-per-compute setup as your Presto cluster with automatic pause/resume to save any idle time. Good middleground between traditional Redshift and flexible BigQuery.

Yes, they have their own query engine but the idea is same; separate storage and query layer, scale the query layer on demand and charge for it. I like their approach but not everyone is willing to depend on a third party company for their company data.

How is that different than BigQuery then? GCP and AWS are also third party companies.

The motivation is rather different. If you ask Redshift users, the motivation is mostly that they're already using AWS and depend on their services. That's the case for GCP as well, I would love to hear from the Google side of the story but AWS customers don't usually use BigQuery for their analytics stack because as the author explained it's not that easy. That's why BigQuery is not getting enough attraction from the outside, even though it's a cool technology.

We all agree that Redshift is not scalable at some point and people switch to other technologies when they need to but that's not the point. Creating a Redshift cluster is dead-easy and getting the data in it is also not complex compared to other solutions if you're already an AWS user.

The case for Snowflake is different, they also use AWS but they manage your services. Although they did a smart move and store the data on their customers' S3 buckets in order to make them feel like they "own" the data but it's possible only because AWS lets them to that way.

I believe that AWS doesn't try to make Redshift cost-efficient for large volumes of data because they already have the advantage of vendor-lock and making money from their large enterprise customers processing billions of data points in Redshift. That's why there are many Redshift monitoring startups out there to save the cost for you.

On the other hand, AWS is smart enough to build a Snowflake-like solution when their Redshift users start to switch BigQuery and Snowflake and the companies such as Snowflake needs to be prepared when the day comes. Cloud is killing everyone including us.

Snowflake has their own storage unless you opt for the enterprise version to have it stored in your own account. They also use their own table format (like BigQuery) but you can always export it.

It's true that AWS has Redshift Spectrum (and Athena) to help with more scalable querying across S3, however I don't think that makes a big risk for another company that provide a focused offering on top. Snowflake is very well capitalized with close to $500M in investment and plenty of customers so I wouldn't worry about them going out of business.

I especially like them because they have the elastic computing style of BigQuery but charge for computing time rather than data scanned, which is a much more effective billing model than anything else out there.

What's the focused offering provided by them? Does their system more scalable, efficient or cheap or is there any technical limitations of the AWS products? Personally, I don't think $500M matters if their customers (I'm assuming most of their revenue comes from enterprise customers because that's how it works usually) churn in a few years.

A more focused data warehouse product that is priced on a better model and is more scalable and efficient and cheaper than any of the existing options. $500M means they'll be around longer than most of the startup customers that try them.

Anyway, at this point I'm just repeating myself so I suggest you actually try them if you care for a better model. It works for us at 600B rows of data that was too expensive with BigQuery and too slow and complicated with Redshift/Spectrum.

Sounds fair. I'm convinced to try out their service, it's good to see that they also added pricing to their website. Let's see how it's better compared to Presto as they have the similar use-case.

Around query pricing, my understanding is BigQuery charges by bytes scanned uncompressed. Redshift Spectrum/Athena charges by bytes scanned compressed. That makes Athena/Spectrum cheaper as well.

It's not strictly true - depends on the rate. The only thing you can say for sure - uncompressed analysis on Athena is too expensive

They have the same rate. BigQuery and Athena are $5/TB scanned. Unless you store everything in raw JSON text, Athena will be cheaper even with most basic compression.

I think that's the issue, sure some companies will overprovision but as you get closer to 100% utilization, BigQuery quickly becomes extremely expensive compared to the other options, unless you're able to afford the 40k/month fee to get to flat-rate.

Also Snowflake Data is another option that supports automatic provisioning and pausing resources which is even easier to manage than redshift. Changing bigquery pricing to be compressed data stored and scanned would go a long way towards making it more attractive for full-time usage.

> In private data centers, it's difficult to get above 30% average efficiency.

What exactly does that mean?

Typically servers are only under a reasonable workload for load 30% of the day. Resulting in that hardware being unused (wasted) the other 70%.

The downside to Athena/PrestoDB is you need to get the data into an optimal format/file size to get the best performance (e.g. Parquet or other columnar format)

This is fine for batch workloads where 'real time', i.e. latency of < 5 minutes, ideally < 1 minute, isn't much of a concern.

but adds complication if you need to make data available quickly, because you either adopt some hybrid approach where you have 'recent' data in database and everything else in S3 - meaning your query layer has added complexity.

I believe this is how BigQuery works with streaming inserts, where recent streamed data is actually stored in BigTable, and asynchronously copied into Capacitor over time (this might be outdated information)

So while BigQuery seems expensive, the other solutions have a lot of other costs that you need to factor in!

Agreed. The pricing is definitely problematic for constant use at smaller companies. We had a single query that read 200GB of data once an hour, and this ends up costing around 700/month. That's for 1 dashboard chart. Add up to 10 charts and we're looking at 7k/month just to run a single company report updated hourly.

BigQuery has high throughput but also very high latency and the shared tenancy means unpredictable query times, with the same SQL taking 10 seconds or 60. They are working on both but it'll be awhile before any changes. DML merge and DDL statements are now in beta though which resolved other big obstacles in automation.

It's a great system but at this point I'd only recommend it for occasional but heavy queries that need to scan massive datasets with complex joins or for some ETL uses. Perhaps if they charged for compressed storage and scans then it would be different but right now Snowflake Data is a much more usable system day to day.

I agree with this statement if you consider BigQuery's on-demand pricing, but at a given scale the $40,000/month flat rate becomes competitive.

I see titles like this and my first thought is “more people need to watch Real Genius or watch it again.”

“What do you think a phase conjugate tracking system is for, Kent?”

Great. You made a system to track a billion people a day. You’re murdering privacy and then bragging about it. And bragging about it during a giant shitstorm caused by Facebook. The fuck is wrong with you?

I want to know how people use my website, fuck me right?

A hundred billion events a day isn’t “knowing how people use your website”. It’s mass scale surveillance. It’s stalking your customers.

Is there some magic scale threshold?

Are you asking permission for misbehaving by trying to determine exactly how much is too much?

Is there some magic scale threshold for excessive force? No. There is no fine line because you shouldn't be anywhere near the line.

100 billion. It's not 'lots' or millions, or a billion we're talking about here. It's one hundred billion. That's 80 events per day on average for every human they've ever seen (1.2 billion, in the about section). If they see half of those on any given day that's 160 events per person, maybe 200. Per day.

Fine grained tracking you do on your own site to determine why people leave and whether they see your new content? I could see my boss asking for that. I wouldn't be enthusiastic. I might make excuses that I was too busy doing other things to help. I might even complain, but we probably aren't running that forever anyway because it tells you less and less over time but still costs the same.

But this is an ad network, not a usability study. We are currently busy handwringing about ad networks, and I'm a little taken aback by the dissonance here.

Nice spin, but you made the statement as if there's a difference so yes, I'm asking you to define the limit. You can't say "anywhere near the line" and not be able to tell me what the line is. Unless you have some understanding for what is reasonable for the business metrics, it's rather useless for any further discussion.

Those ads do have owners, paying a lot of money to both distribute and see how people are interacting with their content so it's the same thing as a website. And there are lots of tracking events to capture so it's easy to add up to billions, but as the article states it's not all user events. Only 10B come from users with everything else being backend server logs.

> I wouldn't be enthusiastic. I might make excuses that I was too busy doing other things to help

Ok... so being dishonest to avoid doing your job is fine? If you are against then don't do it, but what is the point you're trying to make here?

Whatever let’s you sleep at night, man.

    Ok... so being dishonest to avoid doing your job is fine?
You’d be a hoot as a guest speaker in an ethics class.

You haven't answered any of the questions or provided any reasonable details.

What does ethics class have to do with this? I'm sure the question of whether lying in your job to avoid doing a particular project would be far more interesting to study anyway, I'll be sure to bring it up the next time I teach one.

As soon as the IP packets leave your server, it’s not “your website” anymore.

Would you say as soon as the IP packets leave my device it's not "my data" anymore? It's simplified too far.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact