It's not the orthodox cloud-thinking, but you're often best off processing at the point of creation (or ingestion). Normalize the data as much as you can, (probably) compress, and send as close to the target as possible.
If you grab the data, send elsewhere, transform <repeat>... it all gets slow and expensive pretty quickly. Also a headache to manage failures.
This is especially true if you're using BigQuery. Stage your near-raw data into BigQuery and then use it's muscle as much as you can. A classic example here might be de-duplicating data. A painful prospect for many distributed systems, but pretty easy on the BigQuery side.
This is all especially true for time-series data. With BigQuery time partitions you can keep the queries fast (and the costs reasonable).
Also limits the range of technologies and languages you need to wrangle too.
1: Choosing your data format and compression approach can make a huge difference.
ClickHouse can even ingest directly from Kafka (courtesy of Cloudflare - http://github.com/vavrusa contributed it).
> Data transfer is free. AWS does not charge for data transfer from your data producers to Amazon Kinesis Data Streams, or from Amazon Kinesis Data Streams to your Amazon Kinesis Applications.
That sounds like it's sonly free to Amazon Kinesis Applications (== inside AWS).
And on  it says:
> If you use Amazon EC2 for running your Amazon Kinesis Applications, you will be charged for Amazon EC2 resources in addition to Amazon Kinesis Data Streams costs.
So that sounds like you will eventually pay the normal egress cost of EC2.
Amazon Kinesis Application does not imply EC2. They use on-prem examples in a few Re:Invent presentations. It is just an HTTP API like most everything else within Amazon. You will see other services spell out their bandwidth out charges much more clearly.
I wasn't involved on the project but a major Telecoms company i was at was prototyping scanning for mobile devices on their provided home routers. If they didn't recognise a mibile as being on their network they would then target you with mobile deals. Devs saw the issues with this, essentially scanning your home network from their routers and sending the data back to be processed. Managers found Apple's mac address rotation annoying and lost the point why Apple do it.
Unfortunately besides quitting the job devs on the project don't have a say as they don't have the power or say in the final decisions at these companies. It's easy to say you would quit but it's harder when you have a mortgage, wife and kids for example.
Good article none the less.
We talked with a famous analytics company and they gave us a quote of 1 million yearly to work with us. So how did we get it down from 1m to 35k?
We pretty much do the thing the article suggests (roll-ups section). We compute data hourly. Alongside with the hourly computation we also dump some extra data that can be used to compute numbers for the day (eg unique user ids, from unique user ids per hour), then from the day we get per week and then per month. We also have less moving pieces (our stack is way more traditional), and we manage our own hardware (key for keeping costs down)
When you get data back from the system you only hit the pre-computed cache, no query touches the main system from the dashboards. We only allow queries running in a 30 minute window to run on the live system - to ensure that no crazy load is going to be built on top of it and we use that to mostly catch anomalies on the real time data. (our parsing time is good too, between 10 seconds and 1 minute compared to the 2-30 minutes the article gives).
However this is the "You are alive but you 're not living" angst of analytics. All the data is there, but you cannot freely prod it for answers and patterns. If you want to get an answers about past data, you need to go through an overly complex process of raising a new cluster and ingesting old backups, multiple times, then waiting for a few days. It get's relatively expensive, slow and at times will demoralize you and make you back off from getting the answers you need. You could try keeping a smaller cluster that only gets % percent of the data (eg only 2%) for finding trends, drawing heatmaps etc and that one can run in realtime but your CEO will say that's a stupid idea to your face and it's realtime-all or nothing.
You might say that's a situation you can live with provided the absolutely insane cost savings, but when the company goes at a nicer retreat for only a selective elite few that easily costs 20k, or runs an over the top kitsch open party party/recruiting that costs 80k, and you are being dragged into a meeting on a monday morning and confronted; Why did the "3k per month-3 billion requests per day" system cost 6k this month? (because we had multiple clusters in parallel computing historical data for the past 6 months that you asked for). You just get bitter you didn't give the analytics company the 1 million they asked for and be done with it.
Ingesting such amounts of data is a challenge indeed. But problems will become much more complicated if it is necessary to perform complex analysis during data ingestion. Such analysis (not simply event pre-processing) can arise because of the following reasons:
* It is physically not possible to store this amount of events. For example, assume you collect them from devices and sensors
* It is necessary to make faster decisions, e.g., in mission critical applications
* It can be more efficient to do some analytics before storing data (as opposed to first storing data persistently and then loading it again for analysis)
Such analysis can be done by conventional tools like Spark Streaming (micro batch processing) or Kafka Streams (works only with Kafka). One novel approach is implemented in Bistro Streams  (I am an author). It is intended for general-purpose data processing including both batch and stream analytics but it radically differs from MapReduce, SQL and other set-oriented data processing frameworks. It represents data via functions and processes data via column operations rather than having only set operations.
 Bistro: https://github.com/asavinov/bistro
BigQuery's on-demand model charges you EXACTLY for what you consume. Meaning, your resource efficiency is 100% .
By contrast, typical "cluster pricing" technologies require you to pay for 100% of your cluster uptime. In private data centers, it's difficult to get above 30% average efficiency.
BigQuery also takes care of all software, security, and hardware maintenance, including reprocessing data in our storage system for maximum performance and scaling your BigQuery "cluster" for you.
BigQuery has a perpetual free tier of 10GB of data stored and 1TB of data processed per month.
Finally, BigQuery is the only technology we're aware of whose logical storage system doesn't charge you for loads - meaning we don't compromise your query capacity, nor do we bill you for loads.
Most companies are fine letting their warehouse go underutilized, or for queries to not be solved with such enormous resources, if it means capping their monthly data warehouse bill at a fixed number, say $9,000.
BigQuery is an awesome piece of technology, but most publishers, ecommerce, and saas companies have teams of anlayts, engineers, and business folks pounding away at their warehouse all day. And it's fine if those queries aren't as fast BigQuery.
I run an analytics companies and we load billions of events into our customers' warehouses each day. Many have evaluated BigQuery and they all came back with the same answer: too expensive. Most of them are big companies, but spending nowhere near the $40K they'd have to to cap their cost on BigQuery. And with the advent of Spectrum, they're even less likely to jump ship now.
Since you're a PM, I'd be really interested to know if you guys are aware of this issue and if you're doing anything to offer a solution that competes with Redshift (fixed cost/resource). I ask this as someone who runs a ton of stuff on GCP, but we've just never found a way to make BigQuery cost effective for us.
I'll argue that BigQuery's per-query pricing charges you JUST for the resources you use (well, data scaned), so it SHOULD be far less expensive than a model that charges you for the luxury of having a cluster sit idle (and often at only 30% utilization), correct?
Can you help me unpack this further? I think pay-per-query is ultra-efficient, but difficult to predict. However, buy-a-cluster is easy to predict but inefficient. Do you think that difficulty in planning for BigQuery spend translates into perception of being too expensive (and potentially unbounded spend?), or do you think BigQuery's pay-per-query is indeed too expensive?
Most existing BigQuery customers, even at large scale, do have the option to go from pay-per-query to flat rate and back, and choose to stay on on-demand because.. well.. its much much more efficient :)
For example, if Netflix charged you a penny per minute of watchtime, you'd have no idea if it's more expensive or less expensive, but you'd be assured it's more efficient.
Feel free to ping me offline as well.
I think the difference here is in the use case, and I think it boils down to "ETL" vs. "ELT". If you do ETL, and your analyst team runs a few ad-hoc queries, then the pay-per-query approach makes a lot of sense. It would agree that it's "ultra-efficient".
If you do "ELT", where you ingest fairly raw data on a continuous basis, and then run complex transformations within your data warehouse, then the "buy a cluster" pricing will win. The data loads (we've seen load frequency up to every 2 minutes) require resources, and then so do the transformations.
When it comes to utilization, I always have to think of this great research paper by Andrew Odlyzko:
"Data networks are lightly utilized, and will stay that way"
Agreed that most warehouses sit at 30% utilization. The customers we work with have more utilization than 30% though. That's because they're continuously ingesting data, transforming it within Redshift, and then lots of ad-hoc queries by analyst teams and data services that feed other applications. If you have a business that runs on a two or more continents, then you don't even have downtime during US nighttime, as Europe, Asia, etc. are running. And so in those cases, we've seen that customers don't want any surprises and would rather pay for the cluster. Predictability becomes more important than paying the lowest amount possible per query.
I'm a co-founder at https://www.intermix.io - we provide monitoring for data infrastructure
> For example, if Netflix charged you a penny per minute of watchtime, you'd have no idea if it's more expensive or less expensive, but you'd be assured it's more efficient.
The Netflix analogy is actually quite good. Using that, let's say that I have a family of six, I'm billed $0.01 per minute of viewership, but I'm getting the best/quality speed. Each person watches 40 minutes a day, for all 30 days, for a grand total of $72. Far greater than the fixed cost of $10.99 (with, say, significantly less quality, speed, and minutes per month of viewership).
In the real world, the family of six is my team of data analysts, scientists, and BI folks who are querying my database from 9-5 every weekday.
A customer of mine, a large NYC publisher, who you have heard of and probably read, evaluated BigQuery in 2017. They loaded BQ with the exact same data as their Redshift cluster, pointed their Looker instance at it, and in just one day blew through 1/4 their typical Redshift budget. All the queries were faster. Way faster than they needed to be, actually.
Going back to my original post above, the issue here is that just because Google CAN throw these massive amounts of resources at my problem, doesn't mean I can afford to use that level computation for each query, or would even want to. In the Netflix example, I'm happy if Sally and Billy get less quality or limited time watching Netflix, as long as my bill stays at $10.99.
For most companies, their Redshift cluster is optimized to be able to handle their peak workload WELL ENOUGH. That means that queries won't be as fast as BigQuery - and that's totally fine. And it means that the cluster will be underutilized for large portions of the night and weekends - again, totally fine. They just need their usage capped at a predetermined cost, and have their queries finishing in a reasonable amount of time.
I've posed this Google employees before and I'm hit with "well, you can limit how much each person can query a day." Except that isn't a acceptable solution. I can't have analysts sitting around unable to query their database because they've exceeded their daily limit. They'd rather just fire off their Redshift query and if it takes a little bit longer, so be it.
We've run into the same issue where being curious with BigQuery actually becomes problematic as users are perfectly fine waiting an extra minute to scan 10TBs, but are afraid of the $50 bill that comes with that, especially for every little query that might be mistyped or out of their hands when using BI tools that run queries of their own.
The Netflix analogy also applies if they bill $0.001 per minute - the number I chose is not indicative of reality, I was just making a point. My goal was to demonstrate the fact that it's more efficient per-resource.
I might be biased since I'm running an analytics company and this is my core business but here's our alternative:
The query and storage layer should be separated because often the expensive one the is the query layer. We have API servers which ingest the data from SDKs, enrich/sanitize and send it to a commit log such as Kinesis and Kafka. One API server is capable of handling 10k r/q.
We have the Kinesis/Kafka consumers that consume the data in micro-batches, convert them to ORC files, store it on S3 and index the metadata in a single Mysql server. The throughput is around 15k r/q.
S3 or Cloud Storage is cheap, reliable and can be used for many other use-cases as well but you need high-memory nodes for the query layer if you're dealing large volumes of data.
We have a Presto cluster which spins up when you need to run ad-hoc queries, allows you to pre-materialize the data so that you can power your dashboards. If you're running expensive queries on raw data, you can spin up 10 Presto worker, execute the queries and then just shut down them when you don't need.
That way, we're able to access all the historical data, no extra cost for the data ingestion and storage and it's even better than serverless since it will be much cheaper. The system can be automatic based on the hour of the day (during your analysts working hours) the CPU and memory load so unfortunately "BigQuery's on-demand model" is not a killing feature anymore.
One clarification. BigQuery has two methods of ingest. The Streaming API you mention does carry additional cost, but for the added benefit of having your data appear in BigQuery in real-time.
Batch load, however, is entirely free, as in - it doesn't use your "query capacity", and we don't charge for it. I am rather certain this is a very compelling offering. Batch loads encode your data, replicate it, secure it, convert it into our format, and fix any issues your files may have, like optimal file sizes, optimal number of files, optimal file groupings for your queries - lots of subtle things that either slow you down or cause you headaches otherwise. We also manage all the metadata for you (your MySQL instance). We also burn a good chunk of resources post-load re-materializing and optimizing your dataset. We also maintain upgrades, downtime, and so on.
It sounds like your use case is micro-batch, so perhaps you could benefit from our free batch loads?
It also sounds like you don't mind the extra operational overhead, and would rather operate your own stack, run your own upgrades, fix your own file issues, and so on. This is a personal preference, I agree. Customers who prefer our model love the ease of use and would rather focus their energy elsewhere. It is a stated goal of BigQuery to abstract away complexity and make BigQuery as easy to use as possible for everyone, and if we're failing somewhere, we certainly want to know :)
Finally, if I had to equate BigQuery's on-demand pricing, it gives you the ability to go from 0 cores to thousands and back to 0 in sub-second intervals, scoped down to individual query size. This is exactly what's happening under the hood, but all that is abstracted away behind a "run query" button.
Please note that this is our core business. Of course, we maintain these services but we also try hard to push more work to cloud providers. For example, both AWS and GC offers managed Mysql servers, auto scale groups for nodes, object stores such as S3 and Cloud Storage so we actually maintain the software and help the customers to upgrade the software when they need.
I agree that BigQuery will save time if the users are not familiar with distributed systems and big-data but again, "ability to go from 0 cores to thousands" doesn't make sense to me because in practice I have never experienced such case. Since we need to start the instances it may take up to 2 - 3 minutes but this is often acceptable for data analysts.
Could you please elaborate the part "re-materializing and optimizing your dataset."? We do a number of optimizations for compacting ORC files, bucketing etc. but I would love to hear how BigQuery does post-processing.
As a note, I usually tend to simplify things but most of the BigQuery customers that I see usually do overengineering because of the cost optimization. For example in the article the author uses Redshift for the dashboard data the solution they use for moving data from BigQuery to Redshift also needs to be maintained and it's not that easy. If I'm going to adopt my whole system to the way how BigQuery works and push hard to save costs, then I expect it to be pretty cheap but 40K for reserved slots doesn't sound like cheap to me. We maintain similar size clusters for 20% of this price for the same data volume and it's much more flexible.
We all agree that Redshift is not scalable at some point and people switch to other technologies when they need to but that's not the point. Creating a Redshift cluster is dead-easy and getting the data in it is also not complex compared to other solutions if you're already an AWS user.
The case for Snowflake is different, they also use AWS but they manage your services. Although they did a smart move and store the data on their customers' S3 buckets in order to make them feel like they "own" the data but it's possible only because AWS lets them to that way.
I believe that AWS doesn't try to make Redshift cost-efficient for large volumes of data because they already have the advantage of vendor-lock and making money from their large enterprise customers processing billions of data points in Redshift. That's why there are many Redshift monitoring startups out there to save the cost for you.
On the other hand, AWS is smart enough to build a Snowflake-like solution when their Redshift users start to switch BigQuery and Snowflake and the companies such as Snowflake needs to be prepared when the day comes. Cloud is killing everyone including us.
It's true that AWS has Redshift Spectrum (and Athena) to help with more scalable querying across S3, however I don't think that makes a big risk for another company that provide a focused offering on top. Snowflake is very well capitalized with close to $500M in investment and plenty of customers so I wouldn't worry about them going out of business.
I especially like them because they have the elastic computing style of BigQuery but charge for computing time rather than data scanned, which is a much more effective billing model than anything else out there.
Anyway, at this point I'm just repeating myself so I suggest you actually try them if you care for a better model. It works for us at 600B rows of data that was too expensive with BigQuery and too slow and complicated with Redshift/Spectrum.
Also Snowflake Data is another option that supports automatic provisioning and pausing resources which is even easier to manage than redshift. Changing bigquery pricing to be compressed data stored and scanned would go a long way towards making it more attractive for full-time usage.
What exactly does that mean?
This is fine for batch workloads where 'real time', i.e. latency of < 5 minutes, ideally < 1 minute, isn't much of a concern.
but adds complication if you need to make data available quickly, because you either adopt some hybrid approach where you have 'recent' data in database and everything else in S3 - meaning your query layer has added complexity.
I believe this is how BigQuery works with streaming inserts, where recent streamed data is actually stored in BigTable, and asynchronously copied into Capacitor over time (this might be outdated information)
So while BigQuery seems expensive, the other solutions have a lot of other costs that you need to factor in!
BigQuery has high throughput but also very high latency and the shared tenancy means unpredictable query times, with the same SQL taking 10 seconds or 60. They are working on both but it'll be awhile before any changes. DML merge and DDL statements are now in beta though which resolved other big obstacles in automation.
It's a great system but at this point I'd only recommend it for occasional but heavy queries that need to scan massive datasets with complex joins or for some ETL uses. Perhaps if they charged for compressed storage and scans then it would be different but right now Snowflake Data is a much more usable system day to day.
“What do you think a phase conjugate tracking system is for, Kent?”
Great. You made a system to track a billion people a day. You’re murdering privacy and then bragging about it. And bragging about it during a giant shitstorm caused by Facebook. The fuck is wrong with you?
Is there some magic scale threshold for excessive force? No. There is no fine line because you shouldn't be anywhere near the line.
100 billion. It's not 'lots' or millions, or a billion we're talking about here. It's one hundred billion. That's 80 events per day on average for every human they've ever seen (1.2 billion, in the about section). If they see half of those on any given day that's 160 events per person, maybe 200. Per day.
Fine grained tracking you do on your own site to determine why people leave and whether they see your new content? I could see my boss asking for that. I wouldn't be enthusiastic. I might make excuses that I was too busy doing other things to help. I might even complain, but we probably aren't running that forever anyway because it tells you less and less over time but still costs the same.
But this is an ad network, not a usability study. We are currently busy handwringing about ad networks, and I'm a little taken aback by the dissonance here.
Those ads do have owners, paying a lot of money to both distribute and see how people are interacting with their content so it's the same thing as a website. And there are lots of tracking events to capture so it's easy to add up to billions, but as the article states it's not all user events. Only 10B come from users with everything else being backend server logs.
> I wouldn't be enthusiastic. I might make excuses that I was too busy doing other things to help
Ok... so being dishonest to avoid doing your job is fine? If you are against then don't do it, but what is the point you're trying to make here?
Ok... so being dishonest to avoid doing your job is fine?
What does ethics class have to do with this? I'm sure the question of whether lying in your job to avoid doing a particular project would be far more interesting to study anyway, I'll be sure to bring it up the next time I teach one.