Another emerging architecture I see myself investigating is Data Sewer Pattern where huge load of useless data be dumped on to millions of unsuspecting entities via social media.
So Lakehouse is not really an evolution of data warehouses or at least new ones like ClickHouse and Druid. SQL data warehouses are highly optimized for analytic query speed. Think columnar storage, high compression, vectorized query, materialized views, etc. They also couple well with event streams. You can't get high performance without optimized storage and very tight integration of parts.
I have massive respect for Ali and Matei but there's no way Lakehouse will replace this.
Edit: replaced "original" with "survey".
I don't think your argument holds here at all. It's a common misconception to think high performance would require tight coupling of storage and query processing.
"Think columnar storage, high compression, vectorized query, materialized views, etc." All of those are possible in Lakehouse, and all but one (materialized views) are fully implemented on Databricks. And the remaining one isn't far away either (materialized views is really just incremental query processing + view selection, and neither problem has much to do with storage).
In fact the Lakehouse paper seems to be setting up a strawman. Here are three examples.
* The new low-latency SQL data warehouses are open source. They are are not locking data in proprietary formats. We're not Snowflake.
* SQL data warehouses are already headed toward support for object storage for the same reason everyone else is: costs and durability in large datasets. Here's just one sample of many: https://altinity.com/blog/tips-for-high-performance-clickhou...
* Not everyone cares about ML and data warehouse integration. From my experience working on ClickHouse only a small percentage of users integrate ML. By contrast 100% of our users care about efficient visualization and keeping data pipelines as short as possible, hence the benefit of a tightly integrated server.
I think there's actually a bifurcation of the market into low-latency use cases driven by event streams versus much larger datasets containing unstructured/semi-structured data stored in low-cost object storage. Lakehouse addresses the latter. SQL data warehouses are focused on the former. I don't see one "winning"--both markets are growing.
I was already thinking it would be great to get a lakehouse presentation. If you are interested please submit a proposal!!
It used to mean "data lake extended to support data warehouse use cases".
So something like HDFS or S3 with Delta (from DBX) or Apache Iceberg storage formats, utilizing Spark or Presto/Trino or something for compute. One unified platform built on scalable big data technologies, that can do transactions, SQL MERGE, smart partitioning and other bells and whistles.
Then AWS decided to unveil "AWS LakeHouse" which meant you have both S3 and Redshift and use both at the same time - lake and warehouse next to each other.
This is not what lakehouse meant until then It is also terrible design - having data in two places means you now have to implement access control, logging, auditing, data access and so on twice. You also have to sync data between the two storages, keep track of what is where and keep track of what is the single source of truth
Truly idiotic design / marketing that could only have come from AWS. But since any larger company has army of "enterprise architects" who went from "nobody was ever fired for recommending IBM" to "nobody was ever fired for recommending Oracle" to "nobody was ever fired for recommending AWS" who will just internally enforce whatever bullshit vendors pushes on them ... it is almost what "lakehouse" means nowadays.
AWS truly is the Oracle of 2020s. Fuck them.
(Rant over, sorry, got carried away)
I've understood and implemented differently. With Spectrum (or Polybase for SQL Server / Synapse), you can extended into the data lake. Copy over aggregate/curated data or something you need to special use cases on. Leave the structured, columnar data within the cheap storage. You pay per scan but it is cheap (at least to a point).
Also, Databricks took the Lakehouse moniker and sprinted with it. AWS was late to the game from what I saw (at least for marketing terminology adoption).
With ra3 redshift, you pay storage cost of s3 for internal data as well, so unless you use the s3 with something else, I don't see much point in using spectrum.
Still, something like Snowflake works much better. They actually seem to have vision and not just "us too!" like AWS.
Although I don't know how Redshift compression compares to something like gzipped parquet. Maybe the data ends up taking more space and thus money.
Agreed on that elasticity.
As an outsider to this whole movement, data lakes have always seemed like a FOMO product. Like they've heard about big data, but they don't have much data, so they just start piling up stuff until it's "big". Also they don't know what analysis they even want to do with it, so there's no structure.
We need to go a little bit deeper. I can sense that we are just a few steps away from circling all the way back around to fancy terminology for "Postgresql installed on a big server".
--edit: no idea why my post struck a wrong chord somewhere. Looks like the parent comment was not meant as a joke?
The biggest challenge I have is dealing with the fact that I now have an impressive amount of "python developers" doing whatever they can to solve their problems. I genuinely think we're improving the ability of our business to do analytics on top of our enterprise data, but sometimes I worry about the amount of technical debt we just allowed to accrue.
Seriously though I expect data people to understand that the value of clear descriptive naming. Communicate meaning not marketing speak.
Handwaves swat away all technical questions. Salespeople of course turn to execs and promise magic bullet.
One month later "You're all porting your datastores to DATALAKE INTERNATIONAL SYSTEMS".
Databricks and Snowflake are both pulling it off, giving a combination of an RDBMs like experience and a data lake experience. If it can be pulled off, it strips away a lot of complexity in how big companies manage data.
All we need is a database that can do analytical queries (and ideally OLTP) and can scale. We don’t need lakes, ponds, swamps, lake houses, sparks, …
Big Query got it right.
I don't get it... Looks to me like DBT is a Python SQL wrapper / big library that among other things includes an SQL generator / something else like that -- but not "pure" SQL?
You can give it truly pure SQL in both models and scripts, and mixing in Jinja if you need it for dynamic models. But I'd recommend at least using ref/source.
What does that mean?
datelakehouse: I have no idea.
data warehouse = Oracle, SAP, BigQuery etc (backed by database, SQL interface)
data lakehouse = Spark, Presto, Databricks, Snowflake (warehouse backed by data lake)