Hacker News new | past | comments | ask | show | jobs | submit login
What Is the Data Lakehouse Pattern? (timeflow.systems)
68 points by benjaminwootton 7 days ago | hide | past | favorite | 49 comments





Lately we have moved to 'Data Pond pattern' where data store is attached directly to micro service for all CRUD operations. And of course we use Data Well pattern for deep data analysis.

Another emerging architecture I see myself investigating is Data Sewer Pattern where huge load of useless data be dumped on to millions of unsuspecting entities via social media.


I'm sure you know that a crucial pattern during corporate restructuring is the Data Three Gorges Dam, where in the course of construction, hundreds of perfectly productive products staffed by thousands of developers will be wiped away.

You got me

It's been increasingly hard to differentiate legitimate tech architecture/patterns that helps under certain scenarios from marketing shenanigans packaged to sell you more of the same stuff. IMO, it is just natural evolution for such "lake house" patterns to emerge to address shortcomings of "data lakes" (another term passed its prime in terms of hypes around it). But the overhyped term "lakehouse" itself is actually hurting adoption, because it makes it sound like just marketing talks.

Lakehouse seems like an evolution of Hadoop to add better SQL and transactions + reasonable performance on large datasets. ("Reasonable" = not dog slow like Hive.) Reading this article as well as the survey Armbrust, Ghodsi et al. paper [0] you might easily forget that a large fraction of new data warehouse use cases get real-time data from event streams like Kafka, not S3 or HDFS. They also require stable response in small numbers of milliseconds for the more demanding use cases.

So Lakehouse is not really an evolution of data warehouses or at least new ones like ClickHouse and Druid. SQL data warehouses are highly optimized for analytic query speed. Think columnar storage, high compression, vectorized query, materialized views, etc. They also couple well with event streams. You can't get high performance without optimized storage and very tight integration of parts.

I have massive respect for Ali and Matei but there's no way Lakehouse will replace this.

[0] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

Edit: replaced "original" with "survey".


Co-author of the paper here.

I don't think your argument holds here at all. It's a common misconception to think high performance would require tight coupling of storage and query processing.

"Think columnar storage, high compression, vectorized query, materialized views, etc." All of those are possible in Lakehouse, and all but one (materialized views) are fully implemented on Databricks. And the remaining one isn't far away either (materialized views is really just incremental query processing + view selection, and neither problem has much to do with storage).


Thanks for your comment and sorry if I was unclear. I'm not arguing that storage and compute need to be directly coupled. However, storage does need to be very carefully optimized to match compute, especially when you are trying read events and make them available for immediate storage. ClickHouse for example has multiple formats for table parts in order to allow efficient buffering of rapidly arriving records. Using customized formats has allowed the project to evolve quickly.

In fact the Lakehouse paper seems to be setting up a strawman. Here are three examples.

* The new low-latency SQL data warehouses are open source. They are are not locking data in proprietary formats. We're not Snowflake.

* SQL data warehouses are already headed toward support for object storage for the same reason everyone else is: costs and durability in large datasets. Here's just one sample of many: https://altinity.com/blog/tips-for-high-performance-clickhou...

* Not everyone cares about ML and data warehouse integration. From my experience working on ClickHouse only a small percentage of users integrate ML. By contrast 100% of our users care about efficient visualization and keeping data pipelines as short as possible, hence the benefit of a tightly integrated server.

I think there's actually a bifurcation of the market into low-latency use cases driven by event streams versus much larger datasets containing unstructured/semi-structured data stored in low-cost object storage. Lakehouse addresses the latter. SQL data warehouses are focused on the former. I don't see one "winning"--both markets are growing.


p.s., If anyone wants to argue this point we're doing a conference on open source analytics on November 2. It's called OSA Con and the CFP is here: https://altinity.com/osa-con-2021/. It's non-partisan and free. We love all open source projects. :)

I was already thinking it would be great to get a lakehouse presentation. If you are interested please submit a proposal!!


Maybe I'm missing something, but what I need (and what the data team at my company spectacularly fails to deliver) is fast access to post-ETL data as well as pre-ETL data. I need the raw data sometimes, I need the processed data sometimes. What I get instead is no access to the raw data, and broken/slow/uninsured access to the processed data. I don't know about other people but in general I think a decent data warehouse would make me happy and the aspiration to be a datalake is what causes the data team to make me unhappy - they fail to provide the basics because they've heard too many buzzwords.

We use a data vault architecture as the raw/semi-structured source for a 'traditional' data warehouse built on Snowflake. Data vault gives more advanced users access to the raw data and Snowflake gives us all the scalability we need in terms of data volume. Will of course depend on the data model in your data warehouse but works well for us.

"Data Lakehouse" is sadly term ruined by AWS.

It used to mean "data lake extended to support data warehouse use cases".

So something like HDFS or S3 with Delta (from DBX) or Apache Iceberg storage formats, utilizing Spark or Presto/Trino or something for compute. One unified platform built on scalable big data technologies, that can do transactions, SQL MERGE, smart partitioning and other bells and whistles.

Then AWS decided to unveil "AWS LakeHouse" which meant you have both S3 and Redshift and use both at the same time - lake and warehouse next to each other.

This is not what lakehouse meant until then It is also terrible design - having data in two places means you now have to implement access control, logging, auditing, data access and so on twice. You also have to sync data between the two storages, keep track of what is where and keep track of what is the single source of truth

Truly idiotic design / marketing that could only have come from AWS. But since any larger company has army of "enterprise architects" who went from "nobody was ever fired for recommending IBM" to "nobody was ever fired for recommending Oracle" to "nobody was ever fired for recommending AWS" who will just internally enforce whatever bullshit vendors pushes on them ... it is almost what "lakehouse" means nowadays.

AWS truly is the Oracle of 2020s. Fuck them.

(Rant over, sorry, got carried away)


> It is also terrible design - having data in two places means you now have to implement access control, logging, auditing, data access and so on twice.

I've understood and implemented differently. With Spectrum (or Polybase for SQL Server / Synapse), you can extended into the data lake. Copy over aggregate/curated data or something you need to special use cases on. Leave the structured, columnar data within the cheap storage. You pay per scan but it is cheap (at least to a point).

Also, Databricks took the Lakehouse moniker and sprinted with it. AWS was late to the game from what I saw (at least for marketing terminology adoption).


You can do "lakehouse" just with Redshift, but in AWS pictures, you'll see Glue Jobs, Glue Elastic Views, Sagemaker, Aurora ... it's a huge mess.

With ra3 redshift, you pay storage cost of s3 for internal data as well, so unless you use the s3 with something else, I don't see much point in using spectrum.

Still, something like Snowflake works much better. They actually seem to have vision and not just "us too!" like AWS.


You pay the storage cost in S3 as well, depending on tier Snowflake will not necessarily be a cost savings from compute either. Redshift could really use some elasticity beyond a factor of 2 and some warm resume features.

Just to clarify, I meant Redshift ra3 storage costs as much as s3, so you're not saving much by keeping things in s3 instead of in Redshift.

Although I don't know how Redshift compression compares to something like gzipped parquet. Maybe the data ends up taking more space and thus money.

Agreed on that elasticity.


I love that the comments are perfectly split between real analysis and buzzword mockery.

As an outsider to this whole movement, data lakes have always seemed like a FOMO product. Like they've heard about big data, but they don't have much data, so they just start piling up stuff until it's "big". Also they don't know what analysis they even want to do with it, so there's no structure.


Is this new lake house going to have its own pool too?

We need to go a little bit deeper. I can sense that we are just a few steps away from circling all the way back around to fancy terminology for "Postgresql installed on a big server".


You could go pretty far with lakes and islands https://en.wikipedia.org/wiki/Recursive_islands_and_lakes#Is...

Fantastic. I was concerned that this might be a thing.

Yes, there'll be an option to upgrade your "Data Swimming Pool" into a "Data Infinity Pool" overseeing your own "Data Lake", with this extra enterprise feature for only $30k/month! /s

--edit: no idea why my post struck a wrong chord somewhere. Looks like the parent comment was not meant as a joke?


Isn’t the point of building stuff on top of blob stores is that it’s too much data to be housed in normal RDBMSs and the performance expectations are vastly different (run this report that needs to return results in days/weeks not ms) so you can go way way cheaper and slower with the storage?

Partly. It’s also because data modelling and cleaning can cause data warehouse implementations to drag on for years.

The source perpetuates this motherflowing trend: Tensorflow, Airflow, MLflow, Metaflow, KubeFlow, Timeflow.

Costflow

What is that?

Very recently implemented a "lakehouse" in Azure with Databricks and ADLS. So far the enterprise is pleased with it. Our traditional IT EDW developers like it because they can use modern software development practices (source control, CI, unit tests, etc) to build their ETL jobs. Our data analysts like it because the can get access quickly to semi-raw data (we transform it into the delta format before they can access it). As an architect I enjoy it because of how many options I have for moving data around, not to mention how easy and fast things can scale.

The biggest challenge I have is dealing with the fact that I now have an impressive amount of "python developers" doing whatever they can to solve their problems. I genuinely think we're improving the ability of our business to do analytics on top of our enterprise data, but sometimes I worry about the amount of technical debt we just allowed to accrue.


Data werelake - it transforms data into information every full moon.

Seriously though I expect data people to understand that the value of clear descriptive naming. Communicate meaning not marketing speak.


These aren’t data people. These are people building hype with bullshit terminology on bullshit tech.

This smells like a Big Data Lie. Software salesmen enters room, says his special software package can convert your flood of incoming data into a simple little database. CAP? Doesn't exist. B+Tree scaling issues? What's that?

Handwaves swat away all technical questions. Salespeople of course turn to execs and promise magic bullet.

One month later "You're all porting your datastores to DATALAKE INTERNATIONAL SYSTEMS".


Too buzzwordy for me.

As I mention at the end of the article, it's definetly an almost laughable buzzword. I suspect whoever created it almost felt awkward to use it. I do however think there is concrete meaning behind it.

Databricks and Snowflake are both pulling it off, giving a combination of an RDBMs like experience and a data lake experience. If it can be pulled off, it strips away a lot of complexity in how big companies manage data.


Unstructured predictive big data algorithm unsupervised deep ML internet of things mining

Add 'blockchain' in there and my cheque will be straight in the post

Perfectly captures what companies like c3.ai do!

Is it wrong I want to invest in your pitch?

Personally, I think they should invest in their pitch.

The term definitely makes me cringe but it describes a solid and useful design evolution.

The (I believe) original Lakehouse paper is here: http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

Seems unlikely, Databricks used the term in a blog post January/2020[0], while the linked paper suggests it was published January/2021. Edit: Given databricks was involved in this paper, perhaps this is the first.. although both Databricks and Snowflake (and to a lesser extent Google which favours lake/warehouse convergence) have been using the term for a number of years

[0]: https://databricks.com/blog/2020/01/30/what-is-a-data-lakeho...

[1]: https://cloud.google.com/blog/products/data-analytics/data-l...


Same authors.

I manage data infrastructure. Recently, I found I don’t know how to name What we do as many the names are marketing and embarrassing BuzzWords.

All we need is a database that can do analytical queries (and ideally OLTP) and can scale. We don’t need lakes, ponds, swamps, lake houses, sparks, …

Big Query got it right.


I think using a data warehouse as your data lake or lake house is optimal. Even for data that isn't relational. Storage is so cheap now and is decoupled from compute costs for several providers that I don't even give it a thought. You get a fast, scalable SQL interface which is still nice and useful for non-relational data. Then all, or most, of the transformations needed for analysis can be pure SQL using a tool like DBT. In my experience, it greatly simplifies the entire pipeline.

> pure SQL using a tool like DBT

I don't get it... Looks to me like DBT is a Python SQL wrapper / big library that among other things includes an SQL generator / something else like that -- but not "pure" SQL?


DBT has two main innovations. First, everything is a SELECT statement and DBT handles all the DDL for you. You can handle DDL yourself if you have a special case too. Second, the ref/source macros build a DAG of all your models so you don't have to think about build order. There are other innovations but those are the main ones.

You can give it truly pure SQL in both models and scripts, and mixing in Jinja if you need it for dynamic models. But I'd recommend at least using ref/source.


Still not sure I got all that -- gotta look into it some more -- but maybe I know a little more now. Thanks!

"Lakes are typically stored on low cost Object Storage such as AWS S3 where the data is more open."

What does that mean?


datalake = spark or presto on top of s3.

datelakehouse: I have no idea.


data lake = HDFS, GCS, Azure blob, S3, any blob store (backed by blob store, file based)

data warehouse = Oracle, SAP, BigQuery etc (backed by database, SQL interface)

data lakehouse = Spark, Presto, Databricks, Snowflake (warehouse backed by data lake)


datalake + data warehouse = lakehouse



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: