
Databricks open-sources Delta Lake to make data lakes more reliable - solidangle
https://techcrunch.com/2019/04/24/databricks-open-sources-delta-lake-to-make-data-lakes-more-reliable/
======
georgewfraser
There's a lot of confusion around data lakes. One source of confusion is that
"data lake" versus "data warehouse" is often presented as a choice, where you
can have either:

1\. A data lake, where all data is stored in its native format (CSV, JSON,
...), in an object store (S3, GCS, ...), with the schema defined on read
(Hive, Presto, ...).

2\. A data warehouse, where all the data is organized in a highly structured
tables (star schema) in a commercial database (Snowflake, Redshift, ...).

This is a false choice! Modern data warehouses, particularly Snowflake and
BigQuery, are fully capable of storing semi-structured data.

Furthermore, you do not need to curate your data into a star schema before
loading it. The ideal way to set up a modern data warehouse is to establish a
"staging" schema that matches the source, and then transform that data into a
star schema or data marts using SQL. In this scenario, your "data lake" and
"data warehouse" are just two different schemas within the same database.

There are still some scenarios where it makes sense to build a data lake in
addition to a data warehouse, primarily future-proofing. I wrote a blog post
where I tried to outline these scenarios: [https://fivetran.com/blog/when-to-
adopt-a-data-lake](https://fivetran.com/blog/when-to-adopt-a-data-lake)

~~~
nitrogen
Has anyone written about privacy implications of data lakes and data
warehouses? The Extract in ETL is usually supposed to filter out private data,
but if instead all of the raw native data is dumped into a data lake, what
ensures that data is handled with the same care as the individual systems that
normally handle the data? What stops some random business analyst from running
individual or aggregated queries that would be contractually or legally
forbidden?

~~~
dikei
The solution is to divide your Data Lake into different zones with access
control, so that user can only access what they're allowed to. That said, it's
a lot of work to do this properly, so it's often neglected.

------
ekzhu
We (data curation lab at Univ of Toronto) are doing research in data lake
discovery problems. One of the problems we are looking at is how to
efficiently discover joinable and unionable tables. For example, find all the
rental listings from various sources to create a master list (union); or find
tables such as rental listings and school districts that can be used to
augment each other (join). The technical challenges in finding joinable and
unionable tables in data lakes involve the following: (1) the data schema is
often inconsistent and poorly managed, so we can’t simply rely on that schema;
and (2) the scale of data lakes can be in the order of hundreds of thousands
of tables, making a content based search algorithm expensive. We came up with
some solutions that are based on data sketches with several published papers
[1,2,3]. The python library “datasketch” was a byproduct if these work.

Many challenges remain though, and we would like to explore some of the more
pertinent ones. In fact, we are conducting a survey to understand the current
state of data lakes in industry and the challenges experienced. If you're
interested in learning more, see what we came up with here:
[https://www.surveymonkey.com/r/R7MYXSJ](https://www.surveymonkey.com/r/R7MYXSJ)
\- would love to see what the HN community thinks about the current state of
data lakes.

[1]
[http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf](http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf)
[2]
[http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf](http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf)
[3]
[http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf](http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf)

------
alexchamberlain
I don't really understand the concept of Data Lake, and wikipedia isn't
helping much... is it just a buzzword for a collection of data stores?

~~~
based2
[https://martinfowler.com/bliki/DataLake.html](https://martinfowler.com/bliki/DataLake.html)

~~~
MR4D
Awesome link!

The key line for me:

"The data lake stores _raw_ data, in whatever form the data source provides."

The emphasis on"raw" was his, not mine.

------
mobileexpert
Other efforts in improving the parquet datasets on cloud storage world:

[https://github.com/apache/incubator-
iceberg](https://github.com/apache/incubator-iceberg)

[https://github.com/apache/incubator-
hudi](https://github.com/apache/incubator-hudi)

Happy to see Delta go open source.

------
MrPowers
They tried to keep it closed and sell it as a premium service, but looks like
they need help from the open source community to make the product better.
Great to see. Databricks has its roots in open source (the founder created
Spark) and it's great that they're still making a lot of open source code
rather than making everything private.

------
mmrezaie
What are the other alternatives for data lakes that can be used (both open
source and close)?

~~~
zjaffee
Apache Iceberg is probably the closest product to what databricks is open
sourcing, but none of these products are everything that's needed for datalake
management.

What these products do is make it as easy to use decoupled storage and compute
as your analytics system as it would be to use a fully managed analytics DBMS
system.

~~~
groodt
Yes, when I heard about Delta I thought the same. Would love to see a
comparison between Delta and Iceberg. I wonder if Ryan Blue is on HN.

------
tlrobinson
I appreciate the thought TechCrunch put into the image representing ACID-
compliant data lakes.

------
dikei
What I don't like about these ACID storage layers is they reduce compatibility
between different query engine. For example, Spark cannot read Hive ACID
tables natively and Hive cannot read Spark Delta tables either. Then there's
other tools such as Presto or Drill which can read neither.

When you use an ACID storage layer, you're kinda locked into one solution for
both ETL and query, which is not nice.

------
playing_colours
Recently I was interested to learn more on Data Lakes, how to design and
maintain them.

There is a lot of information in articles, blogs, but I prefer books as a
solid source of structured and aggregated information.

Surprisingly, I found just a single proper book on the topic:
[https://www.amazon.com/Enterprise-Big-Data-Lake-
Delivering/d...](https://www.amazon.com/Enterprise-Big-Data-Lake-
Delivering/dp/1491931558)

------
iblaine
What's the difference between a Delta Lake and Change Data Capture? Seems like
in both cases you're creating a type 2 dimension against a source table.

~~~
atwebb
It is a different technology entirely. CDC is just the log of changes on a
relational table. Delta Lake appears to be providing more native
administrative capabilities to a data lake implementation (schemas,
transactions, cataloging).

------
huac
Are there other ways of implementing ACID transactions on Spark tables?

~~~
StreamBright
What do you mean Spark tables? Generally speaking it is a bad idea to try to
combine ACID with data warehouses.

------
FridgeSeal
Sounds cool, but then I'd have to use Spark...

------
5874-4b22-a4e0
Cloud -> Data lake -> Data stream -> Data Ocean -> Cloud

