
Modern Data Lakes Overview - developersh
https://developer.sh/posts/delta-lake-and-iceberg
======
FridgeSeal
Having spent the last ~8 months at my work grappling with the consequences and
downsides of a Data lake, all I want to do is never deal with one again.

Nothing about it was superior or even on par with simply fixing our current
shortcomings OLAP database setup.

The data lake is not faster to write to; it’s definitely not faster to read
from. Querying using Athena/etc was slow, painful to use, broke exceedingly
often and would have resulted in us doing so much work stapling in schemas/etc
that we would have been net better off to just do things properly from the
start and use a database. The data lake also does not have better access
semantics and our implementation has resulted in some of my teammates
practically reinventing consistency from first principles. By hand. Except
worse.

Save yourself from this pain: find the right database and figure out how to
use it, don’t reinvent one from first principles.

~~~
threeseed
For smaller use cases data lakes probably don't make sense.

But data lakes have arisen from the enterprise where the centralised data
warehouse was the standard for the last few decades. They know how to use a
database. They know how to model and schema the data. And they know about all
of the problems it has. They didn't buy into the data lake concept because
it's trendy.

Fact is that for large enterprises and for those with problematic data sets
e.g. telemetry databases simply don't scale. You will always have priority
workloads e.g. reporting during which time users and non-priority ETL jobs
come second. And often Data Science use cases are banned altogether.

The reason data lakes make sense is because it is effectively unlimited
scalability. You can have as many crazy ETL jobs, inexperienced users, Data
Scientists all reading/writing at the same time with no impact.

Generally you want a hybrid model. Databases for SQL users and data lake for
everything else.

~~~
dikei
> The reason data lakes make sense is because it is effectively unlimited
> scalability. You can have as many crazy ETL jobs, inexperienced users, Data
> Scientists all reading/writing at the same time with no impact.

Eh, almost all Data Lakes cannot handle small files well. All it takes is for
someone to write 100 million of tiny files into the Data Lake to make life
miserable for everyone else.

~~~
threeseed
So don't write small files ?

Every time I've seen someone do this it was a mistake and quickly resolved.
Either you have way too many partitions in a Spark job or you are treating S3
like it's a queue. And if you really do need lots of delta records then just
simply have a compaction job.

~~~
dikei
Well, inexperience users/data scientists tend to not care about what they
write out :)

Nevertheless, my point is a Data Lake's does not offer free unlimited
scalability. It takes a lots of effort and good engineering practice to make a
Data Lake run smoothly at scale.

~~~
derefr
Data scientists shouldn't be writing anything to the Data Lake. Data Lakes
store _raw_ datasets (sort of like Event Streaming databases store _raw_
events.) In academic terms, they store _primary-source_ data.

Once data has been through some transformations at the hands of a Data
Scientist, it's now a secondary source—a report, usually—and exists in a form
better suited to living in a Data Warehouse.

Data Lakes need a priesthood to guard their interface, like DBAs are for
DBMSes. The difference being that DBAs need to guard against misarchitected
read workloads, while the manager of a Data Lake doesn't need to worry about
that. They only need to worry about people putting the wrong things (=
secondary-source data) into the Data Lake in the first place.

In most Data Lakes I've seen, usually there are specific teams with write
privilege to it, where "putting $foo in the Data Lake" is their whole job:
researchers who write scrapers, data teams that buy datasets from partners and
dump them in, etc. Nobody else in the company _needs_ to write to the Data
Lake, because nobody else has raw data; if your data already lives in a
company RDBMS, you don't move it from there _into_ the Data Lake to process
it; you write your query to pull data from both.

An analogy: there is a city by a lake. The city has water treatment plants
which turn lakewater into drinking water and pump it into the city water
system. Let's say you want to do an analysis of the lake water, but you need
the water more dilute (i.e. with fewer impurities) than the lake water itself
is. What would you do: pump the city water supply _into_ the lake until the
whole lake is properly dilute? Or just take some lake water in a cup and pour
some water from your tap into the cup, and repeat?

------
cateye
Isn't it just a paradox to store infinite data, to use it later for very
specific things without having to define it first?

It sounds very common sense to not to "limit the potential of intelligence by
enforcing schema on Write" while in reality, the same problem just shifts (or
gets hidden) in the next steps.

For example: there are 10 data sources with each 100TB of data. I aggregate
these to my new shiny data lake with a fast adapter. Just suck it all without
any worries about Schema. So, now I have 1PB of semi unstructured data.

How do I find the fields X and Y when these are all named differently in 10
sources? Can I even find it without having business domain experts for each
data source? How do I keep things in sync when the structure of my data
sources change (frequently)?

It seems like there is an underlying social/political problem that technology
can't really fix.

Reminds me the quote: "There are only two hard things in Computer Science:
cache invalidation and naming things."

~~~
bradleyjg
> Reminds me the quote: "There are only two hard things in Computer Science:
> cache invalidation and naming things."

and off by one errors!

------
sologoub
Not a bad read, but it’s written from the perspective of large mature
operations. If your company is just starting out, the advice is actually there
but not quite spelled out - use S3/GCS to store data (ideally in parquet
format) and query it using Athena/bigquery.

~~~
meritt
Fairly new to this topic and coming from a traditional RDBMS background. How
do you go about deciding how many rows/records to store per object? And how
does Athena/Bigquery know which objects to query? Do people use partitioning
methods (e.g. by time or customer ID etc) to reduce the need to scan the
entire corpus every time you run a query?

~~~
lmkg
From the Google side: In traditional BigQuery, the answer to all three
questions are related. You shard the files by partition key and put the key
into the file name. You can filter the file name in the WHERE clause, and the
query will skip filtered objects, but otherwise fully scan every object it
touches.

There is apparently now experimental support for using Hive partitions
natively. Never used it, literally found out two minutes ago.

The number of records per object is usually "all of them" (restricted by
partition keys). The main exception is live queries of compressed JSON or CSV
data, because BigQuery can't parallelize them. But generally you trust the
tool to handle workload distribution for you.

This works a little differently if you load the data into BigQuery instead of
doing queries against data that lives in Cloud Storage. You can use
partitioning and clustering columns to cut down on full-table scans.

~~~
sologoub
That’s basically how GA export worked from my previous work - everything in a
session is nested. Upshot is basically what’s above - easy to filter and you
don’t get partial data.

The catch is if you need to filter by a property of the session, you are
opening every session in range to check if it’s the one you want. That gets
expensive quickly and is a bit slow.

For data lakes, parquet and Spark support fairly sane date partitioning.
Partitioning by anything else is a question of whether you need it, such as a
customer ID, etc. but remember this is a data lake, not a source table for
your CEOs daily report. The purpose of the lake is to capture everything that
you sanely can.

When you can’t store everything, usually due to cost, you then have to
aggregate and only keep the most valuable data. For example in AdTech, real-
time bidding usually involves a single ad request, hundreds of bid requests, a
few bid responses and the winning bid. Value here is inversely related to size
- bid requests without responses are useful for predicting whether you should
even ask next time, but the winning bid + the runner up tell you a lot about
the value of the ad request.

For structuring warehousing for reporting/ad hoc querying, to me the flatter
the better - this uses the native capabilities of columnar stores and makes
analysis a lot faster. Downside, good luck keeping everything consistent and
up to date. Usually you end up just reprocessing everything each
day/hour/whatever the need is, and at a certain point say no new updates to
rows older than X.

The cool thing about modern data warehouses, is that they include interfaces
to talk to the data lakes, so your analysts don’t have to jump to different
tool chains, such as Redshift Spectrum (which is basically Athena) and the
aforementioned BigQuery ability to use tables, streams and files from GCP.

It’s an incredibly productive time to be working with all this! Even 10 years
ago, you’d need a lot of budget and a team to just keep the lights on, today
it’s all compressed into these services and software.

------
georgewfraser
Modern data warehouses (Snowflake, BigQuery, and maybe Redshift RA3) have
incorporated all the key features of data lakes:

\- The cost of storage is the same as S3.

\- Storage and compute can be scaled independently.

\- You can store multiple levels of curation in the same system: a normalized
schema that reflects the source, alongside a dimensional schema that has been
thoroughly ETL’d.

\- Compute can be scaled horizontally to basically any level of parallelism
you desire.

Given these facts, it is unclear what rationale still exists for data lakes.
The only remaining major advantage of a data lake is that you aren’t subject
to as much vendor lock-in.

~~~
cjalmeida
Not being subject to vendor lock-in is huge in itself.

You can save plenty of money if you have the scale to move out of S3. That’s
important because you can usually trade CPU for storage by storing data in
multiple formats, optimized for different access patterns.

But mostly, the Hadoop ecosystem is very open. The tools are still maturing
and it’s easier to debug open source tools than dealing with the generally
poor support in most managed solutions.

~~~
charlieflowers
Can you clarify what you mean by "if you have the scale to move out of S3"?

Why does it take scale to move out of S3? And I thought S3 was cheap, so how
would moving out save money?

~~~
cjalmeida
S3 is flexible and scalable, but it is not cheap. I'd be hard pressed to run
the numbers now, but at some point it's cheaper to just do storage yourself.

But to be fair, you'll go on-premise due to the computing or bandwidth costs
first. And you'll likely move data to the same datacenter to avoid expensive
transfer costs.

I've also had to work in places where you simply could not put your data in
the cloud due to regulatory reasons.

------
killjoywashere
So, let's say I have a DB of a million rows, anticipate having 100M rows of
archived data, then adding 5M rows per year; each of my rows has some metadata
and points to an image on the order of 10 gigapixels, in a bucket.

There is presently strong interest in associating this data with other DBs, of
which I am aware of about 80, with a total of probably 500-1000 tables, along
with some very old "nosql" b-tree datastores in MUMPS. There are new $10M+
projects coming online around the enterprise roughly every day.

Where would you start?

~~~
jiggawatts
That's a hilariously small amount of relational data that your phone could
probably handle with decent performance. I made larger databases than that
back in 2005 on a single commodity server. I wouldn't be surprised to see
PowerBI manipulating that in-memory on a desktop.

Microsoft SQL Server with Clustered ColumnStore tables would make practically
all queries fast on that, especially if most queries are only for subsets of
the data. PostegreSQL could probably handle that too, no sweat.

Also see "Your data fits in RAM":
[https://news.ycombinator.com/item?id=9581862](https://news.ycombinator.com/item?id=9581862)
which would mean that you could do in-memory analytics of your relational data
with SAP HANA or SQL Server if you _really_ needed that kind of performance:
[https://docs.microsoft.com/en-
us/archive/blogs/sqlserverstor...](https://docs.microsoft.com/en-
us/archive/blogs/sqlserverstorageengine/real-time-operational-analytics-
memory-optimized-table-and-columnstore-index)

You can spin up either SQL or HANA in the cloud or on Linux, so you don't even
need Windows. Both can be connected to just about any other database you can
name, often directly for cross-database queries. SQL 2019 is particularly good
at virtualizing external data: [https://docs.microsoft.com/en-
us/sql/relational-databases/po...](https://docs.microsoft.com/en-
us/sql/relational-databases/polybase/data-virtualization?view=sql-server-
ver15)

10 gigapixel images are a completely separate problem. If you need individual
images to be fast to view, you want some sort of hierarchical tiling like
Google Maps does. If you're processing them with machine vision or something,
then you want whatever makes the ML guys happy.

PS: I hope you're not working on DARPA's spy drone, because then please
disregard everything I said and delete your data for the good of humanity:
[https://www.extremetech.com/extreme/146909-darpa-shows-
off-1...](https://www.extremetech.com/extreme/146909-darpa-shows-
off-1-8-gigapixel-surveillance-drone-can-spot-a-terrorist-from-20000-feet)

~~~
killjoywashere
Parent responding: to be clear I was not impressed by my own row count, if
anything, I was trying to make it clear this would not be a burden for a
traditional postgres instance. I recall a postgres user group I attended where
a guy had been working on handling a billion writes per second (consulting for
Cymer if I recall). My whole dataset is less than 1 second worth of that guy's
data. And since Cymer is in the photons business, I'm willing to bet they were
downsampling heavily.

My question is more the specific mix of problems: a DB, a ton of image data,
and other adjacent DBs that people want us to play with. How would you set
that up?

I work on cancer, so, definitely not spy drones.

~~~
dikei
For this amount of data, I would use good old Postgres, partition the data by
ingestion time, then just detach the old partitions when you need to archive
it.

For joining data from multiple database, if the data is large, I would use
something like Presto([https://prestosql.io/](https://prestosql.io/)) to join
and process the data. But that's partly because we have already had Presto
clusters running.

------
anshumania
For a use case where we ingest hundreds of millions of data points to hadoop
then run spark etl jobs to partition the data on hdfs itself. And then next
day we have several million updates on the several datapoints from the last
day(s). What would be recommended on a hadoop setup ? HBase ? Parquet with
Hoodie to deal with deltas ? Or Iceberg ? Or hive3 ?

~~~
billman
First mention of hoodie here. I'm surprised.

------
TikkiTaki
User experience with any OpenSource software, especially in distributed
computing and storing domain, depends on the quality of your system
administrators and data engineering teams. If you couldn't connect software
produced by different companies properly, it would be a pain to work with this
software's zoo. Most people who used, for example, AWS stack don't want to
return to OpenSource, because Amazon team tests interactions of their systems
and uses properly config files which inexperienced system administrators
can't. Additionally, you shouldn't use distributed storing systems if you can
use horizontal sharding with SQL storing systems.

------
somurzakov
data lakes are nightmare in terms of security. Capital One breach happened
partly because they just pour all data in the lake, as does every other monkey
in data lake business. Role based access control, zero trust, principle least
privilege, service account management in a data lake? hahaha, nope, we don't
do that here

I will never trust a company that stores everything in one data lake, that's
major data breach just waiting to happen.

------
nixpulvis
So where are we on Data Lakes vs NewSQL [1].

[1]:
[https://en.wikipedia.org/wiki/NewSQL](https://en.wikipedia.org/wiki/NewSQL)

~~~
ozkatz
Most “NewSQL” databases are designed for OLTP use cases (i.e. many small
queries that do little aggregation). Data Lakes are optimized for OLAP (i.e.
doing a smaller amount of queries, but aggregating over large amounts of
data).

As an example, Athena would do a terrible job at finding a specific user by
its ID, while Spanner would behave just as poorly at calculating the
cumulative sales of all products for a given category, grouped by store
location (assuming many millions of rows representing sales).

Hope this analogy makes sense.

~~~
FridgeSeal
I think you're selling some of these "NewSQL" DB's short, TiDB/TiKV for
example appears (I haven't personally used it yet) capable of supporting both
OLTP and OLAP workloads due to some clever engineering and data structures
behind the scenes.

~~~
manigandham
TiDB relies on Spark to do analysis, using their TiSpark integration package.
It's not built into the database but offers a smoother install than operating
a Spark cluster separately.

The only "newsql" database that truly does OLAP+OLTP (now called HTAP) well is
MemSQL with it's in-memory rowstores and disk-based columnstores.

~~~
ilovesoup
(I'm a dev of TiDB so I might be biased.) Yes and no. The yes part is that
TiDB still rely on TiSpark for large join query as well as bridging big-data
world. TiDB itself cannot shuffle data like MPP database yet. On the other
hand, TiDB without TiSpark is still comfortable of those dimensional
aggregation queries (which are typical analytical queries as well). The no
part is, TiDB now has a columnar engine (TiFlash) for analytics and providing
workload isolation. TiFlash can keep up to date (latest and consistent data to
be more specific) with row store in real-time in separated nodes via raft.
IMO, HTAP should be TP and AP at the same time instead of just "TP or AP you
choose one". In such cases, workload interference is real deal. Especially
when you are talking about transactions for banking instead of streaming in
logs. In such sense, very few, if any, "newsql" systems achieved what I
considered true HTAP. For more details: [https://pingcap.com/blog/delivering-
real-time-analytics-and-...](https://pingcap.com/blog/delivering-real-time-
analytics-and-true-htap-by-combining-columnstore-and-rowstore/)

Welcome to try it in March with TiDB 3.1.

------
gigatexal
Anyone use Apache iceberg with success?

