I'm going to regret asking this, but what the hell is a "security lake"? A colle...

shaeqahmed · on Aug 6, 2022

Its a data lake in which you store security logs. That includes Cloud/SaaS audit logs, network security logs (Zeek, Suricata, Snort), VPN/Firewall logs, and more.

remram · on Aug 6, 2022

But logs are structured and filtered by their relevance to security. In what way is that a "lake"?

Is "data lake" just the new plural of "dataset"?

aseipp · on Aug 6, 2022

People tend to call them "lakes" because, I think, they are "unfiltered" and contain raw data objects and blobs, originally from the source system, unmodified. In a normal data warehouse system you ETL things and the final "load" step stores them in the warehouse, and then you use that as your source of truth. Your data warehouse might be Redshift on Amazon. In the "Data lake" case you instead load everything into something like S3, and then everything uses S3 as the source of truth -- including your query engine, which in this case might be Athena (also on Amazon). I won't go into Redshift vs Athena but if you're familiar with them, this should make sense.

I'd say like 95% of the case I've seen people talking about these things, they basically mean: shove everything into S3 and use that as the canonical source of truth for your data systems, rather than some OLAP system; instead you build the OLAP system off S3.

More simply, I think of it like a term to describe a particular mindset concerning your ETL: always work on the source data. And source data is often messy and unstructured. It's a lot of potentially unstructured and underspecified bullshit. So S3 is pretty good storage for something like that versus datastores with performance/usability cliffs around things like cardinality, fields that come and go, etc...

One advantage of this design I can see is that S3 is very "commodified" by this point (lots of alternative offerings) and can be integrated with in nearly every pipeline, and your tools can be replaced more easily, perhaps. S3 is more predictable and "low level" in that regard than something like a database, with many more performance/availability considerations. Like in the example I gave, you could feasibly replace Athena with Trino for instance, without disturbing too much beyond that system. You just need to re-ingest data from S3 for a single system. While if you loaded and ETL'd all your data into a database like Redshift, you might be stuck with that forever even if you later decide it was a mistake. This isn't a hard truth (you might still be stuck with Athena) but just an example of when this might be more flexible.

As usual this isn't an absolute and there are things in-between. But this is generally the gist of it, I think. The "lake" naming is kind of weird but makes some amount of sense I think. It describes a mindset rather than any particular tech.

aliqot · on Aug 6, 2022

None of this really drives home why a new term was necessary. I'm still seeing "dataset".

Terretta · on Aug 6, 2022

Because dataset doesn't tell you if the data was ETL'd or ELT'd; data warehouse and data lake do tell you.

Now just wait till you hear someone reference "data lake-house" ...

remram · on Aug 7, 2022

A quick web search gave me:

> A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

Structured logs that have been filtered by their relevance to security really seem to fit the definition. If we must use newspeak, "log warehouse" then?

HyperSane · on Aug 7, 2022

Think of it as "raw dataset" where all the data lands first and only copies of it are modified.

remram · on Aug 7, 2022

Someone's processed data is someone else's raw input.

Here these are logs that were already filtered by their relevance to security and exported as structured data. Considering those "raw unstructured data" because you haven't personally done ETL on it seems wrong.

glitchcrab · on Aug 6, 2022

That's because it's just marketing buzzword bingo

warent · on Aug 6, 2022

data lake isn't a new term (relatively). I remember first hearing it when I worked at Google like 5 years ago, and the context was always referring to some enormous raw data store. Probably the term "lake" is supposed to evoke a sense of largeness and shapelessness. If you wanted to train a model, you would tap into a data lake which had up to petabytes.

aliqot · on Aug 6, 2022

the FAANG job promotion "game" leads to a lot of new terms being coined because everyone wants to be the guy who invented X. My generation isn't beyond reproach either, our thing was acronyms and clever initialisms.

rhyselsmore · on Aug 6, 2022

the (log) streams all run into the (data) lake

AndrewKemendo · on Aug 6, 2022

A single attack surface that makes it easier for adversaries to own your system

geodel · on Aug 6, 2022

From GitHub it looks like implementation of Random Buzzword Lake APIs in Javascript.

I am so excited, couldn't wait to see more!

wizwit999 · on Aug 7, 2022

Haha! We use typescript for CDK infrastructure automation/deployment. Most of our runtime code is in Rust, some Kotlin, and Python for user authored detections.