If you have petabytes of data, I don't think this article is talking about your ...

sanderjd · on Aug 21, 2023

I think it is?

Or I guess, what data size do you think it's talking about? If you only have gigabytes of data, none of this matters, you can use anything pretty cheaply and easily. So is this article just for "terabytes" or does it go up to "hundreds of terabytes" but not "petabytes"?

agent281 · on Aug 22, 2023

Hmm, I suppose it's a bit challenging to say. I initially thought that it wasn't for the 80% smallest companies and petabytes of data is probably puts you in the top 20%. (Most businesses are small businesses after all.)

However, I now realize that th biggest companies probably should manage their own data. If you're Google why would you use Snowflake?

So I don't know if you are the target audience for this blog post. It's pretty ambiguous.

sanderjd · on Aug 22, 2023

I guess I'll say what I think. I do think it is targeted at that smallest 80% of companies with some digital footprint, and also at most of the top 20%. Or more specifically, I think maybe it's targeted at like the 5th percentile to the 99th percentile. That bottom 5% probably just needs Excel, and that top 1% is probably writing or heavily modifying all their own tools.

But I'm not sure the advice is very good from the 5th percentile up to ... maybe that top 20%? A lot of the stuff in the article assumes the availability of sophisticated data architects and mature infrastructure groups that I really don't think the median company has.

agent281 · on Aug 22, 2023

I agree. Really seasoned data people are not common enough. Small companies need to buy services to lighten the load.

We both seem have a sense of the size of companies at different percentiles. At what percentile would you put your company with petabytes of data?

sanderjd · on Aug 22, 2023

Super hard to say, so ... 80th or 90th? With very low confidence.

But I do have very high confidence that the 99th percentile is much larger than petabytes (think: what's next after "exa"), and I believe that many companies these days crack into "peta" territory.

But as I saw another comment mention, I think another, probably more important, consideration besides size in bytes is cardinality and structure. So maybe this whole classification we're doing is kind of beside the point :)

agent281 · on Aug 23, 2023

Yeah, it's hard to say with any certainty. I agree that the far end is the curve probably looks nothing like the "neighborhood" a couple percent away, relatively speaking.

I also agree that the variety of data plays a big part in its complexity. If you have a few petabytes of data, but it's really only a handful of tables you can real hone in on the relationships. If it's a wide array of sources with many tables between them then you have some nasty problems like entity resolution.

All happy data sets are alike; each unhappy data set is unhappy in its own way.

sanderjd · on Aug 23, 2023

> All happy data sets are alike; each unhappy data set is unhappy in its own way.

Ha, gonna steal that for some doc I write someday :)

agent281 · on Aug 23, 2023

That's only fair: I stole it from Anna Karenina. :]

https://en.m.wikipedia.org/wiki/Anna_Karenina_principle#:~:t....

sanderjd · on Aug 23, 2023

Ha I know, I love that opener, despite it being super cliche to love it. Things are usually cliches for a good reason :)