Hacker News new | past | comments | ask | show | jobs | submit login

Alright I’ll bite finally. What do these companies do? Neither Snowflake’s front-facing website, nor the Wikipedia article, nor this post tell me why people pay all this money.

I know a bit about the effort involved in chucking around 100 petabyte datasets, and there are numerous niches a SaaS could fill in there, but it’s very murky from the outside.




I was wondering the same thing. This sums up pretty good I guess:

> The best way to describe Snowflake is that it is a brute force method to run complex queries without creating indexes.

(https://news.ycombinator.com/item?id=32554072)


Column stores on DFS are without a doubt tricky beasts. It’s a very rich field technically.

I guess I’m trying to get a read on whether their core competency / moat is distributed columnar query technology or sales/support/marketing.


Snowflake is slower and more expensive than competitors. I'd say its moat is mostly that its extremely easy to set up and start using without technical support. If you've just got a small team and no one wants to do data engineering snowflake makes that possible, or at least much easier. Most users are generally happy, and they've followed the cloud playbook of making it hard to switch off, so even when teams have scaled to the level where secondary indexes and data support staff makes sense the team is still happy with snowflake.


But why not create indexes? I mean, I understand why sometimes you're you don't want an index. But building an entire warehouse around the idea of "no indexes", really ?


My experience with "Big Data" is pretty dated, 5 years at least. At that time I think a good cutoff for "big data" might have been like a petabyte +/- a factor of 10 depending on your gear. I imagine now even 1PB is probably pretty mild by "big data" standards.

But once you're up in that "I can't even fit this in an 4-8U sled" territory (whatever it is in a given decade) you're probably doing some kind of map/reduce thing, so there's a strong incentive to have a column-major layout. If you can periodically sort by some important column so much the better (log2 n binary search), but mostly you've got a bunch of mappers (which you work hard to get locality on relative to the DFS replicas where the disks live, maybe on the same machine, maybe in the same top-of-rack switch or whatever) zipping through different columns or column sets and producing eligible conceptual "rows" to go into your "shuffle/sort/reduce" pipeline to deal with joins and sorts and stuff like that.

I don't know how Google does it, but I think most everyone else started with something like the Hadoop ecosystem and many with something like Hive/HQL to give a SQL-like way to express that job, especially for ad-hoc queries (long-lived, rarely changing overnight jobs might get optimized into some lower-level representation).

Around the time I was getting out of that game, Spark was starting to get really big, which was due to some combination of RAM getting really abundant and just kind of a re-think on what was by then a pretty old cost model. I have no idea what people are doing now.

I'd love it if someone with up-to-date knowledge about how this stuff works these days chimed in.


It's all around the ethos of ease of use. Snowflake does a lot of smarts in the background so that you don't have the overhead of managing indexes. And not just indexes, there is just less human intervention required overall compared to something like Teradata or even a modern lakehouse.

That said, they've kind of introduced it with the Search Optimization Service, which is like an index across the whole table for fast lookups, but even that is automatically maintained in your behalf.


these tend to be for one-off analytical queries. you want ever user with flag X >10 joined against five other tables each with similar filters. you don't know ahead of time what that query is, your analyst thought of it this morning, so you cant make indices ahead of time. and itll never run again so you don't need to take the performance hit keeping an index. and someone has to decide which indices to keep, but app engineers arent best utilized figuring out indices for analysts.

the indices is nice, but the bigger selling feature for me is if you have many services, and each services data are in the warehouse, you can join against them all together.


Snowflake is a data warehouse in the cloud. In the past, companies would have spent a fortune on Oracle or Teradata licenses and a fortune on on-prem hardware to run it on. Now they spend it on Snowflake and run it on AWS, etc. Same story as with any SaaS product - cheap and easy to get started, only pay for what you use, but over time the costs........get big.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: