Snowflake is the go to data warehouse in my opinion. Redshift and BigQuery are fine, but Snowflake is head and shoulders above. Good community around it and tools for it (dbt - works on other warehouse though). They have the mindshare in the data warehouse market.
There's so much they can do from a user experience perspective to make it even better. The integration with Numeracy was a trainwreck, but the fundamentals of the DB are there.
Interesting to see they lose so much money, but I bet their margins have to be so thin running on the cloud. I wonder if they'll ever have to go bare metal to make it work.
Working with it was fraught with issues. Performance was mediocre at best, it was horribly expensive, Python and JS client libs had re-occurring issues with disconnecting and reconnecting. The advice given to us around scaling concurrent connections was bizarre at best. Teammates had numerous issues where it was clear corners had been cut in handling some edge cases around handling certain unicode characters. Their Snowpipe "streaming" implementation was...not good. The idea of having having compute workers that "spun up and down" sounded good in theory, but in practice lead to more bottlenecks and delays than anything else.
The AWS outage last year that prevented you from provisioning new instances essentially crippled our snowflake DB.
I almost go out of my way to recommend people _not_ use it. I keep seeing it pop up, but mostly because it seems they're doing what Mongo DB did in the early days and just throw marketing money to capture mindshare as opposed to being an actually good product.
We changed to ClickHouse and the difference was literally night-and-day. The performance especially was far superior.
I can't believe that they will succeed in the long run as an independent player IN the cloud.
They are always going to be less integrated and less infrastructure-cost-efficient than the native options (Redshift and BigQuery), without the R&D budgets and with incremental friction (sales) and risk (data privacy and cybersecurity).
AWS really should get around to buying them, like they should have bought Looker or Tableau or Mode or Fivetran or DBT, etc, ect.
Snowflake is wildly better than Redshift, no matter how you want to look at it -- integrations, cost, performance, etc.
Like, in a sane world I agree with you -- Redshift SHOULD have a crazy competitive advantage. But somehow they've been unable to execute on that goal for half a decade, and I don't see that changing quickly, given Snowflake's mindshare and growth.
Snowflake is better. Redshift has been really slow to execute. AWS is doing the world's worst job of articulating whatever vision they have for analytics. AWS's message is laser-focused on infrastructure folks and machine learning engineers (not analyst, data scientist, not absolutely anything else).
The higher you go up the stack, the slower and less meaningful, AWS's solutions feel. There is a fantastic job opportunity out there for someone to reconcile AWS's data analytics offerings. They have so much upside.
I'm still not betting on Snowflake winning a direct competition with their primary supplier. For the enterprise and the highly regulated: Redshift is good enough, already there, and they don't NEED the efficiencies that Snowflake makes available.
Redshift is an onpremise piece of software that was converted into a cloud platform (acquired by AWS). Snowflake was built from day 1 as a cloud platform with awesome big data frameworks as its internal architecture. Its very hard for Redshift to rearchitect itself in the way Snowflake was designed from the start because they need to continue supporting existing instances and create an entirely new product.
You don't need to own the public cloud infrastructure to build a better product.
Example: you can play inside ball on storage infrastructure costs to get a 2x cost benefit at the expense of a lot of extra engineering. Better DBMS storage organization, which is available to any implementation, gets you 10x (or greater) improvement. Which would you rather have?
In fact, products like Redshift don't even really game the infrastructure prices. Costs to customers are comparable with Snowflake for equivalent resources as far as I can tell. They both charge what the market will bear.
Hi, what yo are saying is cryptic to me would I would love to understand. would you mind breaking it down for the financially literate but tech handicapped person I am please? thanks much!!
Sure! Sorry to be so obscure, it was not a good explanation. To take the above example, let's say you have a database with 1TB of tabular data in Amazon.
1. You start out storing it on Amazon gp2 Elastic Block Store, which is fast block storage available on the network. It costs about $0.10 US per month per GB, so that's $102.40 per month.
2. Data (sadly) has a habit of getting destroyed in accidents so we normally replicate to at least one other location. Let's say we just replicate once. You are now up to $204.80 per month.
Now we have a couple of ways of reducing costs.
1. We could make the block storage itself cheaper thanks to inside knowledge of how it works plus clever financial engineering. However, the _most_ that can get us is about 5x savings, because prices for similar classes of storage are not that different. The real discount is more like 2x if we want to make money and be reasonably speedy. You likely have to do engineering work--like implementing blended storage--for this latter approach, so it's not free. So, we're back to $102.40 per month.
2. Or, we could build a better database.
2a.) Let's first build a database that can store data in S3 object storage instead of block storage. Now our storage costs about $0.02 per GB per month. Plus S3 is replicated, so we can maybe just keep a single copy. We're down to $10.28 per month but we had to rewrite the database to get it, because S3 behaves very differently from block storage and we have to build clever caches to work on it.
2b.) But wait! There's more. We could also arrange tabular data in columns rather than rows, which allows us to apply very efficient compression. Let's say the compression reduces size by 90% overall. We're now down to just $1.03 per month. Again, we had to rewrite the database, but we got a huge savings in return, like 100x.
The moral is that clever arrangement of data just about always beats financial shenanigans, usually by a wide margin. The primary reason that Amazon has done well in data services like Redshift and Aurora is partly that they have been extremely smart about data services, not any inherent advantage as platform owners.
Snowflake is better than Redshift but BigQuery has improved greatly in the last 2 years to fill in a lot of the missing gaps. I find Snowflake is the best at dealing with unstructured/JSON data and handling interactive results on smaller datasets while BQ is great with serverless scaling and very large computations.
"Our business benefits from powerful network effects. The Data Cloud will continue to grow as organizations move their siloed data from cloud-based repositories and on-premises data centers to the Data Cloud. The more customers adopt our platform, the more data can be exchanged with other Snowflake customers, partners, and data providers, enhancing the value of our platform for all users. We believe this network effect will help us drive our vision of the Data Cloud."
I fail to understand this network effect. Is there any conflation here ? How does data sharing equate to network effect. Something is fundamentally not adding up here. If I share my data with 10 other customers, it should inherently enhance my experience. How does this happen with Snowflake ?
This is one hypothetical way they could capture this value:
1) Building a common platform to upload datasets by anyone. e.g. weather data, retail data, govt data, other open data, or close data (copyright etc). They gave the example of COVID cases in their S-1 doc.
2) Providing mechanism for others to find data through a marketplace; some data is free, other only via payment (with diff monetisation models, e.g. per consumption, per month). Allow other customers to consume it as & when needed. Note, based on their S-1 doc, data is never copied when shared with others, so cost is limited to share with a wide audience.
3) More data on the platform, more data is shareable in the 'marketplace' and more data used by everyone. This increases the value of the whole platform through network effects.
4) Also opens up alternative revenue streams. e.g. more revenue through storage (more data on platform from different people). and revenue from shared data that is consumed (maybe)
I'm a little skeptical of this as well, but I think there is a path. At my previous company we would take in a lot of data from other companies and do analysis for them. If we had a really easy way to share the transformed and analyzed data after it's been modelled in the warehouse, that really would have been great. The question is, are you going to get companies to create a snowflake account just so they can access data in this way? Maybe if it's easy to export / do further analysis.
One of the barriers for Snowflake is that while it's better than what AWS offers, very few customers start out needing everything Snowflake does. They grow into that. So they stick with AWS, hoping that the features/capabilities there grow fast enough to keep up.
But, also very expensive. You can do queries on a spark cluster for tiny fractions of what they charge. But, snowflake makes things easy for the "decision makers" (who know SQL). So, all good.
Having run a medium size Spark cluster, I'm not sure I agree.
If you have 80-100% utilization for a month, perhaps, but the beauty of Snowflake is that you can spin up a 3XL warehouse for a few MINUTES to get answers fast, and then shut it down again and don't pay anything.
Saying "you could run it on self-managed Spark/Oracle/Hive/SQLite" is approximately the same argument as saying "I can run a web server cheaper myself than paying Amazon for an EC2 instance" -- there are cases where that is true, but there are many, many, cases where the "on demand capacity" is the bigger benefit.
> the beauty of Snowflake is that you can spin up a 3XL warehouse for a few MINUTES to get answers fast, and then shut it down again and don't pay anything
Is this why they're making a $350mn annual loss?
A million dollars a day loss would be a pretty big deal to me.
Wow, Sales + Marketing pretty much 1.5x their revenue ($265M), swinging them from +50% operating margins to -150-200% net margins. They are really trying to cram this product down people's throats, huh?
You pay a lot to acquire customers up front but as long as your churn is net negative (ie., customers end up spending more year over year) you end up coming out way ahead. Sales reps only get paid on a contract one time, based on the value of that initial subscription revenue. $0 commission when it renews next year, and the year after, and the year after...
(Small caveat that someone probably gets commission on follow-on expansion revenue, but not that initial subscription amount.)
Based on the quality of support I think they will have high retention. We’re currently a relatively small customer and the support has been excellent. I submitted a bug and it was fixed on our production account in less than two months. This included a step through on our dev account. It would have taken me two months just to convince Oracle and Microsoft that they had a bug.
I had a couple of other minor issues and got very good response.
It feels very much like early MongoDB days, where they are attempting to capture mind/market share by aggressively marketing as opposed to competing on product quality and features.
We at Census (https://getcensus.com) are super excited by this S1 filing. Before Snowflake (and Bigquery and Redshift), data was seen as something only the fortune 500 could afford by buying Hadoop clusters and throwing an army of scientists and engineers at it.
But Snowflake has really led the way to democratize Data Warehouse the past few years and educating the market. You can start on a $50/month plan, and in our experience, the pricing scale nicely with the value you are getting out of the data. Snowflake (and Bigquery) also made it a lot less scary to get started by having an easy way to ETL data from 3rd parties (google ads, Salesforce, prod DB, etc.) to your warehouse.
Thank you, Snowflake, for paving the path for startups like Census, Fivetran, DBT, Mode to help (data) engineers and analysts do more with their data
If I hear census, especially with a capital “C”, I don’t think a business or startup. Are you in the US? Do a lot of people express confusion? How will you trademark your name?
Now I know why our teams internally have been hammered by sales at Snowflake for the past 4 months. Like, relentless, to the point where I doubt we'd entertain their solution even if we had a need. Sorta like Datadog..
Pretty much everything I threw at both, Clickhouse did fatter. I never benchmarks write speeds properly, but I do know CH is capable of high write performance.
General analytics queries for the like of dashboards, CH latencies in the order of < 100ms, Snowflake about a second. Snowflake couldn’t do Geospatial queries when I had to use it, but I was getting responses from CH in like 40ms for a dataset of 10’s millions of points.
Compute/Storage separation, instant shut/scale up/down (horizontally/vertically), multi-warehouse, Semi-structured queries, change streams for tables and external tables, external tables (data-lake), stored procedures, UDF/UDTFs, cloud agnostic (AWS, Azure, Google), data-exchange/data-sharing, CLI and drivers, external functions (remote inference engine invocation), snowpipe (ingest files even when your warehouses are down), tasks (DAGs), i could go on...
Scaling redshift up and down was a nightmare. Tracking files on ingestion was a nightmare. Semi-structured data into structured data was a nightmare. i could also go on...
I'm a very early customer, and a big fan of SF if you couldn't tell.
- Performance from tables not cached on the warehouse instance is awful. That the price you pay for shutting a warehouse down.
- I wish it were cheaper. If you run queries against a warehouse 24/7, preventing from auto-shutdown, you better hope it's tiny. And even then, the cost might incentivize you to employ a different strategy entirely.
The other stuff you mention is them building a moat around your data lake. They're pretty good at that. I'd happily get locked into Snowflake for the moment. Redshift really does look like an amateur hour product in comparison to the tools you get with Snowflake.
Even then, as good as Snowflake is, our internal users went from complaining about the performance of Looker and Redshift to complaining about the performance of Tableau and Snowflake. I don't know if you can ever please anyone in this space...
Snowflake is the best all around DW product out there. It commonly gets compared to Redshift, but Amazon built Redshift on top of ParAccel's technology. Snowflake built its database from scratch. Most of the Snowflake founders have PhDs with an emphasis on distributed systems and I think you see that in the product.
I can't say enough good things about snowflake, and I have plenty of criticism to throw at hadoop, redshift, asterdata & vertica.
Maybe I'm not really wise to the world of finance and what-have-you, but how many S-1s do I need to see in short succession before I start to ask: "What's going on, guys?"
Like, is this a indication that a lot of people are trying to exchange their companies for hard cash as quickly as possible? It kind of looks a lot like that. This is what, the 3rd or 4th one to hit the HN front-page lately?
When one post is on HN's front page it's common for there to be a rush of follow-up posts. Usually we downweight these since there's a power-law dropoff in how interesting they are. Some of the moderation principles relating to this:
Business was at a standstill in 2008. It takes 8-10 years to go from embryo to IPO these days. Thus, a lot of companies got started in 2010-2012, and are now reaching IPO maturity.
I think this is a "ketchup bottle" effect rather than a sign of the end times.
(That being said, the end times may ALSO be upon us, but not because of this particular sign.)
We're near the top of a huge an unprecedented bubble in tech stocks. Top 5 stocks are tech and are 23% of the S&P 500, all time highs in the middle of a pandemic with the global economy stuttering. That's a greater concentration than the year 2000. So it's a good time to IPO for tech stocks.
It's not supposed to sound attractive, but investors will largely care about the YoY growth of 121% for Q2.
They spend a ton of money on Sales & Marketing (293,577k in the last fiscal year). That's what's driving a lot of their growth I assume and it's a lever they can pull back to increase profitability.
We're hitting some perf issues with Snowflake at work (not necessarily due to Snowflake itself but possibly more what we're trying to do with it: data warehouse storage needs but also a need for close to real-time analytical querying over that data). Has anyone here had any good/bad experiences with MemSQL?
Orthogonal question, what happened to you guys? At one point you were the hottest start up on the block every ICPC competitive brogrammer wanted to get into and then you just fell off the radar..
We make heavy use of views within Snowflake, have sensible cluster keying and, further downstream, also leverage things like Looker's PDTs and Elastic Search.
The issue is we have billions of rows and very varied analytical requirements, so there are quite a few "pathalogical" queries.
Change streams are something we're looking at (as well as MemSQL, Clickhouse etc.)
What are your performance requirements? I've found Snowflake to be fast enough for most workloads, but if you're talking in the sub-500ms range, then MemSQL's in-memory capabilities will help. Though I must warn you that the system is a pain to manage and their managed service is unproven.
Data Warehouse is usually a relational database designed for large OLAP analysis with features like column-oriented storage, vectorized processing, and distributed scale-out architecture. Since it's a database, the focus is on strong schemas and structured data, although all major systems also support JSON datatypes now.
Data Lake is usually object storage or other large storage pool with raw files. These can be different formats like JSON, AVRO, Parquet containing with strong schemas or unstructured data. Processing can be done by engines like Spark, Presto, Drill, etc that support less advanced SQL but more robust access across data files and storage locations. The point is to serve as a general dumping ground or "lake" of all the data and then manage it afterwards (including cleaning and moving important records to a data warehouse).
SQL Server is a single-node OLTP relational database but most database engines are fast enough now that you can do everything you need up to hundreds of millions of rows. Best SQL and feature support with full update capabilities. Some DBs like SQL Server have also added OLAP features like columnstore tables to further delay or eliminate the need for a data warehouse.
Mostly how much data there is, and how structured it is. Not really sure what the difference between data lake and warehouse is, but either of them will typically have less structured data and more of it than an SQL server. We're talking petabyte-scale. Sure, you can get 16TB drives, but it's still a stretch to put it all on a single machine. Data should ideally be stored as parquet or similar, but there's probably a lot of JSON out there. Couple it with something like Athena, and you can query in SQL. Spark for more complicated stuff.
data lake is where all your messy historical timestamped immutable data goes so its not lost. data warehouse is where you make sense of it. and your old sql server is just the current snapshot.
SQL instance is fast and for transactional systems - like stock exchange, purchasing something,... called schema on write.
DW is for analytics and reporting.
Data Lake is like many DWs together and other, often "garbage" data, which "might" be useful in future analysis, ML and stuff. It's the unstructured graveyard of data (joking). Schema is defined on read.
My impression? Data warehouses are more fully featured than a data lake, whereas a data lake implies primarily the storage, with other systems querying it. Sql server is orthogonal in that you need neither if all your data fits in a single sql database (or alternatively, sql server is a small scale data warehouse).
Snowflake's put in more effort around security than I've seen from other data warehouses (that have offered me e.g. AWS SOC3s rather than their own SOC2type2s).
Just my experience. Glad to see them reaching for cash. They're effective at what they do.
Snowflake is by far the best data warehouse I have ever used. I would use it at any job where data warehousing was a keystone of our work. Really 10/10, not even close.
Dumb question, given the filing today what is the earliest date it will be listed on the NYSE? Google time between S-1 and IPO gives a bunch of wishy washy answers.
With all the hype over last few years, thought they had half a billion revenue, instead a paltry $265MM in 2020(per their chart) and a loss of $365MM. In comparison, teradata has $2B revenue in 2019(market cap < $3B). Just another VC fueled. Wait for an year after IPO. The real value will be clear.
Remember the heydays of cloudera, hortonworks, the big data hype cycle in the recent past. It is instructive to see the current valuation of these high-fliers once.(all these vendors sell to the very same end customers). Look at the current valuation(I know cloud is the current hype thing, just like bigdata was 5 years ago). Further, all the primary cloud players, google, amazon, microsoft has their own cloud dbs.Very competitive market. it is one thing the VCs and their friends pushing it to friendly data centers, market will eventually reveal the "real value". Probably worth $4B or less in an year(after the early and late VCs have cashed out)
They doubled revenue over the last year and expanded gross margin by 25% ... seems good. Looks like most of the incremental loss comes from expanded marketing costs - but their net revenue retention rate is between 150% and 200%, which is extremely strong. Unfortunately no reported cohort metrics.
Absolutely phenomenal product and company. I have a huge enterprise SaaS crush on these folks. Very solid, thoughtful team as well in my experience.
Their massive marketing spend is interesting, I suspect that they perceive themselves to be the first (or at least, strongest) mover in a once-in-a generation land grab.
Will they be contributing back to FoundationDB(FDB)? I ask this because they had used FDB since 2014 to build the Snowflake metadata store, and they've supposedly advanced FDB in the process, and I assume some of those advances are generic and would benefit all FBD users. Now that they'll have the IPO cash, I hope they will contribute back. For example, I know FDB maintainers would love help with optimizing FDB for EKS and other managed K8s platforms on major clouds. Congrats on the IPO! I might buy a few shares :-)
There's so much they can do from a user experience perspective to make it even better. The integration with Numeracy was a trainwreck, but the fundamentals of the DB are there.
Interesting to see they lose so much money, but I bet their margins have to be so thin running on the cloud. I wonder if they'll ever have to go bare metal to make it work.