- we now have closed source Amazon Aurora infrastructure that boasts performance gains that might never see it back upstream (who knows if it's just hardware or software or what behind the scenes here)
- we now have Amazon DocumentDB that is a closed source MongoDB-like scripting interface with Postgres under the hood
- lastly, with this news, looks like Microsoft is now doubling down on the same strategy to build out infrastructure and _possibly_ closed source "forked" wins on top of the beautiful open source world that is Postgres
Please, please, please let's be sure to upstream! I love the cloud but when I go to "snapshot" and "restore" my PG DB I want a little transparency how y'all are doing this. Same with DocumentDB; I'd love an article of how they are using JSONB indices at this supposed scale! Not trying to throw shade; just raising my eyebrows a little.
As we continue to grow GitLab this Citus is the leading option to scale out database out. I'm glad that this option will still be there tomorrow.
Yep, I love the fact that y'all went the extension route much like https://www.timescale.com/ and others.
(yes yes, I'm biased, I worked my ass off making that happen)
Moreover, using copyleft software doesn't mean using forces you to release code. There are specific interactions that trigger the sharing clause in, for example, the GPL, such as distribution, linking, and so on. There remain many, many uses that allow commercialization that do not run afoul of the copyleft nature of the GPL.
I am commenting because I have seen this sentiment repeated ad nauseum on here and, maybe that's not what you meant, but I felt the need to clarify. Moreover, if the code is not AGPL, most online uses do not run afoul, because the code product (say executables) are not themselves being distributed. AGPL was formulated to close this loophole, but GPL code is free from this.
Now you can move to Azure without having to change your code.
You can’t get those types of features in a vendor neutral way.
Sorry, not buying it.
Anybody know anything about them contributing money rather than code?
"It couldn't be that hard. I could make a Twitter clone in a week by hiring some people from UpWork"
Did I mention point in time recovery or the architecture behind Serverless Aurora?
It's similar with Redshift although it's a much older codebase from the v8 branch with more customizations. The changes are very specific to their infrastructure and wouldn't help anyone else since it's not designed as an on-prem deployable product.
There's also no confirmation that DocumentDB runs on Postgres and its most likely a custom interface layer they wrote themselves. If you just want MongoDB on postgres then there are already open source projects that do it.
Another commercial MPP database based on Postgres 8.x, GreenplumDB, was open-sourced a few years back. The changes are so extensive that there's little hope of catching up with the current Postgres codebase. Given the focus on OLAP and analytics over OLTP, there might not even be a strong motivation to catch up, either.
The biggest hurdle was that after 8.2, the on-disk storage format changed. The traditional way you upgrade PostgreSQL is to dump the data and re-import it.
This is basically a non-starter with MPP, for the simple reason that there is just too much data. Given the available engineering bandwidth, Greenplum for a long time didn't try to cross that bridge. When the decision was made to merge up to mainline, coming up with a safe format migration was the major obstacle.
Disclosure: I work for Pivotal, the main contributor to Greenplum, though in a different group.
The planner is basically entirely different. It was later extracted into a standalone module to build HAWQ. Since then there has been work to build a new shared query planner called GPORCA.
But yes, the changes are not viable to upstream without so many modifications that fundamentally change the database. Pluggable storage in mainline would be a good first step.
Here's the short story (and I know all of this because the guy who invented the core engine for ParAccel's MPP columnar tech, that is the foundation for Redshift, is one of our early advisors).
- ParAccel developed the tech for a columnar storage database. I believe it was called "matrix"
- Amazon AWS bought the source code from ParAccel, limited for use as a cloud service, i.e. they couldn't create another on-premise version that would compete with ParAccel
- ParAccel then sold to Actian, and a few years ago Actian shelved the product as clearly the on-premise world had lost to cloud warehouses.
The reason AWS bought the source code was time-to-market. It would have taken too long to build a product from scratch, and customers were asking for a cloud warehouse. Back then, ParAccel had by far the best and fastest MPP / columnar tech, plus it's very attractive since it's based on Postgres.
So Actian and Amazon AWS essentially had the same tech, just different distribution models. One is on-premise (Actian), the other one a managed cloud service (AWS). We all know who won.
there's very interesting paper by the Amazon RDS team (where Redshift rolls up). It's not only about "faster, better, cheaper" - it really is about simplicity and that's what Redshift delivered on.
Spin up a cluster in less than 5 minutes and get results within 15 min. Keep in mind, this was all in late 2012, so what appears "normal" today was pure magic back then.
but ever since the "fork", i.e. when AWS purchased a snapshot in time of the code base, the products have obviously diverged. There's some 8 years of development now in Amazon Redshift.
Redshift delivered on sales and marketing.
Amazon made their fortune on the backs of open source contributors many times, and this is just another one of those times.
It's one thing to take a database and fork it towards a specific narrow focus and runtime, it's entirely different to try and put those changes back and make the original database more capable in a general environment.
Now the Affero GPL prevents this, but Stallman has always been crystal clear he sees the "service provider loophole" as an ok thing and not evil.
Q: All right. Now, I've heard described what is called an Application Service Provider - an "ASP loophole"...
Richard Stallman: Well, I think that term is misleading. I don't think that there is a loophole in GPL version 2 concerning running modified versions on a server. However, there are people who would like to release programs that are free and that require server operators to make their modifications available. So that's what the Affero GPL is designed to do. And, so we're arranging for compatibility between GPL version 3 and the Affero GPL. So we're going to do the job that those developers want, but I don't think it's right to talk about it in terms of a loophole.
Q: Very well.
Richard Stallman: The main job of the GPL is to make sure that every user has freedom, and there's no loophole in that relating to ASPs in GPL version 2.
I just want to be cautious that the more we use services like Aurora, the more we're relying on our cloud providers to maintain stability with the core Postgres API/internals while they do some fanciness under the hood to optimize their hardware (if that makes sense).
It's a differentiator for some of their workloads: you don't have to hand your business over to a black box.
The HN community did a little bit of reverse engineering.
this argument cuts both ways.
If we're at the point where Amazon can just re-implement whatever project they want, more or less from scratch, I'm not sure there's any license that can save us. :(
The question is of game theory. MongoDB Inc. invested a lot into developing MongoDB, they figured out the right semantics for a lot of things, trade offs, UX/DX (user and dev experience), and so on. (Recently Mongo4 added transactions. Which is a very very very handy thing in any DB.) But MongoDB calculated that they will recoup their investment because they are in the best position to provide support and even to operate Mongo, all while keeping Mongo sufficiently open source (you can run it yourself on as big a cluster as you please, you can modify it for yourself and don't have to tell anyone - unless you're selling it as a service, pretty standard AGPL).
Now AWS took the API and some of that knowhow, and invested into creating something that's not open source at all. You can't learn anything from it, you are not vendor locked in, because the API is standardized, but other than that it takes away a revenue stream from MongoDB Inc. (Sure, competition is good. DocumentDB-Mongo is probably cheaper than MongoDB Inc.'s Atlas.)
But the question is, will this result in less/slower/lower-quailtiy development of MongoDB itself?
Usually big MongoDB clusters at enterprise companies are not likely to upgrade and evolve, they usually get replaced wholesale, but they would have provided the revenue for MongoDB Inc to continue R&D, and to allow them to provide that next gen replacement. Now ... it'll be likely AWS something-something. Which will be probably closed source (like DocumentDB) and at best it'll have an open API (like DocDB Mongo).
Is it fair? Is it Good for the People? Who knows, these are hard questions, but a lot of people feel that AWS doing this somehow robs the greater open source community, and it cements APIs, concentrates even more economic power, and so on.
Citus is an extension, not a fork.
So neither of these projects are doing Postgres a dis-service. Both are actually pretty heavily aligned with the continued success and maintenance of mainline open source Postgres.
I don't think this is an accurate analysis. For one, they had to make a lot of independent improvements to not have performance regress horribly after their changes, and a lot of those could be upstreamed. Similarly, they could help with the effort to make table storage pluggable, but they've not, instead opting to just patch out things.
> Citus is an extension, not a fork.
Used to be a fork though.
> Both are actually pretty heavily aligned with the continued success and maintenance of mainline open source Postgres.
How is Amazon meaningfully involved in the maintenance of open source postgres?
Virtually all of the companies that were built on open source products in the past few years stopped centering their focus as being the best place to run said open source program, but instead holding back performance and feature improvement as proprietary instead of pushing back upstream.
The performance benefits of Aurora over Postgres are mostly because Amazon rewrote the storage engine to run on top of their infrastructure.
OpenJDK for example and the API/ABI (whatever you want to call it) copyright and now MongoDB with DocumentDB, etc.
We’ve been living with abstractions over the underlying infrastructure for over 45 years.
And quite frankly, that end functionality is what customers are paying for, so that they don’t have to care about all the technical details and operational overhead. It's not like open-source Postgres is being halted by this. The Citus extension itself is open-source too.
To clarify, Amazon DocumentDB uses the Aurora storage engine, which is the same proprietary storage engine that is used by Aurora MySQL and Aurora PostgreSQL, and gives you multi-facility durability by writing 6 copies of your data across 3 facilities, with a 4 of 6 quorum before writes are acknowledged back to the client.
So, it's a bit inaccurate to say that DocumentDB has anything to do with Postgres.
Given Microsoft's change in operation over recent years there's also hope that they can continue their contributions into the future.
It's fascinating to see Microsoft leave behind the "embrace, extend, extinguish" narrative only to have Amazon adopt it, causing massive rifts and action within the database community. I am genuinely concerned about the future of open source software in this continued scenario.
An article with what I considered an outrageous headline ("Is Amazon 'strip mining' open source?") has only rung more true over time. Amazon is one of the largest companies on earth, selling products that they receive for free but never improve, attacking the primary open source provider, and then shift toward their comparable proprietary closed offerings.
Hopefully new ways to "give back", such as equity contribution, can be one of the many paths forward needed to keep open source software healthy. Given how much innovation is unlocked by this, it'd be a crime to go back to the past era.
: From , "Jay Kreps, a creator of Kafka and co-founder and CEO of Confluent ... said Amazon has not contributed a single line of code to the Apache Kafka open-source software and is not reselling Confluent’s cloud tool."
They made a decent cloud business model out of it (no idea how successful but everyone I asked was happy with it).
I just hope Microsoft allow the tech to evolve as open source!
Current Microsoft sure will. They're good with open source stuff.
Considering the competitive database landscape, this is a compelling offering to add to any cloud portfolio. Congrats to the Citus team.
It feels like it was the first. If so, it means bad news for Citus product as it will most likely be ignored for a while. That will be really sad, as I don't know any actively supported automated sharding solution for PgSQL other than Citus. There is PostgresXL, but there isn't much focus to make it community friendly.
Half the team would probably wander off to work for one of the other postgres-centered companies (and quite possibly continue to work on the open source Citus code).
When we first started building Citus, we began on the OLAP side. As time has gone on Citus has evolved to have full transactional support, first when targeting a single shard, and now fully distributed across your Citus database cluster.
Today, we find most who use the Citus database do so for either:
(OLTP) Fully transactional database powering their system of record or system of engagement (often multi-tenant)
(HTAP) For providing real-time insights directly to internal or external users across large amounts of data.
Similarly, I really appreciate MS-SQL for Linux on Docker as it is a lot easier to setup for CI/CD and local for dev and testing and is nearly transparent going to Azure SQL or MS SQL Enterprise for hosted deployments. I'd much rather use PostgreSQL with PLv8 than MS-SQL though.
What's more worrying to me is if they try to do both - build out a Citus offering on Azure, and simultaneously try to keep high-reliability of their AWS Citus Cloud, which may be the most reliable option for some time. It's tough for any organization, no matter how much capital has been injected, to keep a laser-sharp eye on two inevitably-competing initiatives, each of which have their own performance and automation characteristics. I don't want the one person in the company who knows, say, cloud hard drive recovery patterns like the back of their hand and had previously been the EBS guru, to suddenly be pulled into the new Azure optimization project... and that's not something that capital injections can necessarily fix.
That said, this could accelerate their development timelines overall, and it guarantees stability for the product for quite a while. Overall I think this is good news! Citus is one of those things that you want to have in your back pocket when building any type of app on Postgres, and we certainly see it in our company as a long-term "escape hatch" when we're forced to make database-heavy design decisions at currently relatively-small scale. This deal keeps it alive and prospering!
The Googles of the world lose out on professional services, but Microsoft could still make a bundle of money by just consulting on the tools without even managing them. You might even make higher margins by not managing the service.
Think of them like a wedding event rental company, they are more than happy to rent you their own brand of tables, flatware, and silverware, but if you want another brand that’s fine too as long is you buy from them.
I expect they'll also try to port Citus Data functionality to the SQL Server platform.
Also PostgreSQL has improved by large margins in the last five years. The software is more akin to a pyramid rather than a skyscraper. The foundation took a long time, but now that its complete there is a strong base for growth.
It also lets teams own their DB infrastructure and play around with deployment patterns that'd make little sense on Oracle et al because of cost reasons.
It's not that Postgres is "killing" commercial databases, it's just that more and more people are, as you said, noticing that they don't need a commercial database for lots of use cases. And support and consultancy -- and even in-team skills -- for Postgres are often available.
SQL Server platform already has one of the most advanced optimizers and distributed planning with its use in Polybase, stretch tables, SQL MPP and Azure SQL DW.
However I manage roughly 2000 instances of commercial databases. I'd say maybe a 10th could not be hosted on postgres.
It gives postgres a huge disruption potential and the management, in all the big firms I know, is actively looking at it.
I'm sure PG can take on more of the standard RDBMS workloads today but I don't think that's really making a big dent on SQL Server as the bulk of their revenue comes from the serious enterprise scenarios.
I'd say postgres can take 90% of the revenues.
I've worked in 3 of the top 10 european banks. Most of the SQL Server instances do not even need partitioning, for example. Let alone always on, hekaton and so forth.
People mostly buy peace of mind. Until they are charged millions and start questioning the stupid expensive bills saying "do we need that ?"
Really I have all the metrics to back this up: CPU usage/Availability requirements/data size... It is literally my job to collect those.
Also, proper window support, great postgis, great json support, open source ecosystem support are not simple and huge cost savers.
If you mean the licensing and support from commercial distributions and vendors then, as you already recognize, these decisions will fall to which vendor they trust more. That will usually end up being Microsoft.
Enterprises are more than just banks, and vendor relationships matter. They're not fungible and are rarely based on price.
As I said, "If you can use Postgres then you didn't need a commercial system in the first place."
15 years ago Windows Server was far more advanced than any Linux distribution. What does the server OS market look like today?
Aster and Greenplum perform exceptionally for what they do. If SQL Server were better that's exactly what companies would use. Commercial engines are definitely more advanced. But some Postgres derivatives can absolutely out perform enterprise platforms in some uses-cases, and vice-versa.
Microsoft doesn't regard PostgreSQL as a direct threat to SQL Server and it's happy to make money where it can so long as it doesn't perceive something to be a mortal threat.
Azure is happy to take your money to run SQL Server or Postgres, just like how Amazon has been running Aurora side-by-side with Oracle and SQL Server for years now.
Azure Database for PostgreSQL competes against RDS and to a lesser extent, Google's Cloud SQL for PostgreSQL.
They'll still sell MS SQL Server too. But sometimes PostgreSQL is a better fit for your stack, or your preference, and they want your money for them to provide that too.
MS could certainly also do good business selling Docker's friendly Enterprise orchestration tools (including their new Kubernetes based tool) which check all the Enterprise requirements for security, policy, identity management etc.
Also Cstore_fdw is rather obsolete and more of an experiment. It's a rough wrapper around ORC files and is missing many features, advancements and an execution engine to match the performance and usability of a real OLAP database.
If anything, that makes it more likely to be an acquihire.
Shouldn't sustainability be the primary goal instead of making big bucks temporarily?
e.g. something like the free "tiny turtle" plan here: https://www.elephantsql.com/plans.html, but that can auto-scale up to citus-scale things, without user intervention, as needed.
Why people think this is good, but MySQL acquisition by Oracle was bad? Both companies have internal conflict of interest by already owning close-source SQL servers, both companies have history of attacking opensource communities.
Is this a move to defend SQL Server or expand Azure’s PG offering?
Knowing that Citus is available as an option if you need to scale makes Postgres that much more compelling of a default choice for data store.