Hacker News new | past | comments | ask | show | jobs | submit login

In the latest world of Postgres:

- we now have closed source Amazon Aurora infrastructure that boasts performance gains that might never see it back upstream (who knows if it's just hardware or software or what behind the scenes here)

- we now have Amazon DocumentDB that is a closed source MongoDB-like scripting interface with Postgres under the hood

- lastly, with this news, looks like Microsoft is now doubling down on the same strategy to build out infrastructure and _possibly_ closed source "forked" wins on top of the beautiful open source world that is Postgres

Please, please, please let's be sure to upstream! I love the cloud but when I go to "snapshot" and "restore" my PG DB I want a little transparency how y'all are doing this. Same with DocumentDB; I'd love an article of how they are using JSONB indices at this supposed scale! Not trying to throw shade; just raising my eyebrows a little.




Craig here from Citus. We're actually a bit different than past forks. Many years ago Citus itself was a fork, but about 3 years ago we became a pure extension[1]. This means we hook into lower level extension APIs[2] that exist within Postgres and are able to stay current with the latest Postgres versions.

[1]. https://www.citusdata.com/blog/2016/03/24/citus-unforks-goes...

[2]. https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...


Congrats on the acquisition. I love that the complete extension is open source and will stay available: "And we will continue to actively participate in the Postgres community, working on the Citus open source extension as well as the other open source Postgres extensions you love.".

As we continue to grow GitLab this Citus is the leading option to scale out database out. I'm glad that this option will still be there tomorrow.


As a very happy Citus customer, the extension being open source is very important. And at the same time, I hope I never have to manage my own clusters again and who better to manage it than the team that built it.


Holy wow! Thanks for the response!

Yep, I love the fact that y'all went the extension route much like https://www.timescale.com/ and others.


I think Citus was the first PG fork to "unfork"...

(yes yes, I'm biased, I worked my ass off making that happen)


user here: can confirm.


If the creators of Postgres wanted all improvements to be upstreamed, they wouldn’t have released under a permissive license. The ability to use Postgres commercially without exposing your entire codebase to copyleft risk is one of the reasons it’s used commercially in the first place.


This is a big assumption. There are many reasons to release something as copyleft – not everyone that releases BSD-like is actively choosing to deprioritize upstreaming. Rather, they are choosing a license that is less restrictive which has other advantages beyond non-copyleft.

Moreover, using copyleft software doesn't mean using forces you to release code. There are specific interactions that trigger the sharing clause in, for example, the GPL, such as distribution, linking, and so on. There remain many, many uses that allow commercialization that do not run afoul of the copyleft nature of the GPL.

I am commenting because I have seen this sentiment repeated ad nauseum on here and, maybe that's not what you meant, but I felt the need to clarify. Moreover, if the code is not AGPL, most online uses do not run afoul, because the code product (say executables) are not themselves being distributed. AGPL was formulated to close this loophole, but GPL code is free from this.


And this is a benefit to prevent lock-in. Amazon’s OLAP database, Redshift, is protocol compliant with Postgres. Even if you won’t get the performance benefits of Redshift if you move to a standard Postgres, at least you don’t have to change your code.

Now you can move to Azure without having to change your code.


That doesn’t really prevent lock-in. You may not be locked in now, but that compatibility can end whenever Amazon wants it to end.


If that compatibility ends, it might as well be a new product. Every client that connects to Redshift uses a standard Postgres driver.


100% agree. I'm just weary of all these "mini optimizations" that all these cloud providers are about to start doing differently.


They don’t have a choice. Their infrastructure is different. For instance, Aurora integrates with IAM for authentication and has extensions to load and save to S3. Aurora writes to six different disks across three availability zones and read replicas are synchronous because they read from one of the disks that Aurora is already writing to for high availability.

You can’t get those types of features in a vendor neutral way.


The IAM stuff requires using a few hooks, at most adding minor modifications that could be upstreamed. Storage is obviously harder, but I and others are working on making that pluggable in core. Amazon's core contribution, from a few smaller things like commandline tools: 0.

Sorry, not buying it.


While not offering commits, I was under the impression that Amazon had contributed some funding. But I just went off searching for that, and can't find any evidence of this either.

Anybody know anything about them contributing money rather than code?


They've provided AWS credits, yes.


https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

or

"It couldn't be that hard. I could make a Twitter clone in a week by hiring some people from UpWork"

Did I mention point in time recovery or the architecture behind Serverless Aurora?


Yea, right... I don't know anything about how hard this stuff is, I'm just a postgres developer & committer, currently working on making postgres' storage pluggable.


When I publish code using a permissive license I want people to contribute back under the same license. But I don't want to force them to.


The good news is that we'll have another reliable, growing, potentially profitable, PostgreSQL company up and running in no time.


Amazon Aurora doesn't have much to do with Postgres and is a custom storage subsystem used by many different database engines. Aurora Postgres is actually using Postgres code on top to handle queries, and eventually PG itself will get pluggable storage engines.

It's similar with Redshift although it's a much older codebase from the v8 branch with more customizations. The changes are very specific to their infrastructure and wouldn't help anyone else since it's not designed as an on-prem deployable product.

There's also no confirmation that DocumentDB runs on Postgres and its most likely a custom interface layer they wrote themselves. If you just want MongoDB on postgres then there are already open source projects that do it.


Redshift isn't even developed by Amazon — it's a commercial product called ParAccel, which they license (and modify, presumably).

Another commercial MPP database based on Postgres 8.x, GreenplumDB, was open-sourced a few years back. The changes are so extensive that there's little hope of catching up with the current Postgres codebase. Given the focus on OLAP and analytics over OLTP, there might not even be a strong motivation to catch up, either.


Worthwhile to note that Greeplum is being moved forward. While it initially seemed, from the outside, fairly slow-going, it seems that later versions were done more quickly. Apparently largely because of fixing some technical debt. They're catching up to 9.4 in their development branch, I believe. For years they were on 8.2...


It is an explicit goal of the Greenplum team to merge up to the PostgreSQL mainline. It is a heroic effort to apply tens of thousands of patches, consolidate them with heavily forked and modified subsystems, while maintaining reliability and performance.

The biggest hurdle was that after 8.2, the on-disk storage format changed. The traditional way you upgrade PostgreSQL is to dump the data and re-import it.

This is basically a non-starter with MPP, for the simple reason that there is just too much data. Given the available engineering bandwidth, Greenplum for a long time didn't try to cross that bridge. When the decision was made to merge up to mainline, coming up with a safe format migration was the major obstacle.

Disclosure: I work for Pivotal, the main contributor to Greenplum, though in a different group.


Very interesting, thanks. I can't begin to imagine how you'd catch up a decade-old codebase patch by patch. During Greenplum's development, did the team try to avoid modifying core parts of Postgres so that keeping the codebase in sync would be easier? For example, I imagine that you wouldn't need to touch most of the storage engine (page/row management, indices etc.), but you'd have to modify the planner quite a bit.


I think that basically the disk format change was too far a bridge to cross. You can't tell customers that they should quadruple their cluster so that they can dump their data and reimport it to perform the upgrade.

The planner is basically entirely different. It was later extracted into a standalone module to build HAWQ[0]. Since then there has been work to build a new shared query planner called GPORCA[1].

[0] https://hawq.apache.org/

[1] http://engineering.pivotal.io/post/gporca-open-source/


The original codebase came from ParAccel which itself was later acquired by Actian, but Redshift is definitely owned and developed by Amazon.

But yes, the changes are not viable to upstream without so many modifications that fundamentally change the database. Pluggable storage in mainline would be a good first step.


Looks like Amazon actually bought the source code and forked it. This Quora thread has some replies by people from Actian: https://www.quora.com/Amazon-redshift-uses-Actians-ParaAccel....


I'm the guy who wrote the reply that pops up first when you read that Quora thread.

Here's the short story (and I know all of this because the guy who invented the core engine for ParAccel's MPP columnar tech, that is the foundation for Redshift, is one of our early advisors).

- ParAccel developed the tech for a columnar storage database. I believe it was called "matrix"

- Amazon AWS bought the source code from ParAccel, limited for use as a cloud service, i.e. they couldn't create another on-premise version that would compete with ParAccel

- ParAccel then sold to Actian, and a few years ago Actian shelved the product as clearly the on-premise world had lost to cloud warehouses.

The reason AWS bought the source code was time-to-market. It would have taken too long to build a product from scratch, and customers were asking for a cloud warehouse. Back then, ParAccel had by far the best and fastest MPP / columnar tech, plus it's very attractive since it's based on Postgres.

So Actian and Amazon AWS essentially had the same tech, just different distribution models. One is on-premise (Actian), the other one a managed cloud service (AWS). We all know who won.

there's very interesting paper by the Amazon RDS team (where Redshift rolls up). It's not only about "faster, better, cheaper" - it really is about simplicity and that's what Redshift delivered on.

https://event.cwi.nl/lsde/papers/p1917-gupta.pdf

Spin up a cluster in less than 5 minutes and get results within 15 min. Keep in mind, this was all in late 2012, so what appears "normal" today was pure magic back then.

but ever since the "fork", i.e. when AWS purchased a snapshot in time of the code base, the products have obviously diverged. There's some 8 years of development now in Amazon Redshift.


In 2012 neither a column store or spinning up a cluster was magic or state of the art.

Redshift delivered on sales and marketing.

Amazon made their fortune on the backs of open source contributors many times, and this is just another one of those times.



Yes, we already covered that. :-)


Redshift isn’t just a custom storage tier atop an older version of Postgres. It has an entirely different execution engine that compiles commands to an executable and farms them out to multiple data nodes.


Yes, I was trying to keep it simple but that detail only further supports the fact that these changes have no value to upstream.


That's a pretty strong claim. There's bound to be one Apache project (even a future one) that could utilize that code/knowhow.


What does Postgres have to do with Apache? The basics for building OLAP systems are already well known and Apache has several projects (Druid, Calcite, Arrow, Parquet, ORC, Drill) that are related to it.

It's one thing to take a database and fork it towards a specific narrow focus and runtime, it's entirely different to try and put those changes back and make the original database more capable in a general environment.


I replied to a few others further down the thread that had similar thoughts as these.


CitusData made tons of improvements to upstream postgresql, though. Can’t say that about Amazon.


OP is referring to the habit of cloud providers to invest in open source platforms to build cloud services but not contribute back to the community.


Despite what people say, Stallman was a visionary in a sense with the GPL, we're seining that today more than ever.


Stallman is totally ok with the "GPL Loophole" that allows service providers to not give back their changes since they aren't re-distributing the software. If postgres and all citusdata stuff was GPL, this wouldn't change really anything.

Now the Affero GPL prevents this, but Stallman has always been crystal clear he sees the "service provider loophole" as an ok thing and not evil.


RMS is totally NOT okay with this, and has written a extensively on the issue that he sees SaaS similarly to proprietary software[0]

[0]https://www.gnu.org/philosophy/who-does-that-server-really-s...


Ah then perhaps I misunderstood this interview[0].

""" Q: All right. Now, I've heard described what is called an Application Service Provider - an "ASP loophole"...

Richard Stallman: Well, I think that term is misleading. I don't think that there is a loophole in GPL version 2 concerning running modified versions on a server. However, there are people who would like to release programs that are free and that require server operators to make their modifications available. So that's what the Affero GPL is designed to do. And, so we're arranging for compatibility between GPL version 3 and the Affero GPL. So we're going to do the job that those developers want, but I don't think it's right to talk about it in terms of a loophole.

Q: Very well.

[7:50]

Richard Stallman: The main job of the GPL is to make sure that every user has freedom, and there's no loophole in that relating to ASPs in GPL version 2. """

[0] http://www.groklaw.net/articlebasic.php?story=20070403114157...


The article keepper posted was written in 2010, while the interview you linked to happened in 2007. Like any of us, RMS's beliefs and opinions evolve over time.


Love or hate Richard Stallman, he is unbelievably resolute in his points of view. They've rarely changed, even though GNU effectively "lost" and open source is generally seen as more business friendly. You've got to give the guy credit for where it is due, and he's preaching almost exactly the same thing he was before I had used a computer today.


I think you may be over analyzing this. I think his point is simply that the GPL has no loophole in terms of being intentionally designed to be worked around with for SaaS providers, it just was designed at a time and primarily for desktop software/software where this was not a common concern, but it was found to be a problem hence the Affero GPL. He specifically says v3 is more compatible with Affero GPL.


If these platforms didn’t want cloud providers using their products they shouldn’t have released their products as open source. They can’t have their cake and eat it too.


It's not about that, though. It's amazing that Postgres is taking off like this!

I just want to be cautious that the more we use services like Aurora, the more we're relying on our cloud providers to maintain stability with the core Postgres API/internals while they do some fanciness under the hood to optimize their hardware (if that makes sense).


But at least they open sourced their fork, designed for data warehousing, before this happened:

https://www.citusdata.com/product/community


Kudos to azure for opening so much of what they do. Lots of kubernetes work, including AKS-engine which runs their k8s implementation. Machine learning toolkit. Media services (faceid etc) as a container. The whole azure shabang runs on service fabric, which they've also open sourced.

It's a differentiator for some of their workloads: you don't have to hand your business over to a black box.


Aurora databases and DocumentDB share the same underlying reliable single-writer, many-reader block device for storage. That is all the magic. Not sure where you got the idea that DocumentDB has Postgres underneath it.


See this thread: https://news.ycombinator.com/item?id=18869755

The HN community did a little bit of reverse engineering.


I think they are wrong and Amazon is just sharing code with their Postgres layer.


That was more guessing than reverse engineering, no?


I get what you're saying, but BSD-licenses are specifically designed to facilitate things not being sent upstream. I don't understand why people moan about companies complying with the license agreement.


Your argument is legal and my argument is moral =P


So, castrating independent economic activity and forcing people to be subservient based on other activities is 'moral'?

this argument cuts both ways.


your moral assertion isn’t an argument, because it contradicts the license chosen by the relevant people


This is what happens in a world devoid of the GPL, or where a large majority doesn't sponsor the work of upstream.


MongoDB was already under the AGPL; Amazon just replicated the API on top of their own storage engine (or an existing permissive-licensed storage engine? Who knows?).

If we're at the point where Amazon can just re-implement whatever project they want, more or less from scratch, I'm not sure there's any license that can save us. :(


Save us from... companies writing software? Amazon's new project is a completely different database that happens to share an interface. At what point do we acknowledge that it's their own work?


That's not the question.

The question is of game theory. MongoDB Inc. invested a lot into developing MongoDB, they figured out the right semantics for a lot of things, trade offs, UX/DX (user and dev experience), and so on. (Recently Mongo4 added transactions. Which is a very very very handy thing in any DB.) But MongoDB calculated that they will recoup their investment because they are in the best position to provide support and even to operate Mongo, all while keeping Mongo sufficiently open source (you can run it yourself on as big a cluster as you please, you can modify it for yourself and don't have to tell anyone - unless you're selling it as a service, pretty standard AGPL).

Now AWS took the API and some of that knowhow, and invested into creating something that's not open source at all. You can't learn anything from it, you are not vendor locked in, because the API is standardized, but other than that it takes away a revenue stream from MongoDB Inc. (Sure, competition is good. DocumentDB-Mongo is probably cheaper than MongoDB Inc.'s Atlas.)

But the question is, will this result in less/slower/lower-quailtiy development of MongoDB itself?

Usually big MongoDB clusters at enterprise companies are not likely to upgrade and evolve, they usually get replaced wholesale, but they would have provided the revenue for MongoDB Inc to continue R&D, and to allow them to provide that next gen replacement. Now ... it'll be likely AWS something-something. Which will be probably closed source (like DocumentDB) and at best it'll have an open API (like DocDB Mongo).

Is it fair? Is it Good for the People? Who knows, these are hard questions, but a lot of people feel that AWS doing this somehow robs the greater open source community, and it cements APIs, concentrates even more economic power, and so on.


Well said. Unfortunately even in software, it seems that might makes right.


Save independent software companies from having their lunch eaten by the behemoths after doing the legwork of proving product/market fit - no different from any industry dominated by a few large players. Amazon is not in the wrong for building a competing product, but it's good for the market if the scrappy underdogs have a few edges in their favor.


Well, reimplementing an API is essentially the Oracle/Google Java case, right?


Amazon has explained in their reinvent videos that Aurora is the storage layer of Postgres rewritten to be tightly coupled to their AWS infrastructure. So it is just regular Postgres (they upgrade to latest on a slightly slower cadence). And there’s no benefit to getting the Aurora layer upstream, no one else could use it anyway.

Citus is an extension, not a fork.

So neither of these projects are doing Postgres a dis-service. Both are actually pretty heavily aligned with the continued success and maintenance of mainline open source Postgres.


> Amazon has explained in their reinvent videos that Aurora is the storage layer of Postgres rewritten to be tightly coupled to their AWS infrastructure. So it is just regular Postgres (they upgrade to latest on a slightly slower cadence). And there’s no benefit to getting the Aurora layer upstream, no one else could use it anyway.

I don't think this is an accurate analysis. For one, they had to make a lot of independent improvements to not have performance regress horribly after their changes, and a lot of those could be upstreamed. Similarly, they could help with the effort to make table storage pluggable, but they've not, instead opting to just patch out things.

> Citus is an extension, not a fork.

Used to be a fork though.

> Both are actually pretty heavily aligned with the continued success and maintenance of mainline open source Postgres.

How is Amazon meaningfully involved in the maintenance of open source postgres?


This is the future and it's not just big companies doing it.

Virtually all of the companies that were built on open source products in the past few years stopped centering their focus as being the best place to run said open source program, but instead holding back performance and feature improvement as proprietary instead of pushing back upstream.


we now have closed source Amazon Aurora infrastructure that boasts performance gains that might never see it back upstream (who knows if it's just hardware or software or what behind the scenes here)

The performance benefits of Aurora over Postgres are mostly because Amazon rewrote the storage engine to run on top of their infrastructure.


All I'm saying is that it looks like Azure and Microsoft are about to do the same.


What good would it do for AWS to send it’s changes upstream? No one else could use it without the rest of AWS’s infrastructure.


We're at a point in time where f(x) = y and we're starting to stop caring about the internals of "f" as long as the "determinism is equivalent" and that scares me.

OpenJDK for example and the API/ABI (whatever you want to call it) copyright and now MongoDB with DocumentDB, etc.


We’ve been at that point since the first PC compatibles came out in the mid 80s with clean room reverse engineered BIOS firmware.

We’ve been living with abstractions over the underlying infrastructure for over 45 years.


Do you know how your CPU works? Or all the networking hardware in the middle of you loading this page? There are 1000s of layers when it comes to computing that it’s impossible to be transparent with them all.

And quite frankly, that end functionality is what customers are paying for, so that they don’t have to care about all the technical details and operational overhead. It's not like open-source Postgres is being halted by this. The Citus extension itself is open-source too.


> - we now have Amazon DocumentDB that is a closed source MongoDB-like scripting interface with Postgres under the hood

To clarify, Amazon DocumentDB uses the Aurora storage engine, which is the same proprietary storage engine that is used by Aurora MySQL and Aurora PostgreSQL, and gives you multi-facility durability by writing 6 copies of your data across 3 facilities, with a 4 of 6 quorum before writes are acknowledged back to the client.

So, it's a bit inaccurate to say that DocumentDB has anything to do with Postgres.


There’s evidence to suggest that DocumentDB is actually running Aurora Postgres under the hood. https://news.ycombinator.com/item?id=18870397


I would argue Microsoft's strategy actually makes them more wedded and committed to ensuring the vitality of open source PostgreSQL than anything AWS is doing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: