Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft acquires Citus Data (YC S11) (microsoft.com)
707 points by whatok 59 days ago | hide | past | web | favorite | 187 comments



In the latest world of Postgres:

- we now have closed source Amazon Aurora infrastructure that boasts performance gains that might never see it back upstream (who knows if it's just hardware or software or what behind the scenes here)

- we now have Amazon DocumentDB that is a closed source MongoDB-like scripting interface with Postgres under the hood

- lastly, with this news, looks like Microsoft is now doubling down on the same strategy to build out infrastructure and _possibly_ closed source "forked" wins on top of the beautiful open source world that is Postgres

Please, please, please let's be sure to upstream! I love the cloud but when I go to "snapshot" and "restore" my PG DB I want a little transparency how y'all are doing this. Same with DocumentDB; I'd love an article of how they are using JSONB indices at this supposed scale! Not trying to throw shade; just raising my eyebrows a little.


Craig here from Citus. We're actually a bit different than past forks. Many years ago Citus itself was a fork, but about 3 years ago we became a pure extension[1]. This means we hook into lower level extension APIs[2] that exist within Postgres and are able to stay current with the latest Postgres versions.

[1]. https://www.citusdata.com/blog/2016/03/24/citus-unforks-goes...

[2]. https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...


Congrats on the acquisition. I love that the complete extension is open source and will stay available: "And we will continue to actively participate in the Postgres community, working on the Citus open source extension as well as the other open source Postgres extensions you love.".

As we continue to grow GitLab this Citus is the leading option to scale out database out. I'm glad that this option will still be there tomorrow.


As a very happy Citus customer, the extension being open source is very important. And at the same time, I hope I never have to manage my own clusters again and who better to manage it than the team that built it.


Holy wow! Thanks for the response!

Yep, I love the fact that y'all went the extension route much like https://www.timescale.com/ and others.


I think Citus was the first PG fork to "unfork"...

(yes yes, I'm biased, I worked my ass off making that happen)


user here: can confirm.


If the creators of Postgres wanted all improvements to be upstreamed, they wouldn’t have released under a permissive license. The ability to use Postgres commercially without exposing your entire codebase to copyleft risk is one of the reasons it’s used commercially in the first place.


This is a big assumption. There are many reasons to release something as copyleft – not everyone that releases BSD-like is actively choosing to deprioritize upstreaming. Rather, they are choosing a license that is less restrictive which has other advantages beyond non-copyleft.

Moreover, using copyleft software doesn't mean using forces you to release code. There are specific interactions that trigger the sharing clause in, for example, the GPL, such as distribution, linking, and so on. There remain many, many uses that allow commercialization that do not run afoul of the copyleft nature of the GPL.

I am commenting because I have seen this sentiment repeated ad nauseum on here and, maybe that's not what you meant, but I felt the need to clarify. Moreover, if the code is not AGPL, most online uses do not run afoul, because the code product (say executables) are not themselves being distributed. AGPL was formulated to close this loophole, but GPL code is free from this.


And this is a benefit to prevent lock-in. Amazon’s OLAP database, Redshift, is protocol compliant with Postgres. Even if you won’t get the performance benefits of Redshift if you move to a standard Postgres, at least you don’t have to change your code.

Now you can move to Azure without having to change your code.


That doesn’t really prevent lock-in. You may not be locked in now, but that compatibility can end whenever Amazon wants it to end.


If that compatibility ends, it might as well be a new product. Every client that connects to Redshift uses a standard Postgres driver.


100% agree. I'm just weary of all these "mini optimizations" that all these cloud providers are about to start doing differently.


They don’t have a choice. Their infrastructure is different. For instance, Aurora integrates with IAM for authentication and has extensions to load and save to S3. Aurora writes to six different disks across three availability zones and read replicas are synchronous because they read from one of the disks that Aurora is already writing to for high availability.

You can’t get those types of features in a vendor neutral way.


The IAM stuff requires using a few hooks, at most adding minor modifications that could be upstreamed. Storage is obviously harder, but I and others are working on making that pluggable in core. Amazon's core contribution, from a few smaller things like commandline tools: 0.

Sorry, not buying it.


While not offering commits, I was under the impression that Amazon had contributed some funding. But I just went off searching for that, and can't find any evidence of this either.

Anybody know anything about them contributing money rather than code?


They've provided AWS credits, yes.


https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect

or

"It couldn't be that hard. I could make a Twitter clone in a week by hiring some people from UpWork"

Did I mention point in time recovery or the architecture behind Serverless Aurora?


Yea, right... I don't know anything about how hard this stuff is, I'm just a postgres developer & committer, currently working on making postgres' storage pluggable.


When I publish code using a permissive license I want people to contribute back under the same license. But I don't want to force them to.


The good news is that we'll have another reliable, growing, potentially profitable, PostgreSQL company up and running in no time.


Amazon Aurora doesn't have much to do with Postgres and is a custom storage subsystem used by many different database engines. Aurora Postgres is actually using Postgres code on top to handle queries, and eventually PG itself will get pluggable storage engines.

It's similar with Redshift although it's a much older codebase from the v8 branch with more customizations. The changes are very specific to their infrastructure and wouldn't help anyone else since it's not designed as an on-prem deployable product.

There's also no confirmation that DocumentDB runs on Postgres and its most likely a custom interface layer they wrote themselves. If you just want MongoDB on postgres then there are already open source projects that do it.


Redshift isn't even developed by Amazon — it's a commercial product called ParAccel, which they license (and modify, presumably).

Another commercial MPP database based on Postgres 8.x, GreenplumDB, was open-sourced a few years back. The changes are so extensive that there's little hope of catching up with the current Postgres codebase. Given the focus on OLAP and analytics over OLTP, there might not even be a strong motivation to catch up, either.


Worthwhile to note that Greeplum is being moved forward. While it initially seemed, from the outside, fairly slow-going, it seems that later versions were done more quickly. Apparently largely because of fixing some technical debt. They're catching up to 9.4 in their development branch, I believe. For years they were on 8.2...


It is an explicit goal of the Greenplum team to merge up to the PostgreSQL mainline. It is a heroic effort to apply tens of thousands of patches, consolidate them with heavily forked and modified subsystems, while maintaining reliability and performance.

The biggest hurdle was that after 8.2, the on-disk storage format changed. The traditional way you upgrade PostgreSQL is to dump the data and re-import it.

This is basically a non-starter with MPP, for the simple reason that there is just too much data. Given the available engineering bandwidth, Greenplum for a long time didn't try to cross that bridge. When the decision was made to merge up to mainline, coming up with a safe format migration was the major obstacle.

Disclosure: I work for Pivotal, the main contributor to Greenplum, though in a different group.


Very interesting, thanks. I can't begin to imagine how you'd catch up a decade-old codebase patch by patch. During Greenplum's development, did the team try to avoid modifying core parts of Postgres so that keeping the codebase in sync would be easier? For example, I imagine that you wouldn't need to touch most of the storage engine (page/row management, indices etc.), but you'd have to modify the planner quite a bit.


I think that basically the disk format change was too far a bridge to cross. You can't tell customers that they should quadruple their cluster so that they can dump their data and reimport it to perform the upgrade.

The planner is basically entirely different. It was later extracted into a standalone module to build HAWQ[0]. Since then there has been work to build a new shared query planner called GPORCA[1].

[0] https://hawq.apache.org/

[1] http://engineering.pivotal.io/post/gporca-open-source/


The original codebase came from ParAccel which itself was later acquired by Actian, but Redshift is definitely owned and developed by Amazon.

But yes, the changes are not viable to upstream without so many modifications that fundamentally change the database. Pluggable storage in mainline would be a good first step.


Looks like Amazon actually bought the source code and forked it. This Quora thread has some replies by people from Actian: https://www.quora.com/Amazon-redshift-uses-Actians-ParaAccel....


I'm the guy who wrote the reply that pops up first when you read that Quora thread.

Here's the short story (and I know all of this because the guy who invented the core engine for ParAccel's MPP columnar tech, that is the foundation for Redshift, is one of our early advisors).

- ParAccel developed the tech for a columnar storage database. I believe it was called "matrix"

- Amazon AWS bought the source code from ParAccel, limited for use as a cloud service, i.e. they couldn't create another on-premise version that would compete with ParAccel

- ParAccel then sold to Actian, and a few years ago Actian shelved the product as clearly the on-premise world had lost to cloud warehouses.

The reason AWS bought the source code was time-to-market. It would have taken too long to build a product from scratch, and customers were asking for a cloud warehouse. Back then, ParAccel had by far the best and fastest MPP / columnar tech, plus it's very attractive since it's based on Postgres.

So Actian and Amazon AWS essentially had the same tech, just different distribution models. One is on-premise (Actian), the other one a managed cloud service (AWS). We all know who won.

there's very interesting paper by the Amazon RDS team (where Redshift rolls up). It's not only about "faster, better, cheaper" - it really is about simplicity and that's what Redshift delivered on.

https://event.cwi.nl/lsde/papers/p1917-gupta.pdf

Spin up a cluster in less than 5 minutes and get results within 15 min. Keep in mind, this was all in late 2012, so what appears "normal" today was pure magic back then.

but ever since the "fork", i.e. when AWS purchased a snapshot in time of the code base, the products have obviously diverged. There's some 8 years of development now in Amazon Redshift.


In 2012 neither a column store or spinning up a cluster was magic or state of the art.

Redshift delivered on sales and marketing.

Amazon made their fortune on the backs of open source contributors many times, and this is just another one of those times.



Yes, we already covered that. :-)


Redshift isn’t just a custom storage tier atop an older version of Postgres. It has an entirely different execution engine that compiles commands to an executable and farms them out to multiple data nodes.


Yes, I was trying to keep it simple but that detail only further supports the fact that these changes have no value to upstream.


That's a pretty strong claim. There's bound to be one Apache project (even a future one) that could utilize that code/knowhow.


What does Postgres have to do with Apache? The basics for building OLAP systems are already well known and Apache has several projects (Druid, Calcite, Arrow, Parquet, ORC, Drill) that are related to it.

It's one thing to take a database and fork it towards a specific narrow focus and runtime, it's entirely different to try and put those changes back and make the original database more capable in a general environment.


I replied to a few others further down the thread that had similar thoughts as these.


CitusData made tons of improvements to upstream postgresql, though. Can’t say that about Amazon.


OP is referring to the habit of cloud providers to invest in open source platforms to build cloud services but not contribute back to the community.


Despite what people say, Stallman was a visionary in a sense with the GPL, we're seining that today more than ever.


Stallman is totally ok with the "GPL Loophole" that allows service providers to not give back their changes since they aren't re-distributing the software. If postgres and all citusdata stuff was GPL, this wouldn't change really anything.

Now the Affero GPL prevents this, but Stallman has always been crystal clear he sees the "service provider loophole" as an ok thing and not evil.


RMS is totally NOT okay with this, and has written a extensively on the issue that he sees SaaS similarly to proprietary software[0]

[0]https://www.gnu.org/philosophy/who-does-that-server-really-s...


Ah then perhaps I misunderstood this interview[0].

""" Q: All right. Now, I've heard described what is called an Application Service Provider - an "ASP loophole"...

Richard Stallman: Well, I think that term is misleading. I don't think that there is a loophole in GPL version 2 concerning running modified versions on a server. However, there are people who would like to release programs that are free and that require server operators to make their modifications available. So that's what the Affero GPL is designed to do. And, so we're arranging for compatibility between GPL version 3 and the Affero GPL. So we're going to do the job that those developers want, but I don't think it's right to talk about it in terms of a loophole.

Q: Very well.

[7:50]

Richard Stallman: The main job of the GPL is to make sure that every user has freedom, and there's no loophole in that relating to ASPs in GPL version 2. """

[0] http://www.groklaw.net/articlebasic.php?story=20070403114157...


The article keepper posted was written in 2010, while the interview you linked to happened in 2007. Like any of us, RMS's beliefs and opinions evolve over time.


Love or hate Richard Stallman, he is unbelievably resolute in his points of view. They've rarely changed, even though GNU effectively "lost" and open source is generally seen as more business friendly. You've got to give the guy credit for where it is due, and he's preaching almost exactly the same thing he was before I had used a computer today.


I think you may be over analyzing this. I think his point is simply that the GPL has no loophole in terms of being intentionally designed to be worked around with for SaaS providers, it just was designed at a time and primarily for desktop software/software where this was not a common concern, but it was found to be a problem hence the Affero GPL. He specifically says v3 is more compatible with Affero GPL.


If these platforms didn’t want cloud providers using their products they shouldn’t have released their products as open source. They can’t have their cake and eat it too.


It's not about that, though. It's amazing that Postgres is taking off like this!

I just want to be cautious that the more we use services like Aurora, the more we're relying on our cloud providers to maintain stability with the core Postgres API/internals while they do some fanciness under the hood to optimize their hardware (if that makes sense).


But at least they open sourced their fork, designed for data warehousing, before this happened:

https://www.citusdata.com/product/community


Kudos to azure for opening so much of what they do. Lots of kubernetes work, including AKS-engine which runs their k8s implementation. Machine learning toolkit. Media services (faceid etc) as a container. The whole azure shabang runs on service fabric, which they've also open sourced.

It's a differentiator for some of their workloads: you don't have to hand your business over to a black box.


Aurora databases and DocumentDB share the same underlying reliable single-writer, many-reader block device for storage. That is all the magic. Not sure where you got the idea that DocumentDB has Postgres underneath it.


See this thread: https://news.ycombinator.com/item?id=18869755

The HN community did a little bit of reverse engineering.


I think they are wrong and Amazon is just sharing code with their Postgres layer.


That was more guessing than reverse engineering, no?


I get what you're saying, but BSD-licenses are specifically designed to facilitate things not being sent upstream. I don't understand why people moan about companies complying with the license agreement.


Your argument is legal and my argument is moral =P


So, castrating independent economic activity and forcing people to be subservient based on other activities is 'moral'?

this argument cuts both ways.


your moral assertion isn’t an argument, because it contradicts the license chosen by the relevant people


This is what happens in a world devoid of the GPL, or where a large majority doesn't sponsor the work of upstream.


MongoDB was already under the AGPL; Amazon just replicated the API on top of their own storage engine (or an existing permissive-licensed storage engine? Who knows?).

If we're at the point where Amazon can just re-implement whatever project they want, more or less from scratch, I'm not sure there's any license that can save us. :(


Save us from... companies writing software? Amazon's new project is a completely different database that happens to share an interface. At what point do we acknowledge that it's their own work?


That's not the question.

The question is of game theory. MongoDB Inc. invested a lot into developing MongoDB, they figured out the right semantics for a lot of things, trade offs, UX/DX (user and dev experience), and so on. (Recently Mongo4 added transactions. Which is a very very very handy thing in any DB.) But MongoDB calculated that they will recoup their investment because they are in the best position to provide support and even to operate Mongo, all while keeping Mongo sufficiently open source (you can run it yourself on as big a cluster as you please, you can modify it for yourself and don't have to tell anyone - unless you're selling it as a service, pretty standard AGPL).

Now AWS took the API and some of that knowhow, and invested into creating something that's not open source at all. You can't learn anything from it, you are not vendor locked in, because the API is standardized, but other than that it takes away a revenue stream from MongoDB Inc. (Sure, competition is good. DocumentDB-Mongo is probably cheaper than MongoDB Inc.'s Atlas.)

But the question is, will this result in less/slower/lower-quailtiy development of MongoDB itself?

Usually big MongoDB clusters at enterprise companies are not likely to upgrade and evolve, they usually get replaced wholesale, but they would have provided the revenue for MongoDB Inc to continue R&D, and to allow them to provide that next gen replacement. Now ... it'll be likely AWS something-something. Which will be probably closed source (like DocumentDB) and at best it'll have an open API (like DocDB Mongo).

Is it fair? Is it Good for the People? Who knows, these are hard questions, but a lot of people feel that AWS doing this somehow robs the greater open source community, and it cements APIs, concentrates even more economic power, and so on.


Well said. Unfortunately even in software, it seems that might makes right.


Save independent software companies from having their lunch eaten by the behemoths after doing the legwork of proving product/market fit - no different from any industry dominated by a few large players. Amazon is not in the wrong for building a competing product, but it's good for the market if the scrappy underdogs have a few edges in their favor.


Well, reimplementing an API is essentially the Oracle/Google Java case, right?


Amazon has explained in their reinvent videos that Aurora is the storage layer of Postgres rewritten to be tightly coupled to their AWS infrastructure. So it is just regular Postgres (they upgrade to latest on a slightly slower cadence). And there’s no benefit to getting the Aurora layer upstream, no one else could use it anyway.

Citus is an extension, not a fork.

So neither of these projects are doing Postgres a dis-service. Both are actually pretty heavily aligned with the continued success and maintenance of mainline open source Postgres.


> Amazon has explained in their reinvent videos that Aurora is the storage layer of Postgres rewritten to be tightly coupled to their AWS infrastructure. So it is just regular Postgres (they upgrade to latest on a slightly slower cadence). And there’s no benefit to getting the Aurora layer upstream, no one else could use it anyway.

I don't think this is an accurate analysis. For one, they had to make a lot of independent improvements to not have performance regress horribly after their changes, and a lot of those could be upstreamed. Similarly, they could help with the effort to make table storage pluggable, but they've not, instead opting to just patch out things.

> Citus is an extension, not a fork.

Used to be a fork though.

> Both are actually pretty heavily aligned with the continued success and maintenance of mainline open source Postgres.

How is Amazon meaningfully involved in the maintenance of open source postgres?


This is the future and it's not just big companies doing it.

Virtually all of the companies that were built on open source products in the past few years stopped centering their focus as being the best place to run said open source program, but instead holding back performance and feature improvement as proprietary instead of pushing back upstream.


we now have closed source Amazon Aurora infrastructure that boasts performance gains that might never see it back upstream (who knows if it's just hardware or software or what behind the scenes here)

The performance benefits of Aurora over Postgres are mostly because Amazon rewrote the storage engine to run on top of their infrastructure.


All I'm saying is that it looks like Azure and Microsoft are about to do the same.


What good would it do for AWS to send it’s changes upstream? No one else could use it without the rest of AWS’s infrastructure.


We're at a point in time where f(x) = y and we're starting to stop caring about the internals of "f" as long as the "determinism is equivalent" and that scares me.

OpenJDK for example and the API/ABI (whatever you want to call it) copyright and now MongoDB with DocumentDB, etc.


We’ve been at that point since the first PC compatibles came out in the mid 80s with clean room reverse engineered BIOS firmware.

We’ve been living with abstractions over the underlying infrastructure for over 45 years.


Do you know how your CPU works? Or all the networking hardware in the middle of you loading this page? There are 1000s of layers when it comes to computing that it’s impossible to be transparent with them all.

And quite frankly, that end functionality is what customers are paying for, so that they don’t have to care about all the technical details and operational overhead. It's not like open-source Postgres is being halted by this. The Citus extension itself is open-source too.


> - we now have Amazon DocumentDB that is a closed source MongoDB-like scripting interface with Postgres under the hood

To clarify, Amazon DocumentDB uses the Aurora storage engine, which is the same proprietary storage engine that is used by Aurora MySQL and Aurora PostgreSQL, and gives you multi-facility durability by writing 6 copies of your data across 3 facilities, with a 4 of 6 quorum before writes are acknowledged back to the client.

So, it's a bit inaccurate to say that DocumentDB has anything to do with Postgres.


There’s evidence to suggest that DocumentDB is actually running Aurora Postgres under the hood. https://news.ycombinator.com/item?id=18870397


I would argue Microsoft's strategy actually makes them more wedded and committed to ensuring the vitality of open source PostgreSQL than anything AWS is doing.


The big news here: Citus Data donated 1% of their equity to non-profit PostgreSQL organizations[1] so this acquisition is a win for the community even in the darkest scenario of Citus Data disappearing into a canyon on the Microsoft campus.

Given Microsoft's change in operation over recent years there's also hope that they can continue their contributions into the future.

It's fascinating to see Microsoft leave behind the "embrace, extend, extinguish" narrative only to have Amazon adopt it, causing massive rifts and action within the database community[2][3]. I am genuinely concerned about the future of open source software in this continued scenario.

An article with what I considered an outrageous headline ("Is Amazon 'strip mining' open source?"[4]) has only rung more true over time. Amazon is one of the largest companies on earth, selling products that they receive for free but never improve[5], attacking the primary open source provider, and then shift toward their comparable proprietary closed offerings.

Hopefully new ways to "give back", such as equity contribution, can be one of the many paths forward needed to keep open source software healthy. Given how much innovation is unlocked by this, it'd be a crime to go back to the past era.

[1]: https://www.citusdata.com/newsroom/press/citus-data-donates-...

[2]: https://www.cnbc.com/2018/11/30/aws-is-competing-with-its-cu...

[3]: https://techcrunch.com/2019/01/09/aws-gives-open-source-the-...

[4]: https://www.cbronline.com/analysis/aws-managed-kafka

[5]: From [2], "Jay Kreps, a creator of Kafka and co-founder and CEO of Confluent ... said Amazon has not contributed a single line of code to the Apache Kafka open-source software and is not reselling Confluent’s cloud tool."


Any clue what the base for that 1% is going to be? Didn’t see any mention of the total acquisition amount anywhere.



In case folks are interested here are the details from our founders on the Citus blog - https://www.citusdata.com/blog/2019/01/24/microsoft-acquires...


Well this is great news for the guys at Citus - they created something great as a Postgres add-on and a big chunk of it was open sourced.

They made a decent cloud business model out of it (no idea how successful but everyone I asked was happy with it).

I just hope Microsoft allow the tech to evolve as open source!


"I just hope Microsoft allow the tech to evolve as open source!"

Current Microsoft sure will. They're good with open source stuff.


What about future Microsoft? :)


Yes, agreed. Long may this continue.


Citus is already used by Microsoft itself internally, a recent example being the VeniceDB project to analyze Windows telemetry: https://www.youtube.com/watch?v=AeMaBwd90SI

Considering the competitive database landscape, this is a compelling offering to add to any cloud portfolio. Congrats to the Citus team.


I still can't get over the fact that Microsoft is using Postgres internally, if you had told me that 5 years ago I wouldn't have believed it. Did they go into why over MSSQL?


MSSQL currently does not have horizontal sharding capabilities like this, or easy UPSERT functionality.


I've dabbled with SQL Data Warehouse (via Azure); wouldn't this be the horizontal functionality? It has some limitations, but curious how it compares to Citus.


Yes but that's a different product designed with columnstores and still missing the upsert. This use-case was for lots of telemetry data that had high updates and very selective queries using indexes rather than large aggregations. Scale-out OLTP was better suited for these analytics than a normal OLAP system.


Thank you makes sense


The main question is: Did MS want an expert PgSQL team to work on Azure PostgreSQL (and may to create a proprietary competitor to Aurora)? Or Did they acquire Citus for its product, to improve and market it further?

It feels like it was the first. If so, it means bad news for Citus product as it will most likely be ignored for a while. That will be really sad, as I don't know any actively supported automated sharding solution for PgSQL other than Citus. There is PostgresXL[1], but there isn't much focus to make it community friendly.

[1]: https://www.postgres-xl.org/overview/


I don't think anyone should expect acquihiring an expert Postgres team to work on a proprietary product to work well, because the programmers' skills are eminently transferrable.

Half the team would probably wander off to work for one of the other postgres-centered companies (and quite possibly continue to work on the open source Citus code).


Fun fact: the team that built Citus Cloud began with 3 people that came over from Heroku after building its (proprietary) Postgres cloud service.


This is more of a competitor to Redshift than Aurora.


Citus improves the performance of OLAP query loads but it's not an analytics solution first. They say so themselves -

https://www.citusdata.com/blog/2018/06/07/what-is-citus-good...

---

When we first started building Citus, we began on the OLAP side. As time has gone on Citus has evolved to have full transactional support, first when targeting a single shard, and now fully distributed across your Citus database cluster.

Today, we find most who use the Citus database do so for either:

(OLTP) Fully transactional database powering their system of record or system of engagement (often multi-tenant)

(HTAP) For providing real-time insights directly to internal or external users across large amounts of data.


Great news for Citus, Microsoft, Postgres and for people using open source relational databases. This makes so much sense. (I know this comment might read naive to some but I’m genuinely excited right now)


I'm pretty excited as well... Especially if this means improvements to Azure's PostgreSQL options. DBaaS is one of the areas where cloud providers give a LOT of value, more so as long as the interfaces you use can be used internally/locally for development.

Similarly, I really appreciate MS-SQL for Linux on Docker as it is a lot easier to setup for CI/CD and local for dev and testing and is nearly transparent going to Azure SQL or MS SQL Enterprise for hosted deployments. I'd much rather use PostgreSQL with PLv8 than MS-SQL though.


I wonder how long it will be before they shutdown their own Citus Cloud hosted offering, which is hosted on AWS. Seems obvious that will become part of Azure soon.


I doubt they'd disrupt their AWS operations right away - this certainly won't be the first time that a MSFT team/subsidiary has used AWS.

What's more worrying to me is if they try to do both - build out a Citus offering on Azure, and simultaneously try to keep high-reliability of their AWS Citus Cloud, which may be the most reliable option for some time. It's tough for any organization, no matter how much capital has been injected, to keep a laser-sharp eye on two inevitably-competing initiatives, each of which have their own performance and automation characteristics. I don't want the one person in the company who knows, say, cloud hard drive recovery patterns like the back of their hand and had previously been the EBS guru, to suddenly be pulled into the new Azure optimization project... and that's not something that capital injections can necessarily fix.

That said, this could accelerate their development timelines overall, and it guarantees stability for the product for quite a while. Overall I think this is good news! Citus is one of those things that you want to have in your back pocket when building any type of app on Postgres, and we certainly see it in our company as a long-term "escape hatch" when we're forced to make database-heavy design decisions at currently relatively-small scale. This deal keeps it alive and prospering!


It's going to be really funny if Microsoft ends up using Open Source software to compete against its proprietary service-based competitors. Sort of like how GCP runs k8s... you can use the free tool, or you can use the managed service, and the community helps build the thing. In theory, you retain competitive advantage because you have the most expertise in the product.

The Googles of the world lose out on professional services, but Microsoft could still make a bundle of money by just consulting on the tools without even managing them. You might even make higher margins by not managing the service.


Congrats Citus team. Just please keep the blog alive! Craig's post are some of my favorite Postgres reads.


So, a part of Microsoft will advocate for SQL server, and another part will develop for PostgreSQL? Isn't it weird? Why would Microsoft want this?


Because MS is more and more in the business of selling the operation of software as a service, instead of selling licenses for their customers to operate themselves.

Think of them like a wedding event rental company, they are more than happy to rent you their own brand of tables, flatware, and silverware, but if you want another brand that’s fine too as long is you buy from them.


Microsoft is hedging their bets. PostgreSQL has the potential to disrupt the traditional relational database market, so if they're going to be disrupted then better to do it themselves.

I expect they'll also try to port Citus Data functionality to the SQL Server platform.


I doubt Microsoft is as worried about "disruption" than just expanding their reach. The number of developers who currently run Postgres (hell, even the number of customer Citus currently has) is far greater than the numbers likely to completely switch from SQL Server to Postgres.


May be MS is just too rich that they can afford to do stuff like that. But, that sounds very bad for Citus. Now they are PostgreSQL experts. With MS, they will be salesmen who are try to sell extra bloat that will also work with SQL server. Pity to lose such engineering effort in PostgreSQL community.


Disrupt how? Haven't they been part of the traditional relational database market for years now? What's changing?


People outside open source are noticing.

Also PostgreSQL has improved by large margins in the last five years. The software is more akin to a pyramid rather than a skyscraper. The foundation took a long time, but now that its complete there is a strong base for growth.


It's a little bit like Linux in the late 90s. Postgres is increasingly being used in financial services, for instance, because it's "good enough" -- many teams really don't need Oracle's or SQL Server's feature-set, and Postgres has enough 'interesting' features of its own.

It also lets teams own their DB infrastructure and play around with deployment patterns that'd make little sense on Oracle et al because of cost reasons.

It's not that Postgres is "killing" commercial databases, it's just that more and more people are, as you said, noticing that they don't need a commercial database for lots of use cases. And support and consultancy -- and even in-team skills -- for Postgres are often available.


Shifts in the database world happen at a glacial pace. They've been a part of that world for a long time yes, but they're steadily becoming more and more legit and acceptable as an alternative to Oracle, SQL server.


Postgres is a nice open-source platform but the commercial engines are far more advanced in many areas and are not going to be disrupted anytime soon. If you can use Postgres for your needs today than it's likely that any relational system would've worked and you didn't need a commercial system in the first place.

SQL Server platform already has one of the most advanced optimizers and distributed planning with its use in Polybase, stretch tables, SQL MPP and Azure SQL DW.


They are more advanced in many levels, a bit behind on the dev friendliness side (postgres is kind there), that's true.

However I manage roughly 2000 instances of commercial databases. I'd say maybe a 10th could not be hosted on postgres.

It gives postgres a huge disruption potential and the management, in all the big firms I know, is actively looking at it.


Yes, PG has many more usability features. Simple things like UPSERT make UX improvements if you don't need the advanced capabilities of MSSQL.

I'm sure PG can take on more of the standard RDBMS workloads today but I don't think that's really making a big dent on SQL Server as the bulk of their revenue comes from the serious enterprise scenarios.


> I don't think that's really making a big dent on SQL Server as the bulk of their revenue comes from the serious enterprise scenarios.

I'd say postgres can take 90% of the revenues. I've worked in 3 of the top 10 european banks. Most of the SQL Server instances do not even need partitioning, for example. Let alone always on, hekaton and so forth.

People mostly buy peace of mind. Until they are charged millions and start questioning the stupid expensive bills saying "do we need that ?"

Really I have all the metrics to back this up: CPU usage/Availability requirements/data size... It is literally my job to collect those.

Also, proper window support, great postgis, great json support, open source ecosystem support are not simple and huge cost savers.


Revenue for what? Postgres is free.

If you mean the licensing and support from commercial distributions and vendors then, as you already recognize, these decisions will fall to which vendor they trust more. That will usually end up being Microsoft.


Proffesional PostgreSQL support is not free, and the banks would most likely buy support from some company just like they currently support for SQL Server.


>>> "If you mean the licensing and support from commercial distributions and vendors then, as you already recognize, these decisions will fall to which vendor they trust more."

Enterprises are more than just banks, and vendor relationships matter. They're not fungible and are rarely based on price.


You do not trust a vendor who rips you off.


It's not the fault of the vendor if you decide to buy their product. If you didn't need the product features then you shouldn't buy it but enterprise deals are rarely about the absolute price.


Indeed. And as I said this reasoning might eat 90% of sql server's revenue.


I moved from SQL Server to Postgres 5 years ago, I would argue the opposite, ie PG is way better. eg pg supports UTF8, csv, jsonb It has way more standard SQL features and support, such as windows analytical and aggregation functions, string agg is particularly powerful. More SQL join options, join on using clause is v handy, lateral join is standard SQL and v powerful and much easier to understand than SQL servers obtuse version. It’s user defined functions are far more powerful, even with standard sql, let alone the umpteem other languages you can use such as Python or Javascript, functions can be chained for incredible power. Many Dev friendly features make you much more productive eg drop schema and the ability to easily use powerful editors such as sublime text or VSCode. With Recent parallel query improvements pg has mostly caught up on performance, it’s only glaring weakness vs Oracle now is lack of auto incremental mat view refresh - MS auto mat views had so many limitations they were almost pointless, last time I looked - pg mat views have no restrictions but are not that useful because they lack incremental refresh.


Yes PG has more usability features but it is not a match on performance at all and lacks the advanced features that enterprises want. I don't know why this is so shocking to hear. These commercial engines have billions of dollars in research and engineering, they're not just standing still and aren't obsolete because PG finally got parallel query (which is only workable in v11 and still years behind the others). There's a reason why all the enterprise PG distributions add so many other features and even basics like connection pooling, because that's what it takes to compete in the enterprise space with its vast and complex requirements.

As I said, "If you can use Postgres then you didn't need a commercial system in the first place."


At the end of the day, even if you find a better alternative, database engines are a bear to switch. Our database is still SQL Server, even though we've switched our development platform 3 times, and I'd love to be on Postgres.


Well that's exactly how technology market disruption works. The disruptive product sneaks into the low end of the market without any of the established competitors really noticing. Then the disruptive product gradually moves up market and eats everyone's lunch.

15 years ago Windows Server was far more advanced than any Linux distribution. What does the server OS market look like today?


Postgres also has a number of commercial derivatives, so you get the advantage of starting small (open source) and moving to a more optimized derivative when it becomes necessary.

Aster and Greenplum perform exceptionally for what they do. If SQL Server were better that's exactly what companies would use. Commercial engines are definitely more advanced. But some Postgres derivatives can absolutely out perform enterprise platforms in some uses-cases, and vice-versa.


Azure gives you Linux and Windows as options for hosting, because money is money and engineers are not free.


For a similar business reason as to why Azure offers Linux, Microsoft develops for Android and iPhone and Oracle owns MySQL.

Microsoft doesn't regard PostgreSQL as a direct threat to SQL Server and it's happy to make money where it can so long as it doesn't perceive something to be a mortal threat.


That is what I am observing too.


If you could switch from Oracle, you would switch to mySQL, thus oracle buys mySQL to keep control of the enterprise market.. If you could switch from MS SQL you would probably switch to PostgreSQL, thus MS wants to control part of the PostgreSQL enterprise market. If you would switch from Facebook, you would probably go to Instagram. Whatever you choose, the money will end up in the same pocket.


Offering services is far more lucrative than a single product. That's the entire cloud computing business model.

Azure is happy to take your money to run SQL Server or Postgres, just like how Amazon has been running Aurora side-by-side with Oracle and SQL Server for years now.


Same reason why Microsoft wants libraries for languages like Ruby to talk to SQL Server (at the cost of ASP.net adoption - I had the pleasure of speaking to that team at RubyConf one year) or to run SQL Server on Linux (at the cost of Windows licenses). Expanding their breadth benefits the company overall, and they know there will be some who never run SQL Server or Windows, yet 2019 Microsoft wants to have solutions for those folks.


Why did Facebook buy Instagram? Look at all the various travel sites. They are all owned by priceline.


Same thing as Oracle owning MySQL. Keep your friends close but enemies closer.


They shifted a lot of focus to Azure.

Azure Database for PostgreSQL competes against RDS and to a lesser extent, Google's Cloud SQL for PostgreSQL.

They'll still sell MS SQL Server too. But sometimes PostgreSQL is a better fit for your stack, or your preference, and they want your money for them to provide that too.


hosted postgres is already available on azure, ala rds.


for azure?


It is all about choice. As both AWS and Azure user, I can attest that Microsoft is doing a lot more with OSS and contributing than AWS. We are both a SQL Server and PostgreSQL shop and are excited about this move by Microsoft and Citus.


A little off topic, but I wonder how long it will be before MS acquires Docker Inc. Seems like an even better fit for them now that they own GitHub. GitHub + Docker Hub on the developer engagement side and Docker Enterprise on the traditional enterprise side.


I'm wondering how much the OCI and CRI-O has impacted Docker's value proposition. Docker Hub seems more and more like the real product, though I guess you could argue that the container runtime was never really a product in the first place.


Private repos on Docker Hub is definitely a product, especially if you provide a seamless path for moving from source code on GitHub to images on Docker Hub to containers deployed in the cloud or on premise.

MS could certainly also do good business selling Docker's friendly Enterprise orchestration tools (including their new Kubernetes based tool) which check all the Enterprise requirements for security, policy, identity management etc.


Docker the company is a lame duck, and docker the software is being rapidly supplanted by podman and buildah. There would be no point.


I haven't used Citus but once thought about Cstore_fdw. How much of this is about Cstore_fdw? I am curious because in data warehousing space my experience has been column store databases totally rule when it comes to speed on analytics. I know SQL Server has column store indexes but that requires you to create them whereas with genuine column store you get the performance boost by virtue of how data is stored.


Very little, I'm guessing; cstore_fdw is not remotely competitive with mature analytics DBMS.

see here: https://tech.marksblogg.com/benchmarks.html


Great link, thanks for sharing.


SQL Server indexes can be either clustered or non-clustered, which determines whether table data is stored by index order. If you have a clustered columnstore index then the table is actually physically stored in a column-oriented format. Combined with vectorized processing, an impressive query optimizer, and in-memory tables, MSSQL is one of the fastest OLAP systems available.

Also Cstore_fdw is rather obsolete and more of an experiment. It's a rough wrapper around ORC files and is missing many features, advancements and an execution engine to match the performance and usability of a real OLAP database.


Any stats/articles on SQL server having an impressive query optimizer? I have personally found it almost entirely devoid of ability but maybe we just had awkward queries.


For data analytics I use ClickHouse instead of PostgreSQL. There is a PostgreSQL Foreign Data Wrapper (FDW) for the ClickHouse database, but I have never used it.


Does anyone know any details about the financials of the deal? Is this an acquihire or more?


There's no way this was just an acquire: Citus represents some truly impressive computer science.


Impressive computer science does not at all correlate with commercial success.

If anything, that makes it more likely to be an acquihire.


Happy for the folks at Citus. I use Citus at work and it's amazing. Hope things stays the same after the acquisition.


My sentiments as well. Great team to work with and really like the product.


Speaking as a data professional and SQL addict, I was always impressed when I came across Citus Data posts. Good acquisition by Microsoft.


Why is an acquisition a win for the company? Seems to me like the big company is killing the small one and absorbing its soul(brand).

Shouldn't sustainability be the primary goal instead of making big bucks temporarily?


So, how much did YC make?


Maybe now they will actually add a free tier so people can sign up for this, develop their product using a free tier, and upgrade when they launch, as is the natural progression with most other cloud products. I think before there were some complexity and/or financial issues preventing this but with Microsoft's wallet it shouldn't be an issue.


There’s a community version.


That doesn't really help. What I want is a seamless service I can just spin up an SQL server on with some cheap or free plan and tiny capacity, and then have it horizontally scale based on usage without me ever having to do or change anything, potentially to the point where it's dealing with terabytes and terabytes of data, thousands of connections, a large monthly bill, etc. Google's Cloud Spanner was supposed to be this, but the minimum monthly fee is $90ish which makes it impractical for anyone who doesn't want to waste money while developing. Citus has traditionally turned up its nose on HN at users who don't already have a massive database, but you gotta start somewhere, and the product is a lot better if you can stay in the same DB ecosystem from 0 users to 1,000,000 users and beyond.

e.g. something like the free "tiny turtle" plan here: https://www.elephantsql.com/plans.html, but that can auto-scale up to citus-scale things, without user intervention, as needed.


Congratulations to the Citus Data team! I don't have anything significant to add, but I loved the free socks you gave out :)


Wonder if they will have Microsoft socks now :)


5 kinds and with a lot of work you'll be able to figure out which kind is cotton if they put the people in charge of .net naming in control of the sock buying division (which they probably should at that).


On that page appears that the photo is a composite and mashed together from various sources. And not a good job either.


Yeah...looks like only the first, and maybe the second, guy was really on the stairwell. All the other people appear to be photoshopped in :-)


If microsoft invests significantly in plsql support and oracle compatibility they can bleed Oracle bug time.


Great news for another Turkish company acquistion, I wonder what was the acquisition price.


It’s a very unsettling future we’ve ended up in where I see that Citus has been purchased and I’m pleased in was by Microsoft.


What are the odds that Citus's Enterprise is released to community version like Github's private repos in near future?


I wonder if the Citus folks will have any influence on the future of SQLSERVER. Maybe they’ll bring a plug-in system to it?


This is awesome, congrats. Any chance you all may change the license to something more permissive? :fingers crossed:


Congrats to folks at Citus -- they've matured quickly to reach this point in just a couple of years.


genuine question:

Why people think this is good, but MySQL acquisition by Oracle was bad? Both companies have internal conflict of interest by already owning close-source SQL servers, both companies have history of attacking opensource communities.


I wouldn't say it is "good", but two reasons why it probably isn't so bad: Citus is a company active in Postgres, it's not "the Postgres company" like MySQL AB was. Their business model seems like it fits to what Microsoft is doing now, so the open-source part is hopefully not in real danger.


That means PostgreSQL will hopefully get more resources development upstream.


Whaaaaaaaaaaaaaaaaaat!?

Is this a move to defend SQL Server or expand Azure’s PG offering?


I think SQL is a huge business for them already. This looks like for Azure and OSS and expanding there.


Congrats to the Citus Team! Really a great group of people over there.


I hope this means tPostgres or pgtsql gets some Microsoft resources!


Next achievement is selling the operating system to Micros~1


Great news, their product is awesome and this will hopefully let more people use it.

Knowing that Citus is available as an option if you need to scale makes Postgres that much more compelling of a default choice for data store.


The photomontage is pathetically bad...


Azure Cloud Spanner time?


it is known as Azure CosmosDB


I mean, isn't CosmosDB more like Cloud Datastore/Megastore?


Think CosmosDB just is not too appealing with it's layered multi-model approach compared to ArangoDBs native way to support multiple data models in one db.


crap :(




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: