Hacker News new | past | comments | ask | show | jobs | submit login
Babelfish: SQL Server-to-Postgres Translation Layer (amazon.com)
288 points by ke4qqq on Dec 1, 2020 | hide | past | favorite | 113 comments

Don't hate me for it, but I'd like this for MySQL to postgres too. At least as a stepping stone.

Use case: some of my SQL syntax depends on MySQL but I realize I made a poor life choice and would rather have transactional DDL and a myriad of better features on postgres.

This is so awesome. Wish I found it earlier. Thanks for posting

Well, what were you looking for? We might need to improve our SEO...

For SQL server, AWS can save their customers money by cutting license cost (to MS). MySQL is already free so I don’t see how they can benefit from such a project.

They can go one better — they can fork the parsers out of both projects and run it against a common query planner/storage engine (aurora).

If you have several thousand stored procs and triggers then having to not rewrite all of that is actually likely over a million in savings.

I agree, but there might be a creative way to do it that increases customer happiness and still is a decent business.

>don’t see how they can benefit from such a project

I thought there were bad blood between Oracle and Amazon?

In that case you'd expect to see it for Oracle proper instead of MySQL first, no?

The Oracle wire protocol contains a (copyrighted) haiku, so no one can reverse-engineer the Oracle wire protocol without infringing on Oracle's copyright.

You might be able to get away with staying with MySQL or a variant. Years ago I did an "interview problem" to generate a report of some accounting data of a sample data set in MySQL. The details are foggy now, but I ended up doing it all using advanced SQL features that I thought only existed in PostgreSQL (you know, or Oracle), but whatever engine I ended up using had it as well. It was probably the latest Percona or Maria?

MySQL has worse guarantees in several cases, though. For example, using `SERIALIZABLE` isolation means your queries can deadlock when run in parallel.

Likewise, selecting a column not present in a GROUP BY leads to random values being returned.

> Likewise, selecting a column not present in a GROUP BY leads to random values being returned.

That hasn't been the default behavior for something like 5 years.

Beyond that the values aren't "random" as much as "one of the values from the set" which can be a useful feature in cares where you want any of the values and don't want the performance hit that comes with using an aggregate function particularly on very very large sets.

MySQL improved a lot in those places.

pgloader does this (and much much more) in a single command:


  pgloader \
    mysql://user:password@mysql:3306/database \
This is a gem of a tool. Surprised nobody has mentioned it. Supports SQLite and MS-SQL too.


Absolutely, absolute love pgloader! So painless and quite straightforward. Saved me lots of time migrating my production db. Highly, highly recommend if anyone wants to migrate from MySQL to PG!

However if the translation layer is designed for consistency, like Babelfish, it would have to replicate many of the MySQL “faults” too... e.g. it would need to have the same non-transactional DDL so it acted “correctly”!

And it would need to replicate the specialised data types like utf8mb3.

To use PostgreSQL features (such as JSON datatype) would presumably need a separate data connection and transaction, because if it isn’t implemented by MySQL, then the syntax wouldn’t be supported by Babelfish.

You would get some wins, but you don’t get a blend of the best of both.

It would be nice if it could aim for doing the sane thing over exact MySQL compatibility. But I agree it would be a minefield...

I would love that. There is so much mediocre but popular software out there that uses hard-coded MySQL queries instead of some abstraction layer.

Having to use MySQL/MariaDB after usually dealing with PostgreSQL always feels like such a downgrade.

Can't you do it with a foreign data wrapper or the equivalent in mysql? Maybe keep your mysql-specific queries as views in mysql and call them from pg?

Just spitballing, sorry, I love FDWs.

Something like this would solve a lot of my problems

I’d considered migrating MySQL to Postgres on a 200 table production app more than once. Couldn’t find any good tooling at the time so I just sucked it up and lived with my life choices.

Why not migrate to MariaDB? It's better than MySQL and has easy and great clustering contrary to Postgres

This is great. I suspect that I have exactly the right use case for this.

The two main issues with running SQL Server are (1) you have to license all the cores on a system and (2) the standard license only recognizes up to 64GB RAM. So I actually wound up buying a 3GHz single-socket system for around $10K to save $20K on the SQL license.

With this, I can move a couple of the big DBs to another system that has 32 Cores with 256GB RAM and the entire DB will fit in memory, put in 5GB ethernet, and gain a tremendous amount of performance.

But, more importantly, I can migrate the workload on a case-by-case basis. Human costs always dwarf my software and hardware costs.

(2) is not accurate. SQL Server Standard Edition supports up to 128 GB RAM


For the highest socket license.

I still remember the day that MS switch the SQL Server pricing from per-socket to per-core. Dark day, indeed.

> A commonly used datatype to store monetary values is the MONEY data type. In SQL Server, the MONEY data type’s behavior is fixed using four digits to the right of the decimal (e.g., $12.8123). However, in PostgreSQL, the MONEY data type is fixed using two digits to the right of the decimal.

> So, when the application tries to store a value of $12.8123, by example, PostgreSQL will round to $12.81. This subtle difference will result in a rounding error and break an application if not correctly addressed. To ensure correctness in Babelfish, we need to ensure such differences, small and large, are handled with absolute fidelity.

How are they going to solve this with just a query translation layer? Isn't information lost on save?

> How are they going to solve this with just a query translation layer?

Well, the translation layer isn't just a query (DQL) translation layer, its an SQL Translation layer including DDL, DML, etc. Since both Postgres MONEY and SQL Server MONEY are 8-byte, fixed-precision decimal types, with the only difference being the position of the implicit decimal, a translation layer can use one as the backing store for something that is logically treated as the other without data loss, though it will have to be aware of the difference when presenting data and also when doing conversions to other datatypes, doing math other than addition/subtraction, etc.

It would be even easier, I think, to just use, what, DECIMAL(19,4) in Postgres for SQL Server MONEY, with some special handling to have the right failure behavior at the edge of the slightly-narrower range of the SQL Server MONEY type.

One can also change the precision of psql's monetary type. From the docs:

> The fractional precision is determined by the database's lc_monetary setting.

My guess would be: by not using Postgres's money type.

I'd say the best solution here is if a really nice money data type was developed that is configurable as necessary to hopefully satisfy the MS SQL idiosyncrasies while also maybe supporting a bit richer superset of desirable features you'd want in a money type. You could even have things like temporal currency conversion type queries if you had the right feature set.

Presumably by not storing SQL Server's MONEYs in pg MONEYs, but CASTing to a pg MONEY if pg asks for it.

As someone who has been fighting with a SQL Server to Postgresql conversion this sounds AMAZING. Too bad it won't be available before my conversion is complete (and if it is, that's an even sadder proposition)

How are you doing your migration right now? Rewriting code by hand?

We got very far using DMS (Data Migration Service) and SCT (Schema Conversion Tool), especially once we put in some smart renaming conventions. It does like to make varchars everything it doesn't type-convert automatically. We also had to explicitly rename objects, mostly indexes - it like to truncate names, often overwriting previous entries when it truncates the name. Beyond that, there's some procedures and application code to rewrite (it's a Rails app built atop an older database that didn't follow standard Rails conventions) but I'd say automation and semi-automation (ie smart find-and-replace) got us 80% of the way.

Out of curiosity: besides cost, are there other significant reasons that drive your desire to switch?

Don't discount the cost, which is borderline astronomical; if you're on Enterprise, you're paying tens of thousands a year. For large installation, potentially six figures. And that's on-prem which is the cheapest way to do it more often than not.

A 2-core Enterprise license is nearly $14,000.

I guess it's linked to cost but the pure painfulness of just having licensing in the way of your infrastructure management is pretty annoying.

We literally have had this problem the last few months where we had a spike in load causing wide spread performance issues and the obvious answer was to give the server more cores but "we're not licensed for that" ... so everybody just suffered through it because temporarily giving the DB more cores for a few hours was just too painful / costly from a licensing point of view.

I came from Oracle, not SQL Server, but the largest benefit to me, is at a previous job, I ran Oracle Standard (that cost About $25k for our 2 core servers) and one of the critical limitations is you can't build an index without locking the table. The ability to do it 'online' was an enterprise only feature (that would have cost about $700k on our hardware), so we were unable to do any index changes on our tables, since it would take 10-12 hours to rebuild some key tables, and we couldn't block all our transactions for that long. SQL server also makes that an enterprise only feature.

PostgreSQL has the native ability to do that. There are lots of other benefits, but that is the first that comes to mind.

When they come out with Babelfish for Oracle, then stuff is going to get crazy. EnterpriseDB has some limits to its functionality, and its cost is pretty high.

liberatii.com has developed a similar tool like Babelfish for Oracle to SQL Server and PostgreSQL in case you're interested.

Plenty of reasons, but principally we want to bring our stack in line with the most common practices (ie, Rails/Postgresql) for simplicity's sake; this is crucial for our small team.

Postgres is functional by default, no need to throw asinine "with (nolock)" hints on every join.

This would be even greater news if it would not be vaporware "Babelfish for PostgreSQL will be available on Github in 2021." https://babelfish-for-postgresql.github.io/babelfish-for-pos...

I hope this means I can finally connect postgres to excel with the same ease I can connect SQL server.

What does “connect” mean in this context? Read data from SQL server into an excel spreadsheet?

Not the GP but yes I've seen a lot of folks go through the Data tab in Excel and connect to a SQL database to display data directly.

Excel can connect to PostgreSQL as well.

Yes, it can on Windows, both via psqlODBC and Npgsql, when you find out which .net runtime version your Excel version uses.

On Mac, it is more interesting; there's only ODBC, Microsoft doesn't support the same psqlODBC you can use in Windows and you have to purchase one of the supported commercial ODBC drivers.

Yes but not with the same ease :)

Exactly that. It's a nightmare to maintain but business types seem to demand it.

I used to work at big hospital that had a home-grown 'reporting system' built with Excel, data connections to an SQL Server database, and a series of pivot tables. Every morning from 8-9am the whole network would slow as people opened their spreadsheet and it basically did a `select * from a_rather_large_view` - tens of gigabytes flying around (this was in the mid 2000s.)

There was resistance after a data warehouse (BO) was introduced because pivot tables were so flexible and well-understood.

This is interesting because it will also help Sybase migrations. SQL Server is the "brand name" but there are still a lot of people stuck on Sybase.

Will it? MSSQL and Sybase diverged somewhere around 27 years ago; anything that Sybase and Microsoft did differently since then would likely be completely incompatible.

As someone who converted a 400'000+ LoC Sybase SQL codebase to Microsoft SQL Server about 5-6 years ago, they did diverged since around 2000, but not by a lot. We ended up with a codebase that could support both Sybase and MSSQL, where when differences occurred, we essentially used compiler directives, which wasn't that often.

I would guess at least 80% of common SQL and T-SQL from Sybase is still completely compatible with SQL Server.

As a side note, I had no idea Sybase still existed at all. Looks like it's now part of SAP's portfolio.

The two dialects are very different today

Sounds like they’re open sourcing it to get some help on it. I have to wonder if they’ve found it not worth the time to make it fully production ready.

> I have to wonder if they’ve found it not worth the time to make it fully production ready.

It sounds like it's not compatible with all T-SQL commands and data types. It would be tough to get full feature completeness: Microsoft is constantly adding new stuff in each version, like the ability to run Java stored procedures, R & Python in the database, etc. Open sourcing it would let companies expand it to get specific features coded that they need - that otherwise might not get attention from the general public.

Correct. The focus is on 100% correctness. As I wrote in the post: "Over its 35 years in existence, SQL Server has evolved to meet a wide array of use cases. When first made available on GitHub, Babelfish won’t be able to handle every use case, but will be able to tackle the most common application scenarios. Most importantly, Babelfish will meet the correctness objective. That is, if Babelfish doesn’t yet support specific SQL Server functionality, it will return an error to the application, rather than defaulting to PostgreSQL behavior. Why? Because, again, developers (and the enterprises for which they work) must be able to depend absolutely on the correctness of SQL Server compatibility."

At launch Babelfish will be able to handle - with 100% correctness - the main semantics you'd want. HOWEVER, as you said (and as mentioned in the post), there's a large surface area, and a "long tail" of functionality that needs expertise from us and others to cover. So to do this right, it needs a community. It will be great for many use cases right from the start, but to ensure it's great/perfect for your use case...well, that just might take your help. Please work with us on this.

Interesting. Now where's the Oracle-to-Postgres Translation project? ;-)

Their legal team might still be preparing for this one given Oracle's reputation.

SCOTUS hasn't yet ruled on Google LLC v. Oracle America Inc. One imagines that if Oracle prevails in some demonstrative manner they may very well believe their flavor of SQL is a software interface protected by copywrite.

So will all the other 'software interface' owners of the world.

So if Oracle wins, can AWS sue Oracle for Oracle having implemented AWS S3 API ?


AWS can probably sue most of the valley for that one.

IBM might have something to say about that, though.

Wouldn't that be ironic.

Apparently everyone keeps forgetting about ISO.

That's because everyone knows about the delta between ISO and real implementations.

Yet they still forget about pay to play with ISO standards.

I have used ora2pg with great success.

Some time ago there was big news about Amazon migrating away from Oracle. Related to this?

Announcing this first for SQL server might be a way to launch softly and learn.

isn't that what enterprisedb is selling ?

Yes but it isn't Open Source.

Amazon acquiring EDB and rolling some of the currently-proprietary bits like the Oracle compatibility layer into open source Postgres while increasing resources to developing them could be an interesting play.

Would you know if Babelfish will be released for RDS for PostgreSQL as well?

The only mentions so far are for Aurora for PostgreSQL, or self-hosted PostgreSQL.

Very good question, seems to be a strange decision. Maybe it doesn't meet what they deem up to the standards to include it in RDS due to the nature of how the backend is hosted (just a guess). Could be something as simple as the dependencies the extension requires are incompatible with what the RDS servers run. It would be very nice if the RDS team had some guidelines on what it takes to get an extension accepted for RDS with a good framework people can follow to ensure their extension complies with the security requirements, etc. Then have a way to submit an extension to be included and have it be reviewed and given guidance on any changes needed to comply (for a fee potentially).

At that point, thinking that SQL somehow is "portable" from a DB to another is a bit of a mirage. Way too many differences between systems, which does influence data modelling itself.

It feels like this is AWS fighting back with Microsoft after Microsoft raised fees to run SQL Server on AWS. Not only will AWS build this for themselves, but they'll contribute for anyone who wants it. That's a bigger blow to Microsoft than just facilitating migrations from MS SQL to pg in AWS; this impacts non-cloud-lifted database revenue.

It sounds like their goals are pretty lofty -

>Those are the mechanics, but developers need to be certain that Babelfish truly speaks SQL Server’s language in a dependable, predictable way. As such, the guiding principle for Babelfish is correctness, with no compromises. What do I mean by correctness? Namely, that applications designed to use SQL Server semantics will behave the same on PostgreSQL as they would on SQL Server.

I am assuming they just mean in regards to types and the like - to get the same performance characteristics out of the code would be bonkers.

> Sounds like they’re open sourcing it to get some help on it.

My take is that they are open sourcing it because they don't just want to use it to compete with Microsoft to host SQL Server workloads on a better basis than Microsoft's licensing policy has let them in the past, but also to just undercut Microsoft's SQL Server licensing revenue generally.

As a business strategy, sure, but given the way everyone I've heard from AWS talks about Microsoft and SQL Server licensing, I wouldn't be surprised if there was a good deal of spite involved, too.

One word: JEDI.

Well, that too, but the antipathy over SQL Server licensing, in particular, seems to have preceded the JEDI award.

i swear the fuck to god if one more piece of technology is called babelfish i am going to find out who is responsible and toilet paper their house.

I currently support the following products named "Integrity":

- Law firm management software - Document management software - A DVR appliance

Previously I supported HPE Integrity hardware. Certain names just seem way overused.

imagine how hilarious douglas adams would find it to try and google babelfish today lol.

Why not do this for Oracle? I've not found SQL Server to be too bad from the crazy Oracle stuff (light experience only - maybe bigger players have it worse?).

Oracle has peculiarities that you won't see in most other databases, which may make this a bit harder -- including such things as an oddball set of date/time types ("date" includes time to seconds), and treating zero length strings as SQL nulls --

  '' is null
evaluates to true in Oracle. It also has features like "fast refresh" on materialized views which might be tricky to emulate on other db engines. Nothing's impossible, but it's certainly more of a challenge.

Oh, immediate refresh of materialized views...its my biggest wish for postgres to have a seriously robust implementation.

I have some pretty complicated materialized views which the devs said would not work with their first implementation.

>> '' is null

This was a major pain for us especially in our java data-access layer.

It already exists: https://www.enterprisedb.com/products/edb-postgres-advanced-...

I'm not sure exactly how compatible - I've never had to use it - but EDB has been around for a while and is commercially backed. Plus they employ some core PostgreSQL contributors.

As a DBA, T-SQL is much more standard then PL/SQL. Its probably step 1 of the plan, with step 2 being Oracle.

That's completely subjective and depends on your familiarity.

Isn't SQL/PSM the standard? And PL/SQL older than them all? (Hence Postgres providing PL/pgSQL)

About 10 years back my company hired consultants, spent close to 6 months translating SQL queries/stored procs to be Oracle. The goal was to support both MSSQL Server & Oracle for the on-premise product. It was quite costly undertaking. Then few years later they just abandoned Oracle because of maintenance costs.

I wonder if anyone starting out today chooses Oracle as their relational database.

Yes, large organizations that aren't tech oriented will use Oracle every time over other solutions.

It's not a great solution for tech companies, but they do say no one ever got fired for choosing Oracle or IBM or whatever, mainly because you can always pay them to fix your shit and there will always be someone supporting those products.

Oracle is very powerful. As a developer I found it easy to work with. But the cost to run in production is eye-watering. IDK how their licensing works today but in the past it was free to use for evaluation or for developing a prototype. Once you were "gaining business value" i.e. using it even if only internally to develop a real product, you were supposed to be paying for it.

Oracle XE is free for production use, but it only makes sense if

a) you are already familiar with Oracle, and

b) you have a very small data set:

“Oracle Database 18c XE supports up to:

3 Pluggable Databases

2 CPUs for foreground processes

2GB of RAM (SGA and PGA combined)

12GB of user data on disk (irrespective of compression factor)”

> I wonder if anyone starting out today chooses Oracle as their relational database.

Your boss does.

I bet they are working on it.

Heres ms2pg https://edoceo.com/dev/ms2pg

A tool I made and used over a decade ago when migrating a bunch of stuff

Ancient muscle memory completes the URL thusly:


I tried to build a T-Sql-to-pgsql compiler to enable us to migrate our code but ran into some fundamental issues.

Sql Server allows you to have arbitrary statements/declarations embedded in your sql queries. It also doesn't require type information to be specified in many places.

How does this translator get around that?

For example, if I have this bit of unoptimized T-Sql:

   declare @m int
   select @m=[MeterID] from EnergyMeters where MeterLocation='/a/b/c';

   select sum([Value]) from EnergyData where [MeterID]=@m;

How would this get translated to pgsql? (Yes, you can combine this specific statement into a single query - this is a trivial example to highlight the point)

If you are just binding variables with early (before the last one that returns data) selects like that, just turning them into subqueries or factoring them out to CTEs works, which should be reasonably straightforward mechanically.

This sounds like it could be a massive competitive advantage for AWS over Azure.

It would be difficult for Microsoft to canibalize their Azure Sql sales by building a similar translation layer.

About every 3 years when I've tried to migrate and use some db migration tool it always seems to throw frustrating string/formatting errors, each time I've smacked my forehead and ended up just grabbing Ruby and ActiveRecord, it just always seems to work without any weird parsing errors.

1) Is there MongoDB-to-Postgres Translation layer?

2) Is there converter that can convert schema and transfer all data from:

2.1) MongoDB to SQLite?

2.2) MongoDB to PostgreSQL?

Any idea what language it’s written in?

"Babelfish is written in C, which is the same programming language used to develop PostgreSQL. Some parts of Babelfish are developed using procedural language in PL/pgSQL. Many test cases are written in PL/pgSQL and T-SQL."

Argh, I scanned the article multiple times and missed that section. Thank you

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact