
Showdown: MySQL 8 vs. PostgreSQL 10 - kenn
https://blog.dumper.io/showdown-mysql-8-vs-postgresql-10/
======
asah
Great to see MySQL adding this stuff! There's still a ton of reasons to choose
Postgres and more and more silicon valley startups seem to be choosing pg - I
don't remember the last time I met a startup choosing MySQL.

DBMSs are giant complex pieces of software with a million features - it's
really hard to compare them. But if I had to sum it up, you can dump freaking
line noise into Postgres and then hide the nastiness and manage and query it
like a well-designed database. If you have a pretty database, good for you,
but today's app writers have gotten lazy with NoSQL record stores and their
databases resemble vomitoriums - and let's not talk about what people used to
do in the 80s and early 90s.

Without further ado, Postgres vomitorium cleanup features:

\- user defined functions/aggegrates/windowfuncs __in your favorite language
__incl JavaScript, which means you can write tricky business logic once and
run it where the data is, vs pulling out millions of records from the
database. Language list:[https://www.postgresql.org/docs/10/static/external-
pl.html](https://www.postgresql.org/docs/10/static/external-pl.html)

\- foreign data wrappers with hundreds of connectors AND a 5 minute toolkit
for authoring new wrappers in python and other scripting languages.
[https://wiki.postgresql.org/wiki/Foreign_data_wrappers](https://wiki.postgresql.org/wiki/Foreign_data_wrappers)
[http://multicorn.org/](http://multicorn.org/)

\- index goddamned anything. Postgres has the most array of index types of any
open source database AND if you need, you can easily write a function (in
javascript or python!!!) and create an index that's the result of that
function call. Postgres even has a full range of partial indices and block
range indices, which make it practical to index massive and sparse datasets.
[https://www.postgresql.org/docs/current/static/indexes-
parti...](https://www.postgresql.org/docs/current/static/indexes-partial.html)
[https://www.postgresql.org/docs/current/static/indexes-
expre...](https://www.postgresql.org/docs/current/static/indexes-
expressional.html) [https://www.postgresql.org/docs/current/static/brin-
intro.ht...](https://www.postgresql.org/docs/current/static/brin-intro.html)

EXAMPLE: I was once handed a MySQL database of IoT signals where timestamps
were in seconds since the epoch and asked to report on this data without
changing the database. No biggie: 15 minutes to replicate this OLTP database
to a read-only Postgres replica (dumb schema mapping) - then a Postgres VIEW
to hide this nastiness and a function index on the timestamp column
(to_timestamp).

(from memory) CREATE VIEW foo AS SELECT *, to_timestamp(mysql_ts_col) as
ts_timestamp FROM replicated_mysql_table; CREATE INDEX foo_ts_inx ON
replicated_mysql_table(to_timestamp(mysql_ts_col));

At another company, we got dumped a load of JSON and weren't sure how we'd
need to parse it. No biggie, I just created indices using function calls that
parsed the JSON.

\- tons of native datatypes and extensible datatypes. In cases where you're
handed complex structures or "weird" data that doesn't behave like most
programmers expect, you can define new datatypes, then create a library of
user defined functions around them.
[https://www.postgresql.org/docs/9.5/static/xtypes.html](https://www.postgresql.org/docs/9.5/static/xtypes.html)

\- sampling. Postgres has native, low-level support for queries that sample
the data, which makes it super fast to explore data while preserving some
semblance of statistics. [https://blog.2ndquadrant.com/tablesample-in-
postgresql-9-5-2...](https://blog.2ndquadrant.com/tablesample-in-
postgresql-9-5-2/)

\- EXPLAIN. The Postgres planner/optimizer is still the king at explaining why
your query is taking forever and what you can do about it. Admittedly, this
stuff quickly gets arcane, but you can post your EXPLAIN output to a forum and
guys like me will tell you how to override the JOIN order, update statistics,
etc. [https://www.postgresql.org/docs/current/static/using-
explain...](https://www.postgresql.org/docs/current/static/using-explain.html)

Finally, no conversation about Postgres vs <x> is complete without mentioning
that ALMOST EVERY FEATURE IN POSTGRES WORKS WITH EVERY OTHER FEATURE, which
means you don't waste hours investing in something and then "oops" your
carefully written user defined function can't be invoked in some obscure place
- with Postgres, you can assume everything just works and will keep working.
There's few pieces of software that can claim this.

~~~
meritt
> EXAMPLE: I was once handed a MySQL database of IoT signals where timestamps
> were in seconds since the epoch and asked to report on this data without
> changing the database

I'm not following why you moved the data into postgres other than to say you
did? Are you suggesting that because you were restricted from making schema
changes to the MySQL instance that that's a reason why postgres is superior?

~~~
cosmie
The reason for the migration was to be able to leverage the
Functional/Expression Index[1] capability of Postgres. Any type of aggregated
time series reporting queries were likely painful from a performance
perspective, since an index on the epoch timestamp would have limited
usefulness from an optimization standpoint.

So instead, he leveraged the expression index functionality of Postgres to
pre-materialize an index against the converted timestamp. He didn't touch the
table structure itself so it's transparent, but gets the performance benefits
of that index already existing.

MySQL doesn't support function based indexes directly, although you can
achieve a similar result in newer versions of MySQL with an intermediate step.
You can create a Generated Column[2] first, and then build an index against
that. If you specify it as a virtual generated column, then it's essentially
the same as the above process where the column isn't physically stored, but
you can index it. That said, asah may still not have been able to do that if
the version of MySQL was too old or even that level of schema change was not
allowed.

[1] [https://www.postgresql.org/docs/current/static/indexes-
expre...](https://www.postgresql.org/docs/current/static/indexes-
expressional.html)

[2] [https://dev.mysql.com/doc/refman/5.7/en/create-table-
generat...](https://dev.mysql.com/doc/refman/5.7/en/create-table-generated-
columns.html)

------
garyclarke27
Interesting but obviously biased in favour of MySQL. I disagree with some
Postgres (which I know v well not MySQL though) comments. Wrong everyone does
not use Sequential Integer Primary Keys, because they are completely useless
as a Set constraint, I use real world logic. Update is irrelevant to me, so
Vacuum not a problem, I use append only ie immutable database, the upside (not
mentioned) is that this architecture makes PostgreSQL rock solid, can stay up
for years unlike for example MS SQL Server that requires regular reboots and
often fails to come backup cleanly. Also provide DDL transactions with
rollback. I’m curious does MySQL now have Check Constraints, Arrays, Drop
Schema with cascade (a huge timesaver), V8 PL, Lateral Joins, Range Types,
Custom Aggregates (so powerful) SQL Function inlining, Parallel Query,
Functional Indexes, Exclusion Constraints (eg no period overlap),
Notify/Listen, Foreign Data Wrappers?

------
mbell
The connection comments are a bit dubious. MySQL will use less memory for 1000
connections but performance will still drop due to contention and context
switching. In both systems you want a small number of connections to the
actual database, something on the order of 1-2x cpu cores usually, and
something on top pooling client connections if you need a lot of them,
pgbouncer or the equivalent for MySQL.

~~~
nvivo
I had some real issues with mysql handling more than 2000 connections. Once
past that limit, cpu usage increases exponentially with few connections due to
context switching. At 3000 connections, 80% of thr cpu was used gor context
switching and it got unusable. That was a 64 core/256gb ram server.

~~~
stevenwoo
What did you do to address the mysql connection problem?

~~~
nvivo
I ended up reducing the number of clients. In my case I had a thousand
servers, and I was able to change the application structure and merge them
into a dozen big application servers. Now with connection pooling each server
has less than 100 connections.

As a more permanent solution for scaling, I'm moving out of mysql into
something more distributed.

~~~
kolzeq
Or you can have move to mariadb, using configuration "thread_handling=pool-of-
threads" to enable threadpool that is exactly the solution.

(MySQL have that only for "entreprise" version)

------
foxylion
The article contains a footnote about UUIDs as primary keys.

> UUID as a primary key is a terrible idea, by the way — cryptographic
> randomness is utterly designed to kill locality of reference, hence the
> performance penalty

Is there anyone who can go a little bit more in detail?

We planned to migrate our database to use UUIDs as primary keys. This will
allow creating new rows on clients knowing the new primary key before sending
them to the server (simplifying client and server code).

~~~
jerrysievert
uuid's aren't guaranteed to be unique at generation, there is still a non-zero
chance of it having a collision. using it as a primary key to be generated by
the database helps mitigate that, as there will normally be a uniqueness
clause on the index.

creating that uuid on the client likely will not accomplish what you're
hoping.

~~~
mbroshi
It's pretty unlikely to get a collision[1], and the client should be able to
handle the gracefully (regenerate the UUID and retry).

[1] [https://www.quora.com/Has-there-ever-been-a-UUID-
collision](https://www.quora.com/Has-there-ever-been-a-UUID-collision)

~~~
jerrysievert
> It's pretty unlikely to get a collision

unlikely is not zero, which was why I commented. I'd hate to rely on
uniqueness of something that has a chance of not being unique.

~~~
openasocket
If you generate 1 billion UUIDv4s a second, it would take, on average, 85
years for you to produce a duplicate, and the resulting list of UUIDs would
take up ~45 exabytes. And keep in mind that even if inserting a row fails
because you've somehow managed to generate a duplicate UUID, it is trivial to
make a new UUID and retry. Since the database enforces the uniqueness
constraints of primary keys, I'm hard pressed to come up with a scenario in
which generating a duplicate UUID would actually do anything serious.

------
parvenu74
I like that PostgreSQL can have both relational table and JSONB document
collections (NoSQL) in the same database. Use NoSQL where it makes sense and
relational tables for data that is inherently relational and query and join
both (or batch-process from one to the other). I find this very cool.

Of course I wonder if it's too much cool and in trying to do everything it's
falling short in some significant and fundamental way.

~~~
heyoni
Wow, I had no idea it could store jsonb. Any obvious advantages to using
mongodb for nosql that I should consider? I’m way more comfortable with
Postgres and would rather stick to that

~~~
amunicio
Warning: I haven't used MongoDB in several years, so this info can be
outdated!

The main advantage of using JSONB on Postgres over MongoDB is that you can
create tables that mix regular fields (varchar, number, date, etc...) with
JSONB fields. Then you can do joins of your table with other tables (no need
to map/reduce or other insane processing for a simple join).

~~~
heyoni
That actually sounds really cool. Honestly, even if Mongodb has all those
features, I would much rather do everything on postgres seeing as I'm a
hundred times more familiar with it than mongo.

------
hodgesrm
> [2] When I say Postgres is great for analytics, I mean it. In case you don’t
> know about TimescaleDB, it’s a wrapper on top of PostgreSQL that allows you
> to INSERT 1 million records per second, 100+ billion rows per server. Crazy
> stuff. No wonder why Amazon chose PostgreSQL as its base for Redshift.

Correction: Amazon chose ParAccel, which was a data warehouse forked from
PostgreSQL.

Many data warehouse products have followed this path due to licensing. MySQL
is GPLv2 which means you can't ship derivative works without releasing your
code. PostgreSQL has a permissive license similar to MIT/BSD. You can do
anything you want with the code. That's still a major consideration which the
article omitted.

(Cross-posted from another HN link to same article.)

~~~
chrisjc
Also, inserting directly into Redshift is strongly discouraged as it's
extremely non-performant.

>An anti-pattern is to insert data directly into Amazon Redshift, with single
record inserts or the use of a multi-value INSERT statement, which allows up
to 16 MB of data to be inserted at one time. These are leader node–based
operations, and can create significant performance bottlenecks by maxing out
the leader node network as data is distributed by the leader to the compute
nodes.

[https://aws.amazon.com/blogs/big-data/top-10-performance-
tun...](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-
techniques-for-amazon-redshift/)

------
saosebastiao
Referenced in the article:

[https://wiki.postgresql.org/wiki/Future_of_storage](https://wiki.postgresql.org/wiki/Future_of_storage)

This reads like all my dreams come true.

------
sufehmi
What about upgrading to a newer version of PostgreSQL ? Does that still
require upgrading the whole databases ?

I evaluated PostgreSQL several times in the past, and cancelled once I found
out that upgrading to a new version requires upgrading the whole databases -
our databases are too big and our uptime requirement are too strict, we can
not afford it

~~~
manquer
You can use logical replication between 2 versions and fail over to the new
version without any downtime . In pg 10+ it is built into core, for earlier
versions you can use pglogical or similar tools.

~~~
sufehmi
Great idea - I'll try to see if this is doable in our situation. Thanks.

------
cat199
0 mention of language or datatype extensibility, which is huge.

------
cat199
this states all of the drawbacks to process vs thread but none of the benefits
(resiliency / compartmentalization of errors, less need for lock coordination
and less risk of locking related bottlenecks with scale, somewhat better host
OS CPU & IO utilization, etc. )

------
dspillett
_> With a clustered index, when you look up a record by the primary key_

To nit-pic - it might be the case in mySQL but some DBs (SQL Server for
instance) allow the clustering key to be something other than the primary key,
and for some analytical workloads this can be much more efficient.

------
dspillett
_> UUID as a primary key is a terrible idea, by the way_

Unless you use a v1 UUID, i.e. via NEWSEQUENTIALID() in SQL Server. IIRC
postgres has an equivelant available as a standard module.

------
patrickg_zill
I am pretty sure that PG has had clustered indexes for a decade or more... ?
e.g. [https://www.postgresql.org/docs/9.1/static/sql-
cluster.html](https://www.postgresql.org/docs/9.1/static/sql-cluster.html)

Or is this term referring to a different feature/method than this?

One thing not mentioned: PL/SQL vs. whatever the MySQL equivalent is.

~~~
da_chicken
You're correct that PostgreSQL has had clustered indexes for quite a long
time. The only difference here is that PostgreSQL allows a HEAP table. That
said, you do have to re-cluster a PostgreSQL and that requires an exclusive
lock on the table, which is obviously not idea for a massive table. MySQL will
always cluster on the PRIMARY KEY or (if there isn't one) the first UNIQUE
key[0], but as far as I can tell it always clusters on write. I don't see any
way to configure the padding of the clustered index like you can on SQL
Server, so I'm not sure how this is accomplished.

MySQL only supports SQL in procedures[1]. There is no PL/pgSQL[2] equivalent,
and except for custom UDFs written in C/C++, there's no support for external
procedures, either. PostgreSQL[3] supports PL/pgSQL, PL/Python, PL/Tcl, and
PL/Perl in base, plus there's external modules for PL/Java, PL/Lua, PL/R,
PL/sh, and PL/v8.

[0]: [https://dev.mysql.com/doc/refman/5.7/en/innodb-index-
types.h...](https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html)

[1]: [https://dev.mysql.com/doc/refman/8.0/en/adding-
functions.htm...](https://dev.mysql.com/doc/refman/8.0/en/adding-
functions.html)

[2]: [https://www.postgresql.org/docs/current/static/plpgsql-
overv...](https://www.postgresql.org/docs/current/static/plpgsql-
overview.html)

[3]:
[https://www.postgresql.org/docs/current/static/xplang.html](https://www.postgresql.org/docs/current/static/xplang.html)

