Hacker News new | past | comments | ask | show | jobs | submit login
PostgreSQL is the worlds’ best database (2ndquadrant.com)
591 points by known on April 3, 2020 | hide | past | favorite | 355 comments



For most of the projects where the DB really mattered, throughout my 10+ freelancer carrier, it came down to one thing that client really cared about. Performance. Nothing else mattered, not license price, not whistles and bells, not hype. My clients wanted to have data in front of their eyes the same second when they clicked the button. And when you have a table with 100 million rows in it, and an application is not loading data from that table to show it to its users in the same second, you already lost the battle.

Out of my tries, my clients tries, and my friends tries, only one DB was up to this task. Not Maria, not Access, not Oracle, not NoSQL, not MSSQL, not MySQL - all of them failed. The only one was PostgreSQL.

And before starting to bash me, please do this. Make a small application that will show a map, put 100 million points of interest on that map, that are contained in the table we talk about, and now as you scroll the map, select the middle of view as your circle and select on a small radius only those points of interest inside that radius. No more then a thousand points of interest, lets say. When you do that within a second, you got yourself a good database. For me PGSQL was the only one capable to do this reliable.


While I do love PostgreSQL (and PostGIS which is excellent at transforming, indexing and querying geographic content) - I feel like MSSQL should get a bit of a shout-out in relation to your comment on performance.

An experienced database developer with help from "SET STATISTICS IO ON"[1] and query plans[2] can achieve incredible MSSQL query optimization results.

PostgreSQL has good query plan output via the EXPLAIN[2] statement - but I haven't (yet?) seen PostgreSQL produce per-query IO statistics.

[1] - https://docs.microsoft.com/en-us/sql/t-sql/statements/set-st...

[2] - https://docs.microsoft.com/en-us/sql/relational-databases/pe...

[3] - https://www.postgresql.org/docs/current/sql-explain.html


Configuration parameter `track_io_timing = 'on'` will measure I/O time. And running `EXPLAIN (ANALYZE, BUFFERS)` will output per plan node buffer statistics and I/O time spent. On most modern system IO timing has no measurable overhead and should be permanently enabled. Collecting buffer statistics is also relatively cheap and could be enabled for all queries. For example the following configuration will get a log entry with an explain plan and per node I/O stats for every query above a threshold:

    shared_preload_libraries = 'auto_explain'
    auto_explain.log_min_duration = '5s'
    auto_explain.log_analyze = true
    auto_explain.log_buffers = true
    auto_explain.log_timing = false


You don't need to log every auto explain. Just enable track_io_timing and pg_stat_statements and you get per query IO performance metrics much cheaper.

Table F.21. pg_stat_statements Columns

https://www.postgresql.org/docs/11/pgstatstatements.html

blk_read_time

double precision

Total time the statement spent reading blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)

blk_write_time

double precision

Total time the statement spent writing blocks, in milliseconds (if track_io_timing is enabled, otherwise zero)


Thanks for this - 'pg_stat_statements' definitely looks like a similar feature, in particular because it reflects the count of disk-related operations (not just I/O time).

It took some brief configuration but I've been able to try it out locally and will refer to it when doing PostgreSQL query performance tuning in future.


PostgreSQL has track_io_timing and passing buffers will then include IO timing.

\set track_io_timing=on;

explain (analyze, verbose, buffers) your-query;

Additionally, it will also include things like times for triggers executed from the query.


In the end of the day, the PostgreSQL optimizer is just great, and will probably beat anything in a well balanced comparison.

But MSSQL also has a fairy good optimizer (from the ones on the GP relation, those are the only two good options) so you can set things to get better than an unset Postgres. It is also possible (but unlikely) that you get a problem where it fits better.

About Oracle, it is pretty great on doing `select stuff from table join table using (column)`. (I dunno what it does, but it does spend a lot of memory doing it.) So if you only does that, it's the best option. But if you do anything else, that anything else will completely dominate the execution time.


For me, MySQL’s quirkiness and lesser adherence to the SQL standard kind of neuter any performance niftiness it might have. I can see using MySQL for something like logging but for production data, I prefer PostgreSQL. It’s much more predictable.


as others have said, EXPLAIN (ANALYZE, BUFFERS, VERBOSE) will give you IO stats within your query at a very granular level.


It depends a lot on the implementation strategy. Some databases have GIS features, allowing you to query all points within a radius directly. I know oracle can do this, but it’s a paid addon (spatial and graph) and I don’t know how well it scales.

On databases that have good scanning behavior you can use a geohash approach instead. Insert all points with a geohash as key, determine a geohash bounding box that roughly fits your circle, query the rows matching that geohash in their key, run a “within distance” filter on the results client-side. I used that approach in hbase and it could query really fast. I suspect it would perform well on pretty much any mainstream sql db, but admittedly I haven’t tried it.


For true geospatial data models that correctly implement most geospatial predicates and operators your only practical option is PostGIS or Oracle Spatial. This is a very complex and nuanced piece of mathematics and software engineering work that requires writing a massive amount of code, even most closed source databases don't do a real implementation due to the difficulty. Most developers are unaware of just how broken most half-assed implementations are; the databases do just enough to check a box but not actually be useful if you are doing real geospatial work.

Geohashing has a lot of weaknesses and limitations, not really recommended unless your use case is very simple and not primarily geospatial in nature. The modern version was created in the 1990s and implemented by big database companies like Oracle and IBM (based on work in the 1960s and '70s). They deprecated that indexing feature in their database engines a long time ago for good technical reasons that apply to current open source implementations.

Geospatial index performance is very sensitive to temporal and geometry distribution, and the performance variance can be extremely high. For non-trivial applications, the real world will eventually produce data that inadvertently executes a denial of service attack on naive geospatial databases.


Postgres can be combined with the PostGIS addon (which is free) to get a reasonable feature set on top of PG with GIS features.


Yea PostGIS is a must if you plan on doing any sort of geospatial work with Postgres. PostGIS is way ahead of most other databases when it comes to processing geographic (3D) types. For example calculating a 100km buffer zone for 2D geometry types is easy, but only PostGIS can support buffer zone calculations on 3D geographic types. Because of the curvatures of the earth, the math is quite involved but it can make a big difference in accuracy


Oracle now has spatial that is included on the basic license and graph that is paid. Spatial solves a lot of GIS needs, but lacks some features, in special 3D objects.

MSSQL has a GIS package included in the basic database. I'm not sure how featurefull it is now, but last time I looked (years ago) it was missing a lot.

MySQL also has a GIS package included at the database. It's missing a lot of things.

The top of the line is, as usual, Postgres. But honestly, you probably won't need the difference.


MySQL also hast lots of GIS features, including r-tree spatial indexes.


Haha, commented at the same time and that's exactly the approach I've used!


I'm curious to hear why and how Maria failed you.

For a reddit-like social site, we have a Maria database with around 3 billion rows in total, with the largest table (votes) spanning about 1,5 billion rows.

We can select data within ~3ms for about 20k-40k concurrent visitors on the site. Our setup is a simple Master-Slave config, with the slave only as backup (not used for selects in production). Our 2 DB servers have 128gb of RAM and cost around 200 EUR/month each - so, nothing special.

Of course you have to design your schema with great care to get to this point, but I imagine PG to be as lost as any other database if a crucial index is missing.


This doesn't add up. 100M rows isn't a lot. All of those databases can do it, and commercial systems like MSSQL and Oracle have millions of R&D invested into performance and geospatial features.

It's not realistic that Postgres would outclass everything else like this.


Geoquery. That's a very difficult query to efficiently index: requires learning about WGS84 to get the right answer even for computing the radius.


All of these databases have geospatial data types for efficient querying and indexing. It's been a standard feature for years.

Even if they didn't, there are plenty of algorithms based on lat/lon which uses numeric/float data types with simple indexes.


> Even if they didn't, there are plenty of algorithms based on lat/lon which uses numeric/float data types with simple indexes.

Not that trivial if you're not just dealing with points.


So you have more complicated SQL then, but it doesn't make a difference between DBs selecting from a 100M table with indexes.

If it ends up being a full table scan then PG only recently got query compilation and parallel execution so at best it would be even with, but not better than, MSSQL/Oracle that can do vectorized executions on columnstores.


You need different index structures. Yes, it matters.

Hard to believe that spatial queries over geometries more complex than points will meaningfully benefit from "vectorized executions on columnstores".


Do you have an example query? If you're not using geospatial features, then what indexes does PG offer that the others can't match (either directly with indexes or with better scan/query performance)?


> Do you have an example query?

I mean, just about any geospatial query? E.g. finding all polygons in a table that overlap with one specified in a query. https://postgis.net/docs/ST_Overlaps.html

> If you're not using geospatial features,

Why are you talking about not using geospatial features? I was just responding to your

> Even if they didn't, there are plenty of algorithms based on lat/lon which uses numeric/float data types with simple indexes.

statement. Which I still think is not a meaningful alternative. And then subsequently to your statement about vectorized columnstores.

I'm not all saying that only postgres has geospatial indexing support (although postgis is pretty good) - that would be ridiculous.

> then what indexes does PG offer that the others can't match (either directly with indexes or with better scan/query performance)?

I do think that postgres' indexing support (gist, gin in particular) makes it easier to add support for indexing types of data that are harder to index in many other databases. But it's nothing that can't be beat.


The point is that without geospatial features, you can still use other strategies like lat/long or geohash and they will all have the same performance.

The logic (built-in geospatial or hand-written sql) is irrelevant to the performance comparison and GP's post that Postgres was the only one that could query quickly.

Do you disagree with that? If not then I'm not sure what your argument is.


This.


TBH, that should be easily doable for any database. I've built something like this with SQLite and on another occasion with Elasticsearch (where it's obviously much easier). If you leverage Geohash, and precalculate the geohash for every location, you don't even need any builtin geo-capabilities from the database.


SQLite now supports R*Trees for spatial queries.

https://sqlite.org/rtree.html


>And before starting to bash me, please do this. Make a small application that will show a map, put 100 million points of interest on that map, that are contained in the table we talk about, and now as you scroll the map, select the middle of view as your circle and select on a small radius only those points of interest inside that radius. No more then a thousand points of interest, lets say. When you do that within a second, you got yourself a good database. For me PGSQL was the only one capable to do this reliable.

This is a bad example. An SQL database is a wrong tool for the job if you care about performance.

You want a data structure like a range tree[1] or a k-d tree[2]. The results would be near-instantaneous.

100 million lat/lng pairs is something you can fit in RAM without any problem. And you can even do the geometry query on the client side for performance in milliseconds.

Other examples might need SQL, but here, this would be doing things the wrong way.

[1]https://en.wikipedia.org/wiki/Range_tree

[2]https://en.wikipedia.org/wiki/K-d_tree


While SQL will not necessarily give you the optimal solution, a btree index on a geohash will still give you O(log(n)) lookups on spatial data, which will be good enough for most use cases, and pretty much every database engine in existence can create an index on a string, which is all you need for that solution.


Sorry, it is not at all obvious to me how you can implement a query of the form "all locations within 100 miles of lat/lng" with a geohash (as in, "give me all stores within driving distance").

If you mean that nearby places have geotags with large common prefixes, then the "edge" cases there this won't work include such little-known, unpopular cities as London, UK (thorough which the Greenwich meridian passes).

That's to say, East and West London geohashes have no common prefix. The Uber driver picking you up in Greenwich will drop you off in a geohash which starts with a different letter.

To point out the obvious, the surface of Earth is two-dimensional, and a database index is one-dimensional. A solution to the 2D query problem using a 1D index is mathematically impossible (if existed, you would be able to construct a homeomorphism between a line and a plane, which does not exist).

Sure, geohashes can be used to compute the answer. But you need to know more information than just the letters in the geohash to find out which geohash partitions are adjacent to a given one. That extra information would be equivalent to building a space partitioning tree, though, and you still won't be able to use a range query to get it.

If there's an SQL range query-based approach with geohashes which compromises on correctness but is still practical, I'm all ears.

---

PS: this is a good overview of this problem:

https://dev.to/untilawesome/the-problem-of-nearness-part-1-g...

The solution there involves first producing a geohash cell covering. Of course, this means you have a BSP-like structure backed by a database.

---

PPS: the quick-and-dirty way using SQL would be keeping a table with lat / lng in different columns. It is not hard to define a lat/lng box around a point that has roughly equal sides, using basic trig.

You can filter locations outside the box with SQL, and then just loop over the results to check distances and fine-tune the answer (if needed). However, it will not be performant.


Well, a client invited me to evaluate PostgreSQL. Strong MSSQL shop, thousands of databases, millions of customers, tens thousands internal users. Big pile of licensing money they were eager to save.

We investigated the options for LDAP and Kerberos integration, commercial and free. Turns out there is no decent way to do it and grant permissions based on LDAP groups. There wasn't even a a half decent way. The only options were way to hackish to consider. MS SQL still own the house.


Why would you want LDAP integration for a database? Maybe that kind of messed up requirement is part of the problem?

One of the bigger issues with places that have made themselves dependent on MS solutions is that MS software completely permeates the place. Then when the pain of going full MS is too big, they are only looking for drop in replacements of existing parts. This will never go well. MS software never plays well with alternatives.

If you have let yourself slide too far down the slippery slope of MS Office integrations and MS Sharepoints and the like, you will have to pay the MS price. Or you have to be willing to chuck out the entire lot.


Because certain groups of users need the same sets of permissions on hundreds of databases and servers. And the members of those groups are constantly changing. In addition this ties an identity across system and facilitates SSO on the one hand and auditing on the other. There's more to it but these are the most obvious.

This is a common, standard requirement in enterprise systems, and it seem quite sensible. If you have a better, more robust solution, there are billions in this market, in the most literal way.

One more thing, although Active Directory is a standard in the enterprise, LDAP is an open protocol that existed before Microsoft were dreaming on being a player in the enterprise field and have at least half a dozen open source implementations.


> Because certain groups of users need the same sets of permissions on hundreds of databases and servers. And the members of those groups are constantly changing. In addition this ties an identity across system and facilitates SSO on the one hand and auditing on the other. There's more to it but these are the most obvious.

That is precisely what I meant with messed up requirements. Why would any of your SSO be relevant on the database level? A database user should be coupled to an application, not to a single user in AD. If you are letting constantly changing users do their own SQL requests on DB level, there is something rotten in the first place.

> This is a common, standard requirement in enterprise systems, and it seem quite sensible. If you have a better, more robust solution, there are billions in this market, in the most literal way.

I have never come across this kind of setup in a normal enterprise environment. There, you have everything behind some kind of enterprise software. Data entry, auditing, etc.

However, you could have need for this if we are talking developers sharing a single database and you don't want to manage accounts and passwords separately.

You said the solutions you found for Postgres were too "hacky". I think if you have a setup where enterprise users who are not developers or DBAs need personal access on DB level, your setup is quite "hacky" to begin with.

> One more thing, although Active Directory is a standard in the enterprise, LDAP is an open protocol that existed before Microsoft were dreaming on being a player in the enterprise field and have at least half a dozen open source implementations.

Microsoft has bent LDAP and Kerberos in its own way. Trying to use AD like an LDAP server is full of unpleasant surprises dealt to you by badly documented or outright undocumented "features", and useless error messages and logging. Believe me, it's a major nightmare to get anything integrated with AD.


There are tens of DBAs, hundreds of developers, hundreds of application users, those are the users in those groups.

I have done my fair share of integration with AD, including in some of the largest enterprises existing on this planet, where you have hundreds of groups of developers and admins. I read the RFCs and the documentation, in most cases, it is actually quite good and detailed, both on their LDAP and on their Kerberos implementation details.

> I have never come across this kind of setup in a normal enterprise environment. There, you have everything behind some kind of enterprise software. Data entry, auditing, etc.

Well, for you everything is behind the enterprise software, but I'm among those that need implementing it.


> There are tens of DBAs, hundreds of developers, hundreds of application users, those are the users in those groups.

Ah, so this is a special case with a shop that has developers and DBAs in their user base. Now that I can understand.

> I read the RFCs and the documentation, in most cases, it is actually quite good and detailed, both on their LDAP and on their Kerberos implementation details.

Then you know what a giant mess all of it is. I have done my share of LDAP/AD integration as well and found that LDAP is a very malleable tool that has been bent in many ways. There are many details to get right. MS use a certain subset of schemata and have added their special flavoured behaviour in the background. That means, what the AD admin sees is not exactly what an LDAP query might return. Similarly Kerberos or any other authentication scheme like e.g. SAML. Getting all of this right from the view of a third party application is not simple. That's why I'm never surprised why pretty much all of them are incomplete or incorrect in this manner.

Oracle offers a similar integration for AD. Do you happen to have any experience with it and if so, how does it compare to MS SQL?


I was never asked to give a hand in Oracle integration, hence I believe it is quite straight forward.

In my experience usually third party tools, i.e. ldapsearch, don't implement all AD options, if will use them for a LDAP query with AD features you will receive legitimate reply, but they will not be humanly readable as they were not parsed by the tool. However, the documentation from Microsoft in this front is actually surprisingly extensive and if anyone felt like it they could implement the correct parsing. It will also quite safe to assume at this point that intentional changes in the future in those areas are extremely unlikely.

Reading and understanding the LDAP and Kerberos implementation is quite a task, and I can fully understand why no one sat to re-implement it all in open source.

Most of the time, even features that are implemented in open source libraries, for stuff like GSSAPI as an example, are not implemented in the software that uses them. And in this I find PostgreSQL is extremely lacking.

In comparison, Apache lets you use LDAP groups for ACLs, NGinx have a unmaintained plugin stripped from the Apache base code and it only implements support for LDAP users, not groups. PostgreSQL will only let you authenticate users that match users that you already defined in PostgreSQL.

I'm far from being Microsoft fan in general, and I suffer whenever I have to use their OS. But AD is a robust piece of work and pretty much the only game in town. I feel that is is not in their focus right now, as is isn't part of their everything Azure strategy, and that's a shame.


This is very typical in an enterprise shop where you have end users connecting directly to the database for reporting purposes. Being able to take advantage of your existing directory structure and security groups is huge.


Why do end users have direct access to a database? What do they do with it? Copy and paste from their Excel sheets into a GUI database client? Why is there no layer of software in between?


Situations where you have enterprise setups have users measured in the 10s of thousands.

There will be existing tooling around ACL management (typically AD). User Access Management groups, geo-separated local account manager, multiple help-desks to allow for disaster plans (pandemics, fires, localized grid failures).

Access can be as simple as:

- my application will use a system-account, and we'll manage access

- my application will pass user credentials (based on AD), these are the groups (based on AD) that have read/write/delete access; regional team leaders will request access using existing UAM, regional managers will authorize requests, your existing tooling will process those requests as you are the designated authority over those groups

Changing this requires either exceptional levels of observable returns on effort, or extreme will-power and political power.

During evaluation of MS SQL / PostgreSQL / whatever, fitting into existing UAM will be a requirement.


Because the world and especially the enterprise world is full of smart analysts who aren't programmers but who are pretty hot shots with SQL and Excel. They can answer a LOT of ad-hoc business queries just with those two tools and don't need expensive slow programmers to spend 3 months twiddling JIRA tickets when they could get the same answer in a few hours, when they have direct database access.


Exactly. Why should I have to rewrite a whole ACL infrastructure when I can store it all in one place, and it doesn't matter how you access the database, it's all enforced.


LDAP is amazingly convenient. Why not centralize the user/group structure and use it for db access as well?


Would enabling LDAP auth at PAM level be too hackish for the purpose ?


Not at all if was possible. But PosgtreSQL only map users that already exists in PostgreSQL to LDAP users, matching the user name. And there is no way to map the groups. That's ridiculous.

The hackish implementations are syncing users from LDAP to PostgreSQL, on schedule. There are more than 3 different implementations of this madness.


I've done a lot of DBA'ing as a dev/sysadmin/SRE in the past 25 years. While performance matters greatly, there are two features that make me prefer PostgreSQL and that I would rate higher: stability and maintainability.


I have mainly used SQL Server but some Oracle and Informix earlier. I think all of these has been rock solid. When working as a consultant you rarely hear anyone saying that the db was instable.


Heh, I agree with you minus one consulting case: over a span of 10 years, my client had built an unholy disaster of Oracle PL/SQL on top of their not-great schema. While I wouldn’t call it unstable (queries did always return), query performance would vary dramatically even with the same inputs on an isolated install. I spent more time than I should have trying to track it down, but ultimately gave up.

What was my task? Add XML export into a poorly done XML schema, and do it all in PL/SQL; the data that went into the XML was a mixture of normal queries and computed PL/SQL results.

That schema + stored procedure monstrosity sticks out in my mind as the most unholy disasters I’ve ever worked on.


Not sure what strategy you tried to retrieve the data but I did this kind of stuff years ago with MSSQL even before it had geospatial types and it was fast.

If you have LAT/LON in the table as normal columns indexed you just query the bounding box of your radius so it can use the standard b-tree indexes then post filter to a circle cutting off the corners. Of curse you have to do some work to account for the curvature of the earth but this is pretty standard geospatial stuff.

If you have a DB with geospatial types and indexes like MSSQL 2008+, Oracle, PG etc then this becomes trivial as they can do this directly.


This is quite a specific thing though.

If you want to do spatial things in the database, then PostGIS for PostgreSQL or Spatialite for SQLite are definitely your best options.

For general performance, PostgreSQL is consistently excellent, but it's not the best in all cases.


That might be selection bias: those who look for you are mostly concerned about performance. And performance is usually not "solved" -- it's often a treadmill where hardware and software performance is trying to outrun the data growth -- so people who need performance are constantly coming back for more (usually with a pile of cash).

A lot of users case mostly about accessibility, simplicity, integration with existing experience/tools (e.g. JSON), managability, ecosystem, etc. Those users might be happy from the start; or if they aren't happy, they complain for a while until their problem is solved, and then go silent.


I think that clients also care about data consistency and safety too, they just don't know that they care about it until they don't have it.


Mysql can do this just fine with a geospatial index, or, for that matter, with a geohash index[1] (though in the case of a geohash index, you will need a somewhat more complex query). In fact, any database at all that can index on a string (i.e. all of them) can do efficient location-based search.

Honestly, if it's taking your database more than a second to pull 1000 of 100M rows on a simple query, that means you need to figure out what's wrong with your indexes, not that your choice of database vendor is bad. It is true that postgres is a very nice database, but for basic tasks, honestly any database will be fine if you know how to use it effectively.

Demo in mysql:

    -- create a helper table for inserting our 100M points
    create table nums (
      id bigint(20) unsigned not null
    );
    insert into nums (id)
    values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9);

    -- the table in which we will store our 100M points
    create table points (
      id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
      latitude float NOT NULL,
      longitude float NOT NULL,
      latlng POINT NOT NULL,
      primary key (id),
      spatial key points_latlng_index (latlng)
    );

    -- insert the 100M points with latitude between 32 and 49
    -- and longitude between -120 and -75, which is roughly a
    -- rectangle covering the continental US, placing them
    -- randomly. Be patient, as it takes a while to create
    -- 100M rows.
    insert into points (latitude, longitude, latlng)
    select
      ll.latitude,
      ll.longitude,
      ST_GeomFromText(concat('POINT(',
        ll.latitude, ' ', ll.longitude,
      ')')) as latlng
    from (
      select
        32 + 17 * conv(left(sha1(concat('latitude', ns.n)), 8), 16, 10) / pow(2, 32) as latitude,
        -120 + 45 * conv(left(sha1(concat('longitude', ns.n)), 8), 16, 10) / pow(2, 32) as longitude
      from (
        select
          a.id * pow(10, 0) + b.id * pow(10, 1)
            + c.id * pow(10, 3) + d.id * pow(10, 4)
            + e.id * pow(10, 5) + f.id * pow(10, 6)
            + g.id * pow(10, 7) + h.id * pow(10, 8)
            as n
        from nums a join nums b join nums c join nums d
        join nums e join nums f
        join nums g join nums h
      ) as ns
    ) as ll;

    -- Generate a roughly circular boundary polygon 7km around
    -- the center of Las Vegas
    set @earth_radius_meters := 6371000,
        @lv_lat := 33.17,
        @lv_lng := -115.14,
        @search_radius_meters := 7000;
    set @boundary_polygon := (select
      ST_GeomFromText(concat('POLYGON((', group_concat(
        concat(boundary_lat, ' ', boundary_lng)
        order by id
        separator ', '), '))')
      ) as boundary_geom
    from (
      select
        nums.id,
        @lv_lat + @search_radius_meters * cos(nums.id * 2 * pi() / 9)
          / (@earth_radius_meters * 2 * pi() / 360) as boundary_lat,
        @lv_lng + @search_radius_meters * cos(@lv_lat * 2 * pi() / 360) * sin(nums.id * 2 * pi() / 9)
          / (@earth_radius_meters * 2 * pi() / 360) as boundary_lng
      from nums
    ) t);


    explain select count(*) from points where ST_Contains(@boundary_polygon, latlng)\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: points
       partitions: NULL
             type: range
    possible_keys: points_latlng_index
              key: points_latlng_index
          key_len: 34
              ref: NULL
             rows: 987
         filtered: 100.00
            Extra: Using where
    1 row in set, 1 warning (0.01 sec)

    mysql> select count(*) from points where ST_Contains(@boundary_polygon, latlng)\G
    *************************** 1. row ***************************
    count(*): 1228
    1 row in set (0.01 sec)


We were on Oracle for 15 years, but the cost was just too high. We decided to move to Postgres. We thought it wouldn’t be as good as Oracle (you get what you pay for, right?), but it ended up being better. Performance is better. Documentation is better (by far). Adherence to SQL standard is better. Many SQL queries are simpler. Null/empty string is sane. Etc.

Now I don’t have experience with MySQL, SQLServer, etc., but between Oracle and Postgres, Postgres is definitely the best database.


My last employer was a large Oracle place, although everything was standard edition. I am so much happier now.

Everything that used to be a 'that is an enterprise edition feature' is now baked in.

concurrent index creation is amazing.

The simplicity and speed of pgBarman vs RMAN is mind boggling. RMAN in Oracle Standard Edition can only do full backups.

Full read-only access to a standby? That is huge! (enterprise only for the big O)

Realtime replication (not log shipping), also enterprise only for the big O.

I'm a sysadmin, so that is what I notice right away, but our developers love it too, for many other reasons.


Oh, God. For everybody there afraid of moving form Oracle to Postgres, I have one fundamental piece of information:

Postgres backups work.

You can simply backup your database. And you simply restores your backup. And the result is a working database, without errors.


I think a big portion of this isn't a difference in quality, but rather complexity. Postgres is just easier to understand than Oracle. Oracle has insane amounts of settings and configurable parts. The end result is that people who admin postgres servers tend to learn the levers, but many Oracle DBAs simply don't, or have environments where hand tuning is really hard to justify time-wise. I think Postgres success is it's often in environments where DBAs and devs work closely or are the same person.


I use PostgreSQL and MS SQL Server. I love Postgres. There are some things that SQL Server does differently that I would love to see supported in Postgres:

* Protocol: (a) no wire level named parameters support; everything must be by index. (b) Binary vs text is is not great, and binary protocol details is mostly "see source" (c) no support for inline cancellation: to cancel a query client can't signal on current TCP connection, the client must open or use a different connection to kill an existing query, this is sometimes okay, but if you are behind a load balancer or pooler this can end up being a huge pain.

* Multiple Result Sets from a single query without hacks. While you can union together results with similar columns or use cursors, these are hacks and work only in specific cases. If I have two selects, return two result sets. Please.

* Protocol level choosing of language. Yes, you can program in any language, but submitting an arbitrary query in any language is a pain and requires creating an anonymous function, which is a further pain if your ad-hoc query has parameters. I would love a protocol level flag that allows you to specify: "SQL", "PG/plSQL", "cool-lang-1" and have it execute the start of the text as that.

I do love Postgres recently added Procs! Yay!


Good list.

Also true indexed organized table aka real clustered indexes.

Oh and real cross connection query plan caching, prepared statements are only for the connection and must be explicitly used. No need to use prepared statements in MSSQL since the 90's


Another thing I'd add is a quasi-sorted uuid generator built-in. Real clustered indexes need approximately sorted UUIDs - which could be version 1 UUIDs with an artificial, per-database MAC address, or they could be something more esoteric such as Twitter's snowflake ID generator.

Using UUIDs for PKs is fine and dandy but clustered indexes and type 4 UUIDs do not play well. Many MS SQL users discover this the hard way when their toy database suddenly has order of magnitudes more rows.


That's what NEWSEQUENTIALID() is for in MS SQL.


I'd love to see the MERGE statement from MSSQL in other databases. Wonderful tool for integrations.


Postgres has UPSERT (INSERT INTO ... ON CONFLICT DO ... )

I wish MSSQL had this.


But this doesn't deal with DELETEs of a true MERGE statement


No, it's not a replacement. It's just a very useful feature that I wish all DBs had.


I've had to personally avoid MERGE statements due to implementation problems in MSSQLServer. One system just updated to 2019, so maybe it is better now. I also have to lookup the syntax each time, but that might be just me.


In some projects the MERGE helped me a lot and in my previous job, all of our integration scripts were merge based. You just learn to avoid the pitfalls, but you're right that it does have some: https://www.mssqltips.com/sqlservertip/3074/use-caution-with...

I'm sure if postgres had it, it wouldn't have had most of those problems.


Merge statement always comes with a bunch of caveats from MSSQL experts, has that changed recently?


It does have a bunch, but all of them are implementation problems, AFAICT. The idea is great though.


That's an interesting list. Can you give more context about how you would like to use these features?


Sure. I have two roles: database driver/client maintainer and database based application developer and maintainer.

As a database Driver/Client developer:

* Having two protocols/data types that two the same thing (text / binary) isn't the end of the world, but it just adds confusion. Also it adds complexity for a different server/proxy/pool implementation. Recommendation: better document binary and add an intent to remove text data types.

* Not having an inline cancellation (in same TCP/IP connection) means cancellation isn't supported by many drivers, and even when it is, there are many edge cases were it stops working. Each client implementation has to work around this.

As an application developer, I typically see three patterns emerge:

1. Put SQL on SQL platform: stored functions / procs. Application only calls stored functions / procs. 2. Put SQL on Application platform: simple stored SQL text, query builders. 3. Make Application dump and use simple CRUD queries and put logic fully in application.

I typically find myself in category (2), though I have no beef with (1). Benefits of (2) include: (a) easier to create dynamic search criteria and (b) use a single source of schema truth for application and database.

* Multiple Result Sets:

- (A) The screens I make often have a dynamic column set. To accomplish this I may return three result sets: row list, column list, and field list. This works for my screens and print XLSX sheets that can auto-pivot the data in, while all the data transfer columns can be statically known. and pre-declared. This allows me to edit a pivoted table because each cell knows the origin row.

- (B) Any reasonable amount of work in SQL may be long-ish (300-1000+) lines of SQL. There are often many intermediate steps and temp tables. Without Multiple Result Sets, it is difficult to efficiently return the data when there are often multiple arities and columns sets that are all relevant for analysis or work. So if I take a fully application centric view (more like application mindset (3)), you just call the database alot. But if you are more database server centric (1) or (2), this can pose a real problem as complexity of the business problem increases. (I'm aware of escape hatches, but I'm talking about strait forward development without resorting to alternative coding.)

- (C) Named parameters are extremely useful for building up query parameters in the application (development model (2)). You can specify a query where-clause snippet, the parameter name and the have the system compose it for you that is impossible with ordinal queries. A driver or shim can indeed use text replacement, but that (a) adds implementation cost and (b) server computation / allocation cost, (c) mental overhead when you see trace the query on the server. Further more it composes poorly with stored functions and procs on the server (in my opinion). It is again not insurmountable, but it is another thing that adds friction. Lastly, when you have a query that takes over 30 or 50 parameters, you must use named parameters; ordinal positioning is too error prone at scale.

- (D) Protocol level language selection. PostgreSQL always starts execution in plain SQL context. If you always execute functions as PL/pgSQL it is just extra overhead. In addition, running ad-hoc PL/pgSQL with named parameters isn't the most easy thing. It is possible, just not easy. This feature plus named parameters so by the time I'm writing my query, I know (a) that I have all my sent parameters available to me bound to names and (b) the first character I type is in the language I want.

The combination of these features would make applications developed in model (2) go from rather hard to extremely easy. It would also make other development modes easier I would contend as well.


> Recommendation: better document binary and add an intent to remove text data types.

I would note that it's required in Postgres for custom types to implement in/out (text encoding), but not required to implement send/recv (binary encoding.) In case send/recv isn't implemented, the datum falls back to using in/out even over the binary protocol.

Given the number of third-party extensions, and the ease of creating your own custom types even outside an extension, I'd expect that deprecating the text protocol will never happen. It's, ultimately, the canonical wire-format for "data portability" in Postgres; during major-version pg_upgrades, even values in system-internal types like pg_lsn get text-encoded to be passed across.

Meanwhile, binary wire-encoding (as opposed to internal binary encoding within a datum) is just a performance feature. That's why it's not entirely specified. They want to be able to change the binary wire-encoding of types between major versions to get more performance, if they can. (Imagine e.g. numeric changing from a radix-10000 to a radix-255 binary wire-encoding in PG13. No reason it couldn't.)


> * Having two protocols/data types that two the same thing (text / binary) isn't the end of the world, but it just adds confusion. Also it adds complexity for a different server/proxy/pool implementation. Recommendation: better document binary and add an intent to remove text data types.

I don't see us removing the textual transport, unfortunately. The cost of forcing all clients to deal with marshalling into the binary format seems prohibitive to me.

What's the server/proxy/pool concern? I don't see a problem there.

> Not having an inline cancellation (in same TCP/IP connection) means cancellation isn't supported by many drivers, and even when it is, there are many edge cases were it stops working. Each client implementation has to work around this.

Yea, it really isn't great. But it's far from clear how to do it inline in a robust manner. The client just sending the cancellation inline in the normal connection would basically mean the server-side would always have to eagerly read all the pending data from the client (and presumably spill to disk).

TCP urgent or such can address that to some degree - but not all that well.

> - (C) Named parameters

I'm a bit hesitant on that one, depending on what the precise proposal is.

Having to textually match query parameters for a prepared statement for each execution isn't great. Overhead should be add per-prepare, not per-execute.

If the proposal is that the client specifies, at prepare time, to send named parameters in a certain order at execution time, I'd not have a problem with it (whether useful enough to justify a change in protocol is a different question).

> A driver or shim can indeed use text replacement ... b) server computation / allocation cost

How come?

> - (D) Protocol level language selection. PostgreSQL always starts execution in plain SQL context. If you always execute functions as PL/pgSQL it is just extra overhead. In addition, running ad-hoc PL/pgSQL with named parameters isn't the most easy thing. It is possible, just not easy. This feature plus named parameters so by the time I'm writing my query, I know (a) that I have all my sent parameters available to me bound to names and (b) the first character I type is in the language I want.

I can't see this happening. For one, I have a hard time believing that the language dispatch is any sort of meaningful overhead (unless you mean for the human, while interactively typing?). But also, making connections have state where incoming data will be completely differently interpreted is a no-go imo. Makes error handling a lot more complicated, for not a whole lot of benefit.


> Yea, it really isn't great. But it's far from clear how to do it inline in a robust manner.

How would you feel about Postgres listening over QUIC instead of/in addition to TCP?

It seems to me that having multiple independently-advancing "flows" per socket, would fix both this problem, and enable clients to hold open fewer sockets generally (as they could keep their entire connection pool as connected flows on one QUIC socket.)

You'd need to do something fancy to route messages to backends in such a case, but not too fancy—it'd look like a one-deeper hierarchy of fork(2)s, where the forked socket acceptor becomes a mini-postmaster with backens for each of that socket's flows, not just spawning but also proxying messages to them.

As a bonus benefit, a QUIC connection could also async-push errors/notices spawned "during" a long-running command (e.g. a COPY) as their own new child-flows, tagged with the parent flow ID they originated from. Same for messages from LISTEN.


> How would you feel about Postgres listening over QUIC instead of/in addition to TCP?

At some point? Probably not a bad idea. Right now? Doubtful it's worth it - the stack doesn't seem ready yet.

> and enable clients to hold open fewer sockets generally (as they could keep their entire connection pool as connected flows on one QUIC socket.)

I don't think it'd be a meaningful step towards that.

We basically have to move to a threaded model to significantly improve the situation.

> You'd need to do something fancy to route messages to backends in such a case, but not too fancy—it'd look like a one-deeper hierarchy of fork(2)s, where the forked socket acceptor becomes a mini-postmaster with backens for each of that socket's flows, not just spawning but also proxying messages to them.

Not gonna happen. The added context switches would be a very significant increase in overhead. The added processes themselves are but free either. Postmaster not being the immediate parent would complicate the state handling a good bit.

EDIT: expand #2


"better document binary and add an intent to remove text data types"

For all-binary, the client would have to know about all data types and how to represent them in the host language. But that seems clunky. Consider NUMERIC vs. float vs. int4 vs int8: should the client really know how to parse all of those from binary? It makes more sense to optimize a few common data types to be transferred as binary, and the rest would go through text. That also works better with the extensible type system, where the client driver will never know about all data types the user might want to use. And it works better for things like psql, which need a textual representation.

The main problem with binary is that the "optimize a few columns as binary" can't be done entirely in the driver. The driver knows which types it can parse, but it doesn't know what types a given query will return. The application programmer may know what types the query will return, in which case they can specify to return them in binary if they know which ones are supported by the client driver, but that's ugly (and in libpq, it only supports all-binary or all-text). Postgres could know the data types, but that means that the driver would need to first prepare the query (which is sometimes a good idea anyway, but other times the round trip isn't worth it).

"Not having an inline cancellation (in same TCP/IP connection)"

This is related to another problem, which is that while a query is executing it does not bother to touch the socket at all. That means that the client can disconnect and the query can keep running for a while, which is obviously useless. I tried fixing this at one point but there were a couple problems and I didn't follow through. Detecting client disconnect probably should be done though.

Supporting cancellation would be trickier than just looking for a client disconnect, because there's potentially a lot of data on the socket (pipelined queries), so it would need to read all the messages coming in looking for a cancellation, and would need to save it all somewhere in case there is no cancellation. I think this would screw up query pipelining, because new queries could be coming in faster than they are being executed, and that would lead to continuous memory growth (from all the saved socket data).

So it looks like out-of-band is the only way cancellation will really work, unless I'm missing something.

"Multiple Result Sets... you just call the database alot"

Do pipelined queries help at all here?

"Named parameters are extremely useful for building up query parameters"

+1. No argument there.

"Protocol level language selection"

I'm still trying to wrap my head around this idea. I think I understand what you are saying and it sounds cool. There are some weird implications I'm sure, but it sounds like it's worth exploring.


> The driver knows which types it can parse, but it doesn't know what types a given query will return. [...] Postgres could know the data types, but that means that the driver would need to first prepare the query (which is sometimes a good idea anyway, but other times the round trip isn't worth it).

We return the types of the result set separately even when not preparing. The harder part is doing it without adding roundtrips.

I've argued before that we should allow the client to specify which types it wants as binary, unless explicitly specified. IMO that's the only way to solve this incrementally from where we currently are.

> Do pipelined queries help at all here?

We really need to get the libpq support for pipelining merged :(


"I've argued before that we should allow the client to specify which types it wants as binary, unless explicitly specified. IMO that's the only way to solve this incrementally from where we currently are."

I don't understand the idea, can you clarify?


We should allow the client, on the protocol level, to add (and presumably remove) types to a server side list of types that can automatically be sent in binary if the result format is specified as 'auto'.

That way a driver can add handling for ~10 builtin types (bytea, int[48], float[48], timestamp, ...) that matter from a size/efficiency POV, without needing to handle all the other types. That'd provide a large large fraction of the potential benefit of handling binary for all types.

The reason for the server side list is that otherwise I don't see how you could do it without an added roundtrip.


My first job in IT, I was 18, it was the early 2000s. I was a nerd but in professional IT I was essentially a blank slate.

I lucked up and ended up with a real hacker for a boss. As early as 2001-2002 he had saved entire businesses by migrating them from mysql to postgres.

He was a BSD guy and a postgres guy. He made me into a fanboy of both those technologies.

So out of sheer luck I've preferred Postgres for over 15 years, and it has never disappointed me.

Unlike BSD, which often disappointed me once I learned how much easier and mature Linux was to use.


In my experience when people are able to "fix" performance issues by moving from sql database a to sql database b it's because they understand database b and how to optimise for it. Not because the database is better or worse. My company was "saved" moving from postgress to mysql. The difference was the guy and his specific knowledge, not the database technology.


> As early as 2001-2002... from mysql to postgres.

MySql was pretty bad in the first versions. I'm in the enterprise/erp development world, and when I look at mysql when it start to show in the radar, I can't believe the joke.

MySql get good around 5? Now I can say all major DBs are good enough not matter what you chosee, (still think PG is the best around) but MySql was always the more risky proposition, the PHP of the db engines...


Yeah I agree. In practice now that I've accumulated my own experience I feel that Galera clustering is much better than anything I used in Postgres.

So far I've only setup one master/slave cluster with pgpool-II in postgres and it's definitely an OK setup. But I much prefer the usability and stability of Galera/MariaDB for clustering.


And you now know better on how to structure your data because you have done plenty of migrations and changes but couldnt change everything you wanted to because the whole data access logic would need to be rewritten.


> Unlike BSD, which often disappointed me once I learned how much easier and mature Linux was to use.

I was with you right until that last sentence. I'm not going to offer a counterargument because your statement is extremely genralised and ripe for flamewars but I will say it's not as clear cut as you stated.


> Unlike BSD, which often disappointed me once I learned how much easier and mature Linux was to use.

With the exception of package management (at least with debian) I've found the opposite to be true. It's configuration has always been a bit of a mess, especially with regards to networking. Lately systemd has made things only more confusing. And that's before dealing with the myriad of ways different distributions do things (which thankfully now is really only 2 different bases, debian or red hat).

While linux "won" in the end and I use it professionally now, I'm glad I rarely have to interact with it directly anymore thanks to ansible or other orchestration tools.


I discovered PostgreSQL in a similar happenstance sort of way - working at a large telecom directly out of college in the late 90s, tasked with setting up a MySQL backed webapp to manage some internal processes. Fortunately, I was unable to compile MySQL on our HPUX workstations (issue with threads IIRC), PostgreSQL to the rescue. Been using it ever since.


> As early as 2001-2002 he had saved entire businesses by migrating them from mysql to postgres.

I'm extremely curious how this worked out.


Well I was 18-19 so to me these were myths I heard re-told.

But all I know is that they had been throwing hardware on a MySQL install to make it work better.

He migrated them to postgres and they got much better performance and could get away with less hardware than they had with mysql.

That was as much detail that I remember. Keep in mind I said by sheer luck I became a fanboy. Not by experience and competence. That came later.


This isn't hard to believe. MySQL basically started out as a key-value store pretending to be a relational database. If you wanted to do KV stuff and didn't care too much about data corruption (or knew the five zillion magic incantations to mostly avoid them) it was fast. But If you wanted joins or expected the query optimizer to do anything for you, you were SOL.


The dirty secret of database engine development is that all the things that are hardest to do in the database proper are super-easy to do in the application, given a few judicious architectural choices; but despite that, nobody can count on app writers to do them right. So, DB engine developers have to bust their humps to cover for users' incompetence.

A database for competent users can be amazingly simple and fast. But the Market Has Spoken, loud and clear: databases must absorb an indefinitely large amount of complexity to make the job of app writers a tiny bit easier.


I'd be curious to see some examples of what you mean!

I think there are definitely cases where DB needs to compensate for bad architecture, but then even competent developers can profit from the flexibility of being able to ad new features ad hoc to an existing db schema/architecture and the DB just magically bailing them out.


Interesting username, that's for sure!


I think that PostgreSQL itself is great. However, the developer client tools (pgAdmin) leave something to be desired. The old pgAdmin3 was fine, if lacking bells and whistles. The new pgAdmin4 however, was not very good last time I tried it [1]. It was implemented as a client/server web application and had frequent issues.

Can anyone recommend a good client for PostgreSQL?

[1] - I see that there have been some new releases in 2020 so I ought to check on them. The version I tried earlier was 4.11.


DBeaver: https://dbeaver.io/

Native cross-platform and works across dozens of databases with lots of features.

Another option is Jetbrains DataGrip: https://www.jetbrains.com/datagrip/


IMO Java is not native anywhere.

I prefer to use HeidiSQL running on Wine.


It's more native than an Electron app.


Thank you for this - my team and I have been looking for a replacement to PgAdmin for quite some time now and this fits the bill exactly. Two features I thought we were going to have to give up when moving from MySQL to PG where ERD management in-app and the ease of use of MySQL workbench.


Dbeaver is the answer. It is good enough that I no longer miss pgadmin 3.

I've inadvertently become the pgadmin 3 "LTS" maintainer. A release of pgadmin 3 that was altered to support 10.x was previously provided by BigSQL. I forked it on GitHub to add TCP keepalive on client connections. At some point after that BigSQL removed the original repo. Apparently I was the only one that had forked it prior to removal, so now the few vestigial users in the world still using it are forking my fork. One has even patched it to work with 11.x, although they've offered no pull requests.


so do you consider your pgadmin fork maintained or not? where is it?


No. I'm making no effort to maintain it. If a pull request appeared I'd probably accept it. https://github.com/allentc/pgadmin3-lts

Amusingly it appears the previous maintainer, BigSQL, has since forked my repo.


I tried DataGrip many times but kept coming back to DBeaver. It's just easier IMHO.


DataGrip is a bit intimidating at first, but much more powerful.


What features does it have over DBeaver?


First thing that comes to mind is code completion, which is unmatched in any other tool in the space.


If you're on a mac, then Postico is pretty good (https://eggerapps.at/postico/).

Otherwise, TablePlus is cross-platform (and supports multiple DBs) https://tableplus.com/

Agreed that PgAdmin is awful.


I use Postico and agree it's pretty great overall. There's some odd behavior in the SQL editor that sometimes bites me when I'm pasting from my code editor (VSCode), where whitespace sometimes picks up some kind of invisible character or something that makes the query invalid. I have no idea how to reproduce it but it can really break my flow when I hit it. Other than that I've been pretty pleased with it.

If I didn't already have a Postico license I'd definitely be looking at TablePlus.


TablePlus is awesome on MacOS, but not so much on Windows.

I use TablePlus for quickly looking stuff up when I'm working on my MacBook, but I use DataGrip when I'm working on Linux.


I was on the fence about giving TablePlus a go. Why isn't it awesome on Windows? I'm very interested.


I've been using JetBrain's DataGrip. So it's not free, but if I weren't already grandfathered into a good JetBrains package, I'd probably be willing to subscribe to it à la carte and pay for it out of pocket. It's the only JB product I still use consistently after switching from WebStorm to vscode.



I use DBeaver: https://dbeaver.io

It's not specific to PostgreSQL, it's universal.


Azure Data Studio (don't be put off by Azure, it runs locally) has a decent extension for PostgeSQL.

https://github.com/microsoft/azuredatastudio#azure-data-stud...


HeidiSQL: https://www.heidisql.com/

Bonus — it also works for MySQL/MariaDB, MS SQL, and sqlite.


I use http://valentina-db.com, it also have a free version and support others dbs, like sqlite and sql server.


I tried it for some weeks.

It crashes constantly. Went back to HeidiSQL.



If you're willing to pay, worth checking out Aqua Data Studo (https://www.aquafold.com/aquadatastudio), otherwise DataGrip is also fantastic.


I wonder if there is any client that has a halfway decent visualization of RLS policies. Last time I checked, pgAdmin didn't, and Postico, which I use day-to-day, doesnt' (at all).


https://tableplus.com/

Supports many databases


I prefer using Adminer or Azure Data Studio (with pg extension)


pgAdmin4 is not that bad, I'm using it every day


I have tried to give pgAdmin4 a chance three times, and every time it was an exercise in frustration. It is unstable and awful from the UX perspective.

Since then I'm sticking with psql for administration and with DataGrip for more involved DB development, and that works very well.


Have you tried anything else? It's really not great.


This is advertising of course. But if I had to select an SQL DB postgres is my only choice.

Perhaps I don't know enough about databases and their differences.

Anyone have some pros and cons of others?

Like why would I pick MySQL, Microsoft, Oracle, Maria etc over Postgres?

Apart from support that you gotta pay for.


Having worked with dozens of different databases over lots of years, I can certainly say that Postgres is my go-to _default_ database over all the other relational ones.

In the same way that Golang claims it's the 90% language ( https://talks.golang.org/2014/gocon-tokyo.slide#1 ), I'd say Postgres is the perfect 90% database.

However, we're still using at least four other databases in production, and the reason why is that because it's doing something of everything, there are lots of niches that it doesn't fill well. It's all about tradeoffs.

If you've got a Full Text Search problem, PG is "good enough", that is until you need to index more challenging languages like Korean or need more exotic scoring functions. For that, Elastic is better.

You can do small scale analytics on PG, but at larger sizes, the transactional setup gets in the way and you're better off with BigQuery/Redshift/Snowflake.

You can scale vertical pretty big these days, but if you need linear scalability while still guaranteeing single digit ms access, you're better off with ScyllaDB.

However, there isn't a single project I wouldn't not start on a Postgres these days, that's how much I love it.


I wouldn't say PostgreSQL was the analogue of golang: go's initial popular growth for being 'good enough' while being arguably oversimplified, where simplicity masks problems, under the pretence that YANGNI always applies (initial N emphasizing the Not of Aren't), seems much closer to MySQL's history.

PG was driven by engineering correctness, by considering what DBAs 'Are Going to Need'. Sometimes that strictness worked against popular growth but in the end it has worked out well, as many programmers figured out they also needed it. I'd say the programming language analogue would likely be Rust, lets see in 10 years where the 90% case lies.


Rust is too complicated for 90% cases. 90% of cases are simpler with GC.


I'd argue that what you're saying is true, in its current state. High-level Rust libraries take longer to write, and so they don't exist yet. As soon as Rust gets more powerful tools that abstract away the work, I suppose it might alleviate a large part of the churn.


Rust is more complicated than some languages with GC.

However, if you are writing some code with multi-threading (who isn't these days), then you may eventually want the help Rust gives you there as far as correctness goes.


Multithreading is only needed in very few fields. Multiprocessing is way easier and works fine for most. And the majority of programmers still only need to write serial code.


And if not serial.. embarrassingly parallel


Could you expand on the issues with PostgreSQL full text search for Korean?

I'b be interested to know about the limits of the full text feature.


I'm interested in why you would choose Scylla over Cassandra. Does it have advantages/differences besides performance? I remember Scylla being approximately a C++ rewrite of Cassandra, but lost track of the community opinion of them.


ScyllaDB is far faster, more stable, and requires less config, maintenance and tuning. There's no reason to use standard Cassandra unless you need it for compatibility.


What issues are you having with Korean indexing? I work with Japanese data in PostgreSQL all the time and have never noticed any issues. Japanese is more complicated that Korean, in my opinion.


SQLite is also extremely useful, but it fills a different niche. (One of SQLite's slogans is "SQLite does not compete with client/server databases. SQLite competes with fopen().")


> SQLite competes with fopen()

... and wins, which surprised me when I found out.


Replication, failover and high availability are complicated matters. I believe there are other SQL databases that may be said to handle this better than PostgreSQL.

At the very least, CockroachDB is much simpler to set up and manage. Cannot vouch for its stability and complexity of abnormal emergency recoveries, though - haven't used it long enough, only had simplest outages (that it had handled flawlessly).


I've been following YugabyteDB ( https://www.yugabyte.com/ ) with interest for similar reasons. It's build on postgresql so it supports more features than CockroachDB. They just forked PG on top of their clustered document store. And the newest version (2.1) is production ready (their words).

Just haven't had the reason to dive in yet, but seems good. Their PR department just isn't as good as CockroachDBs one.


>Replication, failover and high availability are complicated matters.

I wonder why Postgre hasn't improved in these areas and instead leaving it to third party solution. I mean every year these few points are still listed as something that flavours MySQL.


It might be partly because Postgres is largely developed by the providers of "3rd-party" solutions for features like this. You could see it as an open-core model of sorts.


Why are you saying they have not improved? That is flat out false - postgres provides a lot better tools for this now than 10 years ago. They still leave a lot of the plumbing to third parties, but they are now at least giving those third parties a lot better tools than they used to.


I think a more accurate way to put it, why isn't there any default solution? But I guess the Open Core model another comment pointed out makes sense.


Microsoft SQL Server is an obvious default choice if you're an all-Microsoft shop; it has all the Azure and Visual Studio integrations, etc. and comes with your MSDN.

Oracle is the choice of organisations with money to burn who don't believe that free products can be good. An Oracle installation nearly always comes with some consultants to write your application for you, and define your application for you, and write your invoices for you ...


I think the only reason I wouldn't choose Postgresql is if I need master-master replication.

It's not that you can't create a master-master setup using Postgresql, I believe 2ndquadrant have and add-on that allows this. It just feels like it's messing with the fundamentals of the database on such a low level that I really only trust it, if it's part of Postgresql it self.


Logical replication has been in PostgreSQL itself for a while now.

There are naturally caveats: sequences don't get replicated, so you need to configure each replica with a non-overlapping range for each sequence. And DDL statements are also not replicated, so you have to migrate database schemas by hand on each replica.


Logical changeset extraction itself does not in itself provide a replication feature from what I can see of the wiki [1]. 2ndQuadrant sells BDR which is built on top of it.

[1] https://wiki.postgresql.org/wiki/Replication,_Clustering,_an...


> Logical changeset extraction itself does not in itself provide a replication feature

It is however a building block for logical replication, and used by https://www.postgresql.org/docs/current/logical-replication....

> from what I can see of the wiki [1

I really wish we'd just shut off the wiki. There's random pages starting to be maintained by someone that then stops at some point. Leads to completely outdated content, as in this case.


One of Postgres's biggest pros (over other commonly used dbs in web dev) is Row Level Security. Controlling what data a user can access at the database rather than the application or as part of a query makes the application logic a lot simpler. That's a big win.


But that also means that:

- the code is not reusable outside of a database setting. So not cacheable.

- the code is not reusable accross different storage layers. So not portable.

- the code may needs updating if the schema change, you can't abstract that

- changing the logic means a db migration

- testing the code requires a DB

- tooling support to check that code si limited to SQL tooling, which is very weak, especially for code completion, refactor and debugging.

That's a lot of constraints for just making the application logic a lot simpler.


the code is not reusable accross different storage layers. So not portable.

In my 23-and-a-bit years of web development I've literally never changed the database engine on a project. Maybe that happens on other people's projects, but it's not something I consider important or even useful really. The notion that you can swap out your database for a different one without changing the application code to take advantage of the db you're moving to is ludicrous. Of course your database code isn't portable.

Websites that used MySQL and Myisam tables with raw SQL statements written as strings in PHP for the first decade of the web is one of the reasons why so many web developers still don't use things like transactions, stored procedures, views, etc. That's a bit of a tragedy. The web would be much better today if everyone had been using Postgres's features from the beginning.


If you write a project, you probably won't change the database. But you may add new data layer services such as a cache backend, a search engine with, a secret server, a session store, a permission micro service, etc. They will all interact with permissions and change where you source the data from.

Then there is the case where you don't write a project, but a lib. Or the case where you extract such a lib from a project. In that case, you limit the use case of your lib to your database. Because if you don't change database during your project life time, the users of your lib may start a new project with a different database.

This is why Django ORM allowed such a vibrant ecosystem: not because it allows changing the data base of one project (although it's nice to have and I used it several times), but also because it allow so called "django pluggable apps" to be database independant.


I have never seen anybody migrate out of PostgreSQL, but I HAVE seen people migrate out of Oracle, and DB2 because they are so expensive its worth the cost to migrate.


  I've literally never changed the database engine on a project
And some of us do it several times a day because we deploy to prod with Postgres but run testing and CI with SQLite.


Me too, until I found how fast Postgres is starting in Docker and for different languages there are now libraries helping you writing tests using Docker containers.


Well, you're doing it wrong. It's easy and fast to run tests against real postgres (takes a fraction of a second to spin up the DB for your tests if you do it right).


I agree, but I also see the benefits from the other side, so I’ll play devil’s advocate:

- it is cached, in the database’s memory, where the cache can be invalidated automatically. It is better to cache views than data anyway.

- it is portable to every platform postgres runs, which in practice means it will run everywhere. Portability between databases is overrated because it rarely happens in practice.

- the access control logic evolves together with the schema, guaranteeing they have an exact correspondence. This is a good thing.

- integration tests should involve a live database

- have you looked at jetbrains datagrip?


> it is cached, in the database’s memory, where the cache can be invalidated automatically. It is better to cache views than data anyway.

The cache may not be for the data in the data base, but for something else (task queue, calculation, user session, pre-rendering, etc). You effectively split your cache into several systems.

> it is portable to every platform postgres runs, which in practice means it will run everywhere. Portability between databases is overrated because it rarely happens in practice.

As I mentioned, for a project it doesn't matter. For a lib it does.

> the access control logic evolves together with the schema, guaranteeing they have an exact correspondence. This is a good thing.

Not always. Your access control logic could be evolving with your model abstraction layer, which frees you to make changes to the underlying implementation without having to change the control logic every time. It's also a good for your unit tests, as they are not linked to something with a side effect.

> integration tests should involve a live database

Integration tests are slow. You can run them as you develop.

> Have you looked at jetbrains datagrip?

It's fantastic. If you are willing to pay the price of it for and force it on your entire team. If you are on an open source project, will your expect that from all your contributors ?

High barrier of entry, with no modularity. Linters, formatters, debuggers, auto-importers, they all depend on that one graphical commercial product that is not integrated with your regular IDE and other tools.

Not to say it's not a good editor if you do write a lot of SQL, as JetBrains products are always worth their price.


The tooling is just fine for me, DataGrip or not: I wrote a ton of PostgreSQL triggers and utility functions in Javascript thanks to the excellent plv8 (free) extension. Development and unit tests for these functions is as easy as plain Javascript.


> Development and unit tests for these functions is as easy as plain Javascript.

Since most popular product in the JS world are product to avoid writing Javascript (webpack+babel, typescript, coffeescript, jsx, etc), and plv8 supports non of that nor does it support standard unit tests framework like Jest, I'm not convinced.

To make things even worst, remote debugging is not supported anymore: https://github.com/plv8/plv8/issues/131#issuecomment-2377111...


I personally this time around, I found no need to even use a framework for unit testing a single javascript function.

If I had to do something that feels too complex, I'd write some interfaces in TypeScript to represent the tuples, then implement the function in TS, and compile it into plain JS in order to deploy it to plv8.


It also means you need a per-user database connection which isn't feasible in many applications as the recommended number of concurrent connections is typically quite low.


Not necessarily. You can SET ROLE at beginning of transaction and DISCARD afterward. This is what postgraphile and postgrest do, for example.


Oh cool, I didn't know that. Good to learn!


postgrest actually does a `SET LOCAL ROLE`(transaction scoped, so no need to DISCARD), but yes, that's the gist of it.


This absolutely isn't required; I think you might be getting confused with RBAC rather than RLS. Handy infosheet: https://learn.graphile.org/docs/PostgreSQL_Row_Level_Securit...


You can just change the user for every query.


All times I have seen systems using db security as the main security solution they have gotten into a lot of problems later and most of them changed.


MS SQL has row level security too.


The main reason I use MySQL over Postgres is storage engines. MySQL has storage engines with transparent compression, which allow me to keep an order of magnitude more data than using Postgres. These days its MyRocks, in the past it was TokuDB.

Historically Postgres has thought that the job of the file system, which is basically a bad choice for dbs. MyRocks wipes the floor with it.

TimescaleDB is an interesting new Postgres storage choice. I'm evaluating it. But the free version doesn't do compression on-demand nor transparently. Anyway, it can only get better.


Hi @willvarfar: Just to avoid any confusion, TimescaleDB's compression is available in our community edition, which is fully free to use. Once you specify some compression configs, it's automated and transparent.

https://www.timescale.com/products/features https://docs.timescale.com/latest/using-timescaledb/compress...

(The "community" edition is available under our Timescale License. It's all source available and free-to-use, the license just places some restrictions if you are a cloud vendor offering DBaaS.)


The cool thing about timescaledb is that timescale not Postgres controls the partitions, so who says they all have to be done the same way?

In my use cases I have got recent partitions that are upsert heavy, so row-based works best. But as the partitions age, column-based would be better. What every DB seems to force me to do is use the same underlying format for all partitions. What I’d like is a storage engine that automates everything; instead of me picking partition size, it picks on the fly and makes adjustments. Instead of me picking olap vs oltp, it picks and migrates partitions over time etc.


TimescaleDB mostly does exactly that: Recent data/partitions are maintained in row-based form, and after they reach some age, they are converted into more columnar compressed form.


Postgres with timescaledb extension (community) along with zfs compression works flawless for me. Different approach than Myrocks, but worth a try!


Have you tried using the timescaledb compression instead of zfs? And how does this compare to uncompressed and to native Postgres tables?

I think that if timescaledb can support time-based or mru-based compression and even row-store partitions then it would take things to the next level.


You've probably seen this, but reposting:

https://blog.timescale.com/blog/building-columnar-compressio...

Our own experience was that most users were getting 3-6x compression running TimescaleDB with zfs.

With our native compression -- which does columnar projections, where a type-specific compression algorithm is applied per column -- we see median compression rates across our users of 15x, while 25% of our users see more than 50x compression rates.


Do these satisfy the compression requirement?

https://stackoverflow.com/questions/1369864/does-postgresql-...


No. LSM/Rocksdb databases do "block compressions". Even better is dynamic dictionary compression: https://github.com/facebook/rocksdb/wiki/Dictionary-Compress...


No, that's a compressed row format, not a compressed block format.

MySQL's innodb, for example, supports per-row-compression. Its completely ineffective. Block compression in tokdub and myrocks etc is a completely different class.

MySQL's innodb also supporta kind of 'page compression' using the file-system's sparse pages. Its also naff.

The closest postgres can get is using zfs with compression. Its a lot better than nothing.


Yeah, one of the suggestions on that page was a compressed filesystem.


SQL Server can also do storage compression. I assume Oracle can too. It used to be that the commercial offerings had the most advanced features first and the OSS systems copied them later.


I did use MySQL/MariaDB in production for a couple of years, and now I'm using Postgres.

PG sure seems to have better foundations and generally a more engineered approach to everything. This leads to some noteworthy outcomes in the tooling or the developer UX, for example: PG won't allow a union type (like String or Integer) in query arguments. PGs query argument handling is rather simple (at least the stuff provided with libpq), it only allows positional arguments.

Sometimes, MySQL by extending or bending SQL here and there allows for some queries to be written terser, like an update+join over multiple tables. Also I did actually enjoy that MySQL had less advanced features, so it seemed to be easier to reason about performance and overall behaviour.


Well, (unfortunately) plenty of applications support only one DB backend. For FOSS software, MySQL/MariaDB support is very common, and even if those apps support pgSQL it might be that pgSQL support is not as robust and well-tested as mysql.

And in plenty of situations, sqlite is perfectly adequate, and saves you from the extra work of managing yet another service.

But yes, I agree if I'd start from a blank sheet and need a network-accessible SQL DB, pgSQL would be my first choice. Now if I had some very high end requirements, maybe one of those big commercial DB's would have an advantage.


Yahoo uses PostgreSQL. I think that's pretty high-end requirements an debunks the myth of "commercial software is better for high end requirements".


First of all, Yahoo historically used Oracle Enterprise and MySQL for RDBMS, roughly 50% each.

Large companies use some of everything, but I never saw PG in use there.



No, it wasn't legendary.

That's just one data warehouse application.

The one I administered, just as big, was Saturn. And it was MySQL.

Source: worked at Yahoo, saw way more MySQL than PG.


You just said you never saw Postgres at Yahoo. Yet Yahoo had possibly the world’s largest Postgres installation. I consider that legendary. I’ve certainly been aware of it for over a decade. It was also an early (?) columnar store which is now commonplace among OLAP databases. Seems like if it didn’t start a trend it was at least an early adopter. I’d say that’s legendary too. Even today the scale is impressive.

Now you say you did see some Postgres but there was more MySQL. Fine, you were there, I’ll take your word for it. But I can’t reconcile your own statements on this. Yahoo very clearly used Postgres.

You ran a 2PB MySQL install. Cool, I’d love to hear about that, truly. Do you have any written accounts or talks about that?


I believe the database described is Greenplum[0], which was a fork of PostgreSQL at 8.3, I think. It handles truly enormous datasets.

There's been an ongoing multi-year project to merge Greenplum up to the mainline so that it's no longer a hard fork.

Disclosure: I work for VMware, which sponsors development and sells commercial offerings of Greenplum.

[0] https://greenplum.org/


The article says Yahoo bought Mahat Technologies for their columnar version of Postgres. That sounds similar to Redshift or Greenplum but I think it is different. I can’t find a clear history of Greenplum’s origins or what happened to Mahat. Looks like Redshift came from ParAccel which was a separate project. From what I can find there were a lot of similar projects at the time.


We picked SQL server by Microsoft for one of our projects because it was easier to integrate it when we started (ASP.NET core 2.0) and it has very nice tools like SSMS, better than DBeaver. For a new project we picked Postgres. We might in the future use any of the two.

We also have other DB tech in the company: Teradata, Oracle, Redis, Cassandra.


I am quite surprised MySQL or MariaDB is not mentioned more. It seems like a standard to me. Almost every serious software supports it. Of course it doesn't have some extra features like GIS but for generic RDB it performs well. And there is also Galera cluster btw...


Oracle is currently moving in the direction of autonomous databases. It's no longer just about having great automation features but about the database being able to do things for you like create or drop an index in order to improve ETL performance, and it's all done autonomously.


SQL Server also has some parts of that. Will probably be expanded in the future.


Oracle has not managed the shift to cloud so if you look at where the puck is going, it's moving away from Oracle.


Oracle has their own cloud platform and a great database to go along with it, if anything history has shown us that Oracle is able to adapt their database offerings to cater to current enterprise needs and I think this time will be no different.


The database is not the cloud platform. There is a lot of gravity there but it's not enough.


The largest enterprise SaaS application in the world runs on Oracle.


Is that application built and sold by Oracle as well?


No


What app, out of curiosity? SFDC?


You guessed it.


What app, out of curiosity?


Postgres use always reminds me of this presentation: http://boringtechnology.club/

I self-admittedly love esoteric databases and storage engines to a fault. I'll try to shoehorn things like RocksDB into whatever personal project I'm working on.

At work however, the motto I spread to the teams I work on is "use Postgres until it hurts". And for many, many teams - Postgres will never hurt. I'm very happy for its continued existence because its been a solid workhorse on various projects I've worked on over the years.


What a great presentation. Thanks for the link!

BTW, here's an interesting observation: if you normalize a schema to the max, applying CRDT techniques for eventually-consistent multi-master concurrency is relatively simple, and you can do it using SQL. With PG you could have each instance publish a replication publication of an event log, and each instance could subscribe to a merged log published by any of N instances whose job is to merge those logs, and then each instance could use the merged log to apply local updates in a CRDT manner. If you normalize to the max, this works.

For example, instead of having an integer field to hold a count (e.g., of items in a warehouse of some item type) you can have that many rows in a related table to represent the things being counted. Now computing the count gets a bit expensive (you have to GROUP BY and count() aggregate), but on the other hand you get to do CRDT with a boring, off-the-shelf, well-understood technology, with the same trade-offs as you'd have using a new CRDT DB technology, but with all the benefits of the old, boring technology.


> I self-admittedly love esoteric databases and storage engines to a fault.

Any interesting notes and observations after using those esoteric tools?


Putting thought into design patterns and a storage engine that accents that choice, it's sometimes possible to eschew caching and distributed storage altogether - saving you from a whole host of complexity.

It can also be challenging to accurately assess a database's performance and correctness. I've been using C/Rust bindings when possible for the former, and Aphyr's Jepsen test results for the latter. Unfortunately there's no silver bullet on these topics - you pretty much have to test all your use cases.


Postgres is my go-to RDBMS, but I do have one serious complaint: Connections are too expensive.

This is a side-effect of the old-school one-process-per-connection architecture that Postgres uses. MySQL (ick) easily handles thousands of connections on small servers; with Postgres you will need a LOT of RAM to sustain the same, RAM that would be better served as cache.

I've found (at least, for my current app) that the number of connections defines my scaling limit. I can't add more appserver instances or their combined connection pools will overflow the limit. And that limit is disappointingly small. It's not so painful that I want to reach for another RDBMS, but I'm bumping into this problem way too early in my scale curve.


It might be possible write a small proxy in Rust, that would run next to a PG instance, accepted connections using `async` and then forwarded queries using some limited size connection pool.


... if you are very keen on writing a pgpool replacement.


These are active connections running transactions - unfortunately there's no substitute for just having another connection.


If there is a lot of connections actually running transactions, then I'd expect the db will become overwhelmed due to number of transaction, and per-connection overhead is not going to be a problem.

How I understood the problem is thousands of connections, of which most do a query from time to time only.

Judging by the other comments, it seems solutions like that are already available.


Most of my transactions are relatively long-running (minutes) and mostly idle, so the db can handle the transaction load.

There honestly doesn't seem to be a good solution to this problem. I have reorganized my app a bit to try to keep the transactions shorter (mostly checkpointing) but it's using architecture to solve a fundamentally technical problem. I wouldn't have this problem with MySQL.

Switching from processes to threads (or some other abstraction) per connection isn't likely to show up in Postgres anytime soon, so I guess I'm willing to live with this... but I'm not going to pretend that Postgres is without some serious downsides.


See PostgREST!


> with Postgres you will need a LOT of RAM to sustain the same

The per-connection memory overhead is not actually that high. If you configure huge_pages, it's on the order of ~1.5MB-2MB, even with a large shared_buffers setting.

Unfortunately many process monitoring tools (including top & ps) don't represent shared memory usage well. Most of the time each process is attributed not just the process local memory, but all the shared memory it touched. Even though it's obviously only used once across all processes.

In case of top/ps it's a bit easier to see when using huge pages, because they don't include huge pages in RSS (which is not necessarily great, but ...).

Edit: expand below

On halfway recent versions of linux /proc/$pid/smaps_rollup makes this a bit easier. It shows shared memory separately, both when not using huge pages, and when doing so. The helpful bit is that it has a number of 'Pss*' fields, which is basically the 'proportional' version of RSS. It divides repeatedly mapped memory by the number of processes attached to it.

Here's an example of smaps_rollup without using huge pages:

  cat /proc/1684346/smaps_rollup 
  56444bf26000-7fff2d936000 ---p 00000000 00:00 0                          [rollup]
  Rss:             1854392 kB
  Pss:              235614 kB
  Pss_Anon:           1420 kB
  Pss_File:            274 kB
  Pss_Shmem:        233919 kB
  Shared_Clean:      10700 kB
  Shared_Dirty:    1837560 kB
  Private_Clean:         0 kB
  Private_Dirty:      6132 kB
  Referenced:      1853428 kB
  Anonymous:          2664 kB
  LazyFree:              0 kB
  AnonHugePages:         0 kB
  ShmemPmdMapped:        0 kB
  FilePmdMapped:        0 kB
  Shared_Hugetlb:        0 kB
  Private_Hugetlb:       0 kB
  Swap:                  0 kB
  SwapPss:               0 kB
  Locked:                0 kB
You can see that RSS claims 1.8GB of memory. But the proportional amount of anonymous memory is only 1.4MB - even though "Anonymous" shows 2.6MB. That's largely due to that memory not being modified between postmaster and backends (and thus shared). Nearly all of the rest is shared memory that was touched by the process, Shared_Clean + Shared_Dirty.

With huge pages it's a bit easier:

  cat /proc/1684560/smaps_rollup 
  55e67544d000-7ffecebc9000 ---p 00000000 00:00 0                          [rollup]
  Rss:               13312 kB
  Pss:                1671 kB
  Pss_Anon:           1397 kB
  Pss_File:            274 kB
  Pss_Shmem:             0 kB
  Shared_Clean:      10656 kB
  Shared_Dirty:       1292 kB
  Private_Clean:         0 kB
  Private_Dirty:      1364 kB
  Referenced:        12312 kB
  Anonymous:          2656 kB
  LazyFree:              0 kB
  AnonHugePages:         0 kB
  ShmemPmdMapped:        0 kB
  FilePmdMapped:        0 kB
  Shared_Hugetlb:  1310720 kB
  Private_Hugetlb: 2379776 kB
  Swap:                  0 kB
  SwapPss:               0 kB
  Locked:                0 kB


I love Postgres as much as the next guy but there are three things that really annoy me:

Postgres is still laid out for for 9 to 5 workloads, accumulating garbage during the day and running vacuum at night. Autovacuum just doesn't cut it in a 24/7 operation and it is by far what has caused the most problems in production.

No query hints and no plan to ever implement it. Making the planner better only takes you so far when everything depends on table statistics which can be easily skewed. pg_hint_plan or using CTEs is not extensive enough.

Partial indexes are not well supported in the query planner, thanks to no query hints I can't even force the planner to use them.


Do you have a good Postgres DBA? Autovacuum can indeed ruin your day if you have a 24/7 DB that is delete or update heavy (the default setting are not very aggressive and once it falls behind badly it will never catch up and you'll need to do some emergency manual vacuuming). But I can assure you it's possible to run 24/7 DBs fine without running into autovacuming problems by making sure things are configured right. I've experienced both regular problems due to bad config and their complete disappearance.


PostgreSQL and ZFS is a marriage made in heaven. Block aligns to disk behaviour, snapshots make great low latency db dumps trivial. Never regretted making the combo


FWIW, I've often called out my decision to use PostgreSQL and ZFS (on Linux) as my worst.

Memory usage was extremely inefficient. ARC had to be set to half of what it should've been because it varied so dramatically, so half the system memory was wasted. ZFS would occasionally exhaust system memory causing a block on allocations for 10+ minutes almost daily, no ssh or postgresql connections could be opened in the meantime. Sometimes a zfs kernel process would get stuck in a bad state and require restarting the server. Many days were wasted testing different raid schemes to work around dramatic space inefficiencies (2-8x IIRC) with the wrong combinations of using disk block sizes, zfs record sizes, postgresql block sizes, and zfs record compression. Because zfs records are atomic and larger than disk pages, writes have to be prefaced with reads for the other disk pages, adding lots of latency and random IOPS load. Bunch of other issues, I could go on.

I've since switched back to ext4 and hardware RAID. Median and average latency dropped an order of magnitude. 99th percentile latency dropped 2 orders of magnitude.

These databases are at high load. If they had low load, and I wasn't expecting to grow into high load, I'd consider ZFS since it does have a bunch of nice features.


I'm on a Read-mostly DB. Also, I have SSD backed ARC and its recent ZFS with a lot of memory (512+GB IIRC)


Same here, latest ZFS (on Linux), 512GB RAM, SSD backed, but write heavy.


And compression! I've gotten compression ratios above 3x on real production databases, which is the difference between "we need more disks by next month" and "this hardware will last the foreseeable future". Not to mention that it improves performance when lz4 can decompress faster than the disks can read.


I'm getting almost 4x on a 10TB production database! A little surprising the first time you see all the z_wr_iss processes flying :)


Can you please give us more info about your workload? Is it very high throughput DB? Especially for Inserts and Updates? I searched on ZFS and PostgreSQL performance a couple of years ago on the Internet and popular opinion seems to be databases with Copy on Write type of file systems don't give the best performance.


I am not deep expert, but my impression is that Postgres MVCC is kinda Copy on Write approach already (no in place updates, but you write new page for each update), so it should be perfect marriage..


I'm on a read-mostly DB. few clients, but constant loads. More inserts than updates.


I manage Oracle and Postgresql instances and I agree with this, except for one small thing: Postgresql doesn't have packages.

Otherwise, it is much simpler to manage than Oracle (which really is a system turned inside out). The documentation is really good and straight to the point (Oracle can't help but turn everything into an ad - nvl uses the best null handling technology on the market and such).


I spent many years on Oracle. I do not miss smug Tom Kyte blog posts about the pointlessness of BOOLEANs. Nor do I miss endless hours Googling around Burleson Consulting pages because we couldn’t afford support.

I do kinda miss how easy it was to do partitioning but I’m on Redshift now so it doesn’t matter.


Oh, man, that shirtless guy. Thanks for putting that back into my mind!


While not having packages isn't a huge deal for new projects, it can be a pretty huge stumbling block for dbas who want to migrate from Oracle to Postgres, but need an easy way to port their pl/sql code to pgpl/sql. Most of the syntactic differenes (like pgplqsl code needing to be delimited as a string) can be automated, or handed off to a junior programmer. Not being able to use Oracle's package system would require some major refactoring to how you organize your functions and procedures


This talks a lot about Postgres (which is great ) but not about other databases so the title isn't very accurate. If you ignore cost it certainly doesn't explain why you'd work with Postgres rather than Oracle or SQLServer.

I've used Oracle for 20+ years, SQLServer for > 10 years, and Postgres from time to time and would say they are all good tools for a general use case.


Cost isn't an advantage for the company I work for because we licensed SQL server very cheap in big volume. If devs can work easier and faster with SQL server, then we end up saving money because time is money.


As mentioned, PostgreSQL's documentation is some of the best in the software world. With most software project documentation, one ends up scouring StackOverflow, mailing list archives, and GitHub issues trying to figure out how to do something. With Postgres, I can just go straight to the official documentation, and read it. It has yet to fail me.


While I disagree that it's the world's best, it is my default for all projects and general RMDBS.

I used to be MySQL but it felt like over time it was becoming hack on top of hack for each subsequent version while PG lacked a number of features in the early 2010's it has clearly caught up with great care to its codebase.

There are scenarios that PG does not win and should not be used, but if we're talking about the most applications that it covers well, PG is it.


Out of curiosity, what RDBM do you feel is the best?


Why should there be one best DB? It's a game of tradeoffs. Always was, always will be.


Sure... but if you HAD to choose one? Just kidding. I think the word "best" could easily be substituted with "your go-to DB"... or simply the best for <category of what you want to do>.


I used PostgreSQL as student and during research and had to switch to Microsoft SQL Server as my company uses that since forever.

For features I used on both, the average quality of the SQL Server implementation is between middling and pathetic (mostly the latter). I cannot recall a single instance where I thought "that's nice" or anything positive about how SQL Server does something. At best, it has been "this is not awful".

Things that sucked especially hard:

- Datetime API: In postgresql it is a joy to use and most features one would like to have to analyze data are directly there and work as they should. The analysis for my master's thesis was completely done in postgresql [1] (including outputting CSV files that could then be used by the LaTeX document) - without ever feeling constrained by the language or its features. Meanwhile, the way to store a Timespan from C# via EF is to convert it to a string and back or convert it to ticks and store it as long.

- The documentation: The pgsql documentation is the best of any project I have ever seen, the one for SQL Server is horrible and doesn't seem to have any redeeming qualities. - Management: SMSS is a usability nightmare and still the recommended choice for doing anything. I assume there are better ways but I haven't found them and if I have, they usually didn't work.

- The query language: In psql you can do things like

  select a < b as ab, b < c as bc , count(*) from t group by 1,2; \crosstabview
To get even half of that in SQL Server one would do

  select case when a < b then 1 else 0 end as ab, case when b < c then 1 else 0 end as bc, count(*) as count group by case when a < b then 1 else 0 end, case when b < c then 1 else 0 end
I think crosstab queries are unsupported (there are PIVOTs but I haven't yet looked into them a lot).

Maybe my use cases are a bit weird and I haven't encountered the better features of SQL Server yet. Would appreciate if someone could point out what I am doing wrong.

[1]: https://github.com/oerpli/MONitERO/blob/master/sql/queries/r...


Interesting how mssql has bad documentation. Judging by my experiences with docs for azure and various windows and dotnet APIs, Microsoft's documentation is usually pretty solid. Not that it's surpising, since proprietary rdbs'es tend to prefer "vendor lock-in" over "making the product pleasant to work with so the customer doesn't jump ship"


SQL Server has a ternary. So your example would be written like:

  select iif(a < b, 1, 0) as ab, iif(b < c, 1, 0) bc...
Unsure what crosstabview is as I've never used it.

I hate sql server tho. Basic things like indexable arrays in postgresql make it amazing.


This makes it somewhat easier on the eyes but is still 80% too verbose.

Crosstabview converts a table from:

  |a x 0
  |a y 1
  |a z 2
  |b x 1
  |b z 5
to

  |  x y z
  |a 0 1 2
  |b 1   5


PostgreSQL is great if you:

a) are OK with using SQL (this is not obvious)

b) do not need a distributed database

I've spent a lot of time on looking at database solutions recently, reading through Jepsen reports and thinking about software architectures. The problem with PostgreSQL is that it is essentially a single-point-of-failure solution (yes, I do know about various replication scenarios and solutions, in fact I learned much more about them than I ever wanted to know). This is perfectly fine for many applications, but not nearly enough to call it "the world's best database".


You are the architect of the application. If the application requires distributed database then it's your task to implement that in order to fulfill client's requirement. PGSQL has nothing to do with your skills and/or capabilities to comprehend said situation. You can do distributed database even with Access, or even with simple files.


This is why I'm always looking at the new cool databases on the block before going back to postgres.

The hard truth is that postgres is more annoying to operate than a lot of the modern alternatives when you need HA.


Can you just shortly elaborate on why PostgreSQL is not good for distributed scenarios for those of us who are/were under the impression that psql does have stable distribution features?


IMO it should be possible to build a CRDT-style, eventually-consistent, distributed DB using PG. Normalize to the max, and then normalize more, have instances publish replication for log schemas, have a layer of instances that publish merged logs, and have instances subscribe to and apply (CRDT-style) those merged logs.


You can use CRDTs on top of pretty much any database and while they could simplify things, any database won't be optimized for specific constraints of CRDTs, so you won't be able to achieve certain levels of performance, availability, reliability and operational overhead possible with a well designed CRDT based system (I'm talking an order of magnitude difference here).


So what do you suggest then?

Do you really care that much about the language? Shouldn't you care more about your data? If you are not OK with SQL there are abstractions available.


> Do you really care that much about the language? Shouldn't you care more about your data? If you are not OK with SQL there are abstractions available.

I do care about SQL being a text-based protocol with in-band commands, bringing lots of functionality that I never need (I do not use the fancy query features for exploratory analysis, I have applications with very specific data access patterns).

For my needs, I do not need a "query language" at all, just index lookups. And I would much rather not have to worry about escaping my data, because I have to squeeze everything into a simple text pipe.


So what distributed database solution have you found for your project?


There is no such thing as a "best database for everyone". I intend to migrate to FoundationDB in the main project I'm working on, but it is not necessarily the "best" option for everyone (see https://blogs.vmware.com/management/2020/03/changing-the-fou... for a good writeup).


For someone that is using c# , please have a look at the excellent Marten library that shows the awesome functionality of Postgress as a document ( eg. NoSql) + event store.

Postgres is not only for Sql, as I see multiple people referencing only the SQL functionality.

For those that work with events/projections, Postgress supports js migrations on events so you can update events in you store to the latest version.


PostgreSQL is also my go-to database for anything relational. My only issue with it though, is the need to run vacuum full (which locks tables for significant amount of time) to free up disk space to the operating system.


Is PostgreSQL doing anything in the area of autonomous databases? Oracle seems to be doing a lot in this area right now and it seems to be promising.


That performance carries the day today is a huge improvement since 1999. Back then, there was also only one question that mattered, but the question was, "Is it Oracle?", just as the only question about network equipment was "Is it Cisco?".

It is hard to explain, at this remove, what horrible perversions were performed in order to be able to say "yes" to such questions, or to avoid need to answer them.

One example was a router that, at opposite ends of a long-haul network, managed data flow that was utterly insensitive to the magnitudes of latency or drop rate. They sold hardly any. It turned out the algorithm could be run in user space over UDP, and that became a roaring business, for a different company, because you could deploy without getting network IT involved. Mostly IT didn't even want bribes. They just didn't want anything around that was unfamiliar.

Database products had to pretend to be an add-on to Oracle, and invisibly back up to, and restore from, an Oracle instance, because Oracle DBAs only ever wanted to back up or restore an Oracle instance. The Oracle part typically did absolutely nothing else, but Oracle collected their $500k (yes, really) regardless.


Bi-temporal tables support badly wanted.

https://mariadb.com/kb/en/temporal-data-tables/


Yes, though the SQL2011 temporal support was a dud compared to the state of the theoretical art. A strong whiff of vendor-vs-vendor nonsense.

Folks in the PostgreSQL community are aware of temporal tables:

https://www.pgcon.org/2019/schedule/events/1336.en.html

https://www.2qpgconf.com/wp-content/uploads/2016/05/Chad-Sla...


I get the Postgres love but as someone who’s been doing mysql for a long time I found postgres really hard to use. Like even setting it up on Mac was a pain in the ass. How do I create a new db? Why is pgcli not as friendly as mysqlcli? After 2 weeks of head bashing and lots of stack overflow reading I gave up and went back to good ol MySQL.

For ease of use for a new comer, mysql seems to have nicer tooling and out of box use. I’ve worked with mysql on sites with millions of daily active users and it hasn’t been a problem. Snappy 10ms queries for most things. I guess one really needs to learn ANALYZE and do the necessary index tuning.

I’m just saying, use the tool you’re most comfortable with, has good ecosystem and gets the job done. For some it’s PGSQL, for some it’s MySQL, for others it could be the paid ones. MySQl8 is pretty solid nowadays.

There is no universal “best”, just as there is no “best” car. All about trade offs.


> setting it up on Mac was a pain in the ass

What was your problem? All you need to call is:

docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres:latest

...and you are good to go. In depth guide: https://hackernoon.com/dont-install-postgres-docker-pull-pos...


Might be a bit... Forcibly put, but I gotta say that PostgreSQL is one of my favorite pieces of software ever created. And I don't even like C all that much.


I'd like to add one more thing about postgesql : the source code is arguably the cleanest code you'll ever see. It's second to none.


> So, you want NoSQL, Riak, REACT, Redis, Mongo, etc.? PostgreSQL does all that. Admittedly not with all the bells and whistles of all of the original products. For example, PostgreSQL doesn’t create new shards for you for any of those. That’s still a manual process.

Isn't this a fairly big issue? I would think the convenience of automated scaling is a primary reason to use these other tools.


While we're all here talking about databases I've got a question for the DBA/well versed software engineers.

What database or database design strategies to be used for an analytics dashboard that can be queries real-time similar to Google Analytics?

The first real scaling issue I've come across (as a junior dev working on a side project) is a table with 38 million rows and rapidly growing.

Short of just adding indexes and more RAM to the problem, is there another way to design such a table to be effectively queried for Counts and Group Bys over a date range?

Iimagine things can't really be pre-comuted too easily due to the different filters?


I currently run analytic queries on a page view dataset around ~600 million rows on postgres. You can start by looking into partitioning by date. Also take a look at your IO on your queries. IO was the biggest bottleneck for me. Make sure you have hardware that is not limiting it.


I'm not sure how much flexibility you have, but have you taken a look at Prometheus (for time series data) + Grafana (for dashboards)

I'm certainly not a DB expert, but this is where a similar need led me a few months ago.


You want an OLAP database, something like Clickhouse.


PostreSQL is absolutely fantastic and for most projects, unless there's an existing legacy database or a purely Microsoft environment should be the default RDMS.

My only complaints are:

1) No built-in or easy failover mechanism (and will never have one because of their estated philosophy). I'll settle for a standard one and yes I know there are different requirements and things to optimize for and there are 3rd party solutions (just not one to integrate easily).

2) Horizontal Scalability. And yes I know of some solutions and other dbs being more apt for this and asking the wrong thing.


PostgreSQL is a transaction-oriented row-store. It is slow for analytics, and I mean by something like two orders of magnitude, if not more, relative to state-of-the-art systems like HyperDB or Actian Vector, on typical analytic workloads. You can extend it any which way you like, that won't change. (Ok, if you can extend it with essentially a different DBMS then that could be fast.)

Security, concurrency control - those are important for transactional work facing numerous users. Not for analytic processing speed.


I’m assuming anyone who has to make that decision already knows this but, while PostgreSQL is great to host a production database, it isn’t a great choice for an analytic database at scale, or to train or store your machine learning features. It works, but it’s not great at scale.

You can get away with having a scheduled pg_dump early on, some reports on that, while you figure out an ETL/Messaging process — but picking something that handles concurrent large-scale queries will matter fast.


> PostgreSQL is great to host a production database, it isn’t a great choice for an analytic database at scale

How do you define an "analytic" database? Time series data, or something else?


I’d say hosting data that you don’t need to run your live ergodic (non-ML) software service, but that you’d need for other services: informing the business, flagging operational issues, training ML models.

Some of those can be done with back-end tools — say, if your software also contact to your customer service, and the managers of that center monitor their activity on a solution developed in-house, that solution isn’t “analytical” but should rather be part of the main architecture.

The main distinction that I’d make is: would it be a problem for your service if that data wasn’t available for a second, a minute, an hour? Anything visible on your app or website? A second is probably stretching it. Monitoring logistic operations, say drivers at Ubers? Up to a minute is probably fine. You want to retrain a ML model because you have a new idea, but database is down for maintenance? For an hour? You can go and grab coffee, or lunch — you are fine. Serving that same model for recommendations on a e-commerce website, that’s obviously not something that can take the same delay.


I actually think that part of the popularity of Postgres - and relational databases (RDBMS) in general - is that they are good enough for most analysis work.

Not specialist stuff, Postgres isn't really a competitor if your business is large or data-focused enough to be running OLAP (someone will probably link an OLAP extension now...).

But, most businesses just aren't that sophisticated in how they use their data.


Could you name some better alternatives for analytical db?


There are a lot of commercial solution that are geared towards reporting, from old-school Oracle, TeraData even SAS (would not recommend that one). Some options that came our recently when “Big Data” was a sentence people said: Hive, RedShift, S3, Cassandra. New players that abstract a lot of the problems from the others and add Machine-Learning capacities on top: Google BigQuery, Snowflake. There are non-tabular structure like the graph based ones, Neo4j e.g.: I’ve yet to find a business where that’s a viable option for analytics for the first five years but I like that those are bringing something new.

Most of those are great and work: they have connectors to whatever language or tool you want to use. The most promising tool you’ll need to handle most of the transformation is either AirFlow or preferably dbt. Those are is independent from the database, so don’t worry too much about the features that vendors tout. One key thing: monitor all the queries going to that database, find the expensive ones, those with suspicious patterns, etc.

Spin-up, response time can matter, but they are rarely a problem for most “slow” analytic use cases; for instance, Google BigQuery takes 10 seconds no matter what you query and it’s fine. On the other hand, concurrency has been an issue for me more than anything: all the analysts and managers trying to update their dashboards on Monday at 10 am.

You rapidly get to a point where prices are high and negotiated, so you want to think about your likely usage in the next years before you step into that meeting. Key decision: by default, prefer the tech that is closest to the rest of your stack because ingress is the easiest factor to predict.

Looker will be mentioned: it’s on top of all that, downstream from AirFlow/dbt.


Shameless plug: You might be interested in TiDB (https://github.com/pingcap/tidb), an open-source distributed HTAP database.

- For OLTP workload and some ad-hoc queries, you can use TiDB + TiKV. One of our adopters has a production cluster with 300+ TB data and can easily cope with the spike caused by the brought by COVID-19.

- For more complex queries, TiSpark+ TiKV might work well; for heavier queries, we added a columnar store, TiFlash, see https://pingcap.com/blog/delivering-real-time-analytics-and-...


There are a lot of products that either fork or directly extend PostgreSQL for OLAP workloads. Greenplum[0], Materialize[1], Citus[2], PipelineDB[3] etc etc.

[0] https://greenplum.org/

[1] https://materialize.io/

[2] https://www.citusdata.com/ (edit: I said TimescaleDB, I was thinking of Citus)

[3] https://github.com/pipelinedb/pipelinedb

Disclosure: I work for VMware, which sponsors Greenplum development and sells commercial offerings.


We use Teradata for our data warehouse. We have an enormous quantity of data in there (We are a multinational wholesale chain). We however plan to move to either BigQuery or an inhouse solution in the future. But for now Teradata is our workhorse and quite good at what it does.


i wish there was a management layer that runs and operates high availability postgres databases on AWS/GCP/Azure...but is significantly cheaper.

AWS RDS makes it impossible to export snapshots to s3 (because the native snapshot storage is super expensive).

EC2 m5.xlarge w/ 5000 GB storage costs 700$ RDS db.m5.xlarge w/ 5000 GB costs 1000$

Why is there no SAAS that lets me BYO my own EC2 and just runs and gives me a dashboardy experience ESPECIALLY for snapshots and restores ?


Sounds like https://scalegrid.io/ could fit your use case


Check out ScaleGrid's BYOC plans. They are similar to RDS, but give you the option to host in your own AWS or Azure account. ScaleGrid supports MongoDB, Redis, MySQL and Postgres. https://scalegrid.io/pricing.html#section_pricing_byoc



GCP managed databases are more expensive than AWS.

They claim this is because (unlike AWS), they don't provision iops as "credits".

There is an opportunity to play in this price arbitrage space where you can take the cloud instances and manage a database on top of it.


Let’s say I have a table for reservations: 1. Item ID 2. Reservation duration

What I would like is if I have an item ID of 1 and a reservation duration of today 10AM to 11AM and I try to add another record of item ID 1 and duration 10:39AM to 11:39AM the insert should fail.

Apparently it is straightforward in db2 but not in postgresql?


Postgresql supports storing range of values, in particular ranges of timestamps.

I never used them but the manual page shows an example on avoiding overlaps that looks a lot like your case:

https://www.postgresql.org/docs/12/rangetypes.html#RANGETYPE...


Something like this will do the job in PostgreSQL:

  CREATE TABLE reservations (
    item_id uuid,
    duration tsrange,
    EXCLUDE USING GIST (item_id WITH =, duration WITH &&)
  );


Can this be made to work if the `id` and `duration` are in different tables (with a joining key)?


PostgreSQL is a great database, but pgAdmin (or any client I have tried) is just painful compared to SSMS.

And while SSMS certainly has some pain points, I have yet to find a DB client tool that even comes close in terms of usability. For this reason alone I prefer MS SQL.


Never got SSMS working on my work laptop. Showed the splash screen and that was it. Weird


I hit that too: it was a known bug in SSMS and Microsoft released an update that fixed it.


Good to know, thanks!


Their JSON data type is atrocious, does anybody actually use it or is it a checkbox item?


You can join on arbitrary json fields, with indices. It is an extremely useful swiss army knife.

I've used jsonb for performing ETL in-database from REST APIs.

I know Concourse from 6.0 uses it for storing resource versions. They had an inputs selection algorithm that became stupidly faster because they can perform joins over JSON objects provided by 3rd-party extensions. Previously it was a nested loop in memory.

https://github.com/concourse/concourse/releases/tag/v6.0.0#3...


Isn't it just a string? can you do nested queries? secondary indexes? really anything beyond just querying the key-value pairs?


It's a real type. You can do nested queries, follow paths into the structure etc. I've performed quite sophisticated joins over jsonb records.


I use Postgres for everything. The one thing I wish they would do, though, is improve the error messages when a query fails. At the very least, if I know which position in the text is the beginning of an error, I can fix it.


> The one thing I wish they would do, though, is improve the error messages when a query fails.

Yea, we definitely need to do better there. But it's harder than one might think initially :(. My suspicion is that we (the PG developers) will have to write our own parser generator at some point... We don't really want to go to a manually written recursive descent parser, to a large part because the parser generator ensuring that the grammar doesn't have ambiguities is very helpful.

> At the very least, if I know which position in the text is the beginning of an error, I can fix it.

For many syntax errors and the like we actually do output positions. E.g.

  postgres[1429782][1]=# SELECT * FROM FROM blarg;
  ERROR:  42601: syntax error at or near "FROM"
  LINE 1: SELECT * FROM FROM blarg;
                        ^
(with a mono font the ^ should line up with the second FROM)

If you have a more concrete example of unhelpful errors that particularly bug you, it'd be helpful. In some cases it could be an easy change.


Fair enough. Without Sybase XI or whatever version they would now be up to, I can see where this could be true. But when MySQL added Views in v5.0.3 as I recall, it met my requirements for an RDBMS.


Idk, I'm very pro-PostgreSQL but MySQL/InnoDB used to have that one feature for easy replication via logical DB updates (sending UPDATES to replicas) whereas PostgreSQL support for this was considered lacking, using binary block updates instead if I'm remembering correctly. Also, MySQL, with replication, was considered easier to setup and manage by your average web shop. That may be a thing of the past, though.

Edit: also, best for what? SQLite is, in a way, the perfect DB is you don't have concurrent writes, and I've worked with vector databases (think bitmap indices on steroids) that totally outclass eg. Oracle and PostgreSQL for analytics-heavy workloads


That used to be the case, but it's not anymore. Postgresql has logical replication built-in since Postgresql 10 (released 2017-10-05). It's really simple to run these days and the docs are great [0]

[0] https://www.postgresql.org/docs/12/logical-replication.html


Any love for NoSQL here? Many have all the same capabilities as Postgres/major SQL tech but have a format that removes the need for an ORM.


It would be if it had real clustered indexes like MS SQL, ie. Tables as btrees that are maintained with each update.


Good news is that the table access method API introduced in version 12 makes it possible to build b-tree based tables as extensions.


For scaling I assume some cloud solution or very distributed solution is better than PostgreSQL. For example Azure SQL Database Hyperscale or CockroachDb.

https://redmondmag.com/articles/2019/07/03/microsoft-buildin...


Is there a Galera-equivalent for Postgres?

There's BDR, but it's closed source for recent version Pg AFAICT.


How do you manage your PostgreSQL deployments?


Of those several worlds? Is that a subtle joke? I don't get it ...


I'm a simple man, I see PGSQL in a HN post, I upvote, I don't even read the article


Ironically, the blog post shoots itself in the foot by starting with Postgres security.

Postgres has very poor security compared to MySQL, and in fact, I tell companies implementing compliance policies to shift to MySQL.

https://www.cvedetails.com/metasploit-modules/vendor-336/Pos...

The reasons are:

- Postgres' grant model is overly complex. I haven't seen anybody maintain the grants correctly in production for non-admin read-only users. By contrast, MySQL's are grants are simple to use and simple to understand.

- Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things.

- Postgres' version of upsert allowed any command to be run without checking the permissions. So the vaunted "software engineering" behind Postgres is not that solid.

- Currently Postgres is subject to around a dozen metasploit vulnerabilities that any script-kiddy can execute.

The simple fact is, if you use Postgres, you almost certainly have a security compliance problem.

I could make the same arguments about replication, or online schema changes, multi-master writes, or any enterprise database feature.

Some constructive advice to the Postgres developers is to take a week and add grant commands to limit COPY FROM and COPY TO, and look at the metasploit options and see what can be done ASAP.

I'd appreciate if you're itching to write a hasty response that you actually check your facts first.

If you're thinking, "How is it possible that everybody else is wrong about Postgres being the best?", just remember the decade of Mongo fanboism on HN. I cringed during that era, too.

Source: MySQL and Postgres DBA.


your statements are nothing but opinions. It's also dishonest to make them sound like facts.

If you care this much about security, you should have taken a look at CVE reports of each db. You'll find that historically MySql has almost double the number of vulnerabilities.


> - Postgres' version of upsert allowed any command to be run without checking the permissions. So the vaunted "software engineering" behind Postgres is not that solid.

That never was the case (see evidence in initial commit [1]). Are you talking about CVE-2017-15099? Obviously annoying that we had that bug, but thats very far from what you claim.

Since you write "I'd appreciate if you're itching to write a hasty response that you actually check your facts first." you actually follow up your own advice?

> - Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things.

If, and only if, the user is a superuser. It's possible to do the same in just about any other client/server rdbms.

> Some constructive advice to the Postgres developers is to take a week and add grant commands to limit COPY FROM and COPY TO

You mean, like it has been the case for ~19 years? https://www.postgresql.org/docs/7.1/sql-copy.html "COPY naming a file is only allowed to database superusers, since it allows writing on any file that the backend has privileges to write on.".

> - Currently Postgres is subject to around a dozen metasploit vulnerabilities that any script-kiddy can execute.

That does not actually seem to be the case (what a surprise). And the modules that do exist are one for a (serious!) vulnerability from 2013, and others that aren't vulnerabilities because they require superuser rights.

[1] https://github.com/postgres/postgres/commit/168d5805e4c08bed...


Using inflammatory language like "The simple fact is, if you use Postgres, you almost certainly have a security compliance problem" or "If you're thinking, "How is it possible that everybody else is wrong about Postgres being the best?", just remember the decade of Mongo fanboism on HN. I cringed during that era, too" really doesn't help getting the message through.


I've deployed, administered, and developed with PostgreSQL in a regulated, audited environment with external and internal compliance requirements (US healthcare; database contains protected health information). None of the security model issues you mention is anything like a realistic concern in that environment.

And I'd love to see your sources for all your "exploits".


> I haven't seen anybody maintain the grants correctly in production for non-admin read-only users.

Your opinion against mine, but in my last project I successfully did this. I didn't even have to think about it; the client asked me if I could introduce a role for pure-read-only access and it was done in a few minutes. It helped that I'd already set up all objects, roles, and grants for non-admin access.

OTOH, the privileges system in postgres has always worked when and how I've wanted it to work, unlike my experience with MySQL.

----

> Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things.

> Some constructive advice to the Postgres developers is to take a week and add grant commands to limit COPY FROM and COPY TO

From the docs:

7.1 to 9.2: "COPY naming a file is only allowed to database superusers, since it allows writing on any file that the backend has privileges to write on." (Note that there was no support for PROGRAM in these versions, which was introduced in 9.3, and thus the docs changed to ...)

9.3 to 10: "COPY naming a file or command is only allowed to database superusers, since it allows reading or writing any file that the server has privileges to access." (and after this your request for grantable control was, well, granted ...)

11, 12: "COPY naming a file or command is only allowed to database superusers or users who are granted one of the default roles pg_read_server_files, pg_write_server_files, or pg_execute_server_program, since it allows reading or writing any file or running a program that the server has privileges to access."

And, of course, there's the classic combo of functions with SECURITY DEFINER and EXECUTE grants. For which, again, the doc has always advised: "For security reasons, it is best to use a fixed command string, or at least avoid passing any user input in it."

----

> Postgres' version of upsert allowed any command to be run without checking the permissions. So the vaunted "software engineering" behind Postgres is not that solid.

I'm afraid I couldn't find any info about this thing you mention; and frankly, it's a pretty bold claim. Could you elaborate?

Again, your opinion against mine (and many others'), but the "vaunted" engineering (and _design_) behind Postgres is solid on many fronts, from our experience of running it and using it in many contexts, including security.


If you grant untrusted users administrative access then you're going to get owned, regardless of your chosen DB engine.


Wouldn't a competent DBA take care of those issues? It is a honest question since I don't know much about database administration.


Grant model is little outdated in some aspects, but with right tools it is doable. On some level you just accept that you need to read documentation and follow it. It is just another thing that you need to care about. In case if you need it.

I can little imagine case when you have access to the system and install dumps from untrusted sources. Probably it is world of untrusted PHP scripts from russian forums with stollen software. It is the world full of whole specter of pains. But I’m not agree that it should be considered as weakness. You never should restore dumps from untrusted sources ever. Such dumps can contains stored procedures that can contain code in pl/python that can do a lot of shady things. Is it a weakness or advantage of having freedom of using of python? In the world of script kiddies it is, but it is not problem of the Postgres.


> Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things

MySQL LOAD_FILE() and SELECT ... INTO OUTFILE have been used to compromise the database by copying SSH keys to the server, amongst other things.

> Currently Postgres is subject to around a dozen metasploit vulnerabilities that any script-kiddy can execute

Such as?


does running Postgres on AWS RDS mitigate some of these problems?


I don't know about AWS but our Postgres instances are accessible only through internal network connections. I think we are safe.


shhhhhhh! Be quiet, and let natural selection run its course.


Measurably? Im confused.


I get all the comparison, but Redis? I still use Redis as a cache on top of my PostgreSQL queries. PostgreSQL simply can't be as fast as a K/V like Redis.


> I still use Redis as a cache on top of my PostgreSQL queries

Why? Postgres has a "Redis cache" (in-memory query cache) built in already[1]. Your application layer doesn't have to worry about query caching at all.

1. https://www.postgresql.org/docs/current/runtime-config-resou...



While I have no doubt that Redis is faster for slinging blobs around, that blog post is not great comparison of the technologies. Both Django ORM and Python psycopg2 driver are not performance oriented tools.


this is so egress of a misrepresentation its absurd.

redis in its default configuration is in memory. postgresql stores to disk.

I doubt they was running anything in parallel. (python)

if you tune postgres by using a RAM disk and actually use parallelism it blows redis out of the water hands down.


yes it can. in fact it can be faster due to the fact it isn't single threaded. learn about table spaces and RAM disks.


It's missing the part at the end that says "for general use" Plenty of noSQL and even MySQL/mariaDB features beat psql in the right problem space.


Technically, PostgreSQL is a DBMS, not a database. I.e. “PostgreSQL is the worlds’ best DBMS”, but not “best database”. The problem is, if you call the DBMS a database, so what do you call the database (collection of data, tables, etc.)?


Is this really a problem?


When I knew nothing about databases, and was trying to read stuff on the web about those, it was very confusing trying to understand what the author talks about when he uses the term database, is it about a DBMS (software), or it’s about a database (data). So, yeah, it’s a problem, at least for beginners.


When I learned you cannot change the order of columns in PostgreSQL, it sounded like one of the smarter but slightly weird kid in a class room.

On the other hand, MySQL, who's not as bright as the smarter kid, is easy to talk to and feels friendlier and is generally the more popular kid.

I do appreciate the strictness of PostgreSQL but if I see small weird stuff, I tend to pick the one that is easier to get along with (meaning, more resource found on the net.)

Also to mention that MySQL is also getting 'brighter" since version 8.


I'll admit to not being an expert in databases, but I can't figure out why the order of columns in a database would be relevant, unless you're doing `select *` and indexing the resulting columns by number rather than name, which I've always found to be fragile to the point of being useless. What am I missing?


It doesn't matter for coding or functionality, but it is nice when you're manually eyeballing the results of 'SELECT * ...' statements or browsing tables with something like Navicat, or looking at schema dumps - to have your fields ordered in some way that makes sense to you, rather than strictly in the order they were created.

Sure, you can list the fields in the order you want in a SELECT statement, but that's tedious - it's handy to have something reasonable in the table definition.

There's a wiki page on the Postgres site:

https://wiki.postgresql.org/wiki/Alter_column_position

talking about workarounds and a plan from 2006 on how they might implement that feature.


Having migrated a postgres database to dynamodb, I would say that I'm never going to use postgres again.

The problem with all SQL databases is that they are too easy to query and use. You add all kinds of select queries, joins and foreign keys and when traffic hits scramble to make it scale. NoSQL is hard to design but you can atleast be sure that once traffic hits, you don't have to redesign the schema to make it scale.


> You add all kinds of select queries, joins and foreign keys and when traffic hits scramble to make it scale. NoSQL is hard to design but you can atleast be sure that once traffic hits, you don't have to redesign the schema to make it scale.

Surely this depends on how you set up your SQL database to begin with? I'm not familiar with NoSQL, so can you explain why "schemas" aren't necessary and scaling happens automatically?


You could design a SQL database to be denormalized from the start, but then you are losing many of the advantages of a SQL database.

I never said that schemas aren't necessary, just that using a strict NoSQL database forces you to think about scaling constraints early. This avoids(atleast partially) a schema redesign later.


Ah yes! Optimize early, that's what my teachers always told me to do. /s


Code is easy to change. Data schemata much less so, particularly in traditional SQL databases.


You don't have joins in NoSQL so you have to keep data denormalized. That means data isn't consistent and you can have an will have (at a high enough data volume) errors if you aren't extremely cautious on how you update and add new data. That might matter or not in a particular business case.


I really hope the GP doesn't work on anything handling financial transactions.


If speed it's only thing that matters, he can use /dev/null for writes and /dev/urandom for reads. That's also very scalable.


Some say that scaling is a edge case. what say you?


"In the age old argument about the imperitive vs. the declarative5 programming models, it occurs to me that declarative programming is just imperative programming in a thin disguise. Each declarative token in a database query language ultimately maps to one or several algorithms which apply the declaration in imperitive terms"

Oh, profound, profound!

It occurs to me the author's a n00b.

Anyway, I do MSSQL. Until recently CTEs in PG were an optimisation fence. That would have slaughtered performance for my work. Not that I'm knocking PG, 'best' depends upon the problem.


MSSSQL people seem really proud of their CTEs. What is it that makes CTEs in MSSQL so great?

I wonder if that really disqualifies Postgres from the same task. Are MSSQL CTEs just the hammer for your proverbial nail? Can you use a different approach in Postgres to solve the problem by leveraging its strengths?

My day job is a lot of Redshift. We use CTEs and temp tables depending on what we need. It’s based on Postgres but not really comparable when talking about performance optimization.


Here's the last line of my post: "Not that I'm knocking PG, 'best' depends upon the problem" I did not want to get into a war with people emotionally involved in their favourite programming tool. The article said 'best'. By that measure PG was worse. Point is it doesn't matter, it's about what problem you are trying to solve and at what price. In many cases PG is best for that - do you understand? I knew people would see it as an attack on PG. It's so childish and predictable. I solve business problems, not cheerlead for a particular RDBMS.

As for what made CTEs so great in MSSQL, I just explained in my first post. Look for 'optimisation fence', and understand that a) PG has changed and b) was not a personal attack.

> I wonder if that really disqualifies Postgres from the same task.

If you have 100GB of data being materialised as a temp table under the hood instead of having the predicates being pushed down, what do you think?

> Are MSSQL CTEs just the hammer for your proverbial nail?

Jesus, they're just a tool to get a job done. As for is there a way round it, probably, it just means more work for the programmer, and more stress for the SQL optimiser which will at some point fail as complexity climbs.

> We use CTEs and temp tables depending on what we need

same.


It was an honest question, professional to professional. I did not interpret your post as a personal attack nor did I intend my post as an attack on you.

The last time I used MSSQL in anger I was an intern and Michael Jackson was alive. It’s been a while.

My question is: “what is it about your work that makes (or made) MSSQL a better choice than Postgres or something else, specifically because of CTE differences.”


Sorry. When you said "really proud" and "so great", it was perhaps an emotive choice of words.

In the end the semantics of CTEs should be much the same throughout. There may be small differences, and perhaps larger ones such as being able to update through a CTE, something like

  with x as (...)
  update x set ...
I don't know if PG supprts this, and it wouldn't break my heart if it didn't. It would be easy to work around.

Also IIRC PG has 'materialized' and 'recursive' keywords. No biggie.

So the semantics are substantially the same. Difference is, how the optimiser treats it. That means you will get the same results back but the time difference may be enormous. This explains it: https://paquier.xyz/postgresql-2/postgres-12-with-materializ...

In my previous work, the predicate pushdown (which is the posh term for the optimiser rewriting the query as the above link demonstrates) made CTEs usable. Without them, performance would have destroyed the company.

Can we get round it? Mostly, yes, but it would have been more work for us (meaning more human hours) which the DB optimiser should have saved us from.

The only unique thing CTEs bring to the party is recursion. Otherwise they are no more than a convenience, albeit a great one.

HTH

EDIT: MSSQL has a good optimiser when it works. I recently wrote a trivial CTE and had it take a minute to run, WTF? Looked at the query plan, something was very wrong there, pulled out the CTE part and put it into a temp table, ran in one second. 60x speedup. No idea why, was not impressed though.


You do realize you are the one getting emotional because you are clinging to an engine that consistently underperforms in benchmarks across the board and costs you a fortune right?

Hope the job security is worth being an expert in a bunch of defunkt practices in 10 years nobody will want.


> consistently underperforms in benchmarks across the board

Show me mssql vs PG speed comparison on a recent industry standard benchmark.

> and costs you a fortune right?

Are you witless enough to think I'm unaware of the wallet-raping, labyrinthine licensing of MSSQL?

> Hope the job security is worth...

Since you apparently can't understand, let me spell it out for you: I've recently installed PG to start learning it because, depending on the problem, it IS the best solution. I do hope you can stop feeling the need to protect your pet software package and undersand these are in the end just tools to solve problems.


> Show me mssql vs PG speed comparison on a recent industry standard benchmark.

Does the mssql license allow for this though? Last I heard, they did have a DeWitt clause.


TPC




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: