For most of the projects where the DB really mattered, throughout my 10+ freelancer carrier, it came down to one thing that client really cared about. Performance. Nothing else mattered, not license price, not whistles and bells, not hype. My clients wanted to have data in front of their eyes the same second when they clicked the button. And when you have a table with 100 million rows in it, and an application is not loading data from that table to show it to its users in the same second, you already lost the battle.
Out of my tries, my clients tries, and my friends tries, only one DB was up to this task. Not Maria, not Access, not Oracle, not NoSQL, not MSSQL, not MySQL - all of them failed. The only one was PostgreSQL.
And before starting to bash me, please do this. Make a small application that will show a map, put 100 million points of interest on that map, that are contained in the table we talk about, and now as you scroll the map, select the middle of view as your circle and select on a small radius only those points of interest inside that radius. No more then a thousand points of interest, lets say. When you do that within a second, you got yourself a good database. For me PGSQL was the only one capable to do this reliable.
While I do love PostgreSQL (and PostGIS which is excellent at transforming, indexing and querying geographic content) - I feel like MSSQL should get a bit of a shout-out in relation to your comment on performance.
An experienced database developer with help from "SET STATISTICS IO ON"[1] and query plans[2] can achieve incredible MSSQL query optimization results.
PostgreSQL has good query plan output via the EXPLAIN[2] statement - but I haven't (yet?) seen PostgreSQL produce per-query IO statistics.
Configuration parameter `track_io_timing = 'on'` will measure I/O time. And running `EXPLAIN (ANALYZE, BUFFERS)` will output per plan node buffer statistics and I/O time spent. On most modern system IO timing has no measurable overhead and should be permanently enabled. Collecting buffer statistics is also relatively cheap and could be enabled for all queries. For example the following configuration will get a log entry with an explain plan and per node I/O stats for every query above a threshold:
You don't need to log every auto explain. Just enable track_io_timing and pg_stat_statements and you get per query IO performance metrics much cheaper.
Thanks for this - 'pg_stat_statements' definitely looks like a similar feature, in particular because it reflects the count of disk-related operations (not just I/O time).
It took some brief configuration but I've been able to try it out locally and will refer to it when doing PostgreSQL query performance tuning in future.
In the end of the day, the PostgreSQL optimizer is just great, and will probably beat anything in a well balanced comparison.
But MSSQL also has a fairy good optimizer (from the ones on the GP relation, those are the only two good options) so you can set things to get better than an unset Postgres. It is also possible (but unlikely) that you get a problem where it fits better.
About Oracle, it is pretty great on doing `select stuff from table join table using (column)`. (I dunno what it does, but it does spend a lot of memory doing it.) So if you only does that, it's the best option. But if you do anything else, that anything else will completely dominate the execution time.
For me, MySQL’s quirkiness and lesser adherence to the SQL standard kind of neuter any performance niftiness it might have. I can see using MySQL for something like logging but for production data, I prefer PostgreSQL. It’s much more predictable.
It depends a lot on the implementation strategy. Some databases have GIS features, allowing you to query all points within a radius directly. I know oracle can do this, but it’s a paid addon (spatial and graph) and I don’t know how well it scales.
On databases that have good scanning behavior you can use a geohash approach instead. Insert all points with a geohash as key, determine a geohash bounding box that roughly fits your circle, query the rows matching that geohash in their key, run a “within distance” filter on the results client-side. I used that approach in hbase and it could query really fast. I suspect it would perform well on pretty much any mainstream sql db, but admittedly I haven’t tried it.
For true geospatial data models that correctly implement most geospatial predicates and operators your only practical option is PostGIS or Oracle Spatial. This is a very complex and nuanced piece of mathematics and software engineering work that requires writing a massive amount of code, even most closed source databases don't do a real implementation due to the difficulty. Most developers are unaware of just how broken most half-assed implementations are; the databases do just enough to check a box but not actually be useful if you are doing real geospatial work.
Geohashing has a lot of weaknesses and limitations, not really recommended unless your use case is very simple and not primarily geospatial in nature. The modern version was created in the 1990s and implemented by big database companies like Oracle and IBM (based on work in the 1960s and '70s). They deprecated that indexing feature in their database engines a long time ago for good technical reasons that apply to current open source implementations.
Geospatial index performance is very sensitive to temporal and geometry distribution, and the performance variance can be extremely high. For non-trivial applications, the real world will eventually produce data that inadvertently executes a denial of service attack on naive geospatial databases.
Yea PostGIS is a must if you plan on doing any sort of geospatial work with Postgres. PostGIS is way ahead of most other databases when it comes to processing geographic (3D) types. For example calculating a 100km buffer zone for 2D geometry types is easy, but only PostGIS can support buffer zone calculations on 3D geographic types. Because of the curvatures of the earth, the math is quite involved but it can make a big difference in accuracy
Oracle now has spatial that is included on the basic license and graph that is paid. Spatial solves a lot of GIS needs, but lacks some features, in special 3D objects.
MSSQL has a GIS package included in the basic database. I'm not sure how featurefull it is now, but last time I looked (years ago) it was missing a lot.
MySQL also has a GIS package included at the database. It's missing a lot of things.
The top of the line is, as usual, Postgres. But honestly, you probably won't need the difference.
For a reddit-like social site, we have a Maria database with around 3 billion rows in total, with the largest table (votes) spanning about 1,5 billion rows.
We can select data within ~3ms for about 20k-40k concurrent visitors on the site. Our setup is a simple Master-Slave config, with the slave only as backup (not used for selects in production). Our 2 DB servers have 128gb of RAM and cost around 200 EUR/month each - so, nothing special.
Of course you have to design your schema with great care to get to this point, but I imagine PG to be as lost as any other database if a crucial index is missing.
This doesn't add up. 100M rows isn't a lot. All of those databases can do it, and commercial systems like MSSQL and Oracle have millions of R&D invested into performance and geospatial features.
It's not realistic that Postgres would outclass everything else like this.
So you have more complicated SQL then, but it doesn't make a difference between DBs selecting from a 100M table with indexes.
If it ends up being a full table scan then PG only recently got query compilation and parallel execution so at best it would be even with, but not better than, MSSQL/Oracle that can do vectorized executions on columnstores.
Do you have an example query? If you're not using geospatial features, then what indexes does PG offer that the others can't match (either directly with indexes or with better scan/query performance)?
Why are you talking about not using geospatial features? I was just responding to your
> Even if they didn't, there are plenty of algorithms based on lat/lon which uses numeric/float data types with simple indexes.
statement. Which I still think is not a meaningful alternative. And then subsequently to your statement about vectorized columnstores.
I'm not all saying that only postgres has geospatial indexing support (although postgis is pretty good) - that would be ridiculous.
> then what indexes does PG offer that the others can't match (either directly with indexes or with better scan/query performance)?
I do think that postgres' indexing support (gist, gin in particular) makes it easier to add support for indexing types of data that are harder to index in many other databases. But it's nothing that can't be beat.
The point is that without geospatial features, you can still use other strategies like lat/long or geohash and they will all have the same performance.
The logic (built-in geospatial or hand-written sql) is irrelevant to the performance comparison and GP's post that Postgres was the only one that could query quickly.
Do you disagree with that? If not then I'm not sure what your argument is.
TBH, that should be easily doable for any database. I've built something like this with SQLite and on another occasion with Elasticsearch (where it's obviously much easier). If you leverage Geohash, and precalculate the geohash for every location, you don't even need any builtin geo-capabilities from the database.
>And before starting to bash me, please do this. Make a small application that will show a map, put 100 million points of interest on that map, that are contained in the table we talk about, and now as you scroll the map, select the middle of view as your circle and select on a small radius only those points of interest inside that radius. No more then a thousand points of interest, lets say. When you do that within a second, you got yourself a good database. For me PGSQL was the only one capable to do this reliable.
This is a bad example. An SQL database is a wrong tool for the job if you care about performance.
You want a data structure like a range tree[1] or a k-d tree[2]. The results would be near-instantaneous.
100 million lat/lng pairs is something you can fit in RAM without any problem. And you can even do the geometry query on the client side for performance in milliseconds.
Other examples might need SQL, but here, this would be doing things the wrong way.
While SQL will not necessarily give you the optimal solution, a btree index on a geohash will still give you O(log(n)) lookups on spatial data, which will be good enough for most use cases, and pretty much every database engine in existence can create an index on a string, which is all you need for that solution.
Sorry, it is not at all obvious to me how you can implement a query of the form "all locations within 100 miles of lat/lng" with a geohash (as in, "give me all stores within driving distance").
If you mean that nearby places have geotags with large common prefixes, then the "edge" cases there this won't work include such little-known, unpopular cities as London, UK (thorough which the Greenwich meridian passes).
That's to say, East and West London geohashes have no common prefix. The Uber driver picking you up in Greenwich will drop you off in a geohash which starts with a different letter.
To point out the obvious, the surface of Earth is two-dimensional, and a database index is one-dimensional. A solution to the 2D query problem using a 1D index is mathematically impossible (if existed, you would be able to construct a homeomorphism between a line and a plane, which does not exist).
Sure, geohashes can be used to compute the answer. But you need to know more information than just the letters in the geohash to find out which geohash partitions are adjacent to a given one. That extra information would be equivalent to building a space partitioning tree, though, and you still won't be able to use a range query to get it.
If there's an SQL range query-based approach with geohashes which compromises on correctness but is still practical, I'm all ears.
The solution there involves first producing a geohash cell covering. Of course, this means you have a BSP-like structure backed by a database.
---
PPS: the quick-and-dirty way using SQL would be keeping a table with lat / lng in different columns. It is not hard to define a lat/lng box around a point that has roughly equal sides, using basic trig.
You can filter locations outside the box with SQL, and then just loop over the results to check distances and fine-tune the answer (if needed). However, it will not be performant.
Well, a client invited me to evaluate PostgreSQL. Strong MSSQL shop, thousands of databases, millions of customers, tens thousands internal users. Big pile of licensing money they were eager to save.
We investigated the options for LDAP and Kerberos integration, commercial and free. Turns out there is no decent way to do it and grant permissions based on LDAP groups. There wasn't even a a half decent way. The only options were way to hackish to consider. MS SQL still own the house.
Why would you want LDAP integration for a database? Maybe that kind of messed up requirement is part of the problem?
One of the bigger issues with places that have made themselves dependent on MS solutions is that MS software completely permeates the place. Then when the pain of going full MS is too big, they are only looking for drop in replacements of existing parts. This will never go well. MS software never plays well with alternatives.
If you have let yourself slide too far down the slippery slope of MS Office integrations and MS Sharepoints and the like, you will have to pay the MS price. Or you have to be willing to chuck out the entire lot.
Because certain groups of users need the same sets of permissions on hundreds of databases and servers. And the members of those groups are constantly changing. In addition this ties an identity across system and facilitates SSO on the one hand and auditing on the other. There's more to it but these are the most obvious.
This is a common, standard requirement in enterprise systems, and it seem quite sensible. If you have a better, more robust solution, there are billions in this market, in the most literal way.
One more thing, although Active Directory is a standard in the enterprise, LDAP is an open protocol that existed before Microsoft were dreaming on being a player in the enterprise field and have at least half a dozen open source implementations.
> Because certain groups of users need the same sets of permissions on hundreds of databases and servers. And the members of those groups are constantly changing. In addition this ties an identity across system and facilitates SSO on the one hand and auditing on the other. There's more to it but these are the most obvious.
That is precisely what I meant with messed up requirements. Why would any of your SSO be relevant on the database level? A database user should be coupled to an application, not to a single user in AD. If you are letting constantly changing users do their own SQL requests on DB level, there is something rotten in the first place.
> This is a common, standard requirement in enterprise systems, and it seem quite sensible. If you have a better, more robust solution, there are billions in this market, in the most literal way.
I have never come across this kind of setup in a normal enterprise environment. There, you have everything behind some kind of enterprise software. Data entry, auditing, etc.
However, you could have need for this if we are talking developers sharing a single database and you don't want to manage accounts and passwords separately.
You said the solutions you found for Postgres were too "hacky". I think if you have a setup where enterprise users who are not developers or DBAs need personal access on DB level, your setup is quite "hacky" to begin with.
> One more thing, although Active Directory is a standard in the enterprise, LDAP is an open protocol that existed before Microsoft were dreaming on being a player in the enterprise field and have at least half a dozen open source implementations.
Microsoft has bent LDAP and Kerberos in its own way. Trying to use AD like an LDAP server is full of unpleasant surprises dealt to you by badly documented or outright undocumented "features", and useless error messages and logging. Believe me, it's a major nightmare to get anything integrated with AD.
There are tens of DBAs, hundreds of developers, hundreds of application users, those are the users in those groups.
I have done my fair share of integration with AD, including in some of the largest enterprises existing on this planet, where you have hundreds of groups of developers and admins. I read the RFCs and the documentation, in most cases, it is actually quite good and detailed, both on their LDAP and on their Kerberos implementation details.
> I have never come across this kind of setup in a normal enterprise environment. There, you have everything behind some kind of enterprise software. Data entry, auditing, etc.
Well, for you everything is behind the enterprise software, but I'm among those that need implementing it.
> There are tens of DBAs, hundreds of developers, hundreds of application users, those are the users in those groups.
Ah, so this is a special case with a shop that has developers and DBAs in their user base. Now that I can understand.
> I read the RFCs and the documentation, in most cases, it is actually quite good and detailed, both on their LDAP and on their Kerberos implementation details.
Then you know what a giant mess all of it is. I have done my share of LDAP/AD integration as well and found that LDAP is a very malleable tool that has been bent in many ways. There are many details to get right. MS use a certain subset of schemata and have added their special flavoured behaviour in the background. That means, what the AD admin sees is not exactly what an LDAP query might return. Similarly Kerberos or any other authentication scheme like e.g. SAML. Getting all of this right from the view of a third party application is not simple. That's why I'm never surprised why pretty much all of them are incomplete or incorrect in this manner.
Oracle offers a similar integration for AD. Do you happen to have any experience with it and if so, how does it compare to MS SQL?
I was never asked to give a hand in Oracle integration, hence I believe it is quite straight forward.
In my experience usually third party tools, i.e. ldapsearch, don't implement all AD options, if will use them for a LDAP query with AD features you will receive legitimate reply, but they will not be humanly readable as they were not parsed by the tool. However, the documentation from Microsoft in this front is actually surprisingly extensive and if anyone felt like it they could implement the correct parsing. It will also quite safe to assume at this point that intentional changes in the future in those areas are extremely unlikely.
Reading and understanding the LDAP and Kerberos implementation is quite a task, and I can fully understand why no one sat to re-implement it all in open source.
Most of the time, even features that are implemented in open source libraries, for stuff like GSSAPI as an example, are not implemented in the software that uses them. And in this I find PostgreSQL is extremely lacking.
In comparison, Apache lets you use LDAP groups for ACLs, NGinx have a unmaintained plugin stripped from the Apache base code and it only implements support for LDAP users, not groups. PostgreSQL will only let you authenticate users that match users that you already defined in PostgreSQL.
I'm far from being Microsoft fan in general, and I suffer whenever I have to use their OS. But AD is a robust piece of work and pretty much the only game in town. I feel that is is not in their focus right now, as is isn't part of their everything Azure strategy, and that's a shame.
This is very typical in an enterprise shop where you have end users connecting directly to the database for reporting purposes. Being able to take advantage of your existing directory structure and security groups is huge.
Why do end users have direct access to a database? What do they do with it? Copy and paste from their Excel sheets into a GUI database client? Why is there no layer of software in between?
Situations where you have enterprise setups have users measured in the 10s of thousands.
There will be existing tooling around ACL management (typically AD). User Access Management groups, geo-separated local account manager, multiple help-desks to allow for disaster plans (pandemics, fires, localized grid failures).
Access can be as simple as:
- my application will use a system-account, and we'll manage access
- my application will pass user credentials (based on AD), these are the groups (based on AD) that have read/write/delete access; regional team leaders will request access using existing UAM, regional managers will authorize requests, your existing tooling will process those requests as you are the designated authority over those groups
Changing this requires either exceptional levels of observable returns on effort, or extreme will-power and political power.
During evaluation of MS SQL / PostgreSQL / whatever, fitting into existing UAM will be a requirement.
Because the world and especially the enterprise world is full of smart analysts who aren't programmers but who are pretty hot shots with SQL and Excel. They can answer a LOT of ad-hoc business queries just with those two tools and don't need expensive slow programmers to spend 3 months twiddling JIRA tickets when they could get the same answer in a few hours, when they have direct database access.
Exactly. Why should I have to rewrite a whole ACL infrastructure when I can store it all in one place, and it doesn't matter how you access the database, it's all enforced.
Not at all if was possible. But PosgtreSQL only map users that already exists in PostgreSQL to LDAP users, matching the user name. And there is no way to map the groups. That's ridiculous.
The hackish implementations are syncing users from LDAP to PostgreSQL, on schedule. There are more than 3 different implementations of this madness.
I've done a lot of DBA'ing as a dev/sysadmin/SRE in the past 25 years. While performance matters greatly, there are two features that make me prefer PostgreSQL and that I would rate higher: stability and maintainability.
I have mainly used SQL Server but some Oracle and Informix earlier. I think all of these has been rock solid. When working as a consultant you rarely hear anyone saying that the db was instable.
Heh, I agree with you minus one consulting case: over a span of 10 years, my client had built an unholy disaster of Oracle PL/SQL on top of their not-great schema. While I wouldn’t call it unstable (queries did always return), query performance would vary dramatically even with the same inputs on an isolated install. I spent more time than I should have trying to track it down, but ultimately gave up.
What was my task? Add XML export into a poorly done XML schema, and do it all in PL/SQL; the data that went into the XML was a mixture of normal queries and computed PL/SQL results.
That schema + stored procedure monstrosity sticks out in my mind as the most unholy disasters I’ve ever worked on.
Not sure what strategy you tried to retrieve the data but I did this kind of stuff years ago with MSSQL even before it had geospatial types and it was fast.
If you have LAT/LON in the table as normal columns indexed you just query the bounding box of your radius so it can use the standard b-tree indexes then post filter to a circle cutting off the corners. Of curse you have to do some work to account for the curvature of the earth but this is pretty standard geospatial stuff.
If you have a DB with geospatial types and indexes like MSSQL 2008+, Oracle, PG etc then this becomes trivial as they can do this directly.
That might be selection bias: those who look for you are mostly concerned about performance. And performance is usually not "solved" -- it's often a treadmill where hardware and software performance is trying to outrun the data growth -- so people who need performance are constantly coming back for more (usually with a pile of cash).
A lot of users case mostly about accessibility, simplicity, integration with existing experience/tools (e.g. JSON), managability, ecosystem, etc. Those users might be happy from the start; or if they aren't happy, they complain for a while until their problem is solved, and then go silent.
Mysql can do this just fine with a geospatial index, or, for that matter, with a geohash index[1] (though in the case of a geohash index, you will need a somewhat more complex query). In fact, any database at all that can index on a string (i.e. all of them) can do efficient location-based search.
Honestly, if it's taking your database more than a second to pull 1000 of 100M rows on a simple query, that means you need to figure out what's wrong with your indexes, not that your choice of database vendor is bad. It is true that postgres is a very nice database, but for basic tasks, honestly any database will be fine if you know how to use it effectively.
Demo in mysql:
-- create a helper table for inserting our 100M points
create table nums (
id bigint(20) unsigned not null
);
insert into nums (id)
values (0), (1), (2), (3), (4), (5), (6), (7), (8), (9);
-- the table in which we will store our 100M points
create table points (
id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
latitude float NOT NULL,
longitude float NOT NULL,
latlng POINT NOT NULL,
primary key (id),
spatial key points_latlng_index (latlng)
);
-- insert the 100M points with latitude between 32 and 49
-- and longitude between -120 and -75, which is roughly a
-- rectangle covering the continental US, placing them
-- randomly. Be patient, as it takes a while to create
-- 100M rows.
insert into points (latitude, longitude, latlng)
select
ll.latitude,
ll.longitude,
ST_GeomFromText(concat('POINT(',
ll.latitude, ' ', ll.longitude,
')')) as latlng
from (
select
32 + 17 * conv(left(sha1(concat('latitude', ns.n)), 8), 16, 10) / pow(2, 32) as latitude,
-120 + 45 * conv(left(sha1(concat('longitude', ns.n)), 8), 16, 10) / pow(2, 32) as longitude
from (
select
a.id * pow(10, 0) + b.id * pow(10, 1)
+ c.id * pow(10, 3) + d.id * pow(10, 4)
+ e.id * pow(10, 5) + f.id * pow(10, 6)
+ g.id * pow(10, 7) + h.id * pow(10, 8)
as n
from nums a join nums b join nums c join nums d
join nums e join nums f
join nums g join nums h
) as ns
) as ll;
-- Generate a roughly circular boundary polygon 7km around
-- the center of Las Vegas
set @earth_radius_meters := 6371000,
@lv_lat := 33.17,
@lv_lng := -115.14,
@search_radius_meters := 7000;
set @boundary_polygon := (select
ST_GeomFromText(concat('POLYGON((', group_concat(
concat(boundary_lat, ' ', boundary_lng)
order by id
separator ', '), '))')
) as boundary_geom
from (
select
nums.id,
@lv_lat + @search_radius_meters * cos(nums.id * 2 * pi() / 9)
/ (@earth_radius_meters * 2 * pi() / 360) as boundary_lat,
@lv_lng + @search_radius_meters * cos(@lv_lat * 2 * pi() / 360) * sin(nums.id * 2 * pi() / 9)
/ (@earth_radius_meters * 2 * pi() / 360) as boundary_lng
from nums
) t);
explain select count(*) from points where ST_Contains(@boundary_polygon, latlng)\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: points
partitions: NULL
type: range
possible_keys: points_latlng_index
key: points_latlng_index
key_len: 34
ref: NULL
rows: 987
filtered: 100.00
Extra: Using where
1 row in set, 1 warning (0.01 sec)
mysql> select count(*) from points where ST_Contains(@boundary_polygon, latlng)\G
*************************** 1. row ***************************
count(*): 1228
1 row in set (0.01 sec)
We were on Oracle for 15 years, but the cost was just too high. We decided to move to Postgres. We thought it wouldn’t be as good as Oracle (you get what you pay for, right?), but it ended up being better. Performance is better. Documentation is better (by far). Adherence to SQL standard is better. Many SQL queries are simpler. Null/empty string is sane. Etc.
Now I don’t have experience with MySQL, SQLServer, etc., but between Oracle and Postgres, Postgres is definitely the best database.
I think a big portion of this isn't a difference in quality, but rather complexity. Postgres is just easier to understand than Oracle. Oracle has insane amounts of settings and configurable parts. The end result is that people who admin postgres servers tend to learn the levers, but many Oracle DBAs simply don't, or have environments where hand tuning is really hard to justify time-wise. I think Postgres success is it's often in environments where DBAs and devs work closely or are the same person.
I use PostgreSQL and MS SQL Server. I love Postgres. There are some things that SQL Server does differently that I would love to see supported in Postgres:
* Protocol: (a) no wire level named parameters support; everything must be by index. (b) Binary vs text is is not great, and binary protocol details is mostly "see source" (c) no support for inline cancellation: to cancel a query client can't signal on current TCP connection, the client must open or use a different connection to kill an existing query, this is sometimes okay, but if you are behind a load balancer or pooler this can end up being a huge pain.
* Multiple Result Sets from a single query without hacks. While you can union together results with similar columns or use cursors, these are hacks and work only in specific cases. If I have two selects, return two result sets. Please.
* Protocol level choosing of language. Yes, you can program in any language, but submitting an arbitrary query in any language is a pain and requires creating an anonymous function, which is a further pain if your ad-hoc query has parameters. I would love a protocol level flag that allows you to specify: "SQL", "PG/plSQL", "cool-lang-1" and have it execute the start of the text as that.
Also true indexed organized table aka real clustered indexes.
Oh and real cross connection query plan caching, prepared statements are only for the connection and must be explicitly used. No need to use prepared statements in MSSQL since the 90's
Another thing I'd add is a quasi-sorted uuid generator built-in. Real clustered indexes need approximately sorted UUIDs - which could be version 1 UUIDs with an artificial, per-database MAC address, or they could be something more esoteric such as Twitter's snowflake ID generator.
Using UUIDs for PKs is fine and dandy but clustered indexes and type 4 UUIDs do not play well. Many MS SQL users discover this the hard way when their toy database suddenly has order of magnitudes more rows.
I've had to personally avoid MERGE statements due to implementation problems in MSSQLServer. One system just updated to 2019, so maybe it is better now. I also have to lookup the syntax each time, but that might be just me.
In some projects the MERGE helped me a lot and in my previous job, all of our integration scripts were merge based. You just learn to avoid the pitfalls, but you're right that it does have some: https://www.mssqltips.com/sqlservertip/3074/use-caution-with...
I'm sure if postgres had it, it wouldn't have had most of those problems.
Sure. I have two roles: database driver/client maintainer and database based application developer and maintainer.
As a database Driver/Client developer:
* Having two protocols/data types that two the same thing (text / binary) isn't the end of the world, but it just adds confusion. Also it adds complexity for a different server/proxy/pool implementation. Recommendation: better document binary and add an intent to remove text data types.
* Not having an inline cancellation (in same TCP/IP connection) means cancellation isn't supported by many drivers, and even when it is, there are many edge cases were it stops working. Each client implementation has to work around this.
As an application developer, I typically see three patterns emerge:
1. Put SQL on SQL platform: stored functions / procs. Application only calls stored functions / procs.
2. Put SQL on Application platform: simple stored SQL text, query builders.
3. Make Application dump and use simple CRUD queries and put logic fully in application.
I typically find myself in category (2), though I have no beef with (1). Benefits of (2) include: (a) easier to create dynamic search criteria and (b) use a single source of schema truth for application and database.
* Multiple Result Sets:
- (A) The screens I make often have a dynamic column set. To accomplish this I may return three result sets: row list, column list, and field list. This works for my screens and print XLSX sheets that can auto-pivot the data in, while all the data transfer columns can be statically known. and pre-declared. This allows me to edit a pivoted table because each cell knows the origin row.
- (B) Any reasonable amount of work in SQL may be long-ish (300-1000+) lines of SQL. There are often many intermediate steps and temp tables. Without Multiple Result Sets, it is difficult to efficiently return the data when there are often multiple arities and columns sets that are all relevant for analysis or work. So if I take a fully application centric view (more like application mindset (3)), you just call the database alot. But if you are more database server centric (1) or (2), this can pose a real problem as complexity of the business problem increases. (I'm aware of escape hatches, but I'm talking about strait forward development without resorting to alternative coding.)
- (C) Named parameters are extremely useful for building up query parameters in the application (development model (2)). You can specify a query where-clause snippet, the parameter name and the have the system compose it for you that is impossible with ordinal queries. A driver or shim can indeed use text replacement, but that (a) adds implementation cost and (b) server computation / allocation cost, (c) mental overhead when you see trace the query on the server. Further more it composes poorly with stored functions and procs on the server (in my opinion). It is again not insurmountable, but it is another thing that adds friction. Lastly, when you have a query that takes over 30 or 50 parameters, you must use named parameters; ordinal positioning is too error prone at scale.
- (D) Protocol level language selection. PostgreSQL always starts execution in plain SQL context. If you always execute functions as PL/pgSQL it is just extra overhead. In addition, running ad-hoc PL/pgSQL with named parameters isn't the most easy thing. It is possible, just not easy. This feature plus named parameters so by the time I'm writing my query, I know (a) that I have all my sent parameters available to me bound to names and (b) the first character I type is in the language I want.
The combination of these features would make applications developed in model (2) go from rather hard to extremely easy. It would also make other development modes easier I would contend as well.
> Recommendation: better document binary and add an intent to remove text data types.
I would note that it's required in Postgres for custom types to implement in/out (text encoding), but not required to implement send/recv (binary encoding.) In case send/recv isn't implemented, the datum falls back to using in/out even over the binary protocol.
Given the number of third-party extensions, and the ease of creating your own custom types even outside an extension, I'd expect that deprecating the text protocol will never happen. It's, ultimately, the canonical wire-format for "data portability" in Postgres; during major-version pg_upgrades, even values in system-internal types like pg_lsn get text-encoded to be passed across.
Meanwhile, binary wire-encoding (as opposed to internal binary encoding within a datum) is just a performance feature. That's why it's not entirely specified. They want to be able to change the binary wire-encoding of types between major versions to get more performance, if they can. (Imagine e.g. numeric changing from a radix-10000 to a radix-255 binary wire-encoding in PG13. No reason it couldn't.)
> * Having two protocols/data types that two the same thing (text / binary) isn't the end of the world, but it just adds confusion. Also it adds complexity for a different server/proxy/pool implementation. Recommendation: better document binary and add an intent to remove text data types.
I don't see us removing the textual transport, unfortunately. The cost of forcing all clients to deal with marshalling into the binary format seems prohibitive to me.
What's the server/proxy/pool concern? I don't see a problem there.
> Not having an inline cancellation (in same TCP/IP connection) means cancellation isn't supported by many drivers, and even when it is, there are many edge cases were it stops working. Each client implementation has to work around this.
Yea, it really isn't great. But it's far from clear how to do it inline in a robust manner. The client just sending the cancellation inline in the normal connection would basically mean the server-side would always have to eagerly read all the pending data from the client (and presumably spill to disk).
TCP urgent or such can address that to some degree - but not all that well.
> - (C) Named parameters
I'm a bit hesitant on that one, depending on what the precise proposal is.
Having to textually match query parameters for a prepared statement for each execution isn't great. Overhead should be add per-prepare, not per-execute.
If the proposal is that the client specifies, at prepare time, to send named parameters in a certain order at execution time, I'd not have a problem with it (whether useful enough to justify a change in protocol is a different question).
> A driver or shim can indeed use text replacement ... b) server computation / allocation cost
How come?
> - (D) Protocol level language selection. PostgreSQL always starts execution in plain SQL context. If you always execute functions as PL/pgSQL it is just extra overhead. In addition, running ad-hoc PL/pgSQL with named parameters isn't the most easy thing. It is possible, just not easy. This feature plus named parameters so by the time I'm writing my query, I know (a) that I have all my sent parameters available to me bound to names and (b) the first character I type is in the language I want.
I can't see this happening. For one, I have a hard time believing that the language dispatch is any sort of meaningful overhead (unless you mean for the human, while interactively typing?). But also, making connections have state where incoming data will be completely differently interpreted is a no-go imo. Makes error handling a lot more complicated, for not a whole lot of benefit.
> Yea, it really isn't great. But it's far from clear how to do it inline in a robust manner.
How would you feel about Postgres listening over QUIC instead of/in addition to TCP?
It seems to me that having multiple independently-advancing "flows" per socket, would fix both this problem, and enable clients to hold open fewer sockets generally (as they could keep their entire connection pool as connected flows on one QUIC socket.)
You'd need to do something fancy to route messages to backends in such a case, but not too fancy—it'd look like a one-deeper hierarchy of fork(2)s, where the forked socket acceptor becomes a mini-postmaster with backens for each of that socket's flows, not just spawning but also proxying messages to them.
As a bonus benefit, a QUIC connection could also async-push errors/notices spawned "during" a long-running command (e.g. a COPY) as their own new child-flows, tagged with the parent flow ID they originated from. Same for messages from LISTEN.
> How would you feel about Postgres listening over QUIC instead of/in addition to TCP?
At some point? Probably not a bad idea. Right now? Doubtful it's worth it - the stack doesn't seem ready yet.
> and enable clients to hold open fewer sockets generally (as they could keep their entire connection pool as connected flows on one QUIC socket.)
I don't think it'd be a meaningful step towards that.
We basically have to move to a threaded model to significantly improve the situation.
> You'd need to do something fancy to route messages to backends in such a case, but not too fancy—it'd look like a one-deeper hierarchy of fork(2)s, where the forked socket acceptor becomes a mini-postmaster with backens for each of that socket's flows, not just spawning but also proxying messages to them.
Not gonna happen. The added context switches would be a very significant increase in overhead. The added processes themselves are but free either. Postmaster not being the immediate parent would complicate the state handling a good bit.
"better document binary and add an intent to remove text data types"
For all-binary, the client would have to know about all data types and how to represent them in the host language. But that seems clunky. Consider NUMERIC vs. float vs. int4 vs int8: should the client really know how to parse all of those from binary? It makes more sense to optimize a few common data types to be transferred as binary, and the rest would go through text. That also works better with the extensible type system, where the client driver will never know about all data types the user might want to use. And it works better for things like psql, which need a textual representation.
The main problem with binary is that the "optimize a few columns as binary" can't be done entirely in the driver. The driver knows which types it can parse, but it doesn't know what types a given query will return. The application programmer may know what types the query will return, in which case they can specify to return them in binary if they know which ones are supported by the client driver, but that's ugly (and in libpq, it only supports all-binary or all-text). Postgres could know the data types, but that means that the driver would need to first prepare the query (which is sometimes a good idea anyway, but other times the round trip isn't worth it).
"Not having an inline cancellation (in same TCP/IP connection)"
This is related to another problem, which is that while a query is executing it does not bother to touch the socket at all. That means that the client can disconnect and the query can keep running for a while, which is obviously useless. I tried fixing this at one point but there were a couple problems and I didn't follow through. Detecting client disconnect probably should be done though.
Supporting cancellation would be trickier than just looking for a client disconnect, because there's potentially a lot of data on the socket (pipelined queries), so it would need to read all the messages coming in looking for a cancellation, and would need to save it all somewhere in case there is no cancellation. I think this would screw up query pipelining, because new queries could be coming in faster than they are being executed, and that would lead to continuous memory growth (from all the saved socket data).
So it looks like out-of-band is the only way cancellation will really work, unless I'm missing something.
"Multiple Result Sets... you just call the database alot"
Do pipelined queries help at all here?
"Named parameters are extremely useful for building up query parameters"
+1. No argument there.
"Protocol level language selection"
I'm still trying to wrap my head around this idea. I think I understand what you are saying and it sounds cool. There are some weird implications I'm sure, but it sounds like it's worth exploring.
> The driver knows which types it can parse, but it doesn't know what types a given query will return. [...] Postgres could know the data types, but that means that the driver would need to first prepare the query (which is sometimes a good idea anyway, but other times the round trip isn't worth it).
We return the types of the result set separately even when not preparing. The harder part is doing it without adding roundtrips.
I've argued before that we should allow the client to specify which types it wants as binary, unless explicitly specified. IMO that's the only way to solve this incrementally from where we currently are.
> Do pipelined queries help at all here?
We really need to get the libpq support for pipelining merged :(
"I've argued before that we should allow the client to specify which types it wants as binary, unless explicitly specified. IMO that's the only way to solve this incrementally from where we currently are."
We should allow the client, on the protocol level, to add (and presumably remove) types to a server side list of types that can automatically be sent in binary if the result format is specified as 'auto'.
That way a driver can add handling for ~10 builtin types (bytea, int[48], float[48], timestamp, ...) that matter from a size/efficiency POV, without needing to handle all the other types. That'd provide a large large fraction of the potential benefit of handling binary for all types.
The reason for the server side list is that otherwise I don't see how you could do it without an added roundtrip.
In my experience when people are able to "fix" performance issues by moving from sql database a to sql database b it's because they understand database b and how to optimise for it. Not because the database is better or worse. My company was "saved" moving from postgress to mysql. The difference was the guy and his specific knowledge, not the database technology.
> As early as 2001-2002... from mysql to postgres.
MySql was pretty bad in the first versions. I'm in the enterprise/erp development world, and when I look at mysql when it start to show in the radar, I can't believe the joke.
MySql get good around 5? Now I can say all major DBs are good enough not matter what you chosee, (still think PG is the best around) but MySql was always the more risky proposition, the PHP of the db engines...
Yeah I agree. In practice now that I've accumulated my own experience I feel that Galera clustering is much better than anything I used in Postgres.
So far I've only setup one master/slave cluster with pgpool-II in postgres and it's definitely an OK setup. But I much prefer the usability and stability of Galera/MariaDB for clustering.
And you now know better on how to structure your data because you have done plenty of migrations and changes but couldnt change everything you wanted to because the whole data access logic would need to be rewritten.
> Unlike BSD, which often disappointed me once I learned how much easier and mature Linux was to use.
I was with you right until that last sentence. I'm not going to offer a counterargument because your statement is extremely genralised and ripe for flamewars but I will say it's not as clear cut as you stated.
> Unlike BSD, which often disappointed me once I learned how much easier and mature Linux was to use.
With the exception of package management (at least with debian) I've found the opposite to be true. It's configuration has always been a bit of a mess, especially with regards to networking. Lately systemd has made things only more confusing. And that's before dealing with the myriad of ways different distributions do things (which thankfully now is really only 2 different bases, debian or red hat).
While linux "won" in the end and I use it professionally now, I'm glad I rarely have to interact with it directly anymore thanks to ansible or other orchestration tools.
I discovered PostgreSQL in a similar happenstance sort of way - working at a large telecom directly out of college in the late 90s, tasked with setting up a MySQL backed webapp to manage some internal processes. Fortunately, I was unable to compile MySQL on our HPUX workstations (issue with threads IIRC), PostgreSQL to the rescue. Been using it ever since.
This isn't hard to believe. MySQL basically started out as a key-value store pretending to be a relational database. If you wanted to do KV stuff and didn't care too much about data corruption (or knew the five zillion magic incantations to mostly avoid them) it was fast. But If you wanted joins or expected the query optimizer to do anything for you, you were SOL.
The dirty secret of database engine development is that all the things that are hardest to do in the database proper are super-easy to do in the application, given a few judicious architectural choices; but despite that, nobody can count on app writers to do them right. So, DB engine developers have to bust their humps to cover for users' incompetence.
A database for competent users can be amazingly simple and fast. But the Market Has Spoken, loud and clear: databases must absorb an indefinitely large amount of complexity to make the job of app writers a tiny bit easier.
I'd be curious to see some examples of what you mean!
I think there are definitely cases where DB needs to compensate for bad architecture, but then even competent developers can profit from the flexibility of being able to ad new features ad hoc to an existing db schema/architecture and the DB just magically bailing them out.
I think that PostgreSQL itself is great. However, the developer client tools (pgAdmin) leave something to be desired. The old pgAdmin3 was fine, if lacking bells and whistles. The new pgAdmin4 however, was not very good last time I tried it [1]. It was implemented as a client/server web application and had frequent issues.
Can anyone recommend a good client for PostgreSQL?
[1] - I see that there have been some new releases in 2020 so I ought to check on them. The version I tried earlier was 4.11.
Thank you for this - my team and I have been looking for a replacement to PgAdmin for quite some time now and this fits the bill exactly. Two features I thought we were going to have to give up when moving from MySQL to PG where ERD management in-app and the ease of use of MySQL workbench.
Dbeaver is the answer. It is good enough that I no longer miss pgadmin 3.
I've inadvertently become the pgadmin 3 "LTS" maintainer. A release of pgadmin 3 that was altered to support 10.x was previously provided by BigSQL. I forked it on GitHub to add TCP keepalive on client connections. At some point after that BigSQL removed the original repo. Apparently I was the only one that had forked it prior to removal, so now the few vestigial users in the world still using it are forking my fork. One has even patched it to work with 11.x, although they've offered no pull requests.
I use Postico and agree it's pretty great overall. There's some odd behavior in the SQL editor that sometimes bites me when I'm pasting from my code editor (VSCode), where whitespace sometimes picks up some kind of invisible character or something that makes the query invalid. I have no idea how to reproduce it but it can really break my flow when I hit it. Other than that I've been pretty pleased with it.
If I didn't already have a Postico license I'd definitely be looking at TablePlus.
I've been using JetBrain's DataGrip. So it's not free, but if I weren't already grandfathered into a good JetBrains package, I'd probably be willing to subscribe to it à la carte and pay for it out of pocket. It's the only JB product I still use consistently after switching from WebStorm to vscode.
I wonder if there is any client that has a halfway decent visualization of RLS policies. Last time I checked, pgAdmin didn't, and Postico, which I use day-to-day, doesnt' (at all).
I have tried to give pgAdmin4 a chance three times, and every time it was an exercise in frustration. It is unstable and awful from the UX perspective.
Since then I'm sticking with psql for administration and with DataGrip for more involved DB development, and that works very well.
Having worked with dozens of different databases over lots of years, I can certainly say that Postgres is my go-to _default_ database over all the other relational ones.
However, we're still using at least four other databases in production, and the reason why is that because it's doing something of everything, there are lots of niches that it doesn't fill well. It's all about tradeoffs.
If you've got a Full Text Search problem, PG is "good enough", that is until you need to index more challenging languages like Korean or need more exotic scoring functions. For that, Elastic is better.
You can do small scale analytics on PG, but at larger sizes, the transactional setup gets in the way and you're better off with BigQuery/Redshift/Snowflake.
You can scale vertical pretty big these days, but if you need linear scalability while still guaranteeing single digit ms access, you're better off with ScyllaDB.
However, there isn't a single project I wouldn't not start on a Postgres these days, that's how much I love it.
I wouldn't say PostgreSQL was the analogue of golang: go's initial popular growth for being 'good enough' while being arguably oversimplified, where simplicity masks problems, under the pretence that YANGNI always applies (initial N emphasizing the Not of Aren't), seems much closer to MySQL's history.
PG was driven by engineering correctness, by considering what DBAs 'Are Going to Need'. Sometimes that strictness worked against popular growth but in the end it has worked out well, as many programmers figured out they also needed it. I'd say the programming language analogue would likely be Rust, lets see in 10 years where the 90% case lies.
I'd argue that what you're saying is true, in its current state. High-level Rust libraries take longer to write, and so they don't exist yet. As soon as Rust gets more powerful tools that abstract away the work, I suppose it might alleviate a large part of the churn.
Rust is more complicated than some languages with GC.
However, if you are writing some code with multi-threading (who isn't these days), then you may eventually want the help Rust gives you there as far as correctness goes.
Multithreading is only needed in very few fields. Multiprocessing is way easier and works fine for most. And the majority of programmers still only need to write serial code.
I'm interested in why you would choose Scylla over Cassandra. Does it have advantages/differences besides performance? I remember Scylla being approximately a C++ rewrite of Cassandra, but lost track of the community opinion of them.
ScyllaDB is far faster, more stable, and requires less config, maintenance and tuning. There's no reason to use standard Cassandra unless you need it for compatibility.
What issues are you having with Korean indexing? I work with Japanese data in PostgreSQL all the time and have never noticed any issues. Japanese is more complicated that Korean, in my opinion.
SQLite is also extremely useful, but it fills a different niche. (One of SQLite's slogans is "SQLite does not compete with client/server databases. SQLite competes with fopen().")
Replication, failover and high availability are complicated matters. I believe there are other SQL databases that may be said to handle this better than PostgreSQL.
At the very least, CockroachDB is much simpler to set up and manage. Cannot vouch for its stability and complexity of abnormal emergency recoveries, though - haven't used it long enough, only had simplest outages (that it had handled flawlessly).
I've been following YugabyteDB ( https://www.yugabyte.com/ ) with interest for similar reasons. It's build on postgresql so it supports more features than CockroachDB. They just forked PG on top of their clustered document store. And the newest version (2.1) is production ready (their words).
Just haven't had the reason to dive in yet, but seems good. Their PR department just isn't as good as CockroachDBs one.
>Replication, failover and high availability are complicated matters.
I wonder why Postgre hasn't improved in these areas and instead leaving it to third party solution. I mean every year these few points are still listed as something that flavours MySQL.
It might be partly because Postgres is largely developed by the providers of "3rd-party" solutions for features like this. You could see it as an open-core model of sorts.
Why are you saying they have not improved? That is flat out false - postgres provides a lot better tools for this now than 10 years ago. They still leave a lot of the plumbing to third parties, but they are now at least giving those third parties a lot better tools than they used to.
Microsoft SQL Server is an obvious default choice if you're an all-Microsoft shop; it has all the Azure and Visual Studio integrations, etc. and comes with your MSDN.
Oracle is the choice of organisations with money to burn who don't believe that free products can be good. An Oracle installation nearly always comes with some consultants to write your application for you, and define your application for you, and write your invoices for you ...
I think the only reason I wouldn't choose Postgresql is if I need master-master replication.
It's not that you can't create a master-master setup using Postgresql, I believe 2ndquadrant have and add-on that allows this. It just feels like it's messing with the fundamentals of the database on such a low level that I really only trust it, if it's part of Postgresql it self.
Logical replication has been in PostgreSQL itself for a while now.
There are naturally caveats: sequences don't get replicated, so you need to configure each replica with a non-overlapping range for each sequence. And DDL statements are also not replicated, so you have to migrate database schemas by hand on each replica.
Logical changeset extraction itself does not in itself provide a replication feature from what I can see of the wiki [1]. 2ndQuadrant sells BDR which is built on top of it.
I really wish we'd just shut off the wiki. There's random pages starting to be maintained by someone that then stops at some point. Leads to completely outdated content, as in this case.
One of Postgres's biggest pros (over other commonly used dbs in web dev) is Row Level Security. Controlling what data a user can access at the database rather than the application or as part of a query makes the application logic a lot simpler. That's a big win.
the code is not reusable accross different storage layers. So not portable.
In my 23-and-a-bit years of web development I've literally never changed the database engine on a project. Maybe that happens on other people's projects, but it's not something I consider important or even useful really. The notion that you can swap out your database for a different one without changing the application code to take advantage of the db you're moving to is ludicrous. Of course your database code isn't portable.
Websites that used MySQL and Myisam tables with raw SQL statements written as strings in PHP for the first decade of the web is one of the reasons why so many web developers still don't use things like transactions, stored procedures, views, etc. That's a bit of a tragedy. The web would be much better today if everyone had been using Postgres's features from the beginning.
If you write a project, you probably won't change the database. But you may add new data layer services such as a cache backend, a search engine with, a secret server, a session store, a permission micro service, etc. They will all interact with permissions and change where you source the data from.
Then there is the case where you don't write a project, but a lib. Or the case where you extract such a lib from a project. In that case, you limit the use case of your lib to your database. Because if you don't change database during your project life time, the users of your lib may start a new project with a different database.
This is why Django ORM allowed such a vibrant ecosystem: not because it allows changing the data base of one project (although it's nice to have and I used it several times), but also because it allow so called "django pluggable apps" to be database independant.
I have never seen anybody migrate out of PostgreSQL, but I HAVE seen people migrate out of Oracle, and DB2 because they are so expensive its worth the cost to migrate.
Me too, until I found how fast Postgres is starting in Docker and for different languages there are now libraries helping you writing tests using Docker containers.
Well, you're doing it wrong. It's easy and fast to run tests against real postgres (takes a fraction of a second to spin up the DB for your tests if you do it right).
I agree, but I also see the benefits from the other side, so I’ll play devil’s advocate:
- it is cached, in the database’s memory, where the cache can be invalidated automatically. It is better to cache views than data anyway.
- it is portable to every platform postgres runs, which in practice means it will run everywhere. Portability between databases is overrated because it rarely happens in practice.
- the access control logic evolves together with the schema, guaranteeing they have an exact correspondence. This is a good thing.
- integration tests should involve a live database
> it is cached, in the database’s memory, where the cache can be invalidated automatically. It is better to cache views than data anyway.
The cache may not be for the data in the data base, but for something else (task queue, calculation, user session, pre-rendering, etc). You effectively split your cache into several systems.
> it is portable to every platform postgres runs, which in practice means it will run everywhere. Portability between databases is overrated because it rarely happens in practice.
As I mentioned, for a project it doesn't matter. For a lib it does.
> the access control logic evolves together with the schema, guaranteeing they have an exact correspondence. This is a good thing.
Not always. Your access control logic could be evolving with your model abstraction layer, which frees you to make changes to the underlying implementation without having to change the control logic every time. It's also a good for your unit tests, as they are not linked to something with a side effect.
> integration tests should involve a live database
Integration tests are slow. You can run them as you develop.
> Have you looked at jetbrains datagrip?
It's fantastic. If you are willing to pay the price of it for and force it on your entire team. If you are on an open source project, will your expect that from all your contributors ?
High barrier of entry, with no modularity. Linters, formatters, debuggers, auto-importers, they all depend on that one graphical commercial product that is not integrated with your regular IDE and other tools.
Not to say it's not a good editor if you do write a lot of SQL, as JetBrains products are always worth their price.
The tooling is just fine for me, DataGrip or not: I wrote a ton of PostgreSQL triggers and utility functions in Javascript thanks to the excellent plv8 (free) extension.
Development and unit tests for these functions is as easy as plain Javascript.
> Development and unit tests for these functions is as easy as plain Javascript.
Since most popular product in the JS world are product to avoid writing Javascript (webpack+babel, typescript, coffeescript, jsx, etc), and plv8 supports non of that nor does it support standard unit tests framework like Jest, I'm not convinced.
I personally this time around, I found no need to even use a framework for unit testing a single javascript function.
If I had to do something that feels too complex, I'd write some interfaces in TypeScript to represent the tuples, then implement the function in TS, and compile it into plain JS in order to deploy it to plv8.
It also means you need a per-user database connection which isn't feasible in many applications as the recommended number of concurrent connections is typically quite low.
The main reason I use MySQL over Postgres is storage engines. MySQL has storage engines with transparent compression, which allow me to keep an order of magnitude more data than using Postgres. These days its MyRocks, in the past it was TokuDB.
Historically Postgres has thought that the job of the file system, which is basically a bad choice for dbs. MyRocks wipes the floor with it.
TimescaleDB is an interesting new Postgres storage choice. I'm evaluating it. But the free version doesn't do compression on-demand nor transparently. Anyway, it can only get better.
Hi @willvarfar: Just to avoid any confusion, TimescaleDB's compression is available in our community edition, which is fully free to use. Once you specify some compression configs, it's automated and transparent.
(The "community" edition is available under our Timescale License. It's all source available and free-to-use, the license just places some restrictions if you are a cloud vendor offering DBaaS.)
The cool thing about timescaledb is that timescale not Postgres controls the partitions, so who says they all have to be done the same way?
In my use cases I have got recent partitions that are upsert heavy, so row-based works best. But as the partitions age, column-based would be better. What every DB seems to force me to do is use the same underlying format for all partitions. What I’d like is a storage engine that automates everything; instead of me picking partition size, it picks on the fly and makes adjustments. Instead of me picking olap vs oltp, it picks and migrates partitions over time etc.
TimescaleDB mostly does exactly that: Recent data/partitions are maintained in row-based form, and after they reach some age, they are converted into more columnar compressed form.
Our own experience was that most users were getting 3-6x compression running TimescaleDB with zfs.
With our native compression -- which does columnar projections, where a type-specific compression algorithm is applied per column -- we see median compression rates across our users of 15x, while 25% of our users see more than 50x compression rates.
No, that's a compressed row format, not a compressed block format.
MySQL's innodb, for example, supports per-row-compression. Its completely ineffective. Block compression in tokdub and myrocks etc is a completely different class.
MySQL's innodb also supporta kind of 'page compression' using the file-system's sparse pages. Its also naff.
The closest postgres can get is using zfs with compression. Its a lot better than nothing.
SQL Server can also do storage compression. I assume Oracle can too. It used to be that the commercial offerings had the most advanced features first and the OSS systems copied them later.
I did use MySQL/MariaDB in production for a couple of years, and now I'm using Postgres.
PG sure seems to have better foundations and generally a more engineered approach to everything. This leads to some noteworthy outcomes in the tooling or the developer UX, for example:
PG won't allow a union type (like String or Integer) in query arguments.
PGs query argument handling is rather simple (at least the stuff provided with libpq), it only allows positional arguments.
Sometimes, MySQL by extending or bending SQL here and there allows for some queries to be written terser, like an update+join over multiple tables.
Also I did actually enjoy that MySQL had less advanced features, so it seemed to be easier to reason about performance and overall behaviour.
Well, (unfortunately) plenty of applications support only one DB backend. For FOSS software, MySQL/MariaDB support is very common, and even if those apps support pgSQL it might be that pgSQL support is not as robust and well-tested as mysql.
And in plenty of situations, sqlite is perfectly adequate, and saves you from the extra work of managing yet another service.
But yes, I agree if I'd start from a blank sheet and need a network-accessible SQL DB, pgSQL would be my first choice. Now if I had some very high end requirements, maybe one of those big commercial DB's would have an advantage.
You just said you never saw Postgres at Yahoo. Yet Yahoo had possibly the world’s largest Postgres installation. I consider that legendary. I’ve certainly been aware of it for over a decade. It was also an early (?) columnar store which is now commonplace among OLAP databases. Seems like if it didn’t start a trend it was at least an early adopter. I’d say that’s legendary too. Even today the scale is impressive.
Now you say you did see some
Postgres but there was more MySQL. Fine, you were there, I’ll take your word for it. But I can’t reconcile your own statements on this. Yahoo very clearly used Postgres.
You ran a 2PB MySQL install. Cool, I’d love to hear about that, truly. Do you have any written accounts or talks about that?
The article says Yahoo bought Mahat Technologies for their columnar version of Postgres. That sounds similar to Redshift or Greenplum but I think it is different. I can’t find a clear history of Greenplum’s origins or what happened to Mahat. Looks like Redshift came from ParAccel which was a separate project. From what I can find there were a lot of similar projects at the time.
We picked SQL server by Microsoft for one of our projects because it was easier to integrate it when we started (ASP.NET core 2.0) and it has very nice tools like SSMS, better than DBeaver. For a new project we picked Postgres. We might in the future use any of the two.
We also have other DB tech in the company: Teradata, Oracle, Redis, Cassandra.
I am quite surprised MySQL or MariaDB is not mentioned more. It seems like a standard to me. Almost every serious software supports it. Of course it doesn't have some extra features like GIS but for generic RDB it performs well. And there is also Galera cluster btw...
Oracle is currently moving in the direction of autonomous databases. It's no longer just about having great automation features but about the database being able to do things for you like create or drop an index in order to improve ETL performance, and it's all done autonomously.
Oracle has their own cloud platform and a great database to go along with it, if anything history has shown us that Oracle is able to adapt their database offerings to cater to current enterprise needs and I think this time will be no different.
I self-admittedly love esoteric databases and storage engines to a fault. I'll try to shoehorn things like RocksDB into whatever personal project I'm working on.
At work however, the motto I spread to the teams I work on is "use Postgres until it hurts". And for many, many teams - Postgres will never hurt. I'm very happy for its continued existence because its been a solid workhorse on various projects I've worked on over the years.
BTW, here's an interesting observation: if you normalize a schema to the max, applying CRDT techniques for eventually-consistent multi-master concurrency is relatively simple, and you can do it using SQL. With PG you could have each instance publish a replication publication of an event log, and each instance could subscribe to a merged log published by any of N instances whose job is to merge those logs, and then each instance could use the merged log to apply local updates in a CRDT manner. If you normalize to the max, this works.
For example, instead of having an integer field to hold a count (e.g., of items in a warehouse of some item type) you can have that many rows in a related table to represent the things being counted. Now computing the count gets a bit expensive (you have to GROUP BY and count() aggregate), but on the other hand you get to do CRDT with a boring, off-the-shelf, well-understood technology, with the same trade-offs as you'd have using a new CRDT DB technology, but with all the benefits of the old, boring technology.
Putting thought into design patterns and a storage engine that accents that choice, it's sometimes possible to eschew caching and distributed storage altogether - saving you from a whole host of complexity.
It can also be challenging to accurately assess a database's performance and correctness. I've been using C/Rust bindings when possible for the former, and Aphyr's Jepsen test results for the latter. Unfortunately there's no silver bullet on these topics - you pretty much have to test all your use cases.
Postgres is my go-to RDBMS, but I do have one serious complaint: Connections are too expensive.
This is a side-effect of the old-school one-process-per-connection architecture that Postgres uses. MySQL (ick) easily handles thousands of connections on small servers; with Postgres you will need a LOT of RAM to sustain the same, RAM that would be better served as cache.
I've found (at least, for my current app) that the number of connections defines my scaling limit. I can't add more appserver instances or their combined connection pools will overflow the limit. And that limit is disappointingly small. It's not so painful that I want to reach for another RDBMS, but I'm bumping into this problem way too early in my scale curve.
It might be possible write a small proxy in Rust, that would run next to a PG instance, accepted connections using `async` and then forwarded queries using some limited size connection pool.
If there is a lot of connections actually running transactions, then I'd expect the db will become overwhelmed due to number of transaction, and per-connection overhead is not going to be a problem.
How I understood the problem is thousands of connections, of which most do a query from time to time only.
Judging by the other comments, it seems solutions like that are already available.
Most of my transactions are relatively long-running (minutes) and mostly idle, so the db can handle the transaction load.
There honestly doesn't seem to be a good solution to this problem. I have reorganized my app a bit to try to keep the transactions shorter (mostly checkpointing) but it's using architecture to solve a fundamentally technical problem. I wouldn't have this problem with MySQL.
Switching from processes to threads (or some other abstraction) per connection isn't likely to show up in Postgres anytime soon, so I guess I'm willing to live with this... but I'm not going to pretend that Postgres is without some serious downsides.
> with Postgres you will need a LOT of RAM to sustain the same
The per-connection memory overhead is not actually that high. If you configure huge_pages, it's on the order of ~1.5MB-2MB, even with a large shared_buffers setting.
Unfortunately many process monitoring tools (including top & ps) don't represent shared memory usage well. Most of the time each process is attributed not just the process local memory, but all the shared memory it touched. Even though it's obviously only used once across all processes.
In case of top/ps it's a bit easier to see when using huge pages, because they don't include huge pages in RSS (which is not necessarily great, but ...).
Edit: expand below
On halfway recent versions of linux /proc/$pid/smaps_rollup makes this a bit easier. It shows shared memory separately, both when not using huge pages, and when doing so. The helpful bit is that it has a number of 'Pss*' fields, which is basically the 'proportional' version of RSS. It divides repeatedly mapped memory by the number of processes attached to it.
Here's an example of smaps_rollup without using huge pages:
You can see that RSS claims 1.8GB of memory. But the proportional amount of anonymous memory is only 1.4MB - even though "Anonymous" shows 2.6MB. That's largely due to that memory not being modified between postmaster and backends (and thus shared). Nearly all of the rest is shared memory that was touched by the process, Shared_Clean + Shared_Dirty.
I love Postgres as much as the next guy but there are three things that really annoy me:
Postgres is still laid out for for 9 to 5 workloads, accumulating garbage during the day and running vacuum at night. Autovacuum just doesn't cut it in a 24/7 operation and it is by far what has caused the most problems in production.
No query hints and no plan to ever implement it. Making the planner better only takes you so far when everything depends on table statistics which can be easily skewed. pg_hint_plan or using CTEs is not extensive enough.
Partial indexes are not well supported in the query planner, thanks to no query hints I can't even force the planner to use them.
Do you have a good Postgres DBA? Autovacuum can indeed ruin your day if you have a 24/7 DB that is delete or update heavy (the default setting are not very aggressive and once it falls behind badly it will never catch up and you'll need to do some emergency manual vacuuming). But I can assure you it's possible to run 24/7 DBs fine without running into autovacuming problems by making sure things are configured right. I've experienced both regular problems due to bad config and their complete disappearance.
PostgreSQL and ZFS is a marriage made in heaven. Block aligns to disk behaviour, snapshots make great low latency db dumps trivial. Never regretted making the combo
FWIW, I've often called out my decision to use PostgreSQL and ZFS (on Linux) as my worst.
Memory usage was extremely inefficient. ARC had to be set to half of what it should've been because it varied so dramatically, so half the system memory was wasted. ZFS would occasionally exhaust system memory causing a block on allocations for 10+ minutes almost daily, no ssh or postgresql connections could be opened in the meantime. Sometimes a zfs kernel process would get stuck in a bad state and require restarting the server. Many days were wasted testing different raid schemes to work around dramatic space inefficiencies (2-8x IIRC) with the wrong combinations of using disk block sizes, zfs record sizes, postgresql block sizes, and zfs record compression. Because zfs records are atomic and larger than disk pages, writes have to be prefaced with reads for the other disk pages, adding lots of latency and random IOPS load. Bunch of other issues, I could go on.
I've since switched back to ext4 and hardware RAID. Median and average latency dropped an order of magnitude. 99th percentile latency dropped 2 orders of magnitude.
These databases are at high load. If they had low load, and I wasn't expecting to grow into high load, I'd consider ZFS since it does have a bunch of nice features.
And compression! I've gotten compression ratios above 3x on real production databases, which is the difference between "we need more disks by next month" and "this hardware will last the foreseeable future". Not to mention that it improves performance when lz4 can decompress faster than the disks can read.
Can you please give us more info about your workload? Is it very high throughput DB? Especially for Inserts and Updates? I searched on ZFS and PostgreSQL performance a couple of years ago on the Internet and popular opinion seems to be databases with Copy on Write type of file systems don't give the best performance.
I am not deep expert, but my impression is that Postgres MVCC is kinda Copy on Write approach already (no in place updates, but you write new page for each update), so it should be perfect marriage..
I manage Oracle and Postgresql instances and I agree with this, except for one small thing: Postgresql doesn't have packages.
Otherwise, it is much simpler to manage than Oracle (which really is a system turned inside out). The documentation is really good and straight to the point (Oracle can't help but turn everything into an ad - nvl uses the best null handling technology on the market and such).
I spent many years on Oracle. I do not miss smug Tom Kyte blog posts about the pointlessness of BOOLEANs. Nor do I miss endless hours Googling around Burleson Consulting pages because we couldn’t afford support.
I do kinda miss how easy it was to do partitioning but I’m on Redshift now so it doesn’t matter.
While not having packages isn't a huge deal for new projects, it can be a pretty huge stumbling block for dbas who want to migrate from Oracle to Postgres, but need an easy way to port their pl/sql code to pgpl/sql. Most of the syntactic differenes (like pgplqsl code needing to be delimited as a string) can be automated, or handed off to a junior programmer. Not being able to use Oracle's package system would require some major refactoring to how you organize your functions and procedures
This talks a lot about Postgres (which is great ) but not about other databases so the title isn't very accurate. If you ignore cost it certainly doesn't explain why you'd work with Postgres rather than Oracle or SQLServer.
I've used Oracle for 20+ years, SQLServer for > 10 years, and Postgres from time to time and would say they are all good tools for a general use case.
Cost isn't an advantage for the company I work for because we licensed SQL server very cheap in big volume.
If devs can work easier and faster with SQL server, then we end up saving money because time is money.
As mentioned, PostgreSQL's documentation is some of the best in the software world. With most software project documentation, one ends up scouring StackOverflow, mailing list archives, and GitHub issues trying to figure out how to do something. With Postgres, I can just go straight to the official documentation, and read it. It has yet to fail me.
While I disagree that it's the world's best, it is my default for all projects and general RMDBS.
I used to be MySQL but it felt like over time it was becoming hack on top of hack for each subsequent version while PG lacked a number of features in the early 2010's it has clearly caught up with great care to its codebase.
There are scenarios that PG does not win and should not be used, but if we're talking about the most applications that it covers well, PG is it.
Sure... but if you HAD to choose one? Just kidding. I think the word "best" could easily be substituted with "your go-to DB"... or simply the best for <category of what you want to do>.
I used PostgreSQL as student and during research and had to switch to Microsoft SQL Server as my company uses that since forever.
For features I used on both, the average quality of the SQL Server implementation is between middling and pathetic (mostly the latter).
I cannot recall a single instance where I thought "that's nice" or anything positive about how SQL Server does something. At best, it has been "this is not awful".
Things that sucked especially hard:
- Datetime API: In postgresql it is a joy to use and most features one would like to have to analyze data are directly there and work as they should. The analysis for my master's thesis was completely done in postgresql [1] (including outputting CSV files that could then be used by the LaTeX document) - without ever feeling constrained by the language or its features. Meanwhile, the way to store a Timespan from C# via EF is to convert it to a string and back or convert it to ticks and store it as long.
- The documentation: The pgsql documentation is the best of any project I have ever seen, the one for SQL Server is horrible and doesn't seem to have any redeeming qualities.
- Management: SMSS is a usability nightmare and still the recommended choice for doing anything. I assume there are better ways but I haven't found them and if I have, they usually didn't work.
- The query language: In psql you can do things like
select a < b as ab, b < c as bc , count(*) from t group by 1,2; \crosstabview
To get even half of that in SQL Server one would do
select case when a < b then 1 else 0 end as ab, case when b < c then 1 else 0 end as bc, count(*) as count group by case when a < b then 1 else 0 end, case when b < c then 1 else 0 end
I think crosstab queries are unsupported (there are PIVOTs but I haven't yet looked into them a lot).
Maybe my use cases are a bit weird and I haven't encountered the better features of SQL Server yet. Would appreciate if someone could point out what I am doing wrong.
Interesting how mssql has bad documentation. Judging by my experiences with docs for azure and various windows and dotnet APIs, Microsoft's documentation is usually pretty solid. Not that it's surpising, since proprietary rdbs'es tend to prefer "vendor lock-in" over "making the product pleasant to work with so the customer doesn't jump ship"
I've spent a lot of time on looking at database solutions recently, reading through Jepsen reports and thinking about software architectures. The problem with PostgreSQL is that it is essentially a single-point-of-failure solution (yes, I do know about various replication scenarios and solutions, in fact I learned much more about them than I ever wanted to know). This is perfectly fine for many applications, but not nearly enough to call it "the world's best database".
You are the architect of the application. If the application requires distributed database then it's your task to implement that in order to fulfill client's requirement. PGSQL has nothing to do with your skills and/or capabilities to comprehend said situation. You can do distributed database even with Access, or even with simple files.
Can you just shortly elaborate on why PostgreSQL is not good for distributed scenarios for those of us who are/were under the impression that psql does have stable distribution features?
IMO it should be possible to build a CRDT-style, eventually-consistent, distributed DB using PG. Normalize to the max, and then normalize more, have instances publish replication for log schemas, have a layer of instances that publish merged logs, and have instances subscribe to and apply (CRDT-style) those merged logs.
You can use CRDTs on top of pretty much any database and while they could simplify things, any database won't be optimized for specific constraints of CRDTs, so you won't be able to achieve certain levels of performance, availability, reliability and operational overhead possible with a well designed CRDT based system (I'm talking an order of magnitude difference here).
> Do you really care that much about the language? Shouldn't you care more about your data? If you are not OK with SQL there are abstractions available.
I do care about SQL being a text-based protocol with in-band commands, bringing lots of functionality that I never need (I do not use the fancy query features for exploratory analysis, I have applications with very specific data access patterns).
For my needs, I do not need a "query language" at all, just index lookups. And I would much rather not have to worry about escaping my data, because I have to squeeze everything into a simple text pipe.
There is no such thing as a "best database for everyone". I intend to migrate to FoundationDB in the main project I'm working on, but it is not necessarily the "best" option for everyone (see https://blogs.vmware.com/management/2020/03/changing-the-fou... for a good writeup).
For someone that is using c# , please have a look at the excellent Marten library that shows the awesome functionality of Postgress as a document ( eg. NoSql) + event store.
Postgres is not only for Sql, as I see multiple people referencing only the SQL functionality.
For those that work with events/projections, Postgress supports js migrations on events so you can update events in you store to the latest version.
PostgreSQL is also my go-to database for anything relational. My only issue with it though, is the need to run vacuum full (which locks tables for significant amount of time) to free up disk space to the operating system.
That performance carries the day today is a huge improvement since 1999. Back then, there was also only one question that mattered, but the question was, "Is it Oracle?", just as the only question about network equipment was "Is it Cisco?".
It is hard to explain, at this remove, what horrible perversions were performed in order to be able to say "yes" to such questions, or to avoid need to answer them.
One example was a router that, at opposite ends of a long-haul network, managed data flow that was utterly insensitive to the magnitudes of latency or drop rate. They sold hardly any. It turned out the algorithm could be run in user space over UDP, and that became a roaring business, for a different company, because you could deploy without getting network IT involved. Mostly IT didn't even want bribes. They just didn't want anything around that was unfamiliar.
Database products had to pretend to be an add-on to Oracle, and invisibly back up to, and restore from, an Oracle instance, because Oracle DBAs only ever wanted to back up or restore an Oracle instance. The Oracle part typically did absolutely nothing else, but Oracle collected their $500k (yes, really) regardless.
I get the Postgres love but as someone who’s been doing mysql for a long time I found postgres really hard to use. Like even setting it up on Mac was a pain in the ass. How do I create a new db? Why is pgcli not as friendly as mysqlcli? After 2 weeks of head bashing and lots of stack overflow reading I gave up and went back to good ol MySQL.
For ease of use for a new comer, mysql seems to have nicer tooling and out of box use. I’ve worked with mysql on sites with millions of daily active users and it hasn’t been a problem. Snappy 10ms queries for most things. I guess one really needs to learn ANALYZE and do the necessary index tuning.
I’m just saying, use the tool you’re most comfortable with, has good ecosystem and gets the job done. For some it’s PGSQL, for some it’s MySQL, for others it could be the paid ones. MySQl8 is pretty solid nowadays.
There is no universal “best”, just as there is no “best” car. All about trade offs.
Might be a bit... Forcibly put, but I gotta say that PostgreSQL is one of my favorite pieces of software ever created. And I don't even like C all that much.
> So, you want NoSQL, Riak, REACT, Redis, Mongo, etc.? PostgreSQL does all that. Admittedly not with all the bells and whistles of all of the original products. For example, PostgreSQL doesn’t create new shards for you for any of those. That’s still a manual process.
Isn't this a fairly big issue? I would think the convenience of automated scaling is a primary reason to use these other tools.
While we're all here talking about databases I've got a question for the DBA/well versed software engineers.
What database or database design strategies to be used for an analytics dashboard that can be queries real-time similar to Google Analytics?
The first real scaling issue I've come across (as a junior dev working on a side project) is a table with 38 million rows and rapidly growing.
Short of just adding indexes and more RAM to the problem, is there another way to design such a table to be effectively queried for Counts and Group Bys over a date range?
Iimagine things can't really be pre-comuted too easily due to the different filters?
I currently run analytic queries on a page view dataset around ~600 million rows on postgres. You can start by looking into partitioning by date. Also take a look at your IO on your queries. IO was the biggest bottleneck for me. Make sure you have hardware that is not limiting it.
PostreSQL is absolutely fantastic and for most projects, unless there's an existing legacy database or a purely Microsoft environment should be the default RDMS.
My only complaints are:
1) No built-in or easy failover mechanism (and will never have one because of their estated philosophy). I'll settle for a standard one and yes I know there are different requirements and things to optimize for and there are 3rd party solutions (just not one to integrate easily).
2) Horizontal Scalability. And yes I know of some solutions and other dbs being more apt for this and asking the wrong thing.
PostgreSQL is a transaction-oriented row-store. It is slow for analytics, and I mean by something like two orders of magnitude, if not more, relative to state-of-the-art systems like HyperDB or Actian Vector, on typical analytic workloads. You can extend it any which way you like, that won't change. (Ok, if you can extend it with essentially a different DBMS then that could be fast.)
Security, concurrency control - those are important for transactional work facing numerous users. Not for analytic processing speed.
I’m assuming anyone who has to make that decision already knows this but, while PostgreSQL is great to host a production database, it isn’t a great choice for an analytic database at scale, or to train or store your machine learning features. It works, but it’s not great at scale.
You can get away with having a scheduled pg_dump early on, some reports on that, while you figure out an ETL/Messaging process — but picking something that handles concurrent large-scale queries will matter fast.
I’d say hosting data that you don’t need to run your live ergodic (non-ML) software service, but that you’d need for other services: informing the business, flagging operational issues, training ML models.
Some of those can be done with back-end tools — say, if your software also contact to your customer service, and the managers of that center monitor their activity on a solution developed in-house, that solution isn’t “analytical” but should rather be part of the main architecture.
The main distinction that I’d make is: would it be a problem for your service if that data wasn’t available for a second, a minute, an hour? Anything visible on your app or website? A second is probably stretching it. Monitoring logistic operations, say drivers at Ubers? Up to a minute is probably fine. You want to retrain a ML model because you have a new idea, but database is down for maintenance? For an hour? You can go and grab coffee, or lunch — you are fine. Serving that same model for recommendations on a e-commerce website, that’s obviously not something that can take the same delay.
I actually think that part of the popularity of Postgres - and relational databases (RDBMS) in general - is that they are good enough for most analysis work.
Not specialist stuff, Postgres isn't really a competitor if your business is large or data-focused enough to be running OLAP (someone will probably link an OLAP extension now...).
But, most businesses just aren't that sophisticated in how they use their data.
There are a lot of commercial solution that are geared towards reporting, from old-school Oracle, TeraData even SAS (would not recommend that one). Some options that came our recently when “Big Data” was a sentence people said: Hive, RedShift, S3, Cassandra. New players that abstract a lot of the problems from the others and add Machine-Learning capacities on top: Google BigQuery, Snowflake. There are non-tabular structure like the graph based ones, Neo4j e.g.: I’ve yet to find a business where that’s a viable option for analytics for the first five years but I like that those are bringing something new.
Most of those are great and work: they have connectors to whatever language or tool you want to use. The most promising tool you’ll need to handle most of the transformation is either AirFlow or preferably dbt. Those are is independent from the database, so don’t worry too much about the features that vendors tout. One key thing: monitor all the queries going to that database, find the expensive ones, those with suspicious patterns, etc.
Spin-up, response time can matter, but they are rarely a problem for most “slow” analytic use cases; for instance, Google BigQuery takes 10 seconds no matter what you query and it’s fine. On the other hand, concurrency has been an issue for me more than anything: all the analysts and managers trying to update their dashboards on Monday at 10 am.
You rapidly get to a point where prices are high and negotiated, so you want to think about your likely usage in the next years before you step into that meeting. Key decision: by default, prefer the tech that is closest to the rest of your stack because ingress is the easiest factor to predict.
Looker will be mentioned: it’s on top of all that, downstream from AirFlow/dbt.
Shameless plug: You might be interested in TiDB (https://github.com/pingcap/tidb), an open-source distributed HTAP database.
- For OLTP workload and some ad-hoc queries, you can use TiDB + TiKV. One of our adopters has a production cluster with 300+ TB data and can easily cope with the spike caused by the brought by COVID-19.
There are a lot of products that either fork or directly extend PostgreSQL for OLAP workloads. Greenplum[0], Materialize[1], Citus[2], PipelineDB[3] etc etc.
We use Teradata for our data warehouse. We have an enormous quantity of data in there (We are a multinational wholesale chain). We however plan to move to either BigQuery or an inhouse solution in the future. But for now Teradata is our workhorse and quite good at what it does.
Check out ScaleGrid's BYOC plans. They are similar to RDS, but give you the option to host in your own AWS or Azure account. ScaleGrid supports MongoDB, Redis, MySQL and Postgres. https://scalegrid.io/pricing.html#section_pricing_byoc
Let’s say I have a table for reservations: 1. Item ID 2. Reservation duration
What I would like is if I have an item ID of 1 and a reservation duration of today 10AM to 11AM and I try to add another record of item ID 1 and duration 10:39AM to 11:39AM the insert should fail.
Apparently it is straightforward in db2 but not in postgresql?
PostgreSQL is a great database, but pgAdmin (or any client I have tried) is just painful compared to SSMS.
And while SSMS certainly has some pain points, I have yet to find a DB client tool that even comes close in terms of usability. For this reason alone I prefer MS SQL.
You can join on arbitrary json fields, with indices. It is an extremely useful swiss army knife.
I've used jsonb for performing ETL in-database from REST APIs.
I know Concourse from 6.0 uses it for storing resource versions. They had an inputs selection algorithm that became stupidly faster because they can perform joins over JSON objects provided by 3rd-party extensions. Previously it was a nested loop in memory.
I use Postgres for everything. The one thing I wish they would do, though, is improve the error messages when a query fails. At the very least, if I know which position in the text is the beginning of an error, I can fix it.
> The one thing I wish they would do, though, is improve the error messages when a query fails.
Yea, we definitely need to do better there. But it's harder than one might think initially :(. My suspicion is that we (the PG developers) will have to write our own parser generator at some point... We don't really want to go to a manually written recursive descent parser, to a large part because the parser generator ensuring that the grammar doesn't have ambiguities is very helpful.
> At the very least, if I know which position in the text is the beginning of an error, I can fix it.
For many syntax errors and the like we actually do output positions. E.g.
postgres[1429782][1]=# SELECT * FROM FROM blarg;
ERROR: 42601: syntax error at or near "FROM"
LINE 1: SELECT * FROM FROM blarg;
^
(with a mono font the ^ should line up with the second FROM)
If you have a more concrete example of unhelpful errors that particularly bug you, it'd be helpful. In some cases it could be an easy change.
Fair enough. Without Sybase XI or whatever version they would now be up to, I can see where this could be true. But when MySQL added Views in v5.0.3 as I recall, it met my requirements for an RDBMS.
Idk, I'm very pro-PostgreSQL but MySQL/InnoDB used to have that one feature for easy replication via logical DB updates (sending UPDATES to replicas) whereas PostgreSQL support for this was considered lacking, using binary block updates instead if I'm remembering correctly. Also, MySQL, with replication, was considered easier to setup and manage by your average web shop. That may be a thing of the past, though.
Edit: also, best for what? SQLite is, in a way, the perfect DB is you don't have concurrent writes, and I've worked with vector databases (think bitmap indices on steroids) that totally outclass eg. Oracle and PostgreSQL for analytics-heavy workloads
That used to be the case, but it's not anymore. Postgresql has logical replication built-in since Postgresql 10 (released 2017-10-05). It's really simple to run these days and the docs are great [0]
For scaling I assume some cloud solution or very distributed solution is better than PostgreSQL. For example Azure SQL Database Hyperscale or CockroachDb.
- Postgres' grant model is overly complex. I haven't seen anybody maintain the grants correctly in production for non-admin read-only users. By contrast, MySQL's are grants are simple to use and simple to understand.
- Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things.
- Postgres' version of upsert allowed any command to be run without checking the permissions. So the vaunted "software engineering" behind Postgres is not that solid.
- Currently Postgres is subject to around a dozen metasploit vulnerabilities that any script-kiddy can execute.
The simple fact is, if you use Postgres, you almost certainly have a security compliance problem.
I could make the same arguments about replication, or online schema changes, multi-master writes, or any enterprise database feature.
Some constructive advice to the Postgres developers is to take a week and add grant commands to limit COPY FROM and COPY TO, and look at the metasploit options and see what can be done ASAP.
I'd appreciate if you're itching to write a hasty response that you actually check your facts first.
If you're thinking, "How is it possible that everybody else is wrong about Postgres being the best?", just remember the decade of Mongo fanboism on HN. I cringed during that era, too.
your statements are nothing but opinions. It's also dishonest to make them sound like facts.
If you care this much about security, you should have taken a look at CVE reports of each db. You'll find that historically MySql has almost double the number of vulnerabilities.
> - Postgres' version of upsert allowed any command to be run without checking the permissions. So the vaunted "software engineering" behind Postgres is not that solid.
That never was the case (see evidence in initial commit [1]). Are you talking about CVE-2017-15099? Obviously annoying that we had that bug, but thats very far from what you claim.
Since you write "I'd appreciate if you're itching to write a hasty response that you actually check your facts first." you actually follow up your own advice?
> - Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things.
If, and only if, the user is a superuser. It's possible to do the same in just about any other client/server rdbms.
> Some constructive advice to the Postgres developers is to take a week and add grant commands to limit COPY FROM and COPY TO
You mean, like it has been the case for ~19 years? https://www.postgresql.org/docs/7.1/sql-copy.html "COPY naming a file is only allowed to database superusers, since it allows writing on any file that the backend has privileges to write on.".
> - Currently Postgres is subject to around a dozen metasploit vulnerabilities that any script-kiddy can execute.
That does not actually seem to be the case (what a surprise). And the modules that do exist are one for a (serious!) vulnerability from 2013, and others that aren't vulnerabilities because they require superuser rights.
Using inflammatory language like "The simple fact is, if you use Postgres, you almost certainly have a security compliance problem" or "If you're thinking, "How is it possible that everybody else is wrong about Postgres being the best?", just remember the decade of Mongo fanboism on HN. I cringed during that era, too" really doesn't help getting the message through.
I've deployed, administered, and developed with PostgreSQL in a regulated, audited environment with external and internal compliance requirements (US healthcare; database contains protected health information). None of the security model issues you mention is anything like a realistic concern in that environment.
And I'd love to see your sources for all your "exploits".
> I haven't seen anybody maintain the grants correctly in production for non-admin read-only users.
Your opinion against mine, but in my last project I successfully did this. I didn't even have to think about it; the client asked me if I could introduce a role for pure-read-only access and it was done in a few minutes. It helped that I'd already set up all objects, roles, and grants for non-admin access.
OTOH, the privileges system in postgres has always worked when and how I've wanted it to work, unlike my experience with MySQL.
----
> Postgres' COPY FROM and COPY TO have been used to compromise the database by copying ssh keys to the server, amongst other things.
> Some constructive advice to the Postgres developers is to take a week and add grant commands to limit COPY FROM and COPY TO
From the docs:
7.1 to 9.2: "COPY naming a file is only allowed to database superusers, since it allows writing on any file that the backend has privileges to write on." (Note that there was no support for PROGRAM in these versions, which was introduced in 9.3, and thus the docs changed to ...)
9.3 to 10: "COPY naming a file or command is only allowed to database superusers, since it allows reading or writing any file that the server has privileges to access." (and after this your request for grantable control was, well, granted ...)
11, 12: "COPY naming a file or command is only allowed to database superusers or users who are granted one of the default roles pg_read_server_files, pg_write_server_files, or pg_execute_server_program, since it allows reading or writing any file or running a program that the server has privileges to access."
And, of course, there's the classic combo of functions with SECURITY DEFINER and EXECUTE grants. For which, again, the doc has always advised: "For security reasons, it is best to use a fixed command string, or at least avoid passing any user input in it."
----
> Postgres' version of upsert allowed any command to be run without checking the permissions. So the vaunted "software engineering" behind Postgres is not that solid.
I'm afraid I couldn't find any info about this thing you mention; and frankly, it's a pretty bold claim. Could you elaborate?
Again, your opinion against mine (and many others'), but the "vaunted" engineering (and _design_) behind Postgres is solid on many fronts, from our experience of running it and using it in many contexts, including security.
Grant model is little outdated in some aspects, but with right tools it is doable. On some level you just accept that you need to read documentation and follow it. It is just another thing that you need to care about. In case if you need it.
I can little imagine case when you have access to the system and install dumps from untrusted sources. Probably it is world of untrusted PHP scripts from russian forums with stollen software. It is the world full of whole specter of pains. But I’m not agree that it should be considered as weakness. You never should restore dumps from untrusted sources ever. Such dumps can contains stored procedures that can contain code in pl/python that can do a lot of shady things. Is it a weakness or advantage of having freedom of using of python? In the world of script kiddies it is, but it is not problem of the Postgres.
I get all the comparison, but Redis? I still use Redis as a cache on top of my PostgreSQL queries. PostgreSQL simply can't be as fast as a K/V like Redis.
> I still use Redis as a cache on top of my PostgreSQL queries
Why? Postgres has a "Redis cache" (in-memory query cache) built in already[1]. Your application layer doesn't have to worry about query caching at all.
While I have no doubt that Redis is faster for slinging blobs around, that blog post is not great comparison of the technologies. Both Django ORM and Python psycopg2 driver are not performance oriented tools.
Technically, PostgreSQL is a DBMS, not a database. I.e. “PostgreSQL is the worlds’ best DBMS”, but not “best database”. The problem is, if you call the DBMS a database, so what do you call the database (collection of data, tables, etc.)?
When I knew nothing about databases, and was trying to read stuff on the web about those, it was very confusing trying to understand what the author talks about when he uses the term database, is it about a DBMS (software), or it’s about a database (data). So, yeah, it’s a problem, at least for beginners.
When I learned you cannot change the order of columns in PostgreSQL, it sounded like one of the smarter but slightly weird kid in a class room.
On the other hand, MySQL, who's not as bright as the smarter kid, is easy to talk to and feels friendlier and is generally the more popular kid.
I do appreciate the strictness of PostgreSQL but if I see small weird stuff, I tend to pick the one that is easier to get along with (meaning, more resource found on the net.)
Also to mention that MySQL is also getting 'brighter" since version 8.
I'll admit to not being an expert in databases, but I can't figure out why the order of columns in a database would be relevant, unless you're doing `select *` and indexing the resulting columns by number rather than name, which I've always found to be fragile to the point of being useless. What am I missing?
It doesn't matter for coding or functionality, but it is nice when you're manually eyeballing the results of 'SELECT * ...' statements or browsing tables with something like Navicat, or looking at schema dumps - to have your fields ordered in some way that makes sense to you, rather than strictly in the order they were created.
Sure, you can list the fields in the order you want in a SELECT statement, but that's tedious - it's handy to have something reasonable in the table definition.
Having migrated a postgres database to dynamodb, I would say that I'm never going to use postgres again.
The problem with all SQL databases is that they are too easy to query and use. You add all kinds of select queries, joins and foreign keys and when traffic hits scramble to make it scale. NoSQL is hard to design but you can atleast be sure that once traffic hits, you don't have to redesign the schema to make it scale.
> You add all kinds of select queries, joins and foreign keys and when traffic hits scramble to make it scale. NoSQL is hard to design but you can atleast be sure that once traffic hits, you don't have to redesign the schema to make it scale.
Surely this depends on how you set up your SQL database to begin with? I'm not familiar with NoSQL, so can you explain why "schemas" aren't necessary and scaling happens automatically?
You could design a SQL database to be denormalized from the start, but then you are losing many of the advantages of a SQL database.
I never said that schemas aren't necessary, just that using a strict NoSQL database forces you to think about scaling constraints early. This avoids(atleast partially) a schema redesign later.
You don't have joins in NoSQL so you have to keep data denormalized. That means data isn't consistent and you can have an will have (at a high enough data volume) errors if you aren't extremely cautious on how you update and add new data. That might matter or not in a particular business case.
"In the age old argument about the imperitive vs. the declarative5 programming models, it occurs to me that declarative programming is just imperative programming in a thin disguise. Each declarative token in a database query language ultimately maps to one or several algorithms which apply the declaration in imperitive terms"
Oh, profound, profound!
It occurs to me the author's a n00b.
Anyway, I do MSSQL. Until recently CTEs in PG were an optimisation fence. That would have slaughtered performance for my work. Not that I'm knocking PG, 'best' depends upon the problem.
MSSSQL people seem really proud of their CTEs. What is it that makes CTEs in MSSQL so great?
I wonder if that really disqualifies Postgres from the same task. Are MSSQL CTEs just the hammer for your proverbial nail? Can you use a different approach in Postgres to solve the problem by leveraging its strengths?
My day job is a lot of Redshift. We use CTEs and temp tables depending on what we need. It’s based on Postgres but not really comparable when talking about performance optimization.
Here's the last line of my post: "Not that I'm knocking PG, 'best' depends upon the problem" I did not want to get into a war with people emotionally involved in their favourite programming tool. The article said 'best'. By that measure PG was worse. Point is it doesn't matter, it's about what problem you are trying to solve and at what price. In many cases PG is best for that - do you understand? I knew people would see it as an attack on PG. It's so childish and predictable. I solve business problems, not cheerlead for a particular RDBMS.
As for what made CTEs so great in MSSQL, I just explained in my first post. Look for 'optimisation fence', and understand that a) PG has changed and b) was not a personal attack.
> I wonder if that really disqualifies Postgres from the same task.
If you have 100GB of data being materialised as a temp table under the hood instead of having the predicates being pushed down, what do you think?
> Are MSSQL CTEs just the hammer for your proverbial nail?
Jesus, they're just a tool to get a job done. As for is there a way round it, probably, it just means more work for the programmer, and more stress for the SQL optimiser which will at some point fail as complexity climbs.
> We use CTEs and temp tables depending on what we need
It was an honest question, professional to professional. I did not interpret your post as a personal attack nor did I intend my post as an attack on you.
The last time I used MSSQL in anger I was an intern and Michael Jackson was alive. It’s been a while.
My question is: “what is it about your work that makes (or made) MSSQL a better choice than Postgres or something else, specifically because of CTE differences.”
Sorry. When you said "really proud" and "so great", it was perhaps an emotive choice of words.
In the end the semantics of CTEs should be much the same throughout. There may be small differences, and perhaps larger ones such as being able to update through a CTE, something like
with x as (...)
update x set ...
I don't know if PG supprts this, and it wouldn't break my heart if it didn't. It would be easy to work around.
Also IIRC PG has 'materialized' and 'recursive' keywords. No biggie.
So the semantics are substantially the same. Difference is, how the optimiser treats it. That means you will get the same results back but the time difference may be enormous. This explains it: https://paquier.xyz/postgresql-2/postgres-12-with-materializ...
In my previous work, the predicate pushdown (which is the posh term for the optimiser rewriting the query as the above link demonstrates) made CTEs usable. Without them, performance would have destroyed the company.
Can we get round it? Mostly, yes, but it would have been more work for us (meaning more human hours) which the DB optimiser should have saved us from.
The only unique thing CTEs bring to the party is recursion. Otherwise they are no more than a convenience, albeit a great one.
HTH
EDIT: MSSQL has a good optimiser when it works. I recently wrote a trivial CTE and had it take a minute to run, WTF? Looked at the query plan, something was very wrong there, pulled out the CTE part and put it into a temp table, ran in one second. 60x speedup. No idea why, was not impressed though.
You do realize you are the one getting emotional because you are clinging to an engine that consistently underperforms in benchmarks across the board and costs you a fortune right?
Hope the job security is worth being an expert in a bunch of defunkt practices in 10 years nobody will want.
> consistently underperforms in benchmarks across the board
Show me mssql vs PG speed comparison on a recent industry standard benchmark.
> and costs you a fortune right?
Are you witless enough to think I'm unaware of the wallet-raping, labyrinthine licensing of MSSQL?
> Hope the job security is worth...
Since you apparently can't understand, let me spell it out for you: I've recently installed PG to start learning it because, depending on the problem, it IS the best solution. I do hope you can stop feeling the need to protect your pet software package and undersand these are in the end just tools to solve problems.
Out of my tries, my clients tries, and my friends tries, only one DB was up to this task. Not Maria, not Access, not Oracle, not NoSQL, not MSSQL, not MySQL - all of them failed. The only one was PostgreSQL.
And before starting to bash me, please do this. Make a small application that will show a map, put 100 million points of interest on that map, that are contained in the table we talk about, and now as you scroll the map, select the middle of view as your circle and select on a small radius only those points of interest inside that radius. No more then a thousand points of interest, lets say. When you do that within a second, you got yourself a good database. For me PGSQL was the only one capable to do this reliable.