Hacker News new | comments | show | ask | jobs | submit login
Why Use Postgres? (craigkerstiens.com)
568 points by timf 204 days ago | hide | past | web | 225 comments | favorite



https://www.postgresql.org/docs/9.6/static/rangetypes.html

https://www.postgresql.org/docs/9.6/static/functions-range.h...

Holy shit. Why hasn't anyone talked about it sooner? I've seen literally dozens of tables with

    begin TIMESTAMP,
    end   TIMESTAMP,
and with handmade validation against intersection. And there is even a union operation! Seriously, my mind is blown.

Rails even supports it!

http://edgeguides.rubyonrails.org/active_record_postgresql.h...


People don't know to look for what they don't know.

It's a bit of an issue for something offering unique or novel features, because there's the risk no one will ever find them.

I'd possibly think that most people also just go "we'll use a db" and pick from mysql and postgres (probably based on which one comes to mind first for them), but don't ever think about it any more than that.

I wonder whether an official "cool things you possibly don't know about (and when/why they're useful)" page could help? Or maybe a "designing a schema/model? here a some features that might make your life better" page could help spread the word? Postgres is lucky to have Craig as an advocate, and I love reading everything he writes, but maybe someone just getting exposed to Postgres doesn't know to look at Craig's stuff? (Although, maybe Google rankings would make this page appear early enough to do a great service...)

I was talking to another dev (who's very experienced) about dbs and I said I'd always expect to go postgres because of the extra features (and intentionally tightly couple to it) and the response was "yeah but how often do you actually use them" which initially shocked me (answer was "all the time"....), but now that I think about it; it's probably a pretty common perspective...


I'm a bit flattered. The real credit goes to all of the devs who contribute to Postgres, though I'm happy to have been involved in some way and contributed some outside perspective to many of them that are friends.

Of course... I'd welcome more reading my blog, at the same time there are a lot of others that introduce good Postgres content to the world. This was my biggest goal in starting to curate Postgres weekly (http://www.postgresweekly.com) which focuses on Postgres for app devs. There's a ton in Postgres, which 9 years into spending way too much time with Postgres I'm still discovering each week so trying to help make a better way for others to learn was key.

Sadly the hard part about the cool things isn't the most important. Cool gets a lot of attention, but the really safe and stable things like transactional DDL, rich indexing, MVCC all that doesn't get as much love as hstore, JSONB, or HyperLogLog which is a shame cause they're really important when you're running a real business which cares about data integrity.


>> cool things you possibly don't know about

A hugely useful feature of postgres, that nearly no ORM/ActiveRecord library makes use of, is the RETURNING clause of INSERT statements[1] (also supported on UPDATE and DELETE statements, though those are less useful).

A write statement with a RETURNING clause returns a resultset just like SELECT, with one row per row inserted, updated, or deleted. In this way, you can request the values of columns whose data was dynamically determined by the postgres server. This is especially useful for sequence (ie: SERIAL, autoincrement) values, as well as other columns with default values (such as an insert datetime, or a column with a value based on a complex expression).

This functionality is amazing, yet is so underused. Try to do a multi-row insert statement with MySQL with an auto-increment primary key, and know which IDs were assigned to your rows. You can't, because there is no guarantee your auto-increment values will be sequential.

Example query, which will return the auto-increment sequence id and default insertion timestamp of the two inserted rows:

    INSERT INTO members (name) VALUES ('Foo'), ('Bar') RETURNING id, created_at;
[1] https://www.postgresql.org/docs/current/static/sql-insert.ht...


Some people don't want to use SQL. Or learn it. Or read documentation.

The anti ORM movement convinced only a few interested database users.


Many ORMs like ActiveRecord, the Django one and SQLAlchemy support it just fine, so it's probably just a matter of reading the docs :)


I find the Django docs a bit more digestible then PG docs, personally.

It's because of the Django docs that I've learned about a few of those awesome fields types. https://docs.djangoproject.com/en/1.11/ref/contrib/postgres/...


Letting go of ORMs is so easy with good lightweight model binding libraries. e.g. Dapper (https://github.com/StackExchange/Dapper)


We use something like Dapper at where I work. I think for some projects it's fine, but for some a full ORM can really pay dividends. Especially if you're still working out your problem domain, all the hand crafted SQL can be a pain to update if you find you need to add just 1 more field, or if you figure out you need to split a table up because of a new unforseen feature, etc.


Dapper is great for reads but it doesn't handle complex updates nearly as well, although I'd argue most of those are architectural/design issues in nature.


How do model binding libraries compare to ORM's w.r.t. performance, e.g. query optimization?


ORMs are doing more work and are naturally going to have more overhead. With Dapper , at least with the way I used it, you're going to write your own SQL queries so there is nothing to optimize. I've even run into some nasty SQL server parameter sniffing issues. But despite that I still prefer the style over ORMs.


MyBatis is great for Java as well.


> The anti ORM movement convinced only a few interested database users.

IMO this was because they were too dogmatic and ignored the things ORM's did better, focusing on performance that didn't matter or on issues that come up even when you don't use an ORM (n+1 in the architecture).

That said, I don't think anyone that hasn't learned SQL really don't understand their ORM either. It's one of the fundamentals that any developer should know before they ever write production code.


A rather important library we use at work is built on Potsgres range types. And we interact with it through an ORM (a co-worker of mine will be presenting how it works, and the open-source release of it, at PyCon this year).


ORMs are tool. Arguments for and against them are tired and irrelevant. Use the tool if and where it helps.


And you can create constraints to ensure only unique ranges in a table!


Not just unique ranges (which is trivial), but also non-overlapping ranges!


Sorry, that's what I meant!


There goes a slab of my painfully thought out test cases.


Careful with this if you want to use it for partitioning data (it was atleast true till 9.6 in my experiments). You will never hit the exclusion parameter and consequently end up scanning all the child tables in vain.

Hopefully the partitioning thing is soon fixed 'properly'.


Um, can you explain it like I am 5, why this is a Holy Shit thing?


Because everyone who has been constrained to other databases has done things like "SELECT start, end FROM reservations" and then tried to write the where clause to find the range, had bugs (this person reserved a room for longer than the expected time).

And the end of the day, they ended up selecting all the start and end dates and then building their own interval tree ( https://en.wikipedia.org/wiki/Interval_tree ) and doing overlap detection themselves.

If they were using a database with good stored procedure support and had the skills to write the stored procedure or function in the database, they might then fight with writing the overlap detection there... but they likely found it a profoundly unpleasant experience.

And then here's PostgreSQL. It has it native. Hours, days, weeks of time spent writing, testing, debugging and otherwise fighting with this problem now suddenly become something along the lines of "SELECT isempty(daterange('2017-04-30', '2017-05-07')) FROM reservations" or something to that effect.

It represents a significant time savings when you need it for things that are more interesting than reimplementing data structures that aren't in the standard libraries.


Specifically, here's how that would be done in Postgres:

  create table reservation1 (duration tstzrange);
  select * from reservation1 where duration && '[2017-04-30, 2017-05-07]';
In a database without range types, it would look like this:

  create table reservation2 (start timestamptz, stop timestamptz, check (start <= stop));
  select * from reservation2 where '2017-04-30' <= stop and start <= '2017-05-07';
Personally, I don't think range types by themselves are anything special. However, combined with indexes and exclusion constraints, they're really great:

  create table reservation1 (duration tsrange, exclude using gist (duration with &&));
  insert into reservation1 values ('[2017-04-30, 2017-05-07]');  -- ok
  insert into reservation1 values ('[2017-05-10, 2017-05-12]');  -- ok
  insert into reservation1 values ('[2017-05-05, 2017-05-08]');  -- error!
Here's the difference indexes made when retrieving all the scheduling conflicts with a fixed interval (500k rows):

  No index (start, stop)        123ms
  Btree index (start, stop)     29ms
  GIST index (duration)         0.2ms
Quite a difference!


I have known about these for quite some time, but my only issue is that not all drivers in a particular language support range types.


If your driver or ORM is insufficient, you can use views:

  create table reservation (room text, duration tstzrange);
  insert into reservation (room, duration) values
      ('A100', '[2017-04-30, 2017-05-07]'),
      ('A100', '[2017-05-10, 2017-05-12]');
  
  create view my_orm_sucks (room, start, stop) as
      select room, lower(duration), upper(duration) from reservation;
Let's pretend building A caught on fire, so we have to move to building FM:

  select * from my_orm_sucks;
   room |         start          |          stop          
  ------+------------------------+------------------------
   A100 | 2017-04-30 00:00:00-07 | 2017-05-07 00:00:00-07
   A100 | 2017-05-10 00:00:00-07 | 2017-05-12 00:00:00-07

  update my_orm_sucks set room = 'FM200' where room = 'A100';
The underlying table gets updated:

  select * from reservation;
   room  |                      duration                       
  -------+-----------------------------------------------------
   FM200 | ["2017-04-30 00:00:00-07","2017-05-07 00:00:00-07"]
   FM200 | ["2017-05-10 00:00:00-07","2017-05-12 00:00:00-07"]
Note that updates involving the start and stop columns won't work with a view like that. You can use the rules system to get around that:

https://www.postgresql.org/docs/current/static/sql-createrul...

https://www.postgresql.org/docs/current/static/rules-views.h...

Or better yet, fix the driver or ORM!


Thank you for that idea with views, I had not thought about that perspective. It will come in handy if I have to use ranges in the future. I do think ranges are a better way to model things.


Which language(s)?


not a single driver in go supports them


Well, that seems to be a recurring issue with go.

Not nearly as mature libs as Java.


I am using Go, that is where I ran into the issue. But for the most part, I just stick to ANSI sql and I am ok.


Couldn't you just have a constraint based on more then one column? In fact for many applications there are different types of resources (rooms, people, etc) which you might want to be unique within anyway.


You still would need PostgreSQL's exclusion constraints to handle overlapping ranges since unqiue constraints do not handle checking for overlapping ranges, only exact equality. The below example guarantees that the same room can only be booked once.

    CREATE TABLE reservations (
        room text,
        during tstzrange,
        EXCLUDE USING gist (room WITH =, during WITH &&)
    );
You could also still have two columns and use an expression.

    CREATE TABLE reservations (
        room text,
        start timestamptz,
        stop timestamptz,
        EXCLUDE USING gist (room WITH =, tstzrange(start, stop, '[]') WITH &&)
    );


EXCLUDE constraints require the range to be a single value. (You could EXCLUDE over an expression based on both columns but it's nicer to have one.)


What do you mean by nicer? Not all drivers support range. Why is it better than an exclusion over an expression with two columns? (with simpler data types that the driver certainly supports)


Far too brief.

I would appreciate a really long text that would in a convincing manner explain why Postgres Is so awesome.

I work in the industry, and all I see are Oracle and Sybase everywhere. The experts are zealots also, not even having heard of Postgres. Not willing to believe a word I'm saying about Postgres.

I am already convinced of course, but the industry is not. Not finance, not trading, not telecom.


It sounds like you're not so much in "the" industry but a subset of the industry that doesn't consider free open source software to be an option.

If you're fighting the attitude that the only viable option is one that costs a ton of money instead of a decision based on technical merit, additional facts about Postgres won't help.

Sometimes the decision gets made on a golf course and not after reading very long texts.


>Sometimes the decision gets made on a golf course and not after reading very long texts.

This is why I swore to only write software at software companies some years ago. For other industries software is just another expense like buildings or supplies. The people buying it have no idea what they're looking at so they buy the shiny thing that everyone else is buying.

When I worked in other sectors I regularly saw executives attempt to assemble software as if it was a large building. Materials were gathered, agreements were made to purchase or rent heavy equipment, and the whole project was planned out with exacting detail. When all the permits and contracts were in place construction would begin. This was to be done in a measured amount of code using the chosen architecture, and delivered complete on a certain date. Any deviations were seen as workarounds for not following the original plans, and overruns the result of incompetence. The whole thing was just a shitshow of incompetence at the higher levels.


You're correct that non-software companies will simply outsource but not quite correct that it must be closed source - software is simply something they don't have much expertise in and never will. The problems only show up when software becomes such an integral part of business operations that you really do need to actually do well at software or risk a lot of business. The reason that it tends to be closed source companies is that most of the relationships that the non-software companies have is when open source was really not very viable of a foundation for one's software.

Demanding and exercising the use of a throat to choke is a massive, massive part of how at least American business works due to prevailing leadership styles as well. It also shows how they tend to view their workers as well. "Blameless culture" is something that is a tiny, tiny minority that only seems to have worked out well among socialist-ish boutique software companies and until we start seeing companies that practice traditional Taylorist and authoritarian leader worship cultures fail very disproportionately, nothing will fundamentally change.


I think that is another reason why some companies outsource all aspects of IT, so they can blame someone. Who cares if it takes 10x the time compared to having the skillset in house when you can just point at the 3rd party and blame them. Doesn't matter if it is the same person selecting all the 3rd party providers who continually are horrible at their jobs, we are saving money!


I guess I already covered that in the culture of desiring parties to choke. Whether it's a vendor or your own employees, leaders that rule by fear like to find scapegoats to shift blame away from themselves and take credit for others' efforts. The people that know better are effectively kept away from any form of power or under-resourced, so they'll always be too small to succeed.

I'm familiar with a number of very large deals done basically between C-level to C-level where the scope of IT projects has nothing to do with technologies but entire about cost savings - literally "I will save you $n MM / yr in opex so you can get your bonuses" and other vendors get shut out. Sometimes these deals work out, other times they don't and the executive is basically ousted. Companies with bad politics and enormous cronyism may have worked fine for decades, but they just may not be doing as well anymore unless you're on Wall Street and you make so much money it doesn't matter how it's done.


well, yeah, obviously.

Software is not the end-goal itself. The point is not to make (or use) amazingly elegant software. The point is to make money.

If a supplier says "I will provide the same service as you are currently getting and cost you $X less" then that's a no-brainer regardless of what service they're providing. It's got nothing to do with technology, and technology doesn't change the nature of that decision.

"Having a throat to choke" is also a matter of insurance. You can't insure against your own incompetence, but you can sue a supplier for not fulfilling the terms of their contract. Executives would much rather negotiate what they think is a tough contract with a supplier than manage a complex project themselves. To that mindset, the removal of risk (because if anything goes wrong they can sue the supplier) is a huge bonus.

It's a totally different mindset from those of us who actually make things.


Nobody ever got fired by hiring IBM mindset I guess...


It's worse than that. You're not just choosing a reliable solution, or even a solution with a conservative reputation so that your butt is covered. You're choosing a solution which may actually be worse because when you fail to deliver for someone else, you have someone to scapegoat. If there is nobody else to blame, you might take the blame for having chosen open source with no specific party to blame (the hot potato stopped with you).


Is that not just plan driven development? (I.e. waterfall)


It's not about technical merit or cost - it's about service contracts.

I was a Sybase point person for years at a fortune 50 medical devices enterprise company (we had revenues of well over $1B annually on devices running Sybase dbs). There were dozens of bugs/issues I found that I pushed up the chain to an engineer and had special patches turned around within 48 hours, sometimes even hours.

Before I left that job I started playing around with Rails and mentioned MySQL and Postgres for a potential greenfield project. I was told it would be fine for internal use but no way, no how were they going to deploy any software without that kind of parachute based on economic leverage.


But you can pay for this level of service for PostgreSQL too, there are a bunch of companies which offer it, and I bet you can get it for a fraction of the price you would buy the same service for from Oracle or Microsoft. So this seems to me to mostly stem for a poor understanding of open source.


Some years ago - never mind how long precisely -I was one of the engineers at Sybase who would have produced that EBF patch for you. More recently I do the same sort of thing for clients running Postgresql. I have worked on the code of both db servers, been involved in bug hunting and fixing as well as support escalation for both, and interacted directly with users and developers of both. If my life depended on a medical device, with a choice of one running Sybase and one running Postgresql, I would choose Postgresql, in a heartbeat (so to speak).


FWIW, there's several firms providing such services for PostgreSQL too. At $previous_company others and I, as a PostgreSQL committers/contributors, provided escalation exactly for such cases. I know they, and others still provide such services. (Not naming names right here, so this doesn't come over as an ad)

Often enough you'll even get similar turn-around times from the community, if you can't (or don't want to) afford such a support contract.


> no way, no how were they going to deploy any software without that kind of parachute based on economic leverage.

Making software work at scale without that economic leverage could never work in theory. It only works in practice.


Those in practice fixed bugs in mysql themselves and had to make it more performance themselves. Which basically means larger development team and more time. The parachute is meant to avoid that expense.

Oh, and those who did it were above oracle in terms of data size making it rational decision too. The calculation is still different for majority of companies.


I've gotten better support out of the Postgres mailing list than I have our of any commercial contract.


I spent several years in securities and derivatives trading and the most frequently cited reason for avoiding open-source software I heard was that, in the event of a major foul up, there was no one to sue if you got sued yourself. It's not that difficult for an attorney to paint you as reckless for using "free" software.


I spent 13 years writing the core trading system for many of the well known exchanges. We used open source wherever possible because the software tended to be more reliable. That said, clients usually got to request the database and we used Sybase a lot. I have been using Postgres for the last eight years since. every day of the week and I really like it but the planner is quite a bit worse than Oracle, SQL Server's. The postgres planner is still way way better than MySQL's. It still has correlated subqueries explode into cartesian joins. Mysql is great as a data store but it's more of a replacement for noSQL than an advanced query engine.


MySQL's planner is predictably stupid; structure complex multi-table predicates as joins (nested if necessary) rather than subqueries and it's almost imperative. Postgres OTOH is very unpredictable; sometimes it does the right thing, and sometimes it does something amazingly asinine, where simply swapping a table between from vs join clause can result in 1000x speedup.

Specifically, I've seen pg take a query that looks like this:

  select ... from a join b join c join (select ...) d
where a has millions of rows and is an equijoin with d where d has 10 rows, and it decides to materialize a x b x c, only joining in d at the last step. But do it like this:

  select ... from (select ...) d join a join b join c
and it does the right thing! And analyze gets it right (i.e. the plan for the reordered joins is recognized as better) - never mind genetic optimization, it's lacking analytic optimization.

With the lack of hints, almost the only tool you have to control query plans effectively in postgres is parenthesized joins. Since it's more liable to rewrite the query, the language ends up being less imperative, and thus less predictable. And I like predictability in production.

SQL-level feature set is no comparison of course, pg wins easily.


There are settings for choosing between the exhaustive search planner and the genetic planner. The exhaustive planner is better, but can be slow for complex queries with a lot of paths. But, if your query is at all time consuming you probably want to increase geqo_threshold and geqo_effort as well as join_collapse_limit and from_collapse_limit.

I'd also suggest disabling nest_loop_entirely if you are having problems with bad cardinality estimates resulting in nestloop plans that run 100 times when the planner estimated once.


An interesting argument for the predictability of mysql. Great observation.

It is interesting to see how postgresql will often choose hashmap scan, even with very up to date statistics and much better paths available.

SQLServer's planner does an amazing job of digging right into joins/sub-selects to constrain preliminary results for joins.

It's a very hard job and MS and Oracle obviously have had some of the best people on the world paid well to work on this.


Show me an enterprise DB license that offers you better indemnity/liability options than open source. (I negotiated them on behalf of huge clients for many years.)


It's not so much the license per se. It's that the setup is done by a third party that can be blamed when something goes wrong. I saw one one contract that was specifically saying that they are insured for 1 mio. in damages.

The company for the longest time wouldn't even touch basic firewall rules without having the firewall contractor implement it.


This is so true, it hurts. I call it The Triangle Of CIO Turnover.


In my experience with couple of banks, it comes down to support. Lot of systems in banks are written for longevity. So they frown upon software which might be obsolete or people stop working on them 5 years down the line. A paid software, they reason, can at most release a new version while free might not provide enough incentive for people to work on it continuously.

Many also think looking up issues on stackoverflow, google or blogs as unreliable. Then there are times when issues might be specific to installs or data, in which case sharing the logs/sample data (even masked ones) can be risky. They feel comfortable sharing logs/masked data with for example Oracle because they believe it to be safe and locked under Oracle's security guidelines.

The 2nd refrain I hear is - security. In case of a major security issue being revealed, there is a general sense that FOSS will be slower to react in releasing a "stable" patch. Comparatively paid software take it as a reputation risk and work towards quickly releasing a "stable" patch.

If people have to use FOSS, then they try and search for the paid support flavor. Recently we were looking at MQ software. When we zeroed in on RabbitMQ we were asked to deploy only the paid Pivotal version and not the free version because "support".

Sure, these things might not be completely true but for many higher ups paying for something somehow makes them sleep better at night than a "free" alternative.


This is wrong on so many levels, I suppose you mostly understand it, but here are the counter arguments:

> Written for longevity

OSS is much better at longevity than proprietary. Even if the authors all die without will, it is possible to fix the little bugs that prevent you from using the software on [NewTechnologyHere]. I've done it countless times with Java software; If anything OSS is the guarantee that you own your future and that the system will exist in the legacy.

> Use paid flavor

It's good, but what's better is joining the golf club of a principal maintainer. He's key in paying him to fix the issue you're having quickly and merging upstream.


I think there is more to this than that, though. Software companies usually run R&D at about 10% of revenue. So, these support contracts are really returning only 10% of their cost. And the companies who are buying the support contracts are are usually big. The money they spend on 10 developers for support could pay for 100 developers once you add in license fees, etc, etc.

Long, long ago when I worked at Nortel (a now defunct, but then huge telecommunications company), they used to pay millions of dollars a year to Cygnus to support a particular embedded version of GCC. This, despite the fact that Nortel had more than 10k programmers on staff including a compiler team!

I think the real reason these support contracts exist is because companies (even large ones) don't want to dilute their focus maintaining projects that are peripheral to their core business. It's not so much a technical problem, or a money problem -- it's a management problem. They can't scale out to handle every little thing.

I think OSS is a red herring in this conversation. Most companies just don't care about that. They don't want to support it themselves (even if they are big enough to do so), and they need to have confidence in the company that provides the support. Build that company (hint: you need to be sales heavy!) and you could sell Postgresql just as easily as any other database. Of course breaking into an entrenched area in Enterprise software is always going to be difficult, so I'm not sure how successful you would be with this particular product, but you get my point, I think.


There are several companies which sell PostgreSQL like that with EnterpriseDB and 2ndQuadrant being the two largest ones. It seems like these companies are at least semi-successful since they hire more people all the time. So I agree with your idea, that you just need to convince the enterprise customers that you are a reliable partner.


Most enterprises don't want to own coders to fix/hack OSS. They want a solution, roadmap and 24x7 support.


Sure OSS has a longer lifecycle because a dev can lookup the source code and fix it. But companies don't want to spend twice. For example in case of a DB, they would rather want a DBA to manage the DB. They don't want to hire a developer and a DBA - that's how they view it. Sure if you can find someone who is good at both but they are few and far in between. It is much easier to have an Oracle DBA manage Postgres with paid support than find a developer with enough programming under his belt to ensure he can take care of Postgres issues.

As hindsightbias puts it they want solutions and 24*7 support.


and how often do securities firms sue oracle, microsoft, ibm, sybase and other current and previous database giants?


Nobody ever got fired for buying IBM.

I think is the better point or rewording of your point.


The state government of Queensland (Australia) has forbidden any further contracts going to IBM.

https://www.itnews.com.au/news/queenslands-ibm-ban-lives-on-...


Some of the money gets funneled back into salespeople; that's how enterprise software gets bought. The salespeople target the CTO or another single point of contact.


Regarding "The industry". Okay.

Let me rather call it, The Real World.

I work in a Fortune 500 company, a consultancy of 400.000 people, on projects for other Fortune 500 companies.

My experience is from the real world.


I guess it depends. We are migrating away from Oracle towards Postgresql. For Oracle you quickly need enterprise edition or flaky third-party software for basic features lile replication. We could hire several dedicated DBAs on Postgresql for that money. And the licensing is just too idiotic, with no list price anywhere. Usually takes a month to get a quote for us. Not to mention the client libraries and how cumbersome they are to deal with. Good riddance.


No one ever chooses Oracle or Sybase on technical merits.

That is, although those two products may indeed have some merit, that is not why they are chosen.


As somebody spending a large amount of my time working on postgres, and obviously being convinced PG is good: Of course people sometimes choose oracle on technical merits (I've not worked with modern sybase versions much, so I can't judge, although some sybase IQ numbers look interesting).

Golden Gate, RAC, Management Tools, stable plans, decent parallelism, decent partitioning, some columnar stuff - that's not just marketing fluff.

Oracle's politics / sales tactics, and cost are one large argument against, being able to influence feature development / add features yourself another, against Oracle, that I've seen driving companies - including big financial ones - away from oracle over time. Often that's not starting with the business critical stuff, but with some smaller project, and then grows over time.


What does PG need to improve, then? I see a lot of discussion of how Oracle is better at this or that but not much discussion on how PG will come to parity and how we'll know when it's finally good enough to use.


> What does PG need to improve, then?

There's some things (better replication out of the box , higher performance).

> but not much discussion on how PG will come to parity

That's because this subthread started with "No one ever chooses Oracle or Sybase on technical merits." - neither Postgres' strengths and needed/planned improvements are relevant to refute that position.

> on how PG will come to parity and how we'll know when it's finally good enough to use.

Just because Oracle has some features that postgres doesn't match doesn't mean it's not good enough. There's a lot of features where postgres is further along than Oracle, too. For a good number of OLTPish workloads postgres is faster.

We're talking about large and complex products here - it's seldomly the case that one project/product is going to be better than all others in all respects. Postgres has been good enough to use for a long time.

If you're interested in which areas postgres needs to improve, I'm happy to talk about that too.


I've used oracle. Its awesome. Expensive. But awesome. If I have to pay, postgres is OK. But oracle is really really good. If they were laptop operating systems, postgres would be Debian, oracle would be Mac os x.


Did you know that Oracle forbids "unauthorized benchmarks" for any users of its database products? It's curious that a database with such amazing performance characteristics would not allow third party benchmarks against other products.


[citation needed]


Here's a link to a site I use frequently for these sorts of things. I apologize for not including a citation, I figured most people were familiar with it or similar services. https://www.google.com/search?num=100&q=oracle+benchmarks+ag...

The first result is a blank copy of a license agreement that they presumably forgot was on their website.


Call me stupid all you want, but I was looking for a clause that prohibited running unauthorized benchmarks, and neither my nor your search results corroborate​ that.

You aren't allow to publish benchmark results, a condition that is both upsetting and not at all unique as commercial databases go http://m.sqlmag.com/sql-server/devils-dewitt-clause


Search for DeWitt clause and you'll find plenty interesting history.

This unfortunately has become very common.


I was at an AWS Aurora talk and someone asked the presenter (VP of DB engines) if there is a benchmark comparing Aurora to Oracle and he said that they are unable to do it because of licensing agreements.


In this particular case I think we can do fine without.


I was doing advanced Oracle DBA work.

It is awesome in a way but had some awful bugs (lost hours to installer bugs) and was topped up with dark patterns IMO (expensive features would be one click away)

Edit: and can we stop pretending that OS X is better? It is different. Some people like that. Other has just as legitimate reasons to stay away.

I spent 3 years on a Mac and went from really enthusiastic to really disappointed. I still defend others right to prefer it though and hope you'll defend my choice as well.


Pretending that OS X (or macOS) is better?

Having been exposed to Linux on the desktop for more than a couple of decades, in addition to using OS X since 2003, my subjective opinion is that it's not only far more mature, but better in almost every conceivable way.

That goes for casual users to developers. The ecosystem from Apple is maybe not perfect, but I still dare to use the word fantastic.


I have a number of reasons to dislike it strongly [0].

But we are not supposed to argue over such things here.

I'm just asking that Mac people respect that I and others way prefer other OS-es like Linux, BSD or even Windows.

[0]: like 1.) not having consistent shortcuts for moving using the keyboard 2.) With two monitors the menu bar will be very far away when you work on the other monitor 3.) One Chrome window would block the other, preventing me from finding the instructions on the wiki while having a file select box open in another.

Etc etc. This is before I start my rant about things more unrelated to their OS implementation like a) putting fn in the bottom left corner b) not giving me any chance to fix it c) the fact that many programs I want to use was either unavailable or looked horrible.


Can you give a bullet point list aimed at someone who is very experienced at Postgres but has never used oracle? I'm curious of the corner cases and killer features that make you love oracle so much


I don't like Oracle DB really, but there's one feature that does stand apart from Postgres. Oracle has an internal scheduler that works much like an OS scheduler, allowing you to set priority and resource usage for each user or connection. This made Oracle the go-to database for anyone that needs multi tenant support but wants to allow users to access their own database. If you allow database access its trivial to create a really slow query and the resource limiters prevent one tenant from ruining performance for everybody.

The best example of this is Salesforce, which has their own proprietary SQL-like query language that's clearly just a crappy front end to generate raw SQL to feed their Oracle DB's. Without Oracle's per-tenant limit this would be far too risky because of idiots making bad queries.

An better solution these days is to put each tenant in a Postgres container and let the OS control resource limits for them, but this wasn't an option until recently.


Their own language prevents any kind of expensive queries. Besides that, they limit the amount of anything you can think of (queries, cpu time, memory, call, querytime).

Killing connections can be done in postgresql too. The reason for sf to be on oracle is probably history.


Is that feature available in Microsoft SQL Server?


Yes, there are resource controls in MS SQL server as well.


Disclaimer: huge postgresql fanboy, use it whenever I get the chance

A few things I've found great in Oracle that aren't (AFAIK) available in PG:

    - Straight better/more reliable performance on average
    - More advanced parallel queries (obviously this is changing in PG right now)
    - Flashback queries
    - Better materialized views
    - Plan stability (maintains predictable query performance, rather than the nasty jumps you sometimes see when plans change)
    - Better clustering story (RAC is super expensive but pretty good)


One thing I can remember was the automatic query advisor.

It was actually good and very easy to activate. If you did activate it though you could expect a sizable extra invoice after next audit.


If you're going to make claims like these that will raise a few eyebrows, please give some details and reasons why you think Oracle is awesome. Otherwise, your comment is fairly useless.


It's not that I don't trust you but we are a technical community and I believe we all appreciate some more details...


Have you used 2017 OSX? Because comparing something to OSX doesn't make it sound "really really good".


And by that I assume you mean macOS 10.12.


Your last sentence isn't favorable for Oracle, in my opinion; the opposite in fact.


lol. Any meat to this ? A specific feature missing in pstgres that makes it worth spending a few hundred thousand on oracle ?


Having worked with both Oracle and Postgres, there have been a few features in Postgres that I wish were built out more completely, but nothing insurmountable, and they have been improving those (e.g. partitioning). Postgres is incredibly reliable - I worked on a couple data warehouses in Oracle and found a bunch of defects, and so far none in similar systems in Postgres.


I work in the industry too, and I didn't even realize Sybase was still around. It's a big world out there and there are little subcultures with different preferences. Bigger tech companies seem to prefer MySQL, old Enterprise ccompanies use Oracle, .NET people use SQL Server, and smaller Rails shops seem to use Postgres in my experience. But it's interesting to hear a different perspective.



That's finance and trading. In web services, "Oracle" is a bad word. Like. Paying money? For a database system?! That sounds extremely alien to me


Not sure if you're actually worked in web services but it's very common to pay for database systems.

Netflix, eBay, Apple, Sony (for PSN), Spotify are all Datastax customers paying for the commercial version of Cassandra.

Facebook, Foursquare, eHarmony, Buzzfeed, LinkedIn are listed as paying customers of MongoDB.


Yeah, sure, big companies pay for support/consulting! But Cassandra and Mongo are FOSS, you don't have to pay for the right to use them.


I have experience with Oracle, Postgres, MS SQL Server and DB2.

Postgres is in the same range of relational enterise SQL database engines. Postgres offers even Oracle PL/SQL compatible syntax, so you can think of Postgres as the Linux of relational databases. (Linux is a clone of Unix). If you need advanced SQL syntax, XML support, complex triggers, inlined procedural code, GIS, etc look no further than this and choose Postgres. If you just want to hit your DB with thousends of connections from your web application frameworks and public APIs or think of easy clustering, it might be a good idea to add a caching layer in-front like memcached/redis or read on... (as the forking model doesn't scale that good)

And then there is a unique database-software with a common SQL dialect that supports dozends of database engines. It's called MySQL and it supports plugin-engines like InnoDB (default, true web scale, very fast), ISAM (old 1980s style features, very fast), etc. MySQL can handle many concurrent connections and InnoDB is really good, that's why it's a very good fit for web apps, using basic SQL features and used by Google, Twitter, Facebook, etc as a main production database.

And there are NoSQL databases that have different kinds of features like MongoDB/RethinkDB, Lucene (Solr/ElasticSearch), Hadoop (HBase, etc), Cassandra, etc. - they have often just basic SQL-like query language or non at all, but a custom API to interact with the datastore. Those domain specific solutions are often very fast for certain use cases. Some have limited index support, limited join support, transactions, etc. so the it really depends. E.g. for JSON datastore and full text search, those are ideal solutions.

We would need a database guide that highlights the common open source database engines. And provide a transition guide and compatibility matrix compared to legacy binary blob database engines - the real competitors to open source. People in the open source communities are often fishing in other communities and try to convert them - instead look no further than your corporate colleagues and enterprise fellows and try to convert them away from their rusty databases.


You seem mistaken about what the real costs of the forking model. To solve the cost of opening new connections you just need to use a connection pooler. The need for caching is not really relevant here. PostgreSQL can easily handle hundreds of thousands of read queries.

The real costs are:

1. Need for a third party connection pooler (built-in in many client libraries like JDBC, ActiveRecord or Sequel). Some poolers like pgboucner lack support for prepared statements.

2. Memory usage can become an issue, especially if you have tens of thousands of stored procedures since stored procedures are compiled and cached per connection. Also working memory for sorting is per connection which means PostgeSQL will use more memory than MySQL on some workloads.

3. Having to recompile all stored procedures on first use after reconnecting to the database can be an issue.


In finance, no one will trade on perm oracle/sybase installs for open source databases, even if they were free (which they are not - always have to pay for support). A couple of million $$ don't make the difference in comparison to a few hundred million $$ that have to go into replatforming, tools, skills.

In cloud/managed space, Postgres is also hit and miss - still no good cross-region option and frankly other than AWS RDS not many other managed Postgres services.

So, MySQL rules the cloud/managed databases, and oracle/sybase/mssql rule on prem.


My company does managed Postgres on AWS, Azure, Google Cloud Platform, DigitalOcean and UpCloud - https://aiven.io/postgresql. Compose offers managed Postgres (https://www.compose.com/postgresql), as does ElephantSQL (https://www.elephantsql.com/). Database Labs is one more provider (https://www.databaselabs.io/).

Google recently added beta version of Postgres to Cloud SQL as well (https://cloud.google.com/sql/docs/postgres/). Of course, AWS RDS and Heroku have been on the market for some time.

I'd claim managed Postgres market is in a pretty good shape.


Google Cloud has managed Postgres:

https://cloud.google.com/sql/docs/postgres/


They announced that just a few weeks ago. And what they state on the site you linked to proves the parent's point about managed Postgres being hit or miss:

"This is a Beta release of Cloud SQL for PostgreSQL. This product might be changed in backward-incompatible ways and is not subject to any SLA or deprecation policy."


Azure offers a managed version of mssql, which I'm quite happy with.


I really wonder where you work then. It sounds really weird.


You'll find similar attitudes in utilities, manufacturing, resource extraction, government.


To convince the, guide would also need more words from the enterprise world like OLAP and OLTP, XML, SOAP, etc


great read, but may I also suggest Array?

being able to have a field like:

  ingredients VARCHAR[]
and index like:

  USING GIN (ingredients)
and using an operator like @>

  SELECT * FROM table WHERE ingredients @> ARRAY['mushrooms', 'sour cream']
gives you such amazing flexibility and speed, it's not even funny.

also, while I was sad to not see PLV8 up there with PostGIS (an amazing extension, btw), I was still happy to see it mentioned with such gusto.


That's mentioned in the write-up from five years ago of which this is an update.


Can you give me an example where an array would be more beneficial than having another table with {id, name}? I personally have never found a use for them.


simplicity, for one - write a quick query to retrieve all recipes that contain all of any number of ingredients.

  SELECT * FROM recipes r
   INNER JOIN ingredients i ON (r.id = i.recipe_id)
   WHERE i.ingredient IN ('mushrooms', 'sour cream')
load up some data and run explain analyze on both schemas/queries.


I could think of denormalization; e.g. you have an entitiy that can have 0..n tags and you want to speed up the retrieval, I could think of arrays being a faster way.


Quite a bit faster, actually, depending on the dataset.


Postgres seems to have become the go-to relational database ever since MySQL fell in the hands of Oracle. Can anyone speak to how its json tree compares to MongoDB's document store in practice?


I don't think postgres is the default over mysql.

I think WordPress and php guys default to MySQL.

I think python and Django guys end up defaulting to Postgres.

Rails and node.js I am less sure what the default is, maybe mongodb?

I don't think either camp has really changed since Oracle came along. A lot of momentum in different stacks.


The actual default for Rails is SQLite [1], though a lot of people (perhaps a majority) use PostgreSQL. You certainly can use MongoDB with Rails via Mongoid, though I'm not sure vast numbers of people do, comparatively.

For Node, well, it varies, but at least once upon a time, people used to talk about the "MEAN" stack (MongoDB, Express, Angular, Node) - MongoDB and Node are certainly often used together.

[1] Not MySQL - edited as per the below correction.


The default for Rails is sqlite and has been for quite a while now.


You're quite right. It's so long since I've used Rails with anything other than Postgres I knew that it couldn't be the default because of having to specify the configuration flag, but it escaped me that it would be SQLite. Thanks for the correction.


Well I guess Django ships with sqllite but postgres is the normal thing to flip on. Good to know rails is the same...


Yeah, as others have noted, the default for Rails is definitely not Mongo. SQLite is the default and thus ends up getting used a lot in development. Production is a relatively even split between Postgres and MySQL, although I think it tilts slightly towards Postgres in part because that's the default on Heroku. (I have no data to back this up, just my sense of things.)


The momentum I've seen hasn't been MySQL -> Postgres but rather MySQL -> MariaDB.

Oracle's control over MySQL and the shift towards Open Core over FOSS fundamentals has pushed some companies to use the compatible MariaDB option over it. Most Linux distros use MariaDB by default now, meaning more LAMP stacks by default having MariaDB over Oracle's MySQL.

I'm certainly not saying MariaDB is taking over MySQL, just the trend I've seen being a RHEL Sys Admin.


I used jsonb as an ersatz EAV store for settings etc until my product iterations, or at least the data model underneath, began to stabilise. Great way to get started without getting bogged down in DBA or problematic migrations.

Now the product's data model is mostly settled I am flipping over to regular fields, array types, and higher normal forms.

Being able to do all this without changing DB backend has been a real boon, far less time bogged down in ops = far more time to create customer value.


I'm not seeing it, just an uptick in usage. I find most places that previously used MySQL just use MariaDB instead.


In my experience actually most people still use MySQL and not MariaDB.


If you go to MySQL's website, they will list page after page of users, most of which you've heard of. If you go to MariaDB's website, they list users. Most of them you've never heard of.


Agreed, it seems many people look at Maria as a way out if MySQL itself or its licensing ever turns on them, without wanting to support them in the current tense.


I have used its JSON features for real-world stuff and loved it.

I have stored procedures doing a bunch of stuff and returning a nicely formatted JSON object that makes the job of the application code a lot easier.

One thing I dislike is that the friction for initial development and prototyping is a little too high.

You have to have scripts lying around to re-create the functions when you change it during dev.

And logging/debugging/etc is harder and not as flexible.


How close does it come to document databases when it comes to changing the JSON? Is it easy to increment a field, for example? Or add a property to an object?


PG JSON implementation is solid and supports nearly everything you want from a document database, with the added bonus of joins, queries, ACID, and tooling. However, syntax can be awkward at times -

  # Find a row by JSON property value
  select * from person where details->>'name' ilike '%james%'

  # check for property existence
  select * from person where details ? 'is_hidden'

  # Override a single property in a JSON field -
  update person set details = details || '{"phone":"911"}'

  # Alt increment existing value
  update person set details = jsonb_set(details, '{views}', (details->'views')+1)
More examples at https://www.postgresql.org/docs/current/static/functions-jso...


There are JSON and JSONB types that can store JSON, in one can make tables that are simply a PK and a JSON column. There are function/option wrappers around this type to in place update a JSON field. Syntax is well different than Mongo. Can get a relational-JSON store which Mongo don't really have.

I used to use Mongo along side the PG for the JSON store but now am back all on PG with this JSONB type.


In the dotnet world I'm helping contribute to a project called Marten which is a document database built on top of PostgreSQL. (My contributions are Low cos real life is in the way lately)

The json support is amazing. It's really incredible.


Also postgres was picked up as an option on AWS.


However RDS for PostgreSQL is still more expensive than RDS for MySQL, and I don't know why. If someone can enlighten me on that, I appreciate.


I rented some time on a vultr server recently and chose a prepackaged build which included a MySql install. Coming from a MS SQL background it felt positively medieval. I haven't migrated yet but from my research Postgres seems the closest competitor in the relational db space.

I considered MS SQL for Linux but the server alone required 3GB RAM...


PG is far better than Mysql, something to note if you come from a traditional but robust RDBMS/environment.


Do you have anything to back up this opinion?


The article of this HN already cover that (plus the others that the original post also point)...

But because this is about someone coming from a more traditional RDBMS (like oracle, sql server, sybase) it will note that some or most of the features of PG are not different from something like them.

Also, PG was from the start more focused in be robust, instead of MySql that was focused in be fast (at the cost of being robust). PG is a better fit for more traditional workloads from years now.

And the careful, well-thought, solid development, feature implementation and release discipline is clearly very professional, to match the ones from commercial vendors.

PG is not just good for startups and enthusiasts, but solid enough to be recommended to most companies with total confidence.


I've got a feeling that MySQL has improved a lot. I have nothing besides my experience as a user (not an admin) to back it up, but I haven't found anything that would make me switch from MySQL to PG. Note that I don't look down on PG. I've used it in the past and it is a solid piece. It's just the feeling that both of them covers most users' needs and, apart from some corner cases, they are more or less equal.


I think you'll like Postgres. At least, that's been my experience working with those three, that it is the most comfortable.


HyperLogLog sounds interesting, but looking at the Github page of that extension it mentions that it has been tested with the versions 9.0, 9.1, 9.2, 9.3 and the last commit is 2014. Is it just finished and doesn't need any updates to keep up with newer Postgres versions? Or is it more of an abandonded project?


Most parts of the PostgreSQL extension API are really stable so if it is a simple extension one should not be too worried about the project being inactive. The extensions I have written have not required any changes at all when upgrading PostgreSQL.


Hyperloglog is a pretty straightforward, simple and awesome algorithm.. probably that is the reason.. Take a look at redis implementation.. it is very easily readable.


Is there a new way around the requirement to rebuild your entire replication topology after upgrading versions? (say 9.4 to 9.5) You get a new master ID when running the initdb step, and doing this throws everything else in the topology off. TIA


Most important reason to use PostgreSQL : Constant Time Recovery (Recovery from long running transaction that rollback is instantaneous)


From reading the documentation [0] I can't help but feel a bit underwhelmed by the feature set of a GIST index. Maybe I'm not looking in the right place, but what index provide near mathes (fuzzy, prefix) as well as exact term matches?


Look up Postgres FTS.


Has anyone done a serious implementation on top of full-text search functionality in Postgres? I have a pretty large dataset that's currently in Postgres and I'm deciding between it and Elasticsearch.


PG 9.6 has phrase search now which is nice. And multi-word synonyms beats most search servers. Lack of BM25 or TFIDF (see the `smlar` extension) is the main issue


Depends on your needs, but ES is a superior search engine in just about every way. PG may be good enough for your needs though. In my general view, if you want a more powerful and easy-to-use alternative to LIKE then PG search is great. If you need something more like Google then you should use ES.


Not just that. PostgreSQL also offers a clear advantage if you need to access both full text data and relational or geographical data in the same query.

On the other hand what you say is true, ES is much more flexible in what it can offer in full text search.


ES is very good at storing and querying non-text data including geo data although there are exceptions such as numeric. Both PG and ES have their strengths and weaknesses. ES is great at search. PG is great at joins and constraints. Everything else depends a lot on the specific use case.


Yes, both databases can handle all kinds of data and have support for a wide range of index types, but ES does not come anywhere close to PostGIS when it comes to geodata. PG is superior at search for geodata, while ES is superior at text search.


Interestingly, I just yesterday published a post about pglogical. http://thedumbtechguy.blogspot.com/2017/04/demystifying-pglo...

Postgres is an amazing piece of software.


No disclaimer in the article that the author works for Citus.


...except for the "About" page right up there on his personal blog.


Not sure if there should be any. The article is mostly about Postgres.


Apologies. Now updated for that.


Suggestion: range types link should lead directly to the docs. Some people really dislike PDFs and Postgres docs are awesome.


since you're reading HN... Thx!!

suggestion: under 'Much More' it would be useful to link to articles or docs about each of the features listed.


Will make sure to add more links when back at my machine


Id love postgres even more if they had native support for writing stored procs [ functions ] in javascript.


I guess you know about https://github.com/plv8/plv8 , but would like to see it builtin?

Unfortunately, having something like v8 as a postgres build dependency would be pretty drastic increase in build requirements.

A number of distributions of postres (debian/ubuntu packages (all versions http://apt.postgresql.org/), rhel based (https://yum.postgresql.org/) provide it in an easy manner. I'm not sure there's something as convenient for OSX and windows however :(


my bad.. just found about plv8, and will experiment with it and jsonb on future projects.

[ pg installs I work with are invariably running on linux, so all good. ]


Even with the inclusion of JSONB, I think Postgres is still lagging behind MongoDB by enforcing schema. After so many years doing web apps, I am seeing very little interest to have to enforce 2 times the schema: one time in the DB via migtations and one time in the app itself via ORMs. Maybe I am missing something really obvious.

ps: I don't mind the downvoting. I am truly looking for answers.


I upvote you because that is a common sentiment, and is better to provide some reason why is better the way of RDBMS.

---

A schema-less storage design have some valid uses cases. But even them leak a undeniable truth:

All schema-less design are already a enforced schema, but more general.

Even a KV store is a enforced schema. You have key, and values, and specific operations on them.

Eventually, you will note that that schema-"less" design is too constrained / liberal. For the same reason you note that use all the time hash tables is not as good. You need arrays, and objects, and list, and records, etc.

And you noted that some operations start to repeat themselves. Will be nice if that become abstracted? Right?

And you will noted later that provide reliable results is very hard, specially if things are out-of-order, multi-thread, multi-process, etc.. Will be nice that that become abstracted, right?

And then you can do all that yourself, in your language of choice. And is fine!

But later, you will need to use another language. Now, you need to repeat all that effort AGAIN. Will be nice if you use something like all that about micro-services and stuff, and put the hard-part isolated and caring about that, and you can freely mix-match tech as you see fit?

But then, you do all this, and you MISS THE DAMM SIMPLICITY OF A KV STORE!. How nice will be if exist a model, let's call it relational, that not only can model a KV store, but much MUCH MUCH more!

TADA! You have reinvented a RDBMS!


If you think of your database as a loosely constrained document storage - then yes, postgres is worse than mongo.

But, if you think of your database as a single source of truth, then I almost can't imagine any reason to use mongo[1]. Without ACID transactions it is your problem to enforce all those pesky constraints across your whole codebase and data storage. And that work is HARD.

For instance. I once had to implement a simple scheduling app. Basically, if worker A was scheduled to work from 2017-04-30T18:00:00 to 2017-04-30T18:30:00, then there can't be any overlapping schedule item for said worker. I guess you can do something like a two-phase commit with mongo, but I used a tsrange[1] and an exclusion constraint, and got everything almost for free.

[1] Well, AFAIK mongo has better sharding. For now :)

[2] https://www.postgresql.org/docs/9.6/static/rangetypes.html

[3] https://www.postgresql.org/docs/9.0/static/ddl-constraints.h...


The downvoting is because you can have tables with single jsonb columns without schemas - call them "collections" and pretty much end up with a faster mongo up to having to fiddle a bit more with replication/sharding.

No one is making you enforce schemas with PostgreSQL if you work this way.

As for enforcements - there are things that are _impossible_ to enforce at an application level for atomicity reasons - for example - if you have a `users` table and you can't have duplicate email addresses - you'd need the operation "check if a user with the said email exists, if it does - return it, if it doesn't create it and return it" - PostgreSQL can do this easily, which prevents duplicate entries - Mongo - not so easily (see the "two phase commit" docs for how to simulate transactions).


> check if a user with the said email exists, if it does - return it, if it doesn't create it and return it" ... Mongo - not so easily

Can't you easily do it with upsert in Mongo?


Yep. Put an unique index on the collection too.


I'm not sure that it's that simple in practice. MongoDB may fail to provide the same guarantees that Postgres UPSERT does, if not in theory then at least in practice. In other words, it seems like retrying within client code is necessary when the Wired Tiger storage engine happens to be in use: https://stackoverflow.com/questions/29305405/mongodb-impossi...


There's a wide variety of reasons to want the schema in the database, and schemas are so valuable that even our good friends at Google decoded to add database level schemas to their massively distributed data stores.

The strongest reason is that the database will complain loudly if you want to make a change that breaks existing constraints. An application, no matter how simple, will probably change, and you don't want to leave data out there that is unreadable and that will break your application's expectations. The schema minimizes the chances of garbage historical data that have been, in my experience the plague of NoSQL-backed apps. Often you find people that end up writing ad hoc constraint checkers, creating a poor imitation of what a database that understands schemas does well.

Another big reason is that there's rarely just one application, and the applications could (and very often should) be written in different languages. A schema at that point becomes a service which just happens to not use Json or Thrift, but a far more complicated, sturdier one. You could still argue that databases are too complicated to expose very widely, but I'd much rather have all my data sitting in the very schema-heavy redshift to do reporting than to rely on application logic, or use most business insight tools out there.

Also, let's not forget that schemas let databases do a lot of complex operations rather cheaply, because they can have very optimized code of very specific shapes next to the data. In the cases where 'first, I will take 50K rows out of the db and into my application' is just not acceptable, you have to let the DB do the work. In those cases, the interface to use a windowing function in a SQL database is IMO far more convenient than having to go into the classic remote map reduce function in erlang.

I think there were plenty of gains from the NoSQL movement, especially as far as to start writing database where high availability and horizontally scalable loads are concerned. But outside of those realms, where few companies really are, I would not recommend a schemaless database today.


Good points all around, and I fully agree.

Just to elaborate a little on your third paragraph, this is incredibly important as an app matures. You might start out thinking you're only ever going to have one interface to a datastore, but the reality is that you always have at least two. Someone has the ability to go directly to the store itself and make changes. That's life. Someone at some point in the lifetime of your app or company is going to go in and have to fix things by hand. If you have no schema and no constraints in the datastore itself, the odds of an error make things worse are far, far greater.

We all know that's bad practice, but we all also know that sometimes you just have to do it for some reason. We also know that there will be times when someone is testing something on what they think is the local testing db but are actually connected to prod, and they do something experimental, and it screws things up. That's also a terrible thing to let happen, but we all know that it does.

The second thing I'd like to point out that, much like sending email, any sufficiently useful application will, at some point, provide an API. And you're going to have to re-implement the backend validation again for that.

So you know from the outset that if you are working on something you expect to be successful, you're going to have 3 access points. You're going to have to do the same work 3 times in 3 different places, unless you just decide to take the risk and use a db that doesn't have schema or constraints--which is a big risk. Doesn't make sense to me.

Finally, I'd like to add to this good post that it's pretty easy to work in either direction with many ORMs. If you're comfortable with DBs, you can write out your migration scripts and use the ORM to generate models from it. If you prefer to work with an ORM, you can write the model and manage the migrations in the ORM. Either way, there's really not that much overhead compared to what you gain in safety and security and DRY.

I'm not a DRY nazi, and when I'm prototyping, I'll repeat code anywhere I suspect a later iteration to diverge from the other instances of that code, and then refactor when I get closer to production. But when you're talking about database structure, I don't see how that can diverge, so it makes sense to me that you should never repeat yourself there. That's just asking for trouble.

Do it once in the database, and do it right. Then all you have to do is trap database errors in your backend code and validate on the front end. That will be different for every interface, but it's a lot simpler and cleaner and safer than enforcing schemata and constraints in every codebase that accesses the data store.


The schema in the database can generate most of the ORM layer. If you're using a strongly typed language and ORM like Slick/Scala or C#/LINQ the compiler will actually prevent you from writing invalid queries or putting invalid data into the database. This will eliminate most runtime database errors. This can make development go really fast once you get the hang of it all.


Or you can do the reverse, have the ORM generate the database schema.


Quite true. But I think I'm gonna be sick :-)

(in general, I'm not a fan of Object-Foo-Mapping, but then I've used dynamic/runtime typed languages on and off since the 80s)


Yeah, it's what Rails is doing. However, if you are doing an app like this, you are building logic around the db mainly. If you are that DB centric, why not having Postgres itself serve web requests? There is no consistency with the double approach of having a logic rich app and a logic rich db.


> why not having Postgres itself serve web requests?

Good question! Why not have all the logic in the database?

The answer is, you can actually do that. And it’s popular enough that someone founded a company on that concept, although they didn’t use Postgres, but their own self-built database.

You might have heard of it: https://firebase.google.com/ (They were acquired later, and are now Google’s top database offering).


Well, you have https://postgrest.com/


I've found that it's better to have queries be pretty simple and do any necessary complex aggregations or analysis in the app tier since it's better to let the database's CPU think about doing I/O and cache management rather than business logic. It's also easier to scale out the app tier as you can bring app tier instances up and down easily via AWS vs having to copy a lot of data around to bring up a database replica.


At the kind of scale you're talking about, I am pretty confident that 90% of people just won't need to do anything on that scale. For most people, business logic in the DB will serve their needs perfectly. And if some day they need to scale up to some level beyond a single DB, I'm also pretty confident that they'll have the resources to spend on a DBA consultant specialising in sharding or replication :-)


> ... why not having Postgres itself serve web requests?

At least two OSS backend stacks, PostgREST and PostGraphQL, are built on exactly this idea. And there's at least one commercial stack with the same idea: SlashDB.


You mean "building logic around the db" by using ORMs as opposed to SQL queries?


why not having Postgres itself serve web requests?

Primarily, security.


If by security you mean protection against DOS then yes. If you mean users accessing private data then you are wrong. Postgres (and all the other dbs) have the tools necessary to specify who can access what (one just has to use them).


Always wondering about constraints, relationships, entity/data reusablity and things like calculations, aggregations, ...; typical "SQL tasks" - how does one handle these in a document-oriented schemaless database like mongodb?


Shouldn't the web app itself enforce constraints? I don't think it's reasonable to enforce password strenght for example on the DB level.


If your database knows a user's password, you are doing it wrong no matter what persistence layer you are using.


Why the DB coudln't do the hashing itself? Storing is the actual dangerous from a security perspective, plain passwords themselves just hitting your db or your web apps is fine as long you don't keep a log anywhere in plain.


You'd want to distribute the load of your slow password-hashing functions across your application servers rather than centralizing it in your single database.


Postgres bcrypt is wicked fast. You need some crazy load before it makes sense to move hashing out of the DB.


If your PW hashing function is wicked fast, you need to increase the work factor. The whole idea of a paw hash is to be comparatively slow, because that's one property that makes it harder to crack. Otherwise we could all go back to salted md5


Please change to a higher factor. PostgreSQL's default is a bit outdated and too low for modern computers.


You can have more than on database node as well.


Database constraints are intended for use in distributed systems problems (e.g. two people create accounts with the same email, buy the last item in the store, etc).

Whether you enforce password complexity in the database is a question of where your team is more comfortable working, and what the business requirements are.


Uniq indexes in MongoDB has been a thing since forever.

You can enforce complex constrains like this in `>3.0`:

    db.createCollection( "contacts",
       { validator: { $or:
          [
             { phone: { $type: "string" } },
             { email: { $regex: /@mongodb\.com$/ } },
             { status: { $in: [ "Unknown", "Incomplete" ] } }
          ]
       }
    } )
While still not having to define the rest of the schema that you don't want too. It feels more flexible.


More flexible than what? You can do the same in Postgres, just use a single JSON field (which lets you store any structure) and add constraints on the JSON key(s).


You can do things like this in Postgres as well - my point was that the database are designed to solve distributed systems problems, so your database of choice is a good place to solve those problems.

I couldn't make a universal statement about where the constraints you're talking about should go without knowing the business rules for the application.


What happens when your platform grows and a different app needs to use the same data? Enforce constraints in two places instead of one?


You still need your DB to be correctly sharded plus app constrains correctly distributed. It still more complicated than to enforce only in one place.


You are right, password strength is an example for an application layer constraint.

However, by constraints I was more talking about things like "these properties need to be unique", "this column here must reference that column over there", "this column's values must be in this set" and so on.

I know I could enforce these things in the application layer, but I don't think that's feasible.


Got you. Uniq indexes are a thing in MongoDB as well, but I can see that just another schema-like behavior and doesn't fit what I was saying earlier.


If your constraints are about more than one row (e.g uniqueness, non overlapping ranges, foreign keys) it's very hard to do so clientside in a concurrent setup - unless you're ok with very coarse locking.

Edit: Typo


Via corollary with Greenspun's tenth law, or not at all.


Thanks, didn't knew about that.


The database schema serves as a single source of truth documentation. It is valuable by itself and not replaced by application level enforcement.


Your tools are bad so postgres is bad? Use a ORM that handles migrations and queries.


If you are using something like Mongoid, you are declaring the schema only once directly in your rails app. How having your schema burried in migrations - however clean they are - is superior?


In Rails, your schema isn't buried in migrations. You seem to be lacking a whole bunch of knowledge about this subject in general, maybe tone it down with the hyperbole?


Here's your schema from Postgres:

    pg_dump -s your_db | less
Or you can explore it using `psql`:

    $ psql your_db
List schemas:

    your_db=# \dn
List tables in a schema:

    your_db=# \dt public.
Show table detailed info (constraints, indexes, comments, etc.):

    your_db=# \dt+ your_table
List functions:

    your_db=# \df
Or you can install pgAdmin and browse through the entire database--tables, constraints, indexes, the whole shebang--in a single screen. At no point are the migrations burying the schema. That's just a fundamental misunderstanding of how to work with databases.


That's not what I'm saying, I'm saying use an ORM that does not rely on migrations to form the schema. I can't think of a single one that does.


I don't get it. Isn't any good SQL ORM provide their own DSLs for migrations?


Yes. Look, take for example the Django ORM. You specify the models in a python file. These are the central source of truth for what your database tables look like. If you want to add a column or change something in the schema you then generate the migrations from that.

The migrations don't drive your schema, the schema drives the migrations. Proper tools don't make you have a bunch of .SQL files that organise your database.


Can the Django ORM support migrations of stored procs, custom types, and other non-tabular entities?


Devil's advocate - how is it really better to have a bunch of autogenerated Django migration files?


Better than what?


I always wondered the opposite: what's the point of schemaless DB when you're going to have schema in the app? Also, joins.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: