It's unfortunate that the web development world has been so very much MySQL (and PHP) centric, because it often constrains us to see solutions to problems with the capabilities of only these technologies.
I had heard much good about Postgresql in the past, unfortunately it was also introduced with the spectre of a costly adaptation period, which resulted in making me postpone the move. This time though, I was tempted enough to give it a shot and in that same evening I jumped in.
It's been about a month now and things have been surprisingly smooth. The only tool I use to interact directly with the db is the psql client. Other than that the symbiosis is nearly perfect with Python (I use SQLAlchemy with psycopg2). If I have a recommendation to current MySQL users who would like to try Postgresql, just give in. It's really not as hostile a transition as some seem to make it. Not everything will work as you'd expect from your MySQL background, but most issues are known, the tools are great, the community is great and the documentation is simply top-notch.
I take a break, pop onto HN, and the top comment of the top story explains how to do exactly that. Thanks!
You wouldn't happen to have come across any good tutorials on using PG schemas for this purpose have you?
Also, do schemas provide enough separation of data, when strict client/tenant confidentially is a requirement?
- at the start of your request, grab a connection to the database and make sure that it's only accessible to that specific request (i.e. ensure thread-safety).
- start a transaction within that connection.
- assuming you'll have a http://subdomain.domain.tld/url scheme such as http://clienta.myapp.com/some/controller, to switch to schema 'clienta' you'll execute query "SET search_path = clienta;"
- now you can execute other statements and transactions within that main transaction.
- at end of request, commit the main transaction, reset and release the connection to make it available to other threads.
That's a broad description of the approach, but it should be enough to get you started.
As for the data segregation, see what this commenter had to say about it in his (her?) 3rd point http://news.ycombinator.com/item?id=1567089 .
I found this video that has some illustration near the end of the presentation. The presenter uses Rails. I use Flask and SQLAlchemy. Flask has some handy utilities (Flask.before_request(), Flask.teardown_request()) for this type of setting/unsetting of connection and schema. I imagine Django also has a pretty straightforward approach to this.
The overall PostgreSQL documentation is pretty good. Note that you usually want to prefix your table names, etc, with schema names so instead of create table foo, you have create table myschema.foo
Note that roles and users cannot be assigned to schemas (or databases either) since they are cluster-global.
Unless you think you will have many hundreds or thousands of tenants (!= users) any time soon, I'd advise you to stick with one db per tenant.
The reasons for backing out of the use of schemas for multi-tenancy are for me:
* on a smaller scale (with regards to number of tenants, not users) it does not really offer any significant benefits in resource usage over one db per tenant, which makes the following reasons significant:
* lack of isolation - unless you use a separate database user with carefully set permissions for each customer it is possible to access each schema from a single compromised application process. If you do actually use a separate db user for each customer, one of the benefits of using schemas as opposed to separate db's disappear (you can no longer share db connections/db backends across requests from different end-users so the number of concurrent database connections increases).
* complexity - multi-schema use is usually not supported by default in common web frameworks and related tools. I use Django and had to create a custom db driver (and leverage thread-local storage) to get any sort of transparent support for multi-schema use (and especially in this case you want transparency for your developers - otherwise risk of isolation failure increases to the same level as with the 'one db/tenant_id in table' approach - see previous point). Also backup/restore becomes more complicated. And Django's management tools (the manage.py script) had to be taught a new trick. And South, the schema migration tool had to be ... You get it I guess. It works - sure, but took quite a bit of work to get right and there is always the risk that future updates to any standard packages used break it again. I'd rather not have it.
* one schema per tenant implies also one application process for all (or at least multiple) tenants. This again makes it more complicated to keep things secure and you'll run into some other issues as well depending on the requirements for your application.
One of those issues we've run into (not related to the database) is that we're in the position that we can definitively benefit from a 'one source code base for all' but still have to be able to customize our application per tenant with respect to the specific wording of certain aspects in the user interface and that in multiple languages. In the current setup (one application process for all) it meant that we could not use Django's native i18n machinery. Building an alternative (with proper version control in this case) was not as much fun as you may think and basically a mistake. In the setup we're migrating to we can leverage the native i18n features of Django (and git) for this job.
So where we're going now is:
* one virtualenv per tenant, with a dedicated Django project for each tenant (the Django project does not contain the actual application code, only settings and if the need arises customized .mo/.po files.)
* one application process per tenant (I use uwsgi with emperor mode for this)
* one database per tenant (hosted on a Postgresql master/slave pair)
The resulting setup is easier to maintain and reason about than the current setup and my testing so for shows that resource-wise it does not make a significant difference. There is of course a bit more disk usage and a bit more memory usage. That is mostly because of the extra copies of the application code I need to keep around on disk and in memory - but that is all together a relatively minor bit of the total disk and memory usage so percentage wise it does not really make a difference (at least for us - is probably very situation dependent).
Something I'd been worried about was the number of concurrent database connections. But after some testing that fear turned out to be unnecessary as even under very high load the number remained within manageable bounds.
The nice thing of Fabric is that it makes automating something hardly more work than doing something in a one-off fashion.
The distinction between shared data and isolated data isn't binary. Instead, it's more of a continuum, with many variations that are possible between the two extremes "Isolated - Separated DB - Separate Schema - Shared Schema - Shared"
There are advantages and drawbacks to choosing a strategy over another, but one should not necessarily aim for some Holy Grail. The article goes on saying
Where your application should fall along this continuum depends on your business, architectural, and operational needs, and on customer considerations [...] all of these considerations are interrelated to some degree.
Regarding framework support, this was not my primary concern. We're building one app, so we might as well just build the whole thing from scratch. If we expected to reuse this code for multiple projects, I might have given it a second thought, but it's a single app with its own idiosyncrasies. I didn't want the constraints of following some "agreed upon" approach, at the cost of having to bend the natural alignment that the architecture would follow to stay faithful to what's expected of the application. Like I said, I could've chosen to do things from scratch, but I settled for what seems to me a much more beneficial alternative, a micro-framework and a specialized database library (Flask, SQLAlchemy). So far things have been running pretty smooth.
Regarding complexity, I concur with you that I've had to write a couple of scripts to automate certain aspects of schema management, but honestly, it wasn't that bad (SQLAlchemy helps a great deal with this stuff). The alternative would be to create scripts to manage different application instances. So there's work either way. One of my main concerns was to keep development simple. The approach that you're considering was very appealing in that regard, as I would basically be building a single-tenant app and replicate it for each new tenant. But from my readings, I also learned that there are some performance and scalability disadvantages to running one application instance (and db) per tenant, at the scale that I'm aiming. When I found out about Postgresql schemas, it offered a very good compromise.
I set up an app with the previously mentioned tools, spent some time reading on how to operate with schemas, tested a few queries, had some misses that taught me more on the peculiarities of this setup. But within a few hours of trial and error, everything worked as advertised. I never had to create a driver or anything like that. Upon connection start I just stored a connection on a thread-local object, started a transaction and switched schema. I then did what I had to do in my request and on the way out I released the connection. That was all on day 1. The code that manages the multi-schema access to the db is probably 5 to 10 short lines in 2 functions. The rest of the development is oblivious to the multi-tenancy aspect of the app.
Regarding security, I won't claim to know what you know on the matter, but the scenario in which an attacker gains access to other schemas with a compromised process seems a bit far fetched to me. I'd have to read up a bit more on the topic to understand how that exploit might work. PostgreSQL has some interesting authentication protocols and I imagine that in combination with the OS' own auth protocol there are ways to prevent, or at least make things more difficult.
 Architecture Strategy For Catching The Long Tail http://msdn.microsoft.com/en-us/library/aa479069.aspx
 Multi-Tenant Data Architecture http://msdn.microsoft.com/en-us/library/aa479086.aspx
In all cases there is work to do of course - I've just found that the 'one db per tenant' model allowed us to re-use more exiting infrastructure than the other models. For us that's a win. As always the devil is in the details - some of which do not become clear until after you've run into the corresponding devil.
With respect to security/isolation consider the following.
Case A: one schema per tenant, one db user shared across all tenants. Anyone with the ability to manipulate going to the database (e.g. a hacker via a sql injection vulnerability) can access all schema's and all tenants data.
Case B: one db per tenant, one db user per tenant, one app process per tenant running under its own os user. In this case an attacker is far less likely to be able to access any data from tenants other than the one to whom the compromised process belongs to.
Of course if the processes of other tenants are also aproachable by the attacker it is very much possible that the processes of other tenants will be compromised through the same vulnerability. However in our situation that is not the case (well, I should know better: it will be very difficult for the attacker) , hence we benefit from the extra isolation.
Your assumption that Postgresqls authentication or even OS level authentication provides extra isolation is wrong - unless you use one db user per tenant and set permissions carefully. But as stated before, in that case you better double check if you still benefit as much from the schema-based multi tenancy as you may think. And even then the level of separation is still lower.
What isolation level you need depends on your audience of course but from a security point of view I do not consider a schema based approach as much better than 'all tenants in the same table with a tenant_id field'.
If your environment is set up such that database connections are long-lived, please double-check that you're not using SQLAlchemy's default behavior to open an explicit transaction (e.g., "BEGIN TRANSACTION") upon connection. (It may no longer be the default, but it was last I worked with a Django shop, ~two years ago, now.) That behavior interferes with VACUUM's ability to do its job, your tables can bloat linearly with how "write-hot" they are, and performance of queries using those tables will reflect that.
Briefly, this is a consequence of Postgres' MVCC (multi-version concurrency control) architecture. Physically, an UPDATE statement is an atomic INSERT/DELETE operation; each version of every row is stored on disk, and Postgres keeps track of which versions of which rows are "visible" in the context of which transactions. Obviously, that's not a sustainable approach, both in terms of performance, and resource consumption. You'll fill up your storage with "dead" tuples, and you'll have to trawl through disk page after disk page of them to find a single "live" one.
Enter VACUUM and later, the autovacuum daemon. They know which transactions are open, and which versions of rows were modified by which transactions. "Dead" tuples older than any open transaction are truly dead, and their disk footprint can be re-used by subsequent INSERT — or UPDATE — operations.
SQLAlchemy's (former?) default behavior to open an explicit transaction upon connection isn't ideally suited for use with MVCC-based databases (so if you're using InnoDB, you'll probably have some flavor of this same problem) — particularly if your environment has connections open for days at a time, which was the case at that former gig.
When I found this problem in their stack, tables with hundreds, or perhaps a few thousand rows had disk footprints in the gibibytes. They were among the hottest tables in the system and performance, consequently, sucked. The only thing that kept them alive was that the people they had working on this stuff before me had a weekly maintenance window (site outage) where they ran a VACUUM FULL and REINDEX. How no-one twigged to the fact that app performance followed a perfect sawtooth pattern with a period of exactly one week, lock-step in phase with the maintenance cycle, I'll never understand...
You can check whether your environment is configured this way with:
SELECT * FROM pg_stat_activity WHERE current_query = '<IDLE> in transaction' ORDER BY xact_start;
A caveat: this test is fairly specific, but somewhat less sensitive, with false negatives being possible. What you're really looking for is Postgres backends with an xact_start — current transaction start time — that is (potentially significantly) out of line with how long your normal database operations should take. Those usually show up as "<IDLE> in transaction" but you could very well happen to run the above query exactly when one of these offending connections is busy servicing a query, itself...
Just to get it out of the way, SQLAlchemy does not emit the "BEGIN" statement, nor does it call any kind of database function that directly emits "BEGIN", ever. Feel free to grep for it, start at version 0.1.0 and go all the way up to the 0.8 tip - you won't see it. It's not a default, it's nothing SQLAlchemy has any kind of option for - it doesn't do it.
SQLAlchemy, like virtually all Python libraries and frameworks that talk to relational databases, uses an API known as the Python database API, or DBAPI. With Postgresql, you're usually using a DBAPI implementation known as psycopg2. The DBAPI is organized in such a way that transactions are implicit. This means, when you first get a DBAPI connection, it's per specification required to be in a transaction, or at least it has to be as soon as you do something with that connection. The DBAPI has a `commit()` method as well as a `rollback()`, but has no begin() method. Why is that? Again, because a DBAPI connection is always in a transaction - there is nothing to BEGIN, it is already begun the moment you've connected or the moment the previous transaction has been closed via rollback or commit (or at least, the next time you emit a statement on that connection, it needs to be in a new transaction).
So when you use psycopg2 by itself, as soon as you connect and emit a statement, the abovementioned "BEGIN TRANSACTION" has been emitted (see the big pink box here: http://initd.org/psycopg/docs/usage.html#transactions-contro...). If you want to disable that behavior, you can do so by setting psycopg2's [autocommit](http://initd.org/psycopg/docs/connection.html#connection.aut...) flag. It's also easy enough to set this flag when you're using psycopg2 via SQLAlchemy, and in fact things will work just fine - unless you actually need some degree of transaction isolation and/or need ROLLBACK to actually work. So setting psycopg2's autocommit flag is really not something you'd normally want to do unless, as the docs say, you're emitting explicit VACUUM statements on your psycopg2 connection, which of course we don't have to these days now that vacuuming is automatic (and manually calling VACUUM is a database administration task in any case which you'd be calling from your psql console).
So we've established that 1. SQLAlchemy has nothing to do with "BEGIN TRANSACTION" and 2. psycopg2 and all DBAPIs are required to maintain transactional scope by default when a connection is first procured. What about the supposed issues with VACUUM ?
To put it simply, VACUUM has no problems with connections being open. What you're thinking of here are locks, and locks only occur once you're in a transaction and have accessed some table rows, which are now subject to various isolation rules. If you open a bunch of connections, and access/update a bunch of table rows, you'll have a lot of locks on hand, and that will get in the way of autovacuuming and such. However, as soon as you roll back the transactions, those locks are gone. When you use a database library like SQLAlchemy, a handful of connections are kept open in a pool, but the transactions are not. When you check out a connection, do a few things with it, then return it to the pool, any remaining transactional state is rolled back. edit: this of course assumes your application was written correctly enough that it closes out session/connection resources at the end of an operation - no different than when using the DBAPI directly.
What I would gather was your experience was that you either were using an extremely old version of SQLAlchemy, like an early 0.3 version when we probably weren't calling rollback() correctly (this would be like pre-2008), or your application wasn't handling the scope of connections/sessions appropriately (edit: as django's ORM typically autocommits by default, this detail about SQLAlchemy's more explicit transactional nature may not have been apparent to your dev team; we offer autocommit as an option, though it is discouraged).
It goes without saying that psycopg2 is used in a tremendous number of high volume environments, without forcing the "autocommit" flag on. Postgresql's auto vacuuming works just fine regardless. SQLAlchemy is just a client of psycopg2.
I think you are correct except about this. What he is thinking about is indeed as he said transactions. Locks in PostgreSQL do also interfere with VACUUM but that is seldom the problem in practice since locks are usually taken at the row level and reading rows does not require any row locks. Open transactions are the main culprit when it comes to vacuum problems and it does not matter much if they have taken locks or not.
Nonetheless, there is a problem. It's not open connections that interfere with vacuuming; it's open transactions. While a transaction is open, vacuuming can't reclaim dead tuples, if those tuples were live at the start of the transaction. Those old row versions are still "visible" in the context of that transaction, and can't be reclaimed — and nor can the additional dead tuples created from further updates on those rows, because there is an open transaction that's older than they are.
Simply, if you connect to the db, open a transaction, and leave it open, then every subsequent update to every row in every table in that database will leave a dead tuple that can't be reclaimed until that transaction commits or is rolled back, because at the bottom of each chain of dead tuples, there's one that is still visible in the context of that open transaction.
That's the problem.
Net, in my particular situation, and a couple of others I've encountered first- and second-hand, enabling auto-commit (and only explicitly declaring transactions when multi-statement transactional semantics were specifically needed, and which were promptly committed upon completion) fixed the problem. A 2000 row table with ~1600 byte rows took up a little over 400 (8KB) disk pages, and the autovacuum daemon was able to keep it at that size. Compare that to before, without autocommit: it would be that size after a VACUUM FULL, and weigh gibibytes within days, no matter how many autovacuum workers there were, how aggressively they were tuned, or how often you ran (regular) VACUUM manually. Those dead tuples were not going away until the abuse of transactional semantics that caused them to accumulate was addressed.
If there's anyone to "blame" here, it's the engineers who didn't understand the implications of what they were doing. (To be fair, though, they were also using the database as a work queue, so it's probably reasonable to suggest they liked doing things sub-optimally...) The high volume environments you allude to are probably going about things with more cognizance of the implications of implicitly transactional semantics. They aren't proof that there isn't a problem; they're proof that you can do this sort of thing without shooting yourself in the foot, if you know what you're doing.
Look, it's not that the DBAPI being implicitly transactional is a "bad" thing. It's not. Transactional semantics are awesome, and as someone who gets paid for keeping peoples databases (particularly PostgreSQL) happy, I'm emphatically for them. It just has consequences, particularly in the context of an MVCC-based RDBMS, and doubly so if your application is architected such that it leaves transactions, implicit or otherwise, open for extended periods.
Can you give any good reason why you need to leave transactions open for extended periods? In my experience, it only happens when the developer in question does not understand the semantics of the DBMS and therefore hasn't suitably designed their solution, in which case it's hardly the DBMS's fault when things go wrong.
I've rewritten plenty of code that "needed" a transaction to be open for a long time into code requiring a transaction for a few milliseconds (e.g. in a bulk import, parsing, caching lookups and reference data outside the transaction, building an XML document to supply into a SP as a parameter which only used a transaction for a single INSERT from the XML).
Offhand, other than xid-level consistency for a backup as mentioned by the sibling post, no.  No, this case was pretty clearly the engineers not understanding what they were doing, as evidenced by it being easier to turn off autocommit, and add explicit BEGIN statements where they were needed, than add explicit ROLLBACKs everywhere they were needed.
That makes an implicitly transactional adapter seem a bit of a foot-gun to me, though. At a minimum, I think it places an unnecessary burden on engineers to have to "roll back" every time they even ask the db something. That's how memory leaks happen, too, and that's why we generally think garbage-collected languages are a Good Idea.
 That said, my preferred backup strategy is to take a filesystem-level snapshot of the db volumes, mount that, and start a second Postgres instance against it. It'll "recover" from an "unclean shutdown", and then you can take your backup without incurring the penalty of holding open a transaction that long in the presence of concurrent writes.
You do pay some write-performance penalty for the snapshot while it's open, but upon releasing it, the disk pages are immediately gc-ed, and you're already taking your backup at a low-traffic point anyway, right? Disk pages in a Postgres cluster allocated because vacuuming was hindered by holding an xid for extended periods are merely autovacuumed, however; they're reclaimed for later use, not released.
In the .NET world, for all the database adapters I've used (MSSQL, SQLCE, Oracle, MDB), individual statements are implicitly wrapped in transactions when no explicit transaction has been specified - not sessions/connections.
I agree the "implicit transaction per session/connection" design is confusing and likely to result in bugs. While the standard .NET approach limits bugs, the downside is poorer performance when an unaware developer uses numerous implicit transactions for multiple statements where a single explicit transaction would suffice.
Experts in the area appear to recommend always using explicitly managing transactions (e.g. http://ayende.com/blog/3775/nh-prof-alerts-use-of-implicit-t... ), and this is the approach I follow in my own work now.
I'm wondering where it's best to manage the transactions... would they ever be in raw SQL or stored procedures instead of .NET code?
That is how the postgresql API works as well. Most high level language modules built on top of it also work that way. Unfortunately python's DB API is not one of them, but you can just set a config option to "act like everyone expects".
I'm quick to complain about Oracle (especially the fucking cost based optimizer and how it falls back to retarded hash joins for me frequently, leading me to add hints, rewrite SQL or run the tuning advisor), but for it's faults, it does some things pretty well.
That being said, I'm wholeheartedly rooting for Postgres.
Billions of rows in performance critical apps is pretty edge case. I think a lot of people paying for Oracle could pretty easily migrate away.
Nearly every situation I've used Oracle under would have worked fine in any free DB (including SQLite3 in some silly instances) but I don't doubt there are situations out there where it excels.
Sounds like you're getting your money's worth out of Oracle - which is a good thing. For many others (especially those in the .NET world) they use very few of the high end SQL Server stuff.
The better way is to use WITH RECURSIVE CTE's. See http://ledgersmbdev.blogspot.com/2012/07/ctes-and-ledgersmb.... for how we use them in LedgerSMB for hierarchical data.
I've also pushed Postgres to its limits and suspect you'd be able to make it work with that much data, though I wouldn't consider it a trivial exercise executing that transition on a production system. If you're seriously considering it though, feel free to reach out via the email in my profile.
The biggest issue with Oracle is that even experts don't really know how the optimiser does something. Tuning queries in Oracle s a bit of a black art. It really shouldn't be!
So why do Postgres advocates insist on dissing MySQL with false and misleading arguments? The usual target is some default settings, when obviously there are three kinds of MySQL users: the ones that actually have a reason to want less strictness, the ones that know how and when to change the settings, and the ones that just don't give a fuck.
You're not going to convince the latter until you are #1, and you're insulting the intelligence of the first two, whilst also making them question whether the arguments in favor of Postgres are actually true, since you're not being particularly honest in your criticism of MySQL.
And you're wondering why Postgres doesn't get any more love?
Most MySQL users are well aware that Postgres is technically a better DB. They mostly (right or wrong) feel MySQL serves their needs sufficiently, and are not particularly attracted to an open source community with such a douchy attitude.
If even Oracle doesn't scare people of sufficiently to jump ship and switch to a better, free and open database, you've got a serious image problem. Stop blaming it on other people's "stupidity".
Incidentally, I was not aware of issues around default settings with MySQL when I switched our infrastructure to Postgres - the primary reason we switched was transactional and much more efficient DDL changes. I was at a company previously that I felt was seriously hampered from innovating due to strong resistance against schema changes by the DBAs and want to do my best to prevent a similar culture where I work.
Before I decided to switch I benchmarked schema changes between Postgres and MySQL and found Postgres to be quite impressive, see the gist of results here: https://gist.github.com/1620133. If you (or anyone else) can point out why this benchmark is false or misleading, I'm all ears.
Another interesting part about your benchmarks are that they throw serious doubts on the popular myth that count(*) is slow in PostgreSQL. And your benchmarks were made before PostgreSQL 9.2 which will add index-only scans.
Whew! Glad that was only a myth! I'll remind myself of that whenever my code takes > 10 seconds to return the count() from a table. "Just a myth - this isn't really happening". I'll try repeating that to myself while waiting for the results - probably should only take 10-15 repeats of the phrase before the row returns, right?
Whoah - just did a check on a real table with a whopping 2 million records - postgresql 9 just blazed through that in < 7 seconds to give me
select count(*) from student;
w00t! 7 seconds! I'm not even sure how that myth got started, let alone why people still believe that hogwash.
Are there DBMSs where this doesn't happen?
I just did a SELECT COUNT(*) on a table here in our QA environment; 20.5 seconds to count 42385875 records from one table and 68.8 seconds to count 191906711 records from another table (Oracle 10g 64-bit).
When operating on a table, my understanding is that pg will select a version and operate on that version. If other versions are being worked on in transactions - that's a different story. Why can't metadata about the number of rows be assigned with the table version data, so after every operation, you'd know what the number of rows was at at that moment in time?
In PostgreSQL every row has two numbers. The transaction ID it was insert in and the transaction ID it was deleted in. An update is an insert plus a delete. When running a select in PostgreSQL you just traverse the table and for each row check these two numbers to know if you are allowed to see the row.
The details above are PostgreSQL specific but most other databases have the same problem with there being no way to know the exact count without actually counting the rows.
1. There is contention currently in PostgreSQL when writing the database journal (used for crash recovery and replication).
2. There are some optimization which are done here. For example HOT to avoid index updates.
If you had read his benchmark you would have seen COUNT( ) was slow in both databases (three times as slow in MySQL).
where relname = 'mytable'
You might want to check out the "Online DDL for InnoDB Tables" section of the manual, it should help with understanding the results. http://dev.mysql.com/doc/refman/5.6/en/innodb-online-ddl.htm...
for me it would be missing common table expressions, weak implementation of functions and SPs (recursion) and horrible .NET connector (connections dying, int can get returned as byte). And I'm not even mentioning the gazillions of advanced features that pgsql (and Oracle or SQL Server) has that mysql doesn't.
MySQL has a low barrier to entry with some reasonable tools that make it easy to get up and running with a minimum of fuss. For a large swathe of applications it does the job just fine. There's little point in trying to pretend otherwise. Where PostgeSQL really fits in is when the next step is needed, either because the app has outgrown what MySQL can offer or there are some use (edge) cases where PostgreSQL does a better job. Address these areas competently and the world will beat a path to your door.
Exactly. I mean seriously what do Facebook, Twitter, Yelp, LinkedIn, Flickr etc know about storing lots of data.
It seems like Oracle is more concerned about trying to keep mysql less well suited for major deployments, so it can upsell its flagship product. Or else their development process simply doesn't properly accommodate community assistance. Either way, as long as their resistance to community-aided development continues, mysql and oracle/mysql disparagement will continue, not because people hate Oracle, but because Oracle's management of mysql makes it less than ideal for those who simply want a good open source database.
Which is the point of the OP. Anyone who is open to looking at options and who wants a good open source database should look at PostgreSQL.
I doubt any of those example companies you listed even use stock Percona. Facebook for sure has extensive modifications beyond even the publicly available distributions like Percona and MariaDB.
Source: http://tinyurl.com/7adp6fl (PDF)
You think they are equally going to be okay with losing data ? Or do you want to try again.
Let's add some more companies: 37Signals, DHL, Dropbox, Evernote, UPS, Kayak, LastMinute, Orbitz, Continental, Mint, Quora, Tumblr, Techcrunch, Slashdot, NYT, NBC, Reuters, Wotif, Zappos, Wikipedia, Youtube.
You still think any of these companies would tolerate loss of ANY data for ANY reason ?
For example, taking the first company on your list that actually deals with real stuff in the real world rather than just running a website, DHL, here I see http://www.oracle.com/us/corporate/press/037809 - they're an Oracle shop. I could continue but I think I've proved my point.
So before you get all starry-eyed about these amazing companies and their amazing MySQL installations, understand that you are starting from a false premise. Oracle - for a cost - provides some powerful capabilities (e.g. stretch-RAC). Is it worth the money? Well, it comes back to what your data is worth. This incidentally is why some organizations cling to paper forms - it's cheaper to re-key the data than build out the infrastructure to do it all online...
 Tho' if you think you need this, what you probably really need is DB/2 and Sysplex
I wonder why this bug is still open today after 5 years.
Oracle on the other hand has a setting for everything. A good Oracle DBA can spend all his time fiddling with Oracle settings to get the best possible performance. On the other hand, a good Oracle DBA can spend all his time fiddling with settings to get that little bit of extra performance gain. PostgreSQL is simpler.
PostgreSQL on the other hand lets you write your stored procedures in any language. On the other hand, you can spend all your time writing and debugging stored procedures written in brainfuck.....( and in case you want to do that, download the handler at https://github.com/mikejs/pl-bf and go get started!)
There are a few other differences as well. In general, I think Oracle is better for some things, PostgreSQL for others, but they have more overlap than either does with MySQL.
1. It was the only game in town when the application was designed.
2. It was head-and-shoulders above everyone else. More, better, faster. Most people think you just throw SQL-99 at it, but then you're wasting your money. Look at their analytical languages, for example.
That said, even if you don't use PostgreSQL's whiz bang features, its stability, performance, and outright sanity with regards to handling data make it the right database to reach for in many cases. And, for smaller projects where you are willing to couple yourself with it, its additional data types, query features, and so on are awesome.
Further, every database, noSQL or otherwise, offers a different set of features and functionality. Why the heck wouldn't you take advantage of k-nearest-neighbors indices? Why would you want to roll your own full text search functionality? The alternative is to introduce whole other services to your infrastructure!
In closing: if you've got 'em, smoke 'em.
You put the code in your application so you can switch out the database, but why do you want to switch databases?
You can't switch out the database for one with more features, because then you're not using the lowest common denominator any more, and you can't switch back.
It can't be licensing costs, because postgresql licenses are free.
The only other reason I can think of is performance. But trying to avoid database features just because you might want to migrate to a less-featureful database later seems more likely to result in performance problems than prevent them.
I agree you don't want to go out on a limb with crazy features just because you can. But those features are there for a reason, and might drastically simplify portions of your application if you use them. And a lot of them are SQL standard, just not supported properly in all database systems.
Also, I'd like to point out that database migrations themselves are quite rare once they become established in an organization. You might be able to migrate one fairly simple application if you bend over backwards trying to use only the simplest features; but once a few applications are depending on it, it's just too expensive.
99% of the time, this is why you build DB agnostic apps.
- astonishing amount of things are done differently on these four dbms (idenities/sequences and getting their last value, paging, DDL)
- each of these dbms has specific features which would improve the performance or help us develop stuff but we can't use them because they require different architecture (and we want to keep it the same)
Don't do software for multiple databases. It's just not worth the trouble.
That also applies to MySQL, because it's GPL.
That does not apply to Postgres, because you can just ship it with the app.
I exaggerated a little, there certainly are times when you just don't care what database system you use and would rather have simple migrations at the expense of owning more code. But I think these reasons are weighted much too highly, particularly in the context of postgres.
To be clear, I'm not really suggesting you avoid using fancy features in data stores like Redis or Solr. The relational database, with history as a guide, tends to be the first data store a system starts with, and ends up having everything thrown into it even if it is not the right tool for the job. Over time, chunks of data end up being moved out of the relational monster into other data stores. If your relational database choice is tightly coupled with your implementation, then you'll have a harder time moving things like your search engine into Solr, your queue into RabbitMQ, your counters into Redis, and so on. And yes, you'll also have a hard time moving to a different relational database, which I've also seen happen (on a massive, massive scale actually.)
If you stick to using an ORM that provides few leaks in its abstraction, write non-clever SQL, model things relationally using the standard types, generally speaking moving parts of a system out of a relational database into different data stores goes from being nearly impossible to being merely difficult.
Sometimes, rather than swapping out the entire data store to get access to a feature, if you cut the rope that is tying your hands behind your back you may find that your existing data store can already handle your particular sub-problem quite well, and in fact may do so better than the fancy new one that was tailored specifically for that purpose.
This happens particularly often with PostgreSQL. If you haven't used its GiST or GIN index types, you really should play around with them, and you should keep them in the same arsenal and pull them out for the same reasons you might decide to just switch everything out to Solr. A lot of the reasons people might want to use Redis are handled quite well by PostgreSQL's type extensibility.
In essence, if you are willing to throw everything out and switch to a different tool, it seems entirely inane to ignore that the tool you already have might actually already cover the features you need, especially so if the reason you are refusing to use that functionality is to make it easier for you to eventually jump ship to a tool where you no longer have your self-imposed handicap and are now willing to use the features offered.
Lots of people come to the postgres community because of PostGIS, which is (among other things) a custom type distributed separately from postgres.
Using custom types is not bad, the mistake is thinking that making a new type is easy. For a non-trivial custom data type, you need to tie it into the indexing system (GiST+KNN, GIN, SP-GiST, BTree, Hash) as well as the optimizer (have some good stats functions) -- not to mention the basics like a good representation and a well-thought out set of functions and operators.
Custom data types are really one aspect of an extension, albeit a crucial one. So don't think about it as "hey, I'll make a new phone number type because it sounds cool", think about it like "we need better support for genomic data in postgres, let's sit down and plan a serious extension".
If you want a special type for a phone number, use a domain over text with a check constraint. And that's SQL standard, too.
Now, it's rare that you are likely to get that into phone numbers, but there might be cases where you could decompose the data and do relevant searches on components might come in really handy.
The way I would look at doing that if I didn't want to go into C would be a complex type and a bunch of casts, functions, and operators.
BTW, I use custom types a lot for defining output tuples for stored procedures, It is relatively hard to get away from that.
That was part of my point -- if your entire business revolves around phone numbers, because you are a phone company or something, then maybe a custom data type makes sense. Then do it, and do it right.
But if you think of a phone number as "something to call" then all you need to do is display it to the right person at the right time, and ensure that it reasonably matches something you can type on a phone. And that sounds more like a domain over a text field with a constraint, to me.
"I use custom types a lot for defining output tuples for stored procedures"
I don't really consider composite types "custom" types any more than an enum is a custom type. Maybe I misunderstood gfodor's statement.
Back in the 7.x line at one point I discovered that you could do something like:
create table foo (...);
create table bar (myfoo foo);
And this table would be write-only, and indeed the backend would crash when reading from it. That's the only really interesting bug I ever found in PostgreSQL. However, I made the case that it would be nice to be able to treat tuples as full-fledged data types and now by at least 8.1, that was supported.
So now I can:
create type foo as (...); -- works with create table too
create function foo_to_text .....;
create function text_to_foo ....;
create table bar (my_foo foo);
insert into bar (my_foo) values ('my.foo.as.text'::text);
So the point is that composite types can now be used to blur the lines quite a bit between traditional custom types and the roles that composite types are traditionally used for. You can go a long way prototyping what is useful with composite types, and then later do that right in C.
With that being said, if you're actually designing a complex database back-end for an application you will likely want to spend time becoming very familiar with the database of choice, learn the advantages and then exploit them. This could prove to be pretty difficult to just replace on a whim later on.
edit: Switching between an RDBMS and a "NoSQL" database would be much different though. Switching between one SQL server to another is one thing, but switching from a SQL server to a "NoSQL" server is going to be depend on what kind of model you're changing to. For example, MongoDB & Riak are both "NoSQL" databases but they are both quite different fundamentally.
If you're doing something simple, sure, maybe. But beyond trivial things like DDL syntax and query syntax, there's query optimization (even when only going in via ORM, because schema design can affect this), tuning, backups, HA, monitoring, and a dozen other things.
You're already making a huge non-portable investment in using a complex tool like a database. In comparison to this, introducing dependence on its non-standard features is pretty small change, so you might as well stop worrying (considering how seldom people actually migrate), raise a glass to YAGNI, and learn to love your database.
Maybe in another five years your or some successor will end up cursing the day you made that decision, but if you could really use non-standard feature X and it's sitting right there in front of you, it's silly to shun it "just in case."
There are many applications which may not need this. In those cases, you are better off focusing on portability (only ANSI SQL features).
I use table inheritance as an actual sort of inheritance, to enforce a consistent interface over a set of relations. Yes, that could be a shortcut but it also allows for foreign keys to be against different relations on each of the partitions.
Yes, that could be done with partitioning in other ways, so maybe the short cut analogy works.
But others? I don't know. The ability to write a stored procedure as a quick Perl regular expression and then call it from a SQL query doesn't strike me as a short-cut. Nor does the extensible type system, or the ability to caste complex types as simple types with defined functions.
A lot of these are features which make PostgreSQL a development platform in a box, rather than a data store.
However, there are ways around this. You can, for example, use a composite key with the something related to the specific table in it.
For example our note table (an abstract class if you will) has a note_class field which is set to different default values and with different check constraints on the child tables.
Also foreign keys are a bit of an annoyance in an inherited table environment.
PostgreSQL is an awesome RDBMS, but adoption will never eclipse MySQL until they have a scalable replication model that allows tiered replication, multi-master replication, writable slaves (very useful for reporting boxes), and the ability to purge data on a master without purging it on the slave. That, and the ability to reliably upgrade your binary without performing a massively time-consuming dump and reloading. Replication and upgradability is everything when you are in operations, which are things many (not all) developers do not consider.
"a scalable replication model that allows tiered replication"
They do in 9.2 (currently beta), it's called cascading replication.
The other replication features you mention are under active development by a team of reputed hackers.
"That, and the ability to reliably upgrade your binary without performing a massively time-consuming dump and reloading."
pg_upgrade has been available and you can upgrade from 8.3 to the latest without the dump/reload cycle. It has been a little rocky (by postgres standards) and some people have been hesitant, but it gets the job done and it's been improving a lot. A talk at a recent conference spoke quite highly of pg_upgrade despite running into some challenges (like a library versioning issue related to some perl functions they had and some multibyte characters I think). Not for the faint of heart, but if you really need this and don't mind reading a little (or hiring a consultant), you can make it work.
Every issue or every missing feature seems like a showstopper if you don't step back for a minute. No DBMS is perfect. Everyone has their list of "postgres is missing X,Y, and Z". The funny thing to me is that X, Y, and Z change with each release because the previous X and Y were added (often with greater flexibility than originally imagined), and there's a new alternative to Z.
Postgresql puts out a very high quality release every year with major features. Usually, it's a good balance of features requested by (potential?) users like you and new innovative features that move the database world ahead.
I don't mean this to be MySQL vs PostgreSQL, but MySQL has had these features for over 10 years and the kinks are already worked out. It's one of the main reasons for the success of the platform. My X,Y, and Z list has been unchanging for 5 years when I first started maintaining a medium sized Slony-I cluster that required a complete rebuild and several hours of downtime every few months. I suspect that's true of most operations-focused folks who have maintained PostgreSQL clusters. Devs love it, DBAs love it, but it is Operations that holds the keys to the kingdom.
No. Cascading replication basically means that you can replicate from slaves to other slaves, rather than always replicating from the master. That allows you to form hierarchies, which is I thought what you meant by "tiered".
The current built-in replication is a form of phsyical replication, meaning that data pages are (more or less) identical on the master and slave. What you want is logical replication, which is being actively worked on now as a part of core postgres.
In the meantime, there are external logical replication systems like Slony, Bucardo, and Londiste. Slony is deserving of a reputation for being complex, but it's being actively developed and can be used for everything you mention (based on the very high level descriptions that you gave). I understand that "not in core, and hard to use" are pretty damning for a lot of cases, but it really does get the job done. And hopefully there will be in-core solutions in later versions of postgres.
"I don't mean this to be MySQL vs PostgreSQL, but MySQL has had these features for over 10 years and the kinks are already worked out."
I'm glad to hear that it's working for you.
"My X,Y, and Z list has been unchanging for 5 years"
For future reference, all of these features will fall under the heading "logical replication". When you see that, if it doesn't have all of the features you need, then they are probably going to arrive soon.
Postgres has had external logical replication (e.g. Slony) this whole time, which I think has somewhat reduced the demand to invest huge amounts of work in core. It looks like it fell short for your use case, unfortunately; and hopefully the current logical replication in core project will not.
Slony, in a word, is awful. It's probably as good as it can be given the limitations of working within a trigger-based world, but in the 3 years I spent maintaining Slony clusters, I lost count of the number of times that I had to take a site offline (sometimes for hours) in order to rebuild from scratch because of a failed schema change. Many times, I was able to recover by careful manipulation of the sl_events table, but far too frequently, the error was unrecoverable. Slony is slow, unreliable, not tolerant of high latency connections, and fragile. It was always one of my greatest frustrations that the PostgreSQL community saw it as a solution given how completely unreliable it is. Perhaps things have changed in the 2 years since I last touched it, but I am skeptical.
Please try to understand that my few objections to PostgreSQL come from actual real world experience with both MySQL and PostgreSQL, not from reading blog posts. Both products have great features as well as unpolished turds...it's just a matter of deciding which color of poop you are willing to clean out of the diaper. :)
But working with Postgres was always a joy. Everything seemed to be designed in a logical, straightforward way and worked the first time without any hacks or workarounds. It always felt like there was so much untapped potential humming there behind the prompt.
Eventually my company switched to Oracle and DB hacking quickly became another one of those annoying things.
mysql> alter table test change column my_money my_money decimal(2,0);
Query OK, 2 rows affected (0.03 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> insert into test values (4,'bar', 100);
ERROR 1264 (22003): Out of range value adjusted for column 'my_money' at row 1
* Your application then no longer works properly on a default install. Possibly in very subtle ways that you won't notice.
* Upstream will never make your settings the default. It would break too many applications, so you are on your own with those settings pretty much forever.
* Few users use non-default settings, so you are more likely to run into strange behaviors that are either bugs or can't be readily explained in a forum.
* Just imagine if you are trying to support multiple applications that each have their own settings they prefer. Especially if you want to integrate your data together at all, which is one of the main purposes of a database system.
2) My larger frustration with MySQL is I have run into cases of single transactions deadlocking against themselves. These always happen when the following is true:
* Executing an insert statement in the form of INSERT foo (bar) VALUES (1), (2), (3), (4);
* Only one connection/session active at a time (for example during a data migration to MySQL)
* Frequency goes up when more rows are inserted per statement
* Which inserts trigger the deadlocks are not reproducible
I believe this is an issue with race conditions and threads, perhaps a lock contention that isn't being handled properly between index and table writes or the like. I can reproduce it by inserting a couple million rows into a table, a few thousand at a time, but the statements where this occurs varies from one run to the next.
I have never seen braindead locking behavior on PostgreSQL.
It has happened to us too, though it was very easily fixed in our case due to the excellent deadlock reporting. And I can also confirm that the case for Joel (the original reporter) is also a real world case.
Link to talk about it http://www.pgcon.org/2012/schedule/events/483.en.html
Edit: Not sure if I would call it braindead though. Highly annoying, yes. Alvaro has spent a lot of time into solving this, it is much harder than it initially seems.
I think that this is thread/mutex based because of the fact that I could try to run the same series over and over and get different sets failing. So it seems to me the issue might be something environmental affecting timing.
Again the really braindead thing here was that it was only one statement, only one concurrent transaction at any point in time.
I mean it's fine if you only allow one app to write to your db, but once you have more than one, the fact that the option to ignore invalid data even exists ought to make one very nervous if the data stored is worth anything.
If none of that freaks you out, then I have one word for you: Oracle.
This to me smacks of FUD.
Oracle, evil and terrible as they are, would have a very hard time killing or otherwise harming MySQL due to the code being GPL'd and the vibrant MySQL community (Percona, Facebook, Twitter, Drizzle, MariaDB, SkySQL, etc) that fixes bugs, adds features, and directly competes with the work Oracle does.
At the same time - changing data values not stupid? We may differ on this point but I believe it's insanity. I'd love to see NULLs honored as well as constraints.
Finally, it's easy to characterize my comment as FUD but Oracle has a habit of using Open Source systems to its advantage. Why else would they buy MySQL? And the licensing is not as strong as you may think.
Either way - it's very argumentative and I'm happy to agree it's unsure at best.
To compete with Microsoft SQL Server and to further solidify their customer base (most Oracle customers already use MySQL).
Here's one word: upsell _HARD_
That was my experience trying to buy support/license for GlassFish. Some sales guy tries very hard to upsell us to use WebLogic (of course, otherwise he's not going to fill his quota or get a bonus).
There is NO evidence that MySQL has licensing problems otherwise we wouldn't have the Facebook, Twitter and Percona forks. It's as simple as that really.
The short answer is: we needed to do something, and it's the best.
For such a statement I would rather take one of DB2, Oracle or SQL Server.
Amazon RDS started with MySQL, and only two years later Oracle support was added. SQL Server three years later.
On my area of work we also do a lot of open source, but most customers won't think twice about using proprietary database solutions, specially taking into consideration the types of deployments we do.
In my career in the enterprise world, I've mostly used Oracle, Informix, DB2, Microsoft SQL Server, Sybase SQL Server.
The only two project I used MySQL so far, one was for a prototype application, to be shown in computer fairs. The other was an online survey.
A feature we miss is MySql's ability to use different DB engines such as Archive or Memory. AFAIK there is no equivalent to these in Postgres. We put mysql's Archive engine to good use before we migrated our app. Now we have to export data to gz files via script.
In general I don't think there is a strong reason for the majority of apps to switch to Postgres when MySql works for most needs. (I should mention that pgsql has some nice locking functions that we found useful)
Yes. Functions work within a transaction, which is usually a blessing. But there are cases where you want to do the transaction control in the procedure itself. This is a known request that is discussed as two different features: "stored procedures" (which is probably what you want) and "autonomous transactions".
EDIT: Navicat it looks like.
One comment, way back int he mists of times (the 90's) MySQL was known as the database that would eat your data. It had a number of data gobbling bugs, as well as no transactions.
I'm genuinely surprised how far good marketing and ease of use will go to make a product successful. The postgres guys might learn a bit there. Ease of use will drive a lot of adoption.
What was that Mac 10.7-only, command-line but standalone database tool, designed for high-level analysis of data via queries?
I've been spending hours at a time looking for it, spamming here was a last resort. (Throwing "Postgres" in my searches was what was fouling them—turns out it was just spawned by Heroku's Postgres team...)
My favorite part of PostGIS 2, besides much of the cleaned up syntax, is the new Geography datatype. It has the spatial projection built in to it, so it automatically does great arc computations. Before we had to write more clunky SQL to achieve this. Was a significant barrier for my learning postgis, which has now been removed.
Even more disappointing, when the author proceeds from showing how stupid MySQL is (random collection of choices which can be plausibly decided in any way) to showing how much better Postgres it, he doesn't show anything useful. He just shows random collection of features that may be useful in some use case he probably encountered in his practice, but no compelling picture of a superior product. Nobody changes DB server because one has "tomorrow" keyword and another does not.
And when performance comparison is done - no numbers at all are quoted except for initial data size. Just bare claims "it performed well" - how well? Are these results comparable to what I can expect on my data with my setup? How much manual work (mentioned in passing as 2 experienced DBAs were working on it - but how complex was what they did?) would it require to make it perform? How much better would it perform if I hired a very good DBA? These are very important questions to consider when you choose the DB for your project, but completely omitted in the article.
Looks quite nice, and there certainly is a need for more PostgreSQL GUIs. Some feedback from the minute I used it:
- You don't handle bytea columns very nicely. I'd expect those to be displayed in hex or so.
- Browsing a table is very slow on tables with large columns. I was testing on a table with a ~10KiB text column.
- I personally prefer a more compact interface, having an option for smaller buttons would be a good start. :-)
> qt-psql requirement wasn't very easy to satisfy on Arch
Whoops! My primary dev machine is Arch. Let me fire up a clean VM and try and reproduce the problem. Are you rolled to the latest?
> - You don't handle bytea columns very nicely. I'd expect those to be displayed in hex or so.
Unfortunately, the database types to Qt types get mapped in a pretty undesirable way (for this use case atleast). For example, I do custom handling of timestamp types within reason. Let me take a look at bytea as well.
> - Browsing a table is very slow on tables with large columns. I was testing on a table with a ~10KiB text column.
I strongly suspect the Delegate class of Qt. This class handles the painting of cell data and it can be customized to a high degree. And I think even the default Delegate is pretty heavy leading to extreme stress when painting cells that carry a lot of data. Unfortunately, I don't see this as something that can be fixed quickly unless I start from very low level classes on Qt to display table data.
> - I personally prefer a more compact interface, having an option for smaller buttons would be a good start. :-)
Actually, I was thinking about having the scroll wheel adjust the icon size on toolbars. Let me look into that as well.
Thanks a lot for your input; I really appreciate it.
Last time I did this I used a fixed-width column when displaying text fields. Everything built-in was able to handle things smoothly when clipping to just the first __ characters. (Of course you can't see it all then, but I've found it useful enough... and can paste a cell elsewhere to see all of it.)
By the way, pgXplorer has a quick find feature that can highlight the searched table/view/function. Hopefully, you find that useful. It is a really useful feature especially when dealing with CJK table names.
I can't speak to horizontal scaling, as my little database with maybe 50 million rows never comes close to loading out its aging 4-core server, even when being hammered by lots of web visitors running complicated reports and internal clients performing all sorts of CRUD operations.
1. The commands are very user friendly. In MySQL you can issue commands like "show tables" and "show databases". The last time I used PostgreSQL, the commands were much more esoteric. Things like "\dt". It adds a good hill to the learning curve
2. MySQL is everywhere. Basically anywhere you go on the 'net you can find people who know MySQL, quirks and all. It's very easy to get help, and it's the DB that most tutorials (for other things, like Rails, PHP, etc) use. MySQL is also the standard database you can find on any web host.
3. Replication was a big deal. MySQL has had replication built in for many years, and it's quite easy to use. At my last job we would have been quite happy to go to PostgreSQL for the performance, but we needed replication. At the time, there were solutions for PostgreSQL in outside projects such as Slony. Since we didn't have any PostgreSQL experience we didn't want to expend all the resources on testing and configuring everything. In the last two years or so PostgreSQL has gained an official in-tree replication solution. This makes understanding how to get replication up and running much easier.
4. MyISAM is small and fast. When machines were much slower, that could be a real benefit. Of course you sacrificed all sorts of consistency to get that.
At this point, I'd say PostgreSQL's biggest problem is mindshare. MySQL has been improving the whole time, and some of the worst warts are gone. It may not be the best tool for the job, but MySQL is just so common. It's trivial to find help, advice, employees familiar with it, programs that support it, etc.
I'd really like to use PostgreSQL in production so I could get more experience with it. It has some amazing capabilities. But our needs are relatively simple and we have lots of MySQL experience so there is no reason for us to look at switching right now.
"show tables;" is a server-side command. \dt is a command you give psql to say "ask the server what tables there are and list them." \d stands for describe. If you use a graphical client, it will have to query the system catalogs itself which you can do if you want.
As for replication, the real challenge is that replication is not a one-size-fits-all thing. Slony, Bucardo, and the built-in streaming replication have different limitations. For example you can't replication from 9.0 to 9.1 with streaming replication but you can with Slony and I would expect you could with Bucardo too. So with the out of tree replication systems you can actually have a zero-downtime upgrade as long as you block writes during the upgrade of the master. With streaming replication, assuming pg_upgrade is supported for your upgrade, you only have a short downtime window, but it is downtime.
Also streaming replication is all or nothing. With Slony you can replicate different pieces of your database. On the other hand, with Slony, you can (accidentally) replicate only a piece of your database.
So the differences give you a lot of flexibility there, but you need to have a clear idea of what you need and why you need it before choosing a replication solution. Slony is perfect for what Affilias uses it for, while streaming replication wouldn't meat their needs.
Also I don't think that PostgreSQL suffers from a lack of mindshare. Ever since I have been building apps on open source db's, PostgreSQL has been regarded as the go-to database for complex business apps. The sorts of things people do with the database are different. There is consequently a huge "dark community" if you will (in the sense that you don't see them).
This was driven home to me when I went to the Malaysian Government Open Source Software convention last year. There were two booths (Oracle and one other) which had the MySQL logo, and a few more that were offering MySQL services, while I counted at least 5 that were using the PostgreSQl logo. Most of those were reselling EnterpriseDB's PostgreSQLPlus but it when I looked at the level of interest and the number of government deployments I was asked about, it was clear which people were usually choosing for complex work.
As for user-friendliness, MySQL used to be much more user-friendly than PostgreSQL but I don't think that's true anymore. The command-line tools are more full-featured and the in-app help is better on PostgreSQL. \? brings up a list of psql client-side commands and what they do for example, and \h [command] will give you a summary of the syntax of an SQL command. I use \h quite a bit when I am trying to remember the specific syntax of a new feature, or when I am doing something I rarely do (like ALTER TABLE). I think that since about PostgreSQL 7.3, PostgreSQL has been at least as easy to use as MySQL, and quite frankly far more robust.
Finally, the PostgreSQL planner is awesome. I have not had to rewrite a query to get around planner limitations since 8.1.
I've been able to look at PostgreSQL a few times over the years, and your right that the friendliness is getting better.
It's a very impressive system.
I guess if "popular" is a synonym for "demotic" (in the sense of progressing away from professional dba's etc) then MySQL is also more popular in that sense too. But in terms of actual mindshare, it is rediculously hard to compare them.
The best advice I have is that using a different database system, particularly ones as different as MySQL and PostgreSQL (which are essentially opposites) is like going to a foreign country. You'll find things you like and things that you don't like, and it will have a real impact on your quality of life, but it's hard to say in a few words the things that will really matter to you. You don't have time to live in every country in the world, but you owe it to yourself to at least visit a few that come highly recommended.
Whatever you do, don't try to make the new country into your old country. You will just be disappointed.
The same goes for database systems.
Also I expect that with 9.2 we will probably see a release of Postgres-XC which will kick the pants off any MySQL sharding solution out there.
Postgres-XC is a way to abstract database shards such that referential integrity etc. is enforced between shards. The best comparison to date would be to Teradata's clustering system. Better yet, Postgres-XC is just a set of patches on PostgreSQL, and is on-the-wire compatible, so you can swap it out when your database gets too big. This will also bring an ability to concurrently query different shards and aggregate the results without the application having to know anything about the sharding....
The fact that many large companies and organizations have successfully scaled a particular technology is a useful yardstick, and while it does little to answer the question of which scales better, it does, by the process of elimination, help answer the question of which cannot scale.
I don't know about others here, but I would welcome actual objective information on this topic rather than sarcasm and an appeal to popularity.
I had hoped that others here would enjoy the sarcasm in the context of the full-throated yet fact-free advocacy of the parent, the text of which somehow still remains legible on an off-white background.
Postgres was developed over a long period of time with careful attention to architecture, robustness, and extensibility. It also focused on both external and internal documentation and cleanliness. And, it followed the traditional database system model with a cost-based optimizer and everything else.
As a result, we see a very robust developer community. It's a real machine -- churning out high quality releases every year with a great mix of newsworthy features. Enterprises and startups alike are flocking to postgres for a huge variety of use cases.
Slow and steady wins the race (and it doesn't hurt to be 5 tons and have tusks) and isn't tired afterward. And in databases, it's good to play for the long term because that's how long the data will last.
I wouldn't trade that for some first-to-market popularity.
An undeniably, there is a shift happening, even in parts of the market that were a MySQL stronghold. Heroku and their customers use postgres for multi-tenancy. Instagram uses postgres for photo-sharing.
I think MySQL has been very complacent because they are the default for a lot of simple web apps, and that keeps their numbers high. They aren't really breaking into new markets -- postgres is winning the geospatial market big time, and is always coming out with new features to break into new markets (personally, I am trying to advance postgres into the temporal database space).
MySQL cluster is highly popular in telecom. http://www.mysql.com/customers/industry/?id=78
i've been made aware that you that there are other ways of dealing with user/perm management in pg, but from the perspective of people running shared hosting, the default tutorials feel extremely burdensome compared to managing everything in one database via SQL (no need for multiple user accounts on the system - just multiple user accounts on the database).
PostgreSQL has phppgadmin and pgAdmin which provide a nice GUI experience for role management.
non transactional migrations always seem to trip me up in mysql, cleaning up the mess is a nightmare
I've been wanting to pick up postgres but my only motivation has been curiosity. Any good resources/tutorials people here can recommend?
Also check out our new book which came out this month.
Its primary target audience are people coming from other relational databases.
- index partitioning
some great features of postgres, which were not mentioned so far
IMHO, postgres developers and support organizations need to work with hosting companies to make postgres available along with myql.