On my way to build a multi-tenant application I went through a great deal of articles recommending various architecture strategies. I was looking for an approach to organize the data for the app's various customers (multi-tenant). Most recommendations revolved around 2 solutions: 1 db per tenant, or 1 db for all tenants with a tenant_id in each table. Lucky me, I eventually stumbled upon a thread where someone mentioned Postgresql's schemas, which kinda give you the best of both worlds.
It's unfortunate that the web development world has been so very much MySQL (and PHP) centric, because it often constrains us to see solutions to problems with the capabilities of only these technologies.
I had heard much good about Postgresql in the past, unfortunately it was also introduced with the spectre of a costly adaptation period, which resulted in making me postpone the move. This time though, I was tempted enough to give it a shot and in that same evening I jumped in.
It's been about a month now and things have been surprisingly smooth. The only tool I use to interact directly with the db is the psql client. Other than that the symbiosis is nearly perfect with Python (I use SQLAlchemy with psycopg2). If I have a recommendation to current MySQL users who would like to try Postgresql, just give in. It's really not as hostile a transition as some seem to make it. Not everything will work as you'd expect from your MySQL background, but most issues are known, the tools are great, the community is great and the documentation is simply top-notch.
I found this video that has some illustration near the end of the presentation. The presenter uses Rails. I use Flask and SQLAlchemy. Flask has some handy utilities (Flask.before_request(), Flask.teardown_request()) for this type of setting/unsetting of connection and schema. I imagine Django also has a pretty straightforward approach to this.
I actually have implemented a multi-tenant application based on Postgres 3 years ago and we have been using it in production ever since (for paying customers that is). As such I have some experience in this field and I'm actually in the process of moving (back) to a model where each customer has its own database (and application process). For context: I have a relatively low number of tenants (tens) which in turn each have mid-hundreds to many thousands of active users (the largest tenant has just about 10k active users).
Unless you think you will have many hundreds or thousands of tenants (!= users) any time soon, I'd advise you to stick with one db per tenant.
The reasons for backing out of the use of schemas for multi-tenancy are for me:
* on a smaller scale (with regards to number of tenants, not users) it does not really offer any significant benefits in resource usage over one db per tenant, which makes the following reasons significant:
* lack of isolation - unless you use a separate database user with carefully set permissions for each customer it is possible to access each schema from a single compromised application process. If you do actually use a separate db user for each customer, one of the benefits of using schemas as opposed to separate db's disappear (you can no longer share db connections/db backends across requests from different end-users so the number of concurrent database connections increases).
* complexity - multi-schema use is usually not supported by default in common web frameworks and related tools. I use Django and had to create a custom db driver (and leverage thread-local storage) to get any sort of transparent support for multi-schema use (and especially in this case you want transparency for your developers - otherwise risk of isolation failure increases to the same level as with the 'one db/tenant_id in table' approach - see previous point). Also backup/restore becomes more complicated. And Django's management tools (the manage.py script) had to be taught a new trick. And South, the schema migration tool had to be ... You get it I guess. It works - sure, but took quite a bit of work to get right and there is always the risk that future updates to any standard packages used break it again. I'd rather not have it.
* one schema per tenant implies also one application process for all (or at least multiple) tenants. This again makes it more complicated to keep things secure and you'll run into some other issues as well depending on the requirements for your application.
One of those issues we've run into (not related to the database) is that we're in the position that we can definitively benefit from a 'one source code base for all' but still have to be able to customize our application per tenant with respect to the specific wording of certain aspects in the user interface and that in multiple languages. In the current setup (one application process for all) it meant that we could not use Django's native i18n machinery. Building an alternative (with proper version control in this case) was not as much fun as you may think and basically a mistake. In the setup we're migrating to we can leverage the native i18n features of Django (and git) for this job.
So where we're going now is:
* one virtualenv per tenant, with a dedicated Django project for each tenant (the Django project does not contain the actual application code, only settings and if the need arises customized .mo/.po files.)
* one application process per tenant (I use uwsgi with emperor mode for this)
* one database per tenant (hosted on a Postgresql master/slave pair)
The resulting setup is easier to maintain and reason about than the current setup and my testing so for shows that resource-wise it does not make a significant difference. There is of course a bit more disk usage and a bit more memory usage. That is mostly because of the extra copies of the application code I need to keep around on disk and in memory - but that is all together a relatively minor bit of the total disk and memory usage so percentage wise it does not really make a difference (at least for us - is probably very situation dependent).
Something I'd been worried about was the number of concurrent database connections. But after some testing that fear turned out to be unnecessary as even under very high load the number remained within manageable bounds.
We actually do automate as much as possible (small team, lots to do). We've a Fabric script that sets up a complete new environment for a customer/tenant in one go. That includes creating and populating the virtualenv, creating a new database + database user etc. If the command is finished the customer basically is able to log in. And that script is actually only about a 200 lines of code or so.
The nice thing of Fabric is that it makes automating something hardly more work than doing something in a one-off fashion.
During my research prior to starting development, I read two very interesting articles from the msdn library  . My take from them was that there's a spectrum of possible architectures for multi-tenancy and within that spectrum, there isn't exactly any wrong or right decision. Quoting the second article,
The distinction between shared data and isolated data isn't binary. Instead, it's more of a continuum, with many variations that are possible between the two extremes "Isolated - Separated DB - Separate Schema - Shared Schema - Shared"
There are advantages and drawbacks to choosing a strategy over another, but one should not necessarily aim for some Holy Grail. The article goes on saying
Where your application should fall along this continuum depends on your business, architectural, and operational needs, and on customer considerations [...] all of these considerations are interrelated to some degree.
Regarding framework support, this was not my primary concern. We're building one app, so we might as well just build the whole thing from scratch. If we expected to reuse this code for multiple projects, I might have given it a second thought, but it's a single app with its own idiosyncrasies. I didn't want the constraints of following some "agreed upon" approach, at the cost of having to bend the natural alignment that the architecture would follow to stay faithful to what's expected of the application. Like I said, I could've chosen to do things from scratch, but I settled for what seems to me a much more beneficial alternative, a micro-framework and a specialized database library (Flask, SQLAlchemy). So far things have been running pretty smooth.
Regarding complexity, I concur with you that I've had to write a couple of scripts to automate certain aspects of schema management, but honestly, it wasn't that bad (SQLAlchemy helps a great deal with this stuff). The alternative would be to create scripts to manage different application instances. So there's work either way. One of my main concerns was to keep development simple. The approach that you're considering was very appealing in that regard, as I would basically be building a single-tenant app and replicate it for each new tenant. But from my readings, I also learned that there are some performance and scalability disadvantages to running one application instance (and db) per tenant, at the scale that I'm aiming. When I found out about Postgresql schemas, it offered a very good compromise.
I set up an app with the previously mentioned tools, spent some time reading on how to operate with schemas, tested a few queries, had some misses that taught me more on the peculiarities of this setup. But within a few hours of trial and error, everything worked as advertised. I never had to create a driver or anything like that. Upon connection start I just stored a connection on a thread-local object, started a transaction and switched schema. I then did what I had to do in my request and on the way out I released the connection. That was all on day 1. The code that manages the multi-schema access to the db is probably 5 to 10 short lines in 2 functions. The rest of the development is oblivious to the multi-tenancy aspect of the app.
Regarding security, I won't claim to know what you know on the matter, but the scenario in which an attacker gains access to other schemas with a compromised process seems a bit far fetched to me. I'd have to read up a bit more on the topic to understand how that exploit might work. PostgreSQL has some interesting authentication protocols and I imagine that in combination with the OS' own auth protocol there are ways to prevent, or at least make things more difficult.
Indeed there is no right or wrong in thus case. Just different shades of grey. The scale you're aiming for is an important factor though (and I don't know what you're aiming for so I provided ours for context). And perhaps even more important is the ratio between tenants and actual users.
In all cases there is work to do of course - I've just found that the 'one db per tenant' model allowed us to re-use more exiting infrastructure than the other models. For us that's a win. As always the devil is in the details - some of which do not become clear until after you've run into the corresponding devil.
With respect to security/isolation consider the following.
Case A: one schema per tenant, one db user shared across all tenants. Anyone with the ability to manipulate going to the database (e.g. a hacker via a sql injection vulnerability) can access all schema's and all tenants data.
Case B: one db per tenant, one db user per tenant, one app process per tenant running under its own os user. In this case an attacker is far less likely to be able to access any data from tenants other than the one to whom the compromised process belongs to.
Of course if the processes of other tenants are also aproachable by the attacker it is very much possible that the processes of other tenants will be compromised through the same vulnerability. However in our situation that is not the case (well, I should know better: it will be very difficult for the attacker) , hence we benefit from the extra isolation.
Your assumption that Postgresqls authentication or even OS level authentication provides extra isolation is wrong - unless you use one db user per tenant and set permissions carefully. But as stated before, in that case you better double check if you still benefit as much from the schema-based multi tenancy as you may think. And even then the level of separation is still lower.
What isolation level you need depends on your audience of course but from a security point of view I do not consider a schema based approach as much better than 'all tenants in the same table with a tenant_id field'.
I mentioned something about this the last time a discussion involving Django and Postgres came up, but it bears repeating.
If your environment is set up such that database connections are long-lived, please double-check that you're not using SQLAlchemy's default behavior to open an explicit transaction (e.g., "BEGIN TRANSACTION") upon connection. (It may no longer be the default, but it was last I worked with a Django shop, ~two years ago, now.) That behavior interferes with VACUUM's ability to do its job, your tables can bloat linearly with how "write-hot" they are, and performance of queries using those tables will reflect that.
Briefly, this is a consequence of Postgres' MVCC (multi-version concurrency control) architecture. Physically, an UPDATE statement is an atomic INSERT/DELETE operation; each version of every row is stored on disk, and Postgres keeps track of which versions of which rows are "visible" in the context of which transactions. Obviously, that's not a sustainable approach, both in terms of performance, and resource consumption. You'll fill up your storage with "dead" tuples, and you'll have to trawl through disk page after disk page of them to find a single "live" one.
Enter VACUUM and later, the autovacuum daemon. They know which transactions are open, and which versions of rows were modified by which transactions. "Dead" tuples older than any open transaction are truly dead, and their disk footprint can be re-used by subsequent INSERT — or UPDATE — operations.
SQLAlchemy's (former?) default behavior to open an explicit transaction upon connection isn't ideally suited for use with MVCC-based databases (so if you're using InnoDB, you'll probably have some flavor of this same problem) — particularly if your environment has connections open for days at a time, which was the case at that former gig.
When I found this problem in their stack, tables with hundreds, or perhaps a few thousand rows had disk footprints in the gibibytes. They were among the hottest tables in the system and performance, consequently, sucked. The only thing that kept them alive was that the people they had working on this stuff before me had a weekly maintenance window (site outage) where they ran a VACUUM FULL and REINDEX. How no-one twigged to the fact that app performance followed a perfect sawtooth pattern with a period of exactly one week, lock-step in phase with the maintenance cycle, I'll never understand...
You can check whether your environment is configured this way with:
SELECT * FROM pg_stat_activity WHERE current_query = '<IDLE> in transaction' ORDER BY xact_start;
If any of those guys are older than ... not very old at all (seconds, perhaps minutes at the outside, but that's a function of how your stack works, and only you can really say what's abnormal) then, Houston, you have a problem.
A caveat: this test is fairly specific, but somewhat less sensitive, with false negatives being possible. What you're really looking for is Postgres backends with an xact_start — current transaction start time — that is (potentially significantly) out of line with how long your normal database operations should take. Those usually show up as "<IDLE> in transaction" but you could very well happen to run the above query exactly when one of these offending connections is busy servicing a query, itself...
wow, where to begin with the factual errors in this post - most of this post is incorrect.
Just to get it out of the way, SQLAlchemy does not emit the "BEGIN" statement, nor does it call any kind of database function that directly emits "BEGIN", ever. Feel free to grep for it, start at version 0.1.0 and go all the way up to the 0.8 tip - you won't see it. It's not a default, it's nothing SQLAlchemy has any kind of option for - it doesn't do it.
SQLAlchemy, like virtually all Python libraries and frameworks that talk to relational databases, uses an API known as the Python database API, or DBAPI. With Postgresql, you're usually using a DBAPI implementation known as psycopg2. The DBAPI is organized in such a way that transactions are implicit. This means, when you first get a DBAPI connection, it's per specification required to be in a transaction, or at least it has to be as soon as you do something with that connection. The DBAPI has a `commit()` method as well as a `rollback()`, but has no begin() method. Why is that? Again, because a DBAPI connection is always in a transaction - there is nothing to BEGIN, it is already begun the moment you've connected or the moment the previous transaction has been closed via rollback or commit (or at least, the next time you emit a statement on that connection, it needs to be in a new transaction).
So when you use psycopg2 by itself, as soon as you connect and emit a statement, the abovementioned "BEGIN TRANSACTION" has been emitted (see the big pink box here: http://initd.org/psycopg/docs/usage.html#transactions-contro...). If you want to disable that behavior, you can do so by setting psycopg2's [autocommit](http://initd.org/psycopg/docs/connection.html#connection.aut...) flag. It's also easy enough to set this flag when you're using psycopg2 via SQLAlchemy, and in fact things will work just fine - unless you actually need some degree of transaction isolation and/or need ROLLBACK to actually work. So setting psycopg2's autocommit flag is really not something you'd normally want to do unless, as the docs say, you're emitting explicit VACUUM statements on your psycopg2 connection, which of course we don't have to these days now that vacuuming is automatic (and manually calling VACUUM is a database administration task in any case which you'd be calling from your psql console).
So we've established that 1. SQLAlchemy has nothing to do with "BEGIN TRANSACTION" and 2. psycopg2 and all DBAPIs are required to maintain transactional scope by default when a connection is first procured. What about the supposed issues with VACUUM ?
To put it simply, VACUUM has no problems with connections being open. What you're thinking of here are locks, and locks only occur once you're in a transaction and have accessed some table rows, which are now subject to various isolation rules. If you open a bunch of connections, and access/update a bunch of table rows, you'll have a lot of locks on hand, and that will get in the way of autovacuuming and such. However, as soon as you roll back the transactions, those locks are gone. When you use a database library like SQLAlchemy, a handful of connections are kept open in a pool, but the transactions are not. When you check out a connection, do a few things with it, then return it to the pool, any remaining transactional state is rolled back. edit: this of course assumes your application was written correctly enough that it closes out session/connection resources at the end of an operation - no different than when using the DBAPI directly.
What I would gather was your experience was that you either were using an extremely old version of SQLAlchemy, like an early 0.3 version when we probably weren't calling rollback() correctly (this would be like pre-2008), or your application wasn't handling the scope of connections/sessions appropriately (edit: as django's ORM typically autocommits by default, this detail about SQLAlchemy's more explicit transactional nature may not have been apparent to your dev team; we offer autocommit as an option, though it is discouraged).
It goes without saying that psycopg2 is used in a tremendous number of high volume environments, without forcing the "autocommit" flag on. Postgresql's auto vacuuming works just fine regardless. SQLAlchemy is just a client of psycopg2.
> To put it simply, VACUUM has no problems with connections being open. What you're thinking of here are locks, and locks only occur once you're in a transaction and have accessed some table rows, which are now subject to various isolation rules.
I think you are correct except about this. What he is thinking about is indeed as he said transactions. Locks in PostgreSQL do also interfere with VACUUM but that is seldom the problem in practice since locks are usually taken at the row level and reading rows does not require any row locks. Open transactions are the main culprit when it comes to vacuum problems and it does not matter much if they have taken locks or not.
First, thank you very much for the clarifications and corrections. I wasn't on the engineering team that fixed the problem, and I'm not a Python guy; I just found the problem and explained the consequences of what was happening to the engineers. It seems I was mistaken in the particulars, for which I do apologize, both to you and everyone who's ever contributed to SQLAlchemy, and to anyone who was misinformed by my previous comment.
Nonetheless, there is a problem. It's not open connections that interfere with vacuuming; it's open transactions. While a transaction is open, vacuuming can't reclaim dead tuples, if those tuples were live at the start of the transaction. Those old row versions are still "visible" in the context of that transaction, and can't be reclaimed — and nor can the additional dead tuples created from further updates on those rows, because there is an open transaction that's older than they are.
Simply, if you connect to the db, open a transaction, and leave it open, then every subsequent update to every row in every table in that database will leave a dead tuple that can't be reclaimed until that transaction commits or is rolled back, because at the bottom of each chain of dead tuples, there's one that is still visible in the context of that open transaction.
That's the problem.
Net, in my particular situation, and a couple of others I've encountered first- and second-hand, enabling auto-commit (and only explicitly declaring transactions when multi-statement transactional semantics were specifically needed, and which were promptly committed upon completion) fixed the problem. A 2000 row table with ~1600 byte rows took up a little over 400 (8KB) disk pages, and the autovacuum daemon was able to keep it at that size. Compare that to before, without autocommit: it would be that size after a VACUUM FULL, and weigh gibibytes within days, no matter how many autovacuum workers there were, how aggressively they were tuned, or how often you ran (regular) VACUUM manually. Those dead tuples were not going away until the abuse of transactional semantics that caused them to accumulate was addressed.
If there's anyone to "blame" here, it's the engineers who didn't understand the implications of what they were doing. (To be fair, though, they were also using the database as a work queue, so it's probably reasonable to suggest they liked doing things sub-optimally...) The high volume environments you allude to are probably going about things with more cognizance of the implications of implicitly transactional semantics. They aren't proof that there isn't a problem; they're proof that you can do this sort of thing without shooting yourself in the foot, if you know what you're doing.
Look, it's not that the DBAPI being implicitly transactional is a "bad" thing. It's not. Transactional semantics are awesome, and as someone who gets paid for keeping peoples databases (particularly PostgreSQL) happy, I'm emphatically for them. It just has consequences, particularly in the context of an MVCC-based RDBMS, and doubly so if your application is architected such that it leaves transactions, implicit or otherwise, open for extended periods.
> if your application is architected such that it leaves transactions, implicit or otherwise, open for extended periods.
Can you give any good reason why you need to leave transactions open for extended periods? In my experience, it only happens when the developer in question does not understand the semantics of the DBMS and therefore hasn't suitably designed their solution, in which case it's hardly the DBMS's fault when things go wrong.
I've rewritten plenty of code that "needed" a transaction to be open for a long time into code requiring a transaction for a few milliseconds (e.g. in a bulk import, parsing, caching lookups and reference data outside the transaction, building an XML document to supply into a SP as a parameter which only used a transaction for a single INSERT from the XML).
Can you give any good reason why you need to leave transactions open for extended periods?
Offhand, other than xid-level consistency for a backup as mentioned by the sibling post, no.  No, this case was pretty clearly the engineers not understanding what they were doing, as evidenced by it being easier to turn off autocommit, and add explicit BEGIN statements where they were needed, than add explicit ROLLBACKs everywhere they were needed.
That makes an implicitly transactional adapter seem a bit of a foot-gun to me, though. At a minimum, I think it places an unnecessary burden on engineers to have to "roll back" every time they even ask the db something. That's how memory leaks happen, too, and that's why we generally think garbage-collected languages are a Good Idea.
 That said, my preferred backup strategy is to take a filesystem-level snapshot of the db volumes, mount that, and start a second Postgres instance against it. It'll "recover" from an "unclean shutdown", and then you can take your backup without incurring the penalty of holding open a transaction that long in the presence of concurrent writes.
You do pay some write-performance penalty for the snapshot while it's open, but upon releasing it, the disk pages are immediately gc-ed, and you're already taking your backup at a low-traffic point anyway, right? Disk pages in a Postgres cluster allocated because vacuuming was hindered by holding an xid for extended periods are merely autovacuumed, however; they're reclaimed for later use, not released.
I'm not familiar with PG but I am planning on using it in my current project, which is why I'm interested in understanding whether these are really DBMS issues.
In the .NET world, for all the database adapters I've used (MSSQL, SQLCE, Oracle, MDB), individual statements are implicitly wrapped in transactions when no explicit transaction has been specified - not sessions/connections.
I agree the "implicit transaction per session/connection" design is confusing and likely to result in bugs. While the standard .NET approach limits bugs, the downside is poorer performance when an unaware developer uses numerous implicit transactions for multiple statements where a single explicit transaction would suffice.
Can you point to any open source projects or other examples implementing this pattern? I think I understand where you're coming from on this but I have not seen many examples of this approach in the wild (except religiously using TransactionScope, eg. http://amnesia.codeplex.com/ ).
I'm wondering where it's best to manage the transactions... would they ever be in raw SQL or stored procedures instead of .NET code?
The PostgreSQL official C client library, libpq, (which most language specific wrappers are based on) works the same way as you say the .NET adapters do. Every statement is by default implicitly wrapped in a transaction until you run "BEGIN;".
>In the .NET world, for all the database adapters I've used (MSSQL, SQLCE, Oracle, MDB), individual statements are implicitly wrapped in transactions when no explicit transaction has been specified - not sessions/connections.
That is how the postgresql API works as well. Most high level language modules built on top of it also work that way. Unfortunately python's DB API is not one of them, but you can just set a config option to "act like everyone expects".
SQL Server has schemas which sound very similar to Postgres (i.e. logical groupings of tables within the same database, with different owners). And I think Oracle and DB2 do also have this feature. Maybe this is just an example of where MySQL is a bit behind, rather than something awesome with Postgres.
I'd love to move away from Oracle to Postgres, I really would. I'm trying to. But for massive amounts of data the partitioning and some other features of Oracle just work better. The partitioning is a huge thing, especially for our data which is partitioned by week then organized according to a hierarchical triangular mesh with bitmapped indexes. This works so well for us (at 8 billion rows) it's silly. MySQL couldn't do this and hacking this into Postgres was sloppy. Stored procedures in Oracle work pretty good as well (especially because of shared memory), but I'm trying to eliminate them as much as possible to try to accommodate other people.
I'm quick to complain about Oracle (especially the fucking cost based optimizer and how it falls back to retarded hash joins for me frequently, leading me to add hints, rewrite SQL or run the tuning advisor), but for it's faults, it does some things pretty well.
That being said, I'm wholeheartedly rooting for Postgres.
There's a reason Oracle can charge an arse-load of money, and it's not because they're "evil": it's because their DB does some frigging incredible things when you know how to use it. I've worked with a 4 billion row star schema, partitioned by day then sub-partitioned for query optimization. It was OK to be "stale", so we inserted each day's data into an indexless table then swapped it in as a new partition over night.
Billions of rows in performance critical apps is pretty edge case. I think a lot of people paying for Oracle could pretty easily migrate away.
Yeah, there's a lot of other things we use it for. We have a lot of hierarchical data that uses the CONNECT BY statement, and I hate it because it has issues with scaling and bad execution plans, so I'm working on migrating all of that to a closure table instead.
If you'd like to give Postgres a serious shot, I'd recommend taking a look at Citus: http://citusdata.com/. It's a distributed database forked from Postgres, and can easily handle billions of rows in realtime for certain use-cases. We use it at MixRank to store and run analytics on for ads we crawl in the internet (a respectable amount of data), and it handles extremely well. (Full disclosure: I know the Citus team well).
I've also pushed Postgres to its limits and suspect you'd be able to make it work with that much data, though I wouldn't consider it a trivial exercise executing that transition on a production system. If you're seriously considering it though, feel free to reach out via the email in my profile.
In PostgreSQL tuning queries is generally quite easy once you have learned to read the EXPLAIN output. The EXPLAIN output makes it obvious most of the time why the planner chose the plan it did. This is also an area which has received improvements in every recent release. Both the explain output and the cost based planner.
While your partitioning scheme should be implementable in PostgreSQL using table inheritance I do not doubt it is simpler to manage in Oracle. The PostgreSQL partitioning is lacking in usability and also a bit in features.
The primary target audience for hacker-to-hacker Postgres evangelism is MySQL users. Because let's face it, the choice for DBs like Oracle is usually made upstairs, and for very different reasons.
So why do Postgres advocates insist on dissing MySQL with false and misleading arguments? The usual target is some default settings, when obviously there are three kinds of MySQL users: the ones that actually have a reason to want less strictness, the ones that know how and when to change the settings, and the ones that just don't give a fuck.
You're not going to convince the latter until you are #1, and you're insulting the intelligence of the first two, whilst also making them question whether the arguments in favor of Postgres are actually true, since you're not being particularly honest in your criticism of MySQL.
And you're wondering why Postgres doesn't get any more love?
Most MySQL users are well aware that Postgres is technically a better DB. They mostly (right or wrong) feel MySQL serves their needs sufficiently, and are not particularly attracted to an open source community with such a douchy attitude.
If even Oracle doesn't scare people of sufficiently to jump ship and switch to a better, free and open database, you've got a serious image problem. Stop blaming it on other people's "stupidity".
I'm not sure where in the article the author insults the _users_ of MySQL as stupid, but rather insults some decisions made by MySQL as ill-advised.
Incidentally, I was not aware of issues around default settings with MySQL when I switched our infrastructure to Postgres - the primary reason we switched was transactional and much more efficient DDL changes. I was at a company previously that I felt was seriously hampered from innovating due to strong resistance against schema changes by the DBAs and want to do my best to prevent a similar culture where I work.
Before I decided to switch I benchmarked schema changes between Postgres and MySQL and found Postgres to be quite impressive, see the gist of results here: https://gist.github.com/1620133. If you (or anyone else) can point out why this benchmark is false or misleading, I'm all ears.
Your benchmarks especially demonstrate the fact that PostgreSQL tries to do as little work as possible when adding a column or dropping a column (or in recent versions of PostgreSQL when changing the type of a column). If you do not have a default value for the new column PostgreSQL only has to update the table definition which is almost instant.
Another interesting part about your benchmarks are that they throw serious doubts on the popular myth that count(*) is slow in PostgreSQL. And your benchmarks were made before PostgreSQL 9.2 which will add index-only scans.
"... throw serious doubts on the popular myth that count() is slow in PostgreSQL."
Whew! Glad that was only a myth! I'll remind myself of that whenever my code takes > 10 seconds to return the count() from a table. "Just a myth - this isn't really happening". I'll try repeating that to myself while waiting for the results - probably should only take 10-15 repeats of the phrase before the row returns, right?
Whoah - just did a check on a real table with a whopping 2 million records - postgresql 9 just blazed through that in < 7 seconds to give me
select count(*) from student;
w00t! 7 seconds! I'm not even sure how that myth got started, let alone why people still believe that hogwash.
I just did a SELECT COUNT(*) on a table here in our QA environment; 20.5 seconds to count 42385875 records from one table and 68.8 seconds to count 191906711 records from another table (Oracle 10g 64-bit).
"The fact that multiple transactions can see different states of the data means that there can be no straightforward way for "COUNT(*)" to summarize data across the whole table; PostgreSQL must walk through all rows, in some sense."
When operating on a table, my understanding is that pg will select a version and operate on that version. If other versions are being worked on in transactions - that's a different story. Why can't metadata about the number of rows be assigned with the table version data, so after every operation, you'd know what the number of rows was at at that moment in time?
PostgreSQL does not track any versions at the table level, instead it tracks the versions for every row. This means two queries can modify different parts of the table concurrently without any lock contention.
In PostgreSQL every row has two numbers. The transaction ID it was insert in and the transaction ID it was deleted in. An update is an insert plus a delete. When running a select in PostgreSQL you just traverse the table and for each row check these two numbers to know if you are allowed to see the row.
The details above are PostgreSQL specific but most other databases have the same problem with there being no way to know the exact count without actually counting the rows.
1. There is contention currently in PostgreSQL when writing the database journal (used for crash recovery and replication).
2. There are some optimization which are done here. For example HOT to avoid index updates.
I was unclear the part that is a myth is that they are slow in _PostgreSQL_. COUNT( ) is actually slow in general on databases with high concurrency requirements. Because you have to sacrifice something to make COUNT( ) fast.
If you had read his benchmark you would have seen COUNT( ) was slow in both databases (three times as slow in MySQL).
for me it would be missing common table expressions, weak implementation of functions and SPs (recursion) and horrible .NET connector (connections dying, int can get returned as byte). And I'm not even mentioning the gazillions of advanced features that pgsql (and Oracle or SQL Server) has that mysql doesn't.
Typical fanboy bluster. What's dishonest about the criticism, be specific? What you really mean to say is that most MySQL users use it because it is there and take it as it has been configured for them. MySQL is the storage platform for people who do not know about databases and probably only use it because some blogging product requires them to.
You just made his point and rather succinctly too.
MySQL has a low barrier to entry with some reasonable tools that make it easy to get up and running with a minimum of fuss. For a large swathe of applications it does the job just fine. There's little point in trying to pretend otherwise. Where PostgeSQL really fits in is when the next step is needed, either because the app has outgrown what MySQL can offer or there are some use (edge) cases where PostgreSQL does a better job. Address these areas competently and the world will beat a path to your door.
Even when a company like Google or Facebook uses "mysql", it's not stock Oracle MySQL -- which is what people are going to download and use when they hear that MySQL is good enough because major companies use it. This is an unfortunate situation caused by most distributions still favoring stock mysql over percona, and Oracle refusing to incorporate patches even when they're plenty of evidence that people who know what they're doing want those patches.
It seems like Oracle is more concerned about trying to keep mysql less well suited for major deployments, so it can upsell its flagship product. Or else their development process simply doesn't properly accommodate community assistance. Either way, as long as their resistance to community-aided development continues, mysql and oracle/mysql disparagement will continue, not because people hate Oracle, but because Oracle's management of mysql makes it less than ideal for those who simply want a good open source database.
Which is the point of the OP. Anyone who is open to looking at options and who wants a good open source database should look at PostgreSQL.
I doubt any of those example companies you listed even use stock Percona. Facebook for sure has extensive modifications beyond even the publicly available distributions like Percona and MariaDB.
Up until very recently Twitter was still using stock MySQL 5.0.45 (which I believe is from 2007) nearly everywhere in production. The notion that MySQL requires a highly patched version with a world class development team to maintain is a myth. Facebooks patches are made public and are often incorporated into MariaDB and/or Percona forks. Oracle usually goes their own way on implementing changes from the community.
Not at all. Like I say above, right tool for the job. I don't doubt that all those companies use MySQL for something, indeed probably every large company in the world has at least one instance of it somewhere.
For example, taking the first company on your list that actually deals with real stuff in the real world rather than just running a website, DHL, here I see http://www.oracle.com/us/corporate/press/037809 - they're an Oracle shop. I could continue but I think I've proved my point.
PS the cost of not losing "ANY data for ANY reason" is effectively infinite. You make a call on what data loss would cost your business then you decide how likely each scenario is and that gives you your budget. Datacentre disruptions I have been personally involved with include workmen digging up fibre (several times actually), deliberate vandalism of cable (during May Day riots), fire in the generator room (a couple of times), large truck losing control and driving through the walls, flooding, and several more. No one of these scenarios was sufficient by itself to hurt us because we had mitigated them. But 2 happening together, then yeah, there is a significant possibility that some data loss would occur (e.g. replication to datacentre B is disrupted, then datacentre A is catastrophically lost). The cost of protecting against that however is prohibitive.
So before you get all starry-eyed about these amazing companies and their amazing MySQL installations, understand that you are starting from a false premise. Oracle - for a cost - provides some powerful capabilities (e.g. stretch-RAC). Is it worth the money? Well, it comes back to what your data is worth. This incidentally is why some organizations cling to paper forms - it's cheaper to re-key the data than build out the infrastructure to do it all online...
 Tho' if you think you need this, what you probably really need is DB/2 and Sysplex
I couldn't agree with you more. The statements made in this article seem pretty derisive towards MySQL, a system I like for its flexability, easy integration, and power. Sure, it might not be the most pure sql implementation, it lets you bend the rules for data integrity (maybe not best practice but I still like it) and there are some quirks but calling MySQL stupid doesn't make me a Postgres fan.
Not just that. PostgreSQL is a lot more like Oracle than it is like MySQL. Both are development platforms in a box. Both are built for many applications to run off the same database (this is a use case that MySQL really sucks at btw-- MySQL is ideal for one app per database deployments). Both let you write stored procedures in some variation of PL/SQL and Java.
Oracle on the other hand has a setting for everything. A good Oracle DBA can spend all his time fiddling with Oracle settings to get the best possible performance. On the other hand, a good Oracle DBA can spend all his time fiddling with settings to get that little bit of extra performance gain. PostgreSQL is simpler.
PostgreSQL on the other hand lets you write your stored procedures in any language. On the other hand, you can spend all your time writing and debugging stored procedures written in brainfuck.....( and in case you want to do that, download the handler at https://github.com/mikejs/pl-bf and go get started!)
There are a few other differences as well. In general, I think Oracle is better for some things, PostgreSQL for others, but they have more overlap than either does with MySQL.
And there is also a company dedicated to providing a migration path from Oracle to PostgreSQL: EntrepriseDB. I have not used their products but they employ several of the core contributors to the PostgreSQL project which have contributed to many important features.
That's one reason. The reason why they need it for third-party applications is because that's what developers targeted, for two reasons:
1. It was the only game in town when the application was designed.
2. It was head-and-shoulders above everyone else. More, better, faster. Most people think you just throw SQL-99 at it, but then you're wasting your money. Look at their analytical languages, for example.
Of course, Oracle and Microsoft are also major vendors of applications that run on their database servers. In fact, the application revenue for both is probably much higher than database server revenue.
I'm a huge PostgreSQL fanboy but I think it's worth mentioning that it's usually not a good idea to use too many esoteric database features when building an app, since it couples your system with a particular database.
That said, even if you don't use PostgreSQL's whiz bang features, its stability, performance, and outright sanity with regards to handling data make it the right database to reach for in many cases. And, for smaller projects where you are willing to couple yourself with it, its additional data types, query features, and so on are awesome.
I absolutely disagree. Drop-in portability between databases is an operational myth for any production application anyway. Whether you're using Postgres, Riak, Mongo or Oracle, you're going to have to do a lot of work to change your database infrastructure.
Further, every database, noSQL or otherwise, offers a different set of features and functionality. Why the heck wouldn't you take advantage of k-nearest-neighbors indices? Why would you want to roll your own full text search functionality? The alternative is to introduce whole other services to your infrastructure!
It's not a matter of drop-in portability, it's a matter of reducing the complexity of migration as well as developer confusion. Often the case for using custom data types, for example, is quite weak, when considering the tradeoffs. They move complicated logic into the database, are unfamiliar to most developers, and end up needing to be reverse engineered if you want to move your data into different storage. I think they should be an option of last resort, or used in situations where the cost/benefit ratio is so skewed in their favor you'd be crazy to not take advantage. Generally speaking these are the rarest of cases.
"it's a matter of reducing the complexity of migration"
You put the code in your application so you can switch out the database, but why do you want to switch databases?
You can't switch out the database for one with more features, because then you're not using the lowest common denominator any more, and you can't switch back.
It can't be licensing costs, because postgresql licenses are free.
The only other reason I can think of is performance. But trying to avoid database features just because you might want to migrate to a less-featureful database later seems more likely to result in performance problems than prevent them.
I agree you don't want to go out on a limb with crazy features just because you can. But those features are there for a reason, and might drastically simplify portions of your application if you use them. And a lot of them are SQL standard, just not supported properly in all database systems.
Also, I'd like to point out that database migrations themselves are quite rare once they become established in an organization. You might be able to migrate one fairly simple application if you bend over backwards trying to use only the simplest features; but once a few applications are depending on it, it's just too expensive.
I am building a product that can work with SQL Server, MySQL, PgSQL and Oracle. The reason was if the company who buys the software already has the infrastructure and DBAs in that database, they can leverage it. If I were making the decision now I would definitely choose just one and stick with it, because
- astonishing amount of things are done differently on these four dbms (idenities/sequences and getting their last value, paging, DDL)
- each of these dbms has specific features which would improve the performance or help us develop stuff but we can't use them because they require different architecture (and we want to keep it the same)
Don't do software for multiple databases. It's just not worth the trouble.
Not to mention even the basic data types are differently named and work differently (text vs varchar vs nvarchar vs varchar2). It is especially hard to when you have to work with date and time types/functions.
The main reason to do that is licensing. You either need to get a redistribution license, or the client needs to work out their own licensing deal -- and either way, you don't want to be beholden when negotiation time comes.
That also applies to MySQL, because it's GPL.
That does not apply to Postgres, because you can just ship it with the app.
I exaggerated a little, there certainly are times when you just don't care what database system you use and would rather have simple migrations at the expense of owning more code. But I think these reasons are weighted much too highly, particularly in the context of postgres.
I've seen it happen multiple times. The reasons vary, but it generally has to do with scaling up the system into more modularized components and using data stores better suited for particular sub-problems that the "big honking database" was used previously.
To be clear, I'm not really suggesting you avoid using fancy features in data stores like Redis or Solr. The relational database, with history as a guide, tends to be the first data store a system starts with, and ends up having everything thrown into it even if it is not the right tool for the job. Over time, chunks of data end up being moved out of the relational monster into other data stores. If your relational database choice is tightly coupled with your implementation, then you'll have a harder time moving things like your search engine into Solr, your queue into RabbitMQ, your counters into Redis, and so on. And yes, you'll also have a hard time moving to a different relational database, which I've also seen happen (on a massive, massive scale actually.)
If you stick to using an ORM that provides few leaks in its abstraction, write non-clever SQL, model things relationally using the standard types, generally speaking moving parts of a system out of a relational database into different data stores goes from being nearly impossible to being merely difficult.
> using data stores better suited for particular sub-problems
Sometimes, rather than swapping out the entire data store to get access to a feature, if you cut the rope that is tying your hands behind your back you may find that your existing data store can already handle your particular sub-problem quite well, and in fact may do so better than the fancy new one that was tailored specifically for that purpose.
This happens particularly often with PostgreSQL. If you haven't used its GiST or GIN index types, you really should play around with them, and you should keep them in the same arsenal and pull them out for the same reasons you might decide to just switch everything out to Solr. A lot of the reasons people might want to use Redis are handled quite well by PostgreSQL's type extensibility.
In essence, if you are willing to throw everything out and switch to a different tool, it seems entirely inane to ignore that the tool you already have might actually already cover the features you need, especially so if the reason you are refusing to use that functionality is to make it easier for you to eventually jump ship to a tool where you no longer have your self-imposed handicap and are now willing to use the features offered.
"Often the case for using custom data types, for example, is quite weak..."
Lots of people come to the postgres community because of PostGIS, which is (among other things) a custom type distributed separately from postgres.
Using custom types is not bad, the mistake is thinking that making a new type is easy. For a non-trivial custom data type, you need to tie it into the indexing system (GiST+KNN, GIN, SP-GiST, BTree, Hash) as well as the optimizer (have some good stats functions) -- not to mention the basics like a good representation and a well-thought out set of functions and operators.
Custom data types are really one aspect of an extension, albeit a crucial one. So don't think about it as "hey, I'll make a new phone number type because it sounds cool", think about it like "we need better support for genomic data in postgres, let's sit down and plan a serious extension".
If you want a special type for a phone number, use a domain over text with a check constraint. And that's SQL standard, too.
The reason to use a phone number type if you want to use one is that you can, in theory, create functions against it. A good example of what might be possible with such a type can be gained by looking at the standard networking type.
Now, it's rare that you are likely to get that into phone numbers, but there might be cases where you could decompose the data and do relevant searches on components might come in really handy.
The way I would look at doing that if I didn't want to go into C would be a complex type and a bunch of casts, functions, and operators.
BTW, I use custom types a lot for defining output tuples for stored procedures, It is relatively hard to get away from that.
"Now, it's rare that you are likely to get that into phone numbers..."
That was part of my point -- if your entire business revolves around phone numbers, because you are a phone company or something, then maybe a custom data type makes sense. Then do it, and do it right.
But if you think of a phone number as "something to call" then all you need to do is display it to the right person at the right time, and ensure that it reasonably matches something you can type on a phone. And that sounds more like a domain over a text field with a constraint, to me.
"I use custom types a lot for defining output tuples for stored procedures"
I don't really consider composite types "custom" types any more than an enum is a custom type. Maybe I misunderstood gfodor's statement.
With the ability to define columns which store composite types, the line between a custom data type in C and a composite type plus casts is blurring.
Back in the 7.x line at one point I discovered that you could do something like:
create table foo (...);
create table bar (myfoo foo);
And this table would be write-only, and indeed the backend would crash when reading from it. That's the only really interesting bug I ever found in PostgreSQL. However, I made the case that it would be nice to be able to treat tuples as full-fledged data types and now by at least 8.1, that was supported.
So now I can:
create type foo as (...); -- works with create table too
create function foo_to_text .....;
create function text_to_foo ....;
create table bar (my_foo foo);
insert into bar (my_foo) values ('my.foo.as.text'::text);
So the point is that composite types can now be used to blur the lines quite a bit between traditional custom types and the roles that composite types are traditionally used for. You can go a long way prototyping what is useful with composite types, and then later do that right in C.
You don't need to create a custom type to return tuples anymore. (that kind of went out with 8.1). You can use OUT parameters or RETURNS TABLE syntax. RETURNS TABLE is ANSI compliant and supported since 8.3. SQL Server supports RETURNS TABLE as well.
And I, also, disagree. I'm using the Developer Express ORM, XPO, and it provides 'drop-in portability' between a range of databases.
I've successfully run my small-mid scale app on three different backends - MS SQL, MySQl and Postgres - simply by changing the connection string.
Yes, it's .NET / Windows only, but drop-in portability is definitely not a myth.
Depending on your application and how you're taking advantage of your database, you could drop-in replace a database. For example, people using some sort of abstraction on top of their database (an ORM) often switch between databases.
With that being said, if you're actually designing a complex database back-end for an application you will likely want to spend time becoming very familiar with the database of choice, learn the advantages and then exploit them. This could prove to be pretty difficult to just replace on a whim later on.
edit: Switching between an RDBMS and a "NoSQL" database would be much different though. Switching between one SQL server to another is one thing, but switching from a SQL server to a "NoSQL" server is going to be depend on what kind of model you're changing to. For example, MongoDB & Riak are both "NoSQL" databases but they are both quite different fundamentally.
>Depending on your application and how you're taking advantage of your database, you could drop-in replace a database.
If you're doing something simple, sure, maybe. But beyond trivial things like DDL syntax and query syntax, there's query optimization (even when only going in via ORM, because schema design can affect this), tuning, backups, HA, monitoring, and a dozen other things.
You're already making a huge non-portable investment in using a complex tool like a database. In comparison to this, introducing dependence on its non-standard features is pretty small change, so you might as well stop worrying (considering how seldom people actually migrate), raise a glass to YAGNI, and learn to love your database.
Maybe in another five years your or some successor will end up cursing the day you made that decision, but if you could really use non-standard feature X and it's sitting right there in front of you, it's silly to shun it "just in case."
Depends on what you are doing. the key thing is to decide how tightly coupled the system should be.
There are many applications which may not need this. In those cases, you are better off focusing on portability (only ANSI SQL features).
At least with PostgreSQL, most of the features are really shortcuts, rather than kludges masquerading as features. E.g. - extending a table as if it were a superclass, rather than creating a supplemental "joiner" table and creating a writeable view on top of the "super" table + the "sub" table.
I use table inheritance as an actual sort of inheritance, to enforce a consistent interface over a set of relations. Yes, that could be a shortcut but it also allows for foreign keys to be against different relations on each of the partitions.
Yes, that could be done with partitioning in other ways, so maybe the short cut analogy works.
But others? I don't know. The ability to write a stored procedure as a quick Perl regular expression and then call it from a SQL query doesn't strike me as a short-cut. Nor does the extensible type system, or the ability to caste complex types as simple types with defined functions.
A lot of these are features which make PostgreSQL a development platform in a box, rather than a data store.
I just watched a 10 minute video whereby nearly all the issues pointed out can be solved with one line in my.cnf or dynamically by setting SET GLOBAL server_sql_mode=TRADITIONAL. Yeah the default is no good. Learn your RDBMS and the problem goes away. Watching him point and click and move windows around also made it very difficult to follow.
PostgreSQL is an awesome RDBMS, but adoption will never eclipse MySQL until they have a scalable replication model that allows tiered replication, multi-master replication, writable slaves (very useful for reporting boxes), and the ability to purge data on a master without purging it on the slave. That, and the ability to reliably upgrade your binary without performing a massively time-consuming dump and reloading. Replication and upgradability is everything when you are in operations, which are things many (not all) developers do not consider.
Your concerns are constructive and well-placed. It happens that they are all either done or actively being improved, but that's not obvious unless you follow very closely, so keep 'em coming.
"a scalable replication model that allows tiered replication"
They do in 9.2 (currently beta), it's called cascading replication.
The other replication features you mention are under active development by a team of reputed hackers.
"That, and the ability to reliably upgrade your binary without performing a massively time-consuming dump and reloading."
pg_upgrade has been available and you can upgrade from 8.3 to the latest without the dump/reload cycle. It has been a little rocky (by postgres standards) and some people have been hesitant, but it gets the job done and it's been improving a lot. A talk at a recent conference spoke quite highly of pg_upgrade despite running into some challenges (like a library versioning issue related to some perl functions they had and some multibyte characters I think). Not for the faint of heart, but if you really need this and don't mind reading a little (or hiring a consultant), you can make it work.
Every issue or every missing feature seems like a showstopper if you don't step back for a minute. No DBMS is perfect. Everyone has their list of "postgres is missing X,Y, and Z". The funny thing to me is that X, Y, and Z change with each release because the previous X and Y were added (often with greater flexibility than originally imagined), and there's a new alternative to Z.
Postgresql puts out a very high quality release every year with major features. Usually, it's a good balance of features requested by (potential?) users like you and new innovative features that move the database world ahead.
As I mentioned, I do like PostgreSQL for its consistency and I want to like it even more. Cascading replication looks interesting - does it support the topology that I mention where the master has a subset of the from the slaves and the slaves can be written to? The documentation is difficult to decipher and lacks examples.
I don't mean this to be MySQL vs PostgreSQL, but MySQL has had these features for over 10 years and the kinks are already worked out. It's one of the main reasons for the success of the platform. My X,Y, and Z list has been unchanging for 5 years when I first started maintaining a medium sized Slony-I cluster that required a complete rebuild and several hours of downtime every few months. I suspect that's true of most operations-focused folks who have maintained PostgreSQL clusters. Devs love it, DBAs love it, but it is Operations that holds the keys to the kingdom.
"Cascading replication looks interesting - does it support the topology that I mention where the master has a subset of the [tables??] from the slaves and the slaves can be written to?"
No. Cascading replication basically means that you can replicate from slaves to other slaves, rather than always replicating from the master. That allows you to form hierarchies, which is I thought what you meant by "tiered".
The current built-in replication is a form of phsyical replication, meaning that data pages are (more or less) identical on the master and slave. What you want is logical replication, which is being actively worked on now as a part of core postgres.
In the meantime, there are external logical replication systems like Slony, Bucardo, and Londiste. Slony is deserving of a reputation for being complex, but it's being actively developed and can be used for everything you mention (based on the very high level descriptions that you gave). I understand that "not in core, and hard to use" are pretty damning for a lot of cases, but it really does get the job done. And hopefully there will be in-core solutions in later versions of postgres.
"I don't mean this to be MySQL vs PostgreSQL, but MySQL has had these features for over 10 years and the kinks are already worked out."
I'm glad to hear that it's working for you.
"My X,Y, and Z list has been unchanging for 5 years"
For future reference, all of these features will fall under the heading "logical replication". When you see that, if it doesn't have all of the features you need, then they are probably going to arrive soon.
Postgres has had external logical replication (e.g. Slony) this whole time, which I think has somewhat reduced the demand to invest huge amounts of work in core. It looks like it fell short for your use case, unfortunately; and hopefully the current logical replication in core project will not.
Thank you for the clarification about the PostgreSQL terminology. That will help me keep track of the progress. I am very pleased that the PostgreSQL community has finally started implementing native logical replication, which is really one of the biggest obstacles to widespread adoption.
Slony, in a word, is awful. It's probably as good as it can be given the limitations of working within a trigger-based world, but in the 3 years I spent maintaining Slony clusters, I lost count of the number of times that I had to take a site offline (sometimes for hours) in order to rebuild from scratch because of a failed schema change. Many times, I was able to recover by careful manipulation of the sl_events table, but far too frequently, the error was unrecoverable. Slony is slow, unreliable, not tolerant of high latency connections, and fragile. It was always one of my greatest frustrations that the PostgreSQL community saw it as a solution given how completely unreliable it is. Perhaps things have changed in the 2 years since I last touched it, but I am skeptical.
Please try to understand that my few objections to PostgreSQL come from actual real world experience with both MySQL and PostgreSQL, not from reading blog posts. Both products have great features as well as unpolished turds...it's just a matter of deciding which color of poop you are willing to clean out of the diaper. :)
Back when I was still doing web development, I had a big stack of things I had to deal with to actually get html on somebody's screen. Most of them were slightly annoying to deal with and I would often push them back to the one day a week I reserved to futz with annoying little details.
But working with Postgres was always a joy. Everything seemed to be designed in a logical, straightforward way and worked the first time without any hacks or workarounds. It always felt like there was so much untapped potential humming there behind the prompt.
Eventually my company switched to Oracle and DB hacking quickly became another one of those annoying things.
The complaint that MySQL is by default loosey-goosey with your data is valid, but it's an easy default to change. Here is what happens when you run some of the commands shown in that 'Why Not MySQL?' video on a sanely configured MySQL system by setting SQL_MODE to TRADITIONAL. This mode also allows you to not have dates with zeroes, etc.
mysql> alter table test change column my_money my_money decimal(2,0);
Query OK, 2 rows affected (0.03 sec)
Records: 2 Duplicates: 0 Warnings: 0
mysql> insert into test values (4,'bar', 100);
ERROR 1264 (22003): Out of range value adjusted for column 'my_money' at row 1
Semantic changes in configuration files are problematic in MySQL for the same reasons they are problematic in PHP (the only other language that seems to do that):
* Your application then no longer works properly on a default install. Possibly in very subtle ways that you won't notice.
* Upstream will never make your settings the default. It would break too many applications, so you are on your own with those settings pretty much forever.
* Few users use non-default settings, so you are more likely to run into strange behaviors that are either bugs or can't be readily explained in a forum.
* Just imagine if you are trying to support multiple applications that each have their own settings they prefer. Especially if you want to integrate your data together at all, which is one of the main purposes of a database system.
1) can applications set the SQL_MODE themselves? Can an admin configure the server so applications cannot specify mode? If not, what good is it since it won't guarantee your data?
2) My larger frustration with MySQL is I have run into cases of single transactions deadlocking against themselves. These always happen when the following is true:
* Executing an insert statement in the form of INSERT foo (bar) VALUES (1), (2), (3), (4);
* Only one connection/session active at a time (for example during a data migration to MySQL)
* Frequency goes up when more rows are inserted per statement
* Which inserts trigger the deadlocks are not reproducible
I believe this is an issue with race conditions and threads, perhaps a lock contention that isn't being handled properly between index and table writes or the like. I can reproduce it by inserting a couple million rows into a table, a few thousand at a time, but the statements where this occurs varies from one run to the next.
I have never seen braindead locking behavior on PostgreSQL.
That certainly is odd behavior however reading through the whole email thread that seems to be one complicated issue with a two processes obtaining different locks to the same row. Probably not ideal, but maybe I need to recalibrate my definition of braindead because in real-world examples, that would be highly annoying.
Actually it is not fixed yet, the patch was pushed to 9.3 and Alvaro is still working on it.
It has happened to us too, though it was very easily fixed in our case due to the excellent deadlock reporting. And I can also confirm that the case for Joel (the original reporter) is also a real world case.
I don't think so. First ID's were being assigned manually, so I don't think a table lock would be pulled out, and secondly only one statement was executing at a time. Only one connection issuing sequential statements, and nobody else on the server.
I think that this is thread/mutex based because of the fact that I could try to run the same series over and over and get different sets failing. So it seems to me the issue might be something environmental affecting timing.
Again the really braindead thing here was that it was only one statement, only one concurrent transaction at any point in time.
So, what good is setting a sensible default as a DBA if the dev guys can override it?
I mean it's fine if you only allow one app to write to your db, but once you have more than one, the fact that the option to ignore invalid data even exists ought to make one very nervous if the data stored is worth anything.
Yes, it's a silly default, but I don't think it's the fatal flaw that you make it out to be. That's ok though, this is clearly an advocacy piece, so I don't expect nuance.
If none of that freaks you out, then I have one word for you: Oracle.
This to me smacks of FUD.
Oracle, evil and terrible as they are, would have a very hard time killing or otherwise harming MySQL due to the code being GPL'd and the vibrant MySQL community (Percona, Facebook, Twitter, Drizzle, MariaDB, SkySQL, etc) that fixes bugs, adds features, and directly competes with the work Oracle does.
I've been the victim of that flaw - I biffed a Rails migration and sent (what should have been) $200 transactions into the DB as $99. My local tests used SQLite, so I didn't catch it until we went live - true deal there. Very dumb on my part.
At the same time - changing data values not stupid? We may differ on this point but I believe it's insanity. I'd love to see NULLs honored as well as constraints.
Finally, it's easy to characterize my comment as FUD but Oracle has a habit of using Open Source systems to its advantage. Why else would they buy MySQL? And the licensing is not as strong as you may think.
Either way - it's very argumentative and I'm happy to agree it's unsure at best.
I agree, it's insanity. I have no idea why it's still the default, there are probably a bunch of crappy apps out there that depend on this behavior. Anyway, sorry for hijacking your thread. I thought your video production and screencast techniques were top notch. As someone who dabbles in creating geeky youtube videos, I have to give you credit for a job well done.
One of the best features of Postgres is Tom Lane. We've been using Postgres since 2000 and following the (user-oriented) mailing lists closely and he's always been responsive, helpful, extremely competent and tactful - a rare mix indeed (unfortunately). He should be the primary role model for FOSS developers, where quirky characters dominate the field. Thanks, Tom!
My opinion is contrary to most of the comments here. About a year ago we switched a multi-tiered app from MySQL/MSSqlServer to Postgres/Postgres. While everything works fine, after months of using it I find writing programs for and with MySql or MSSqlServer seems easier.
One specific complaint I have is that there is no transaction control within pgsql stored procedures (functions). I had to kludge the db link module to get transactions to work.
Another related complaint is the DB Link module has poor functionality when compared to MSSqlServer's DB Link. I brought this up yesterday at the PostGres booth at OSCON and they said they're aware of it. No news on when they'll get around to fixing it.
A feature we miss is MySql's ability to use different DB engines such as Archive or Memory. AFAIK there is no equivalent to these in Postgres. We put mysql's Archive engine to good use before we migrated our app. Now we have to export data to gz files via script.
In general I don't think there is a strong reason for the majority of apps to switch to Postgres when MySql works for most needs. (I should mention that pgsql has some nice locking functions that we found useful)
"One specific complaint I have is that there is no transaction control within pgsql stored procedures (functions). I had to kludge the db link module to get transactions to work."
Yes. Functions work within a transaction, which is usually a blessing. But there are cases where you want to do the transaction control in the procedure itself. This is a known request that is discussed as two different features: "stored procedures" (which is probably what you want) and "autonomous transactions".
As a long time PostgrSQL fan and user, it's nice to see it get well-deserved attention. It's kept our company of 3000 employees going strong for nearly a decade and has NEVER failed us or lost any of our data. (Knock on wood ;-) )
What tool is he using to do the postgres part of that video?
I've been looking for a decent postgres gui and can't find one.
EDIT: Navicat it looks like.
One comment, way back int he mists of times (the 90's) MySQL was known as the database that would eat your data. It had a number of data gobbling bugs, as well as no transactions.
I'm genuinely surprised how far good marketing and ease of use will go to make a product successful. The postgres guys might learn a bit there. Ease of use will drive a lot of adoption.
I've been spending hours at a time looking for it, spamming here was a last resort. (Throwing "Postgres" in my searches was what was fouling them—turns out it was just spawned by Heroku's Postgres team...)
Interestingly, Ingres was the first database I ever used - running on Unix (v6 or v7?) on a PDP 11/34. Since then, I've ambled through Oracle, SQL Server, etc, but I'm currently using PostgreSQL on a project, and loving it. The native spatial types are really handy for what I'm implementing, and the speed and stability are great.
The recent release of PostGIS 2, the spatial extension of Postgres, has been a huge benefit to geo software. It has increased the ease and speed at which we can now work with large data sets and do spatial computation on them.
My favorite part of PostGIS 2, besides much of the cleaned up syntax, is the new Geography datatype. It has the spatial projection built in to it, so it automatically does great arc computations. Before we had to write more clunky SQL to achieve this. Was a significant barrier for my learning postgis, which has now been removed.
I was reading this article with great interest, exactly up to the part where it claims MySQL is stupid. Maybe it's just me but I've had enough of articles that are written with the idea that the best way to promote your product is to bash your perceived competition and call it stupid, worthless, idiotic and broken. People, here's the news for you: people don't use your competition (or your beloved tool's competition) because they are stupid. They use it because of variety of reasons, but it almost never involves them being stupid, and calling their choice stupid is almost never a good way to get them to consider changing their preferences.
Even more disappointing, when the author proceeds from showing how stupid MySQL is (random collection of choices which can be plausibly decided in any way) to showing how much better Postgres it, he doesn't show anything useful. He just shows random collection of features that may be useful in some use case he probably encountered in his practice, but no compelling picture of a superior product. Nobody changes DB server because one has "tomorrow" keyword and another does not.
And when performance comparison is done - no numbers at all are quoted except for initial data size. Just bare claims "it performed well" - how well? Are these results comparable to what I can expect on my data with my setup? How much manual work (mentioned in passing as 2 experienced DBAs were working on it - but how complex was what they did?) would it require to make it perform? How much better would it perform if I hired a very good DBA? These are very important questions to consider when you choose the DB for your project, but completely omitted in the article.
> qt-psql requirement wasn't very easy to satisfy on Arch
Whoops! My primary dev machine is Arch. Let me fire up a clean VM and try and reproduce the problem. Are you rolled to the latest?
> - You don't handle bytea columns very nicely. I'd expect those to be displayed in hex or so.
Unfortunately, the database types to Qt types get mapped in a pretty undesirable way (for this use case atleast). For example, I do custom handling of timestamp types within reason. Let me take a look at bytea as well.
> - Browsing a table is very slow on tables with large columns. I was testing on a table with a ~10KiB text column.
I strongly suspect the Delegate class of Qt. This class handles the painting of cell data and it can be customized to a high degree. And I think even the default Delegate is pretty heavy leading to extreme stress when painting cells that carry a lot of data. Unfortunately, I don't see this as something that can be fixed quickly unless I start from very low level classes on Qt to display table data.
> - I personally prefer a more compact interface, having an option for smaller buttons would be a good start. :-)
Actually, I was thinking about having the scroll wheel adjust the icon size on toolbars. Let me look into that as well.
Thanks a lot for your input; I really appreciate it.
Oops, just actually tried to compile from git and it just worked. I was under the impression that the qt-psql library wasn't included by default, so I went looking for a package and only found something outdated on AUR. At which point I gave up, expecting that to be too old to still work. Sorry for the FUD, I feel stupid now. Keep up the good work!
> - Browsing a table is very slow on tables with large columns.
Last time I did this I used a fixed-width column when displaying text fields. Everything built-in was able to handle things smoothly when clipping to just the first __ characters. (Of course you can't see it all then, but I've found it useful enough... and can paste a cell elsewhere to see all of it.)
That looks great actually. I've recently switched to PG and use pgAdmin III, but find it too confusing with tables being hidden so deep in the tree (or maybe that's "pg way" of doing things with tables not being centric?)
Thanks a lot for your kind words. If there are any specific features you would like, feel free to email me or raise an issue on the projects' github page.
By the way, pgXplorer has a quick find feature that can highlight the searched table/view/function. Hopefully, you find that useful. It is a really useful feature especially when dealing with CJK table names.
What are the scaling differences between MySQL and PostgreSQL? That's the main reason we haven't shifted and we have a new project coming up that I've been interested to use PostgreSQL with as one our developers prefers it, but are we opening a whole new can of worms on that front?
For vertical scaling, PostgreSQL is generally seen to be much faster. That varies by workload, of course, and I'm sure you could tailor a benchmark that shows MySQL to be dramatically faster than PostgreSQL. But in every benchmark or sample set I've come across, PostgreSQL tends to scale linearly with the number of processor cores and handles ridiculously unlikely concurrency demands with ease. MySQL has a reputation, deserved or not, for being fast for single-connection, read-only workloads but not so quick when writes or multiple users are added in. Under load, PostgreSQL seems to be faster than the quick-but-dangerous MyISAM MySQL table type, and almost always faster than the slow-but-safer InnoDB table type.
I can't speak to horizontal scaling, as my little database with maybe 50 million rows never comes close to loading out its aging 4-core server, even when being hammered by lots of web visitors running complicated reports and internal clients performing all sorts of CRUD operations.
I have one customer who uses MySQL because they wrote their app around the query cache. The data is aggregated and then fed into PostgreSQL for real processing, and eventually reports generated and shipped back to MySQL....
I've never used PostgreSQL at my jobs outside of initial "could we switch" testing. I know of a couple of benefits that MySQL has/had over PostgreSQL.
1. The commands are very user friendly. In MySQL you can issue commands like "show tables" and "show databases". The last time I used PostgreSQL, the commands were much more esoteric. Things like "\dt". It adds a good hill to the learning curve
2. MySQL is everywhere. Basically anywhere you go on the 'net you can find people who know MySQL, quirks and all. It's very easy to get help, and it's the DB that most tutorials (for other things, like Rails, PHP, etc) use. MySQL is also the standard database you can find on any web host.
3. Replication was a big deal. MySQL has had replication built in for many years, and it's quite easy to use. At my last job we would have been quite happy to go to PostgreSQL for the performance, but we needed replication. At the time, there were solutions for PostgreSQL in outside projects such as Slony. Since we didn't have any PostgreSQL experience we didn't want to expend all the resources on testing and configuring everything. In the last two years or so PostgreSQL has gained an official in-tree replication solution. This makes understanding how to get replication up and running much easier.
4. MyISAM is small and fast. When machines were much slower, that could be a real benefit. Of course you sacrificed all sorts of consistency to get that.
At this point, I'd say PostgreSQL's biggest problem is mindshare. MySQL has been improving the whole time, and some of the worst warts are gone. It may not be the best tool for the job, but MySQL is just so common. It's trivial to find help, advice, employees familiar with it, programs that support it, etc.
I'd really like to use PostgreSQL in production so I could get more experience with it. It has some amazing capabilities. But our needs are relatively simple and we have lots of MySQL experience so there is no reason for us to look at switching right now.
As for "show tables", there's a big difference here.
"show tables;" is a server-side command. \dt is a command you give psql to say "ask the server what tables there are and list them." \d stands for describe. If you use a graphical client, it will have to query the system catalogs itself which you can do if you want.
As for replication, the real challenge is that replication is not a one-size-fits-all thing. Slony, Bucardo, and the built-in streaming replication have different limitations. For example you can't replication from 9.0 to 9.1 with streaming replication but you can with Slony and I would expect you could with Bucardo too. So with the out of tree replication systems you can actually have a zero-downtime upgrade as long as you block writes during the upgrade of the master. With streaming replication, assuming pg_upgrade is supported for your upgrade, you only have a short downtime window, but it is downtime.
Also streaming replication is all or nothing. With Slony you can replicate different pieces of your database. On the other hand, with Slony, you can (accidentally) replicate only a piece of your database.
So the differences give you a lot of flexibility there, but you need to have a clear idea of what you need and why you need it before choosing a replication solution. Slony is perfect for what Affilias uses it for, while streaming replication wouldn't meat their needs.
Also I don't think that PostgreSQL suffers from a lack of mindshare. Ever since I have been building apps on open source db's, PostgreSQL has been regarded as the go-to database for complex business apps. The sorts of things people do with the database are different. There is consequently a huge "dark community" if you will (in the sense that you don't see them).
This was driven home to me when I went to the Malaysian Government Open Source Software convention last year. There were two booths (Oracle and one other) which had the MySQL logo, and a few more that were offering MySQL services, while I counted at least 5 that were using the PostgreSQl logo. Most of those were reselling EnterpriseDB's PostgreSQLPlus but it when I looked at the level of interest and the number of government deployments I was asked about, it was clear which people were usually choosing for complex work.
As for user-friendliness, MySQL used to be much more user-friendly than PostgreSQL but I don't think that's true anymore. The command-line tools are more full-featured and the in-app help is better on PostgreSQL. \? brings up a list of psql client-side commands and what they do for example, and \h [command] will give you a summary of the syntax of an SQL command. I use \h quite a bit when I am trying to remember the specific syntax of a new feature, or when I am doing something I rarely do (like ALTER TABLE). I think that since about PostgreSQL 7.3, PostgreSQL has been at least as easy to use as MySQL, and quite frankly far more robust.
Finally, the PostgreSQL planner is awesome. I have not had to rewrite a query to get around planner limitations since 8.1.
I don;t know how you would measure popularity. I think it is fair to say that MySQL has more visible mindshare.
I guess if "popular" is a synonym for "demotic" (in the sense of progressing away from professional dba's etc) then MySQL is also more popular in that sense too. But in terms of actual mindshare, it is rediculously hard to compare them.
"What are the scaling differences between MySQL and
The best advice I have is that using a different database system, particularly ones as different as MySQL and PostgreSQL (which are essentially opposites) is like going to a foreign country. You'll find things you like and things that you don't like, and it will have a real impact on your quality of life, but it's hard to say in a few words the things that will really matter to you. You don't have time to live in every country in the world, but you owe it to yourself to at least visit a few that come highly recommended.
Whatever you do, don't try to make the new country into your old country. You will just be disappointed.
Obviously it depends on what you are doing. However, if you are trying to keep everything in one consistent environment it does win out.
Also I expect that with 9.2 we will probably see a release of Postgres-XC which will kick the pants off any MySQL sharding solution out there.
Postgres-XC is a way to abstract database shards such that referential integrity etc. is enforced between shards. The best comparison to date would be to Teradata's clustering system. Better yet, Postgres-XC is just a set of patches on PostgreSQL, and is on-the-wire compatible, so you can swap it out when your database gets too big. This will also bring an ability to concurrently query different shards and aggregate the results without the application having to know anything about the sharding....
Definitely. When your application needs hundreds of read slaves in order to scale the load, PostgreSQL has always been everyone's first choice due to it's mature and historically awesome replication system. That's why companies who need to scale big (YouTube, Facebook, Yahoo, LinkedIn, Wikipedia, Twitter etc) all have hundreds (if not thousands) of PostgreSQL machines in their infrastructure.
That is a comparison from 2006 using MySQL 5.0.20 (ancient!) and as such is completely irrelevant to any discussion on performance in 2012. Dramatic scalability improvements went into late 5.1.x releases, and then even more in 5.5.
The fact that everybody uses a particular technology does not in fact answer the question of which scales better or which cannot scale. I don't know about others here, but I would welcome actual objective information on this topic rather than sarcasm and an appeal to popularity.
The fact that everybody uses a particular technology does not in fact answer the question of which scales better or which cannot scale.
The fact that many large companies and organizations have successfully scaled a particular technology is a useful yardstick, and while it does little to answer the question of which scales better, it does, by the process of elimination, help answer the question of which cannot scale.
I don't know about others here, but I would welcome actual objective information on this topic rather than sarcasm and an appeal to popularity.
I had hoped that others here would enjoy the sarcasm in the context of the full-throated yet fact-free advocacy of the parent, the text of which somehow still remains legible on an off-white background.
The technology stacks for these companies, by and large, were selected a decade ago or more. I'm no more surprised they chose MySQL as their database as I am to see that they chose PHP or Java as their core language (in some cases).
I think the reason that MySQL became more popular than PostgreSQL is because it did one thing really good: It was a very fast file access daemon. PostgreSQL tried to do too much, ended up not doing any of it very well especially in the area of performance and didn't become as popular. Maybe useful to remember this going forward.
Postgres was developed over a long period of time with careful attention to architecture, robustness, and extensibility. It also focused on both external and internal documentation and cleanliness. And, it followed the traditional database system model with a cost-based optimizer and everything else.
As a result, we see a very robust developer community. It's a real machine -- churning out high quality releases every year with a great mix of newsworthy features. Enterprises and startups alike are flocking to postgres for a huge variety of use cases.
Slow and steady wins the race (and it doesn't hurt to be 5 tons and have tusks) and isn't tired afterward. And in databases, it's good to play for the long term because that's how long the data will last.
I wouldn't trade that for some first-to-market popularity.
First of all, the success of postgresql and mysql are not mutually exclusive. Postgres operates in a lot of markets that MySQL doesn't (as far as I know) like telecom and finance. Postgres gets entirely new users with a variety of use cases, as well as people from Oracle, SQL Server, MySQL, and even some converts from people who tried NoSQL and found it lacking in some way.
An undeniably, there is a shift happening, even in parts of the market that were a MySQL stronghold. Heroku and their customers use postgres for multi-tenancy. Instagram uses postgres for photo-sharing.
I think MySQL has been very complacent because they are the default for a lot of simple web apps, and that keeps their numbers high. They aren't really breaking into new markets -- postgres is winning the geospatial market big time, and is always coming out with new features to break into new markets (personally, I am trying to advance postgres into the temporal database space).
Something that tends to get overlooked in pg/mysql discussions is administration - more specifically account management. all of this is done in the 'mysql' database via SQL directly in mysql ('insert into user', 'grant permission', etc). Most tutorials on PG I've seen over the years show these things being done from the commandline, and assume a unix environment, and an understanding of connecting from various account shells and such.
i've been made aware that you that there are other ways of dealing with user/perm management in pg, but from the perspective of people running shared hosting, the default tutorials feel extremely burdensome compared to managing everything in one database via SQL (no need for multiple user accounts on the system - just multiple user accounts on the database).
One thing I've always loved about SQL Server is the Adventure Works OLTP and data warehouse DB samples, they are absolutely priceless when it comes to experimenting and learning. Is there something like that for Postgres? Is Postgres even intended to be used as a data warehouse and, if so, are there any decent business intelligence tools that can work with a Postgres warehouse? I'm not looking for an open source Analysis Services/Business Intelligence Studio equivalent, just something that I can build some cubes with and run MDX queries against.
I had to use Postgres at a job or about 6 months and hated it due to lack of a decent GUI. The "official" one has many bugs and quirks and takes quite a bit of getting used to especially after coming from phpMyAdmin. So it's nice to see there is now an .app for it.
i believe one of the main reasons for mysql's popularity is that it is available by default with many hosting providers who provide PHP based hosting for a very cheap price. These are low-medium volume sites developed by a small teams. They just don't need or care about the benefits of using a better database like 'postgres'.
IMHO, postgres developers and support organizations need to work with hosting companies to make postgres available along with myql.