[Edit: In migrations, specifically. Also! No mention of the MySQL dual-license JDBC driver malarky... Not that it's a success of Postgres' as much as it's a failure of MySQL's... ]
I'd never heard of Window Functions (not an Oracle user, and been on MySQL for a few years... left Postgres for replication and haven't made it back quite yet).. but I desperately want them. The idea of effectively aggregating over groups in a group by (which is how I understand these Window doohickeys) is something I come up on a few times a year and get really frustrated by.
I think the only killer feature I'm left wanting from some RDBMS at this point is managed-sequences... The idea that I have a related list of things that I want to keep a sequence column for, and would like the database to understand the concept and help me maintain it.
Go ahead and give it a shot. Once you get used to it, you cannot live without it.
Another killer feature in my opinion is the recursive query: http://www.postgresql.org/docs/current/static/queries-with.h...
I find it a pretty typical and profound source of frustration, and since the whole idea of RDBMS is to model relationships.. and relationships so often have integral sequence/rank.. it's always seemed to me like something that ought to've be considered 'in scope' during the design of SQL.
Another option would be to keep the chapter order in an array column on the Book table, since arguably the arrangement of chapters belongs to the book rather than the chapter.
PostgreSQL array documentation is here: http://www.postgresql.org/docs/9.1/static/arrays.html
That's a really excellent point, and array columns look like a really cool solution! I suppose the only bummer is that it introduces a way to get things out of sync (I could delete a Chapter row, but forget to remove it from the array in Book), but in general this is a much better approach than the status-quo solution.
Essentially if your API is single-table DML, then triggers are preferable. If multiple tables need to be manipulated at the same time to complete the full operation, then there should be a procedural API and triggers are likely unnecessary.
It boils down to two problems. You can't be sure you're the only transaction in progress, and you also can't "take back" a sequence number. If you could guarantee you were the only transaction, you could just use COUNT. But since you can't, you could have two transactions in progress that both get the same COUNT result and you wind up with two "Chapter 5"s. You can put a UNIQUE constraint on that table, and then one of your transactions fails that constraint. That way you retain integrity and you get (at least for a while) sequential numbering, but there is no in-database way to ensure that a sequence of numbers has no gaps. This is because to do so would be expensive but also not really relate to data integrity.
The purpose of PostgreSQL sequences is to reliably generate surrogate keys. That means every transaction in progress gets a separate ID from that sequence. They all have to be unique because they all might commit. If they rollback instead, it doesn't matter; those become gaps, but nobody is failing the UNIQUE constraint.
What are you going to do to seal the gap when you have two transactions in progress on the same sequence and one of them rolls back? Are you going to pause the other transaction and propagate some kind of decrement to the other transactions? What if I inserted that ID into another table or called a globally-visible stored procedure (sent an email, for example)? I'm not certain it would even be possible, but even if it were, it would clearly be extremely expensive, and therefore, probably not a feature you want on by default.
So I hope that explains why this is not "in scope." :)
I would like some RDBMS to give sharable transactions. What I mean by that is something like this.
1. When I begin a transaction, I can assign it a name.
2. If multiple connections to the database are in transactions with the same name, they can see the changes made by the others, as if none of them were in a transaction. Connections not party to the shared transaction do not see the changes until all of the connections in the shared transaction commit.
3. The connections in the shared transaction can do unnamed transactions, which isolate them from the others within the shared transaction.
The idea here is that sometimes I have some operation that I would like to do atomically, but that involves multiple processes acting on the database. Process A wants to make some changes, then pass responsibility on to process B (which may be on a separate machine) to make more changes, and then process C finishes up, and if any of them encounter an error I want to rollback to the state before A started.
I generally wouldn't defer that responsibility to the database though. I'd more often handle this as multiple database transactions that advance a state machine, at the end of which consequence-inducing-events are fired off.
The idea that process A tickles the database into some state expected to be understood (and only seen by) process B seems like too flimsy a contract to me.
Basically you encapsulate whatever you need to do with the database in objects (logic + parameters). And you describe for these objects not only the logic for updating the database (forward), but also for reverting the database to its previous state (rollback).
Or you could delegate all this to the database, which already has its own transaction protocol.
Some explanation of how it works: http://stackoverflow.com/questions/494550/how-does-transacti...
Usage to 'forget' automated test database changes: http://amnesia.codeplex.com
Excuse me if I didn't completely understand what you wanted but it seemed similar to SQL Server's sequence: http://msdn.microsoft.com/en-us/library/ff878091.aspx
- scalability. Logic and processing in the server is bad as it means you can only scale up and not scale out. Scale up is damn expensive. When you need that 64 core 96Gb machine to run it all on with 6 replicas will see what I mean.
- complexity. The DBMS is a black box which sucks up tight coupling almost instantly. Coupling is bad for maintenance and scalability. SQL always ends up with heavy coupling.
- lock in. Those features only work with postgres so you're stuck with it forever or induce a lot of cost to move away when a better solution comes along.
- schema. Maintaining schema and dependencies is hell on all but a small system.
These facts come from 20 years of using databases.
Stuff that will save you: Use your DBMS as a simple flat store akin to a document store. Use a CQRS based architecture and encapsulate ALL of your logic in code. If you have to go at least partially relational for god's sake use an ORM that supports several databases well. Make a provider independent app level backup/restore facility such as the one with JIRA or TeamCity. NEVER hire a DBA - they exist only to create walled gardens to protect their existence.
My current gig has cocked up on all areas and it's costing them over 40% of their turnover keeping it alive.
Happy databasing :)
Similarly, using an ORM doesn't help you. All that does is hide the details from you. It still uses SQL underneath, but limits what you can do with it.
Let's talk schemas. You have one whether your database understands it or not. Trying to pretend you don't have one doesn't make managing your schema any easier. It doesn't make building indicies on fields any easier.
These are hard problem and pretending they don't exist by rolling your own implementation or hiding them under an ORM won't make them go away.
It's also not about it being a hard problem (which it incidentally is but that's another story). It's about controlling the scale of the problem as time goes on and sufficiently abstracting it without adding complexity. That's not hiding it - that's making it a non issue. Permanently.
I am only advocating using a minimal subset of the functionality to aid migration in the future between engines, platforms, versions and storage paradigms.
An ORM (in our case NHibernate) is a clean abstraction over the top which allows us to be completely mobile between vendors and versions. There is no NIH in it.
The world changes pretty quick. Those who adapt quickly stay around.
An ORM does not replace proper database design. And letting the ORM do all your work for you w/r/t to mobility means accepting that you will not have good access to anything but the least common denominator among databases. Which is approximately the SQL-92 standard.
Another choice, which doesn't involve cutting off one's nose to spite one's face, is to couple the ORM to a public interface which is defined in terms of a specific set of views and stored procedures. Then you can very easily write schema which conforms to this interface whenever you need to switch to a new DBMS. The new schema is then free to take advantage of whatever features that DBMS provides in order to achieve good performance.
And the potential performance gains can be huge. There are plenty of examples out there of replacing 1-hour basic SQL + in-app processing procedures with 100ms common table expressions and the like. If you've got someone who knows what they're doing in charge of it, the latter solution will often take quite a bit less time to write and debug, too.
Ah, therein lies the rub. Spot on.
The last decade has seen a gradual weening off of reliance on the database stack for processing any form of "logic". This has been a evolutionary disaster. When companies like Sun pushed out application platforms such as Java, Microsoft followed suite to remain in the game. That's where the action was and that's where their revenues came from for the good part of 1998-2008 (still does). As a result, the newcomer programmers find it unfathomable to position any amount of processing in the hands of RDBMSs. These newlings choke at the idea of stored procedures or triggers - citing that these constructs are a step in the WRONG direction. Set-based SQL operations remain an anathema for a large majority of the new programming stock.
This is a pity because the side-effect of this attention diversion has been poorer data modeling skills and gross underutilisation of new and often improved RDBMS concepts & tools.
In the end, these new-school graduates of denormalisers and NoSQLers will have to contend with the one and only critical challenge that RDBMSs set out to fix: data correctness.
Performance & scalability mean shyte if you can't guarantee correctness. Bring to the table ACID transactions and MVCC with your new concepts of scalability and performance - then we talk ... kinda thing.
Believe me, we still need ACID/MVCC and by no means do I want to break the consistency part of CAP theorem if possible.
Also set operations break down under the following conditions:
1. when you have to aggregate a LOT of data together they don't scale. If it scaled well we wouldn't need OLAP solutions and could do everything with live, perfectly normalised data. That is just not the way it is.
2. They really don't work for temporal or complex non-uniformly structured data at all well.
You say you still need ACID, but you advocate manually writing joins on top of an ORM layer which does not give you ACID and introduces massive latency. Using an ORM on top of an a key value store doesn't absolve you of ACID or having joins. You still do joins and transactions, but you need to write them manually and test correctness manually. Both are very expensive to do correctly and running some unit tests is not sufficient to show correctness at this level. You are writing these things without recognizing what they are.
The whole NoSQL movement is fundamentally about trading correctness for performance, and usually read performance at that. There is nothing else new about NoSQL databases and ironically many even use a variant of SQL for queries! SQL is just a language, not an implementation.
As for aggregates, yes they are hard. By doing it yourself you are not making the problem less hard. Rather, you are possibly implementing solutions that RDBMS's can't yet do. I suspect that with a reasonable relational design you could probably do whatever aggregates you need to in a high performance manner while still taking advantage of the work done in an RDBMS.
It sounds like you don't understand the requirements of your application. If you need full ACID, your solution with nosql + ORM + manually written joins and queries is failing you. Unless, of course, if you are also writing your own global lock or using some form of lock-free algorithm.
NHibernate, SQLAlchemy, Hibernate are where you need to look and compare.
The ORM does a few joins where a performance gain is possible but we avoid them though as they are expensive on IO and transfer.
Most of our tables match the UI 1:1 and we populate these from the domain model when it is mutated or when we can afford it to be done (either directly or via message queues). Effectively they are materialized views which are updated when data changes or before it is required. Google CQRS.
I'm not suggesting NoSQL - just caution when using RDBMS features. I'm not an advocate of NoSQL.
In all those cases, the devil remains in the details. For example if you're just looking to add one or two columns to the result set, an INNER or OUTER JOIN usually reduces the amount of data that has to be shoved through the pipe. That's because you avoid transmitting the keys that are needed to join the data in-app. On the other hand, if you're joining a small number of larger values to a table with lots of rows, perhaps the join duplicates those values enough to really bulk up the result set. But it's effectively impossible to know whether that will be the case without measuring the performance for a couple different options.
Here's an alternative consideration: ORMs tend to encourage a style of coding wherein every column in the table is always grabbed all the time, regardless of how much of that information is actually needed for the task at hand. This is done in order to be able to generate full-fledged objects like the ORM wants to, and to reduce the amount of custom mapping that needs to be maintained. Now if you're looking for a needless waste of bandwidth, there's a needless waste of bandwidth.
If you're consistently having performance problems on joins, it's likely that the real culprit is poor configuration of keys and indexes. Or if you're not watching how the ORM chooses to join the data too closely, perhaps you've got some N+1 queries sneaking in there.
I take it you are aware of Brewer's CAP theorem? (http://www.julianbrowne.com/article/viewer/brewers-cap-theor...) If not, you should be.
Whoah. This is a type of sociopathic reasoning applied to data modeling. The premise is based on accepting compromise knowing fully the consequences of data corruption and yet failing to address the actual rider: data shall not be left in an inconsistent state. An argument turned on its head.
Much like asking the question “is it ethical to be deliberately inaccurate on your IRS returns” is answered with “this is not a question about ethics”.
Is it not a gigantic assumption to claim:
1. The compromise has “worked” for others (“most successful websites on the planet” [sic]) without providing any real evidence supporting this claim? What does “worked” mean? Is Twitter “ok” with providing suggestions for Followers which contain people already in the user’s follow list? What were the consequences of data inconsistencies in matters related to financial transactions? Was every requirement able to simulate a catch-all “place a back-order” solution? These are real questions that have to be answered before evangelising the compromise.
2. If it works for others (meaning: if you are willing to accept compromises about data consistency) this compromise ought to be brought to your garden-variety corporate environment? Gasp! As it is, volumes or scalability issues aside, corporate environments suffer from data inconsistencies, poor or no transactional control, ineffective referential integrity etc. There’s very little positive impact of encouraging this habit as an overt practise under the false generalised promise of “you will eventually need to scale to millions”.
They can keep their CAP theorem or BASE principles for the likes of social toys like Twitter and Facebook. When credit card companies and banks start accepting compromises on their transactions, I’ll start paying due attention.
Most of these large -- and particularly, global -- systems you speak of are eventually consistent also; they use the relational database as a potent backend member of a distributed system, able to check and deal with a great many constraint problems that can be localized.
If you look at NHibernate for example, if you do paging on SQL 2005+ it will use CTE yet on MySQL it will use LIMIT clauses underneath.
Some enterprise products have a really crappy schema with an ORM thrown over the top - that's where the pain is.
There's also no reason why you can't use multiple smaller databases. Use dblink. http://www.postgresql.org/docs/9.1/static/contrib-dblink-con...
I've never had a problem managing "schema and dependencies" on larger systems. Not sure what you are referring to there.
I'd rather have a slightly higher latency and be able to scale to 100,000 users.
Materialized views are not needed with CQRS architecture. ALL tables (or sets) are materialized views already. No joins are ever made.
dblink is ugly and if you ask me a problem, not a solution (latency, network io, memory).
I'm not sure what you classify as "larger" but I'm talking tens of Tb, 1500 tables and over 2 million LOC in the application.
Billions of records, 10,000 transactions per second. This was a couple of years ago.
Dunno if Skype is complicated enough for you.
serving hundreds of thousands of users
That's not much...
- Compelxity, lock in: The DBMS is like any other software component: If you're actually using it to its capabilities, then it's trivially obvious that it should be difficult to replace, because there are a lot of capabilities you have to find substitutes for. True, you could avoid being coupled to your dependencies by choosing to reinvent wheels whenever possible instead. That would obviously reduce coupling. But bloating your codebase is generally not the best way to make your code more maintainable.
- schema: Pretty much any programming task is hell for people who aren't skilled at it. If you find that you're not very good at using the standard tools of a carpenter, that does not mean that the tools are fundamentally misdesigned. Much more likely, it means that you should be hiring someone else to do your carpentry for you. It's worth pointing out here that programmers shouldn't need to touch the schema in a well-designed system: They should be touching views and stored procedures that abstract away the implementation details.
I view it as an optimization problem. Based upon what you're building, the aforementioned issues are more or less likely to occur. Optimize for what is likely.
If I'm building an official site for a popular product, I'd bet on all the above being true.
Some of what I build, even if 100% of people who could use the product would use it, will run just fine with a single database server that costs in the area of $8k. Planning for scale out is a complete waste of time here.
Most of what I build starts out simple. Most of what I build starts out with 0 users. Will it be popular, will it last, will it need to scale? We have no clue. For these, we keep it simple: modularize our code based upon features, use our datastores to their fullest within features, and keep in mind how we might scale things out (which may ultimately be incorrect) - if we have to. Usually we don't have to. I believe it to be a fair balance in this scenario.
Also, given your constraints, I don't quite understand how you could find a future datastore that is a better solution. How can you exploit better solutions and avoid lockin at the same time? If a better solution comes along in the future, it seems like you would be using features that would lock your datastore to that newer, better solution. However, you cannot do this because you need to avoid lockin.
For a fresh start (greenfield) it's probably fine.
Unfortunately for 90%+ of IT products, it's not a greenfield. That's where the problem comes.
Damn expensive, eh?
As it happens I bought a couple boxes that size just recently. Except with 256G RAM instead of 96G. They cost $8500 per shot. That's not even 2 months of salary for a single developer.
Don't underestimate Moore's Law.
How much does the storage subsystem cost?
You know the whole stack of fully redundant iSCSI/FC kit behind it?
15x 146G SAS spindles are included with the price.
Why, yes? We do run a FC fabric and various DAS boxes for various purposes, but not on the DB servers.
I'm not sure what you're arguing about here, the figures (labor vs hardware cost) are not even in the same ballpark.
As a rule of thumb: If a piece of gear can save your team 1 month of work then it usually pays for itself.
In the US I've heard good things about ThinkMate. E.g. this is close to our configuration but with a different chassis; http://www.thinkmate.com/System/RAX_QS5-4410
In general throw money at the problem first, resort to code last
What happens when someone has to do a SQL update manually? Referential integrity can save you from shooting yourself in the foot there. SQL injection? Much less likely if you're using stored procedures with parameters as your client interface.
You have lock-in all over the place regardless. Languages. Frameworks. ORMs. Operating systems. All are decisions that are painful to change when something "better" comes along. You're better off making those decisions and then leveraging them for all you can get. RDBMSs have been around longer than many languages and most modern frameworks, and are still a very good, probably the best, general purpose solution for data storage and management.
You don't do manual SQL updates. Why would you if you have an ORM? You update the domain and it does all of that stuff for you. I only care about the domain which is the stuff that makes the money - not writing SQL.
ORMs don't suffer from SQL injection as they use properly parameterised prepared statements.
Regarding lock in you are correct, but the RDBMS is the most politicised and expensive component which is why yo uneed the flexibility.
I am in no way saying "RDBMSs are shit - NoSQL forever!".
The DBA you refuse to hire would likely disagree.
Databases are for putting stuff in when you don't immediately need it. The DBA mindset pushes that the DB is the hammer for all nails.
You clearly are not in willing to reflect on the actual problems being presented to you. Instead you are clinging desperately to some sort of ideology that either you yourself came up with or someone sold you on.
Look at the basic problems. You are not solving them or making them go away by "abstracting them away." The reason is that these particular problems can't be abstracted away.
For example, you raise joins into the code that your programmers write instead of the RDMBS level. But you are still writing joins, they are just for loops on top of your ORM solution. Now you incur the cost of writing them manually each time and testing them, also dealing with ACID which almost surely you are not doing correctly. Or if you are, you are spending huge amounts of money to make it right. Even worse, your solution can't possible be as fast as the RDBMS solution since you need to go through all sorts of heavy abstraction.
You are not seeing the actual problems at all and not surprisingly demonizing the type of people who do understand them: DBAs (assuming we are talking about expensive professional DBAs).
The ideology came out of necessity and 20 years of watching people fall down the same holes.
Joins - loops on top of ORM? Doing ACID myself? Do you understand what a non-toy ORM does? Do you understand the separation between the data model and the domain model? Sounds like you don't.
Note the section on joins. Their example is simple, a mapping between users and addresses. The ORM here generates a join in SQL. You are not avoiding it. The only alternative to generating SQL with a join is to write the join yourself. Such a thing would query users and then either lookup the address via an index or do a nested scan over addresses if you don't have an index. There is no way around it. Conceptually, a join is a description of a result set that you want. If you generate it, you are doing a join in some way.
If you are doing full ACID with complex queries, then you still need to understand transactions and all the associated complexities. An ORM doesn't make any of that go away by using terminologies like sessions. Rather it limits you to some subset of SQL and schema design.
If you want to limit your use of SQL for portability reasons, you can still do that without an ORM. At least that way you are more aware of what is actually going on under the hood and can more easily identify problems.
Redis solves a certain set of problems for me, Memcached as well. ElasticSearch solves another set, and RDBMS yet another. Square pegs for square holes, round pegs for round holes.
I can understand wanting to avoid Oracle lockin.... But Postgres?
Well we could shift a deployment to Amazon RDS (Oracle or MySQL) quite happily in a couple of hours without having to rewrite everything because we used all the postgresql features...
On the other hand, I could name quite a few people who went the other way, like Urban Dictionary who run on our http://postgres.heroku.com service, too. (I don't normally name customers, but he recorded a video saying he liked us so I think it's fair game. http://success.heroku.com/urbandictionary )
Also contrary to the norm here, some applications last a long time. The one I work with daily is actually 20 years old this year. In that time a lot has changed and in the next 20 years a lot will as well. It makes sense for long-lived applications to plan well ahead and consider paradigm and architecture changes.
This platform started in C++ on OS/2 with Oracle running on VAX/VMS at the back end. It went to Java/J2EE with an Oracle back end in the early 00's. It's now SQL Server and .Net.
Compare to compilers being better at writing machine code...
When you have the tools you have the power.
Probably about the same as you on just SQL Server and half as much again on Oracle...
> One advantage of GiST is that it allows the development of custom data types with the appropriate access methods, by an expert in the domain of the data type, rather than a database expert.
> Traditionally, implementing a new index access method meant a lot of difficult work. It was necessary to understand the inner workings of the database, such as the lock manager and Write-Ahead Log. The GiST interface has a high level of abstraction, requiring the access method implementer only to implement the semantics of the data type being accessed. The GiST layer itself takes care of concurrency, logging and searching the tree structure.
> So if you index, say, an image collection with a PostgreSQL B-tree, you can only issue queries such as "is imagex equal to imagey", "is imagex less than imagey" and "is imagex greater than imagey". Depending on how you define "equals", "less than" and "greater than" in this context, this could be useful. However, by using a GiST based index, you could create ways to ask domain-specific questions, perhaps "find all images of horses" or "find all over-exposed images".
If you are already using PostgreSQL though have not known about this fact, I highly recommend you to learn about GiST. It is the most powerful feature of PostgreSQL as I know.
Let me pile on a big one: Stored procedures in multiple languages:
* PHP (though I could not get it to work on 9.1)
* I'm not fan of doing backups with pg_dump. I think it would be a lot more awesome if I could tell the database to back itself up in-band. Even better, I would like to be able to ask the database to schedule its own backups, instead of having to set up my own cron jobs (or equivalents). Commercial databases can do that.
* The sad state of libpq (the C API). Libpq is where PostgreSQL really shows its age. It's very cumbersome to use and can't do (or can't do well) a bunch of things the raw protocol is capable of. Since almost all PostgreSQL bindings are built on libpq, none of the bindings support these features either.
* It would be cool if it was possible to kill sessions in-band (other database servers can do that). With PostgreSQL, the only way to shut down misbehaving sessions is sshing in to the database server and doing a `kill` on the process of the session.
* The infamous COUNT(*) fiasco.
libpq knows how to do all this.
The reason for this is that to avoid inspecting the connection for new instructions during query processing, acting under the assumption that cancellations will be rare.
* libpq does basically suck, but go has a pretty great native implementation. suggestions are welcome for improvement, patches are even better.
* pg_cancel_backend() plus select-fu on the `pg_locks` table joined with `pg_stat_activity` = joy
* fast count(*) is possible via index-only scans in 9.2
I'd like to expand on this. It will be faster, but still involves an O(n) over the qualifying data, but it'll be a lot less data, particularly for tables with wider rows. To some extent this problem is fundamentally unable to reach a small O(1) without introducing more lock contention and expense that many people don't want to pay continuously and across every possible query or dimension that can exist (in the latter case, it is clearly impossible given the existence of arbitrary WHERE clauses), but I think it'd be nice if people could add a new physical construct whereby count(*) queries on certain dimensions can be made faster at the cost of lock contention or resource consumption, not entirely unlike indexes.
select n_live_tup from pg_stat_user_tables
where relname = 'your_table_name'
1. It's surprisingly difficult to find an experienced PostgreSQL DBA who wants to work in a corporate environment.
2. It's even more difficult to get a large corporation to pay a reasonable salary (why would we pay more for a PostgreSQL DBA, when we can just throw that money at Oracle/Microsoft and get a cheaper DBA with all the certs...) which might entice an experienced PostgreSQL DBA to join the organization.
To expand a bit, IMO - that scenario doesn't really exist for PostgreSQL DBAs, you are either experienced or inexperienced (with no weight-carrying certificate to complement credentials). Hence the corporate environment won't hire an entry-level DBA who has no certs to validate the hire, and they won't justify the seemingly added cost of an experienced PostgreSQL DBA.
heh..actually..it's the first thing on their product list: http://www.commandprompt.com/products/hot_standby_drbd/ ... they also do clusters, etc.
I've taken 5 years and several critical arguments to get a couple of MySQL instances in one of them to run some internal stuff. It's a start at least.
I like Postgres, however in my experience, where money is not a consideration, Oracles RAC with its (fairly) easy multi node DB clusters is often the competitive advantage that prevents Postgres from being used more widely in the enterprise - maybe that could be something that the Postgres team look at replicating.
Let me explain the DBA mindset: "when this falls over at 2am the day before <major, non-negotiable deadline>, who is responsible for fixing it?"
Now I am not saying Postgres is flaky; I am saying that the DBAs operate under a different set of constraints than developers perhaps do. One common scenario is that devs are incentivized on new features delivered and DBAs/SAs are incentivized on uptime.
The issue: I ran a SELECT with a WHERE against a column that is an "int(11)", but I meant to run it against another column that is "varchar(255)". Something like:
SELECT * FROM table WHERE wrong_column = "1abcdefghijk";
Somehow MySQL cast a "varchar(255)" to an "int(11)" with a value "1" without telling me. WTF? My result set was approximately 1M rows. I expected 1 row, so knew the result was wrong... But what if I had expected more than 1 row? Then the bizarre results from MySQL would have appeared to make sense.
For example since 9.1 Postgres support foreign data tables - other data sources is available for user as regular table; there are many drivers: MySQL, Oracle, external pg server, and even plain files.
Since 9.0 there is support for per-column triggers.
- "While replication is indeed very important, are users actually setting up replication each time with MySQL or is it to only have the option later?"
Replication is not an option. It is a must-have for any serious products, both in scaling and in operation.
- Windows functions: They are wonderful and I love them. But it is not an important factor at all.
- Flexible datatypes: True in certain scenarios. It allows to create certain types to map a business object with unusual requirements. Otherwise, the requirements have to be implemented in application logic. But data type of Array? IMHO, use of Array data type in relational database usually means there are some issues in data modeling and design. The performance is another issue. If you have to use array, consider NoSQL options.
- Functions. Although I am a database architect, I use database functions only for simple poking around. For any serious work, I prefer to write application code. It is easier to maintain, to test, and to extend. I don't want to have a big muddy ball. I prefer to have structured, well-decoupled application code.
- Custom Language. Same as above. Why not just write application code?
- Extensions: Geospatial is awesome. It is better than that in MySQL. But for many extensions, I prefer not to do the computation in database. Databases are usually the most expensive resources (in operation costs) and are usually the
performance bottle neck. It would be better to have the computation in app servers. It is easier to scale and it is cheaper.
- Custom Functions: see above.
- Common Table Expressions: Recursive queries are awesome. I t is nice to have for ad-hoc data inspection, but I wouldn't encourage its usage in application code. The risk to use it wrong vs the benefits are high if used in application code.
I disagree or would choose to qualify a number of the things you wrote, but I'll choose this one because I think it's the most interesting:
While window functions are seldom (but not never) used by common application workloads, but I find them pretty important for lowering the barrier of what one-off queries I can write (hence, more questions are economic to answer) to figure out some statistics that assist doing business or solving strange problems. The alternative is to be forced to make a greater number of decisions with less supporting data.
This can be applied to any labor-saving device when it comes to reporting, but I think window functions make the cut for me overall, and it's great they are there.
Its that simple. I can spend a lot of time setting up and monitoring a Postgres cluster, or I can spend about 5 minutes configuring a MultiA-Z RDS instance. However, the moment Amazon offers Postgres on RDS instances, I'll make plans to switch.
I'd love to hear from anyone who has managed to deploy the beta.
- CREATE DOMAIN
- ON UPDATE CASCADE
Oh, and yeah, a BOOLEAN datatype. That's one in the eye to Larry Ellison.
What would make it absolutely the bee's knees would be a decent data language on top of that lovely data engine. Something closed over composition, with a simple, orthogonal syntax. NoSQL in the truest form — not a blind rejection of Codd's algebra, but something LESS DUMB STUPID THAN THE WRETCHEDNESS THAT IS SQL.
Is sad that Postgres completely replaced QUEL with SQL, it was not the best language ever, but some competition and alternatives in this area would be good.
Probably the sanest alternative around today to SQL would be Date's D (not to be confused with the "C++ on steroids" D language), see: http://en.wikipedia.org/wiki/D_(data_language_specification) and http://www.thethirdmanifesto.com/ (evidence that database types should never be allowed to design websites :))
- using pg'a arrays is using SQL against its grain. If you need multiple phone records per user just create a user_data table with three columns (user, key, value)
- the number one reason to prefer PostgreSQL over MySQL is just correctness. Every second query I write in MySQL get a wrong answer.
- using pg arrays is awesome, and for many access patterns is more performant and equally normalized
- MySQL is mostly quite correct, though I disagree with some of the implicit casting of values to integers and some nuances of GROUP BY. That said, operationally speaking, I'd say MySQL replication is less good and can get out of sync in awful ways, there are fewer awesome features (I made some slides: http://love-your-database.heroku.com/), and the storage engines mean you have to choose between useful features and like, not losing your data, man.
I won't call them nuances. SQL tries hard to be very close to set theory, and not just for the sake of it. In set theory, you have no nuances, you have only one possible answer to a question, and MySQL is like a bad student how do not understand why, in maths, you should never affirm something that is not 100% sure and proven.
If you need to query specific phone numbers, then you should have separate columns whose names describe their roles (e.g. cell_phone_number) rather than the EAV monstrosity you proposed. :)
I would really like to know what is monstrous in the "Entity–attribute–value" model, I scanned the Wikipedia article without finding any obvious problems.
I had, however, many many problems with separate columns for each values: I have to handle tables with over 600 columns, for example "fixed_phone", "fixed_phone_modified", "fixed_phone_modified_by", "cell_phone", "cell_phone_modified", "cell_phone_modified_by", etc. With an EAV variant, I'll have "entity", "attribute", "value", "modified", "modified_by", "deleted", and it will be much cleaner, or prove me wrong.
The current (rather than historical) values should not be in generic EAV tables because putting data of different domains in the same column subverts both the relational model and the DBMS's indexing capabilities.
If the point is to label telephone numbers with a role, then a person_id, phone_number, phone_number_label table should be fine.
CiviCRM has an interesting scheme too: it lets you say that a number is a [Home|Work|Other|Whatever] [Mobile|Fixed|VoIP|Whatever] number.
I think applications overspecify things like cell_phone_number because they would like to query for all cell phone numbers, and because they're stuck in the obsolete mindset that 1NF can't include arrays.
As someone coming from a Maths background - instead of pure CS - I'd never heard of some of the features you highlighted. If you consider expanding the article, please could you put in a line or 2 explaining what a feature is?