In terms of scalability in the small, many of the MySQL 5.6 replication features skew this comparison in MySQL's direction: multi-threaded (optionally row-based) replication, global transaction ids, and the like make it easier to improve availability, add read-slaves, and host multiple databases/shards on a single machine (replication afaik is no longer per-server, it is now per-database).
There's also been work on tuning innodb for SSDs (which is a near certain recommendation for your dataset -- it gives plenty of breathing room to forestall horizontal partitioning).
I'd also look very heavily into performance with large buffer caches, compression (great way to reduce IOPS and squeeze the most out of SSD or memory space), etc... I am curious to see if any of these were compared between MySQL and Postgres. As far as I understand, innodb's compression is somewhat more advanced than Postgres, but running either on ZFS is an even better bet for this.
On the other hand, Postgres has a better query optimizer, supports more complex data and relation formats, and so on. However, most of these features aren't going to be used at scale (I am presuming you're talking about OLTP workloads).
Honestly this is a bit of an unusual decision -- I'd probably start a project with Postgres and avoid using MySQL until later, but choosing Postgres out of scalability reasons seems a bit odd. I'm curious to know why!
From my experience with PostgreSQL and MySQL, it is a bit like the difference between python and php, one is a great work of craftmanship that is reliable, predictible, coherent, enjoyable to work with and minimise the wtf/mn rate (which is the best quality measure in software), while the other is a bunch of hacks knit together to make it work asap and its wtf/mn is skyscraping.
To put it bluntly after managing both for years, Postgres doesn't scare the shit out of me like MySQL does. I've had quite a few moments with MySQL doing stupid things that aren't intuitive or right, particularly in the backup/restore space. For example backup one schema and restore onto later versions doesn't always work.
Postgres has the same feel as *BSD i.e. deterministic. It is well documented and does exactly what the manual says and is devoid of surprises. It feels engineered and I can provide reliably engineered solutions because of this.
Basically I can sleep at night.
I really give less of a crap about scalability. I'd throw a bigger box at the problem or use heavy caching. I've built much larger ecommerce solutions with orders of magnitude more hits/orders than sparkfun on much smaller kit
I have been woken up in the night multiple times and found that the problem is that Postgres has arbitrarily decided to stop using an index and changed to do a full table scan and the database has ground to a halt under the load.
Doing an ANALYZE sometimes fixes it. Increasing the statistics target for the table in addition sometimes fixes it. One time we had to add an extra column to an index to make it decide to use it even though it shouldn't have needed it (and didn't a few hours before!).
The developers seem dead set against allowing you to override the query planner to add determinism.
I'm not saying MySQL is any better, as I have not used that in production before.
That's crazy. How would you rate a shell that just said nothing and copied nothing when asked to copy an important backup file somewhere? For me it is the hell of software, not even telling there is something wrong.
The vessel is in danger when the captain don't know her position, but in much greater danger when he is not even knowing he doesn't know where she is.
That said, it's unfair to compare with MySQL's planner, which, although it is there, is so infantile that by comparison you can say MySQL has no planner. Example: Invariant sub selects run on every row of the master query, instead of being stored in a temporary table. Think SELECT a FROM foo WHERE foo.b IN (SELECT b FROM bar). Or another basic one SELECT a FROM foo ORDER BY b LIMIT 5 where there exists an index on b actually sorts the entire table instead of using the index and jumping out when it hits the LIMIT.
Query planner changing its mind can be a problem for most DBs, Postgres is not special in this regard.
I use MSSQL a lot in my day job and sometimes the query planner there surprises us by being fairly bright about this and other optimisations. Of course it sometimes gets things very wrong and you have to use index hints to push it in the right direction, but generally speaking if your queries are complex enough to hit that problem you might need to consider refactoring your design for other reasons too.
With Postgres, while it usually seems to do a better job, when it refuses to use your indexes it can be a massive pain to coax it into doing what you want it to, even when you know it's making a bad choice.
Postgres has a number of issues like that, where ideological purity has a negative impact on usability. It still mainly use Postgres these days, but I regularly swear at it over things I know would be far simple than MySQL for those reasons (replication used to be a huge one, though that is now finally getting there, 8 years after I set up my first "large" replicated MySQL setup)
The query planner evaluates multiple query plans, and chooses the one with the lowest cost based. If that results in the wrong plan, you should attempt to correct the query planner input (that is, the statistics and the server configuration, and possibly the tables and their indices) so that it can better evaluate the possible query plans.
EXPLAIN is a great tool to debug query plans.
You mean hacks like these?
* Why don't you think the optimizer is a consideration? That's a big deal, because it helps the database choose the right algorithms as your data changes. Continuing gracefully while the input data is changing sounds like scalability to me.
* Postgres is generally considered to scale to many concurrent connections and many cores. Useful for OLTP.
* Sometimes the right feature or extensibility hook can make a huge difference allowing you to do the work in the right place at the right time. For instance, LISTEN/NOTIFY makes it easier to build a caching layer that's properly invalidated -- not trivial to get right if the feature is missing, which might mean that you're not getting as much out of caching as you could be.
Tuning for SSDs, compression, buffer caching, etc., are the antithesis of scalability. Those things are pure performance issues. Scalability is about a system that adapts as input data and processing resources change. (Your point about replication does concern scalability, of course.)
But MySQL scales trivially to many servers. Postgres replication is finally getting there, but it's been incredibly slow coming, while setting up huge MySQL installations with a dozen+ replicas, with selective sharding etc. has been effortless for about a decade.
... while setting up huge MySQL installations with a
dozen+ replicas ...
InnoDB has a lot of really smart optimizations which make it much better than just zipping things up. Included are the modification log (so you only have to re-compress occasionally) and a dynamic scaling ability to keep compressed pages in memory rather than always decompressing. These optimizations are really only possible with an understanding of the data.
I would only consider ZFS for something like an append-only data warehouse type system.
I know that a lot of work has been done on InnoDB compression, but I didn't quite grasp the extent, thinking it wasn't far different from Postgres ( http://www.postgresql.org/docs/current/static/storage-toast.... ).
> See, MySQL and its forks, for all of the really huge
> and amazing stuff built on them in the last decade,
> have got some drawbacks. On this point, I risk triggering
> a lot of strong feelings from the sort of people who
> have strong feelings about databases, so for the time
> being, let's just say that it started to seem like we'd
> have a better time writing business software on a
> database that embodied a more rigorous approach to
> formal types and relationships. And because of years
> of our nerd friends telling us we should probably just
> use PostgreSQL, it kind of seemed like PostgreSQL might
> be that database, especially since we already knew we
> were going to use something with open code.
"As mentioned here before, SparkFun.com is really the tip of a much bigger iceberg called Sparkle. Sparkle is an internal website where we hang every tool we need. It does everything but the core accounting."
If you can share, what's the reason you folks went with this architecture?
Beyond DB choices, I'd be very concerned about a design that tied so many important business functions into the one tech. stack, let alone the same application.
I'd like to be able to do things like bring internal tools down for maintenance without hosing the public-facing CMS. Or change database technologies to suit whichever particular application I was optimising.
In its present state it's a fairly sane architecture for our purposes, but the only honest & concise answer to this question is that we didn't know what we were doing and it just kind of worked out that way.
Not coincidentally, this also describes the architecture of the business as a whole.
Sparkfun is in an odd place, they could call themselves a store, but then again so could Amazon -- it's not clear that selling technology is requisite to technology being a primary area of competence.
Also, wanting to have a single authoritative view of inventory and order data is NOT a big ball of mud. Nor is a single big db with everything. The bid ball of mud generally starts with the link editor and developers who think that the only way to integrate functionality is to link it all into one big binary. For all we know they are planning to use RabbitMQ as their main integration tool, and build lots of small single purpose apps that work together like an orchestra.
You do not know hell if you've never encountered a medium sized organization without much in house tech talent who've somehow ended up with 5 different application with database contents and business functionality overlap and completely different architectures and languages with no ability to integrate with each other.
This is actually also the main reason I've come to dislike Rails - for complex businesses with single small teams or single developers it's hard to encapsulate different site functionalities as modular components of a larger system (more like 'apps' in Django) as compared to running multiple rails sites.
This same need is also why many business (prepare to shoot yourself) try to do everything with Drupal.
The ability to meet a need quickly with something that can just plug in easily to an existing system far outweighs anything else in most non-giant non-massive-scale organizations.
This is part of why Rails is so successful despite not making it easy to modularize your site functionality - the components and libraries of the application overall are highly modular and community package resources make it so stuff can be built quickly even it's hard to modularize within the Rails application itself.
The big bad ball of mud is more an outgrowth of business trying to move quickly and demanding their flavor of the week without being able to understand, plan, or manage the quality of their developers, modularity of their code, or the development process, rather than complex single (but modular) systems being bad.
The key with rails is not to modularize the site, but modularization at a service level. Give your rails app a REST API and use it from the other apps.
eg. Instead of shared authentication classes between sites, setup an auth app that supports oauth, and use an oauth module for devise.
Says you. My experience has been that a sensibly separated range of simple, single-purpose applications with intuitive APIs is actually far easier to work with, especially given limited resources.
I might be missing something, though. Perhaps it's that other situations have concerns without lots of overlaps in business logic and data, or that there's an easier way than I've found to deploy multiple applications transparently so that users don't notice or have to deal with multiple applications.
I'm curious, though, how with a true multiple application architecture you deal with database duplication and fragmentation? This is always the single biggest issue in my experience.
(I also might not be talking about the same thing as you - if I can make a single purpose application that handles back-end tasks and isn't user facing, doesn't need auth, and also doesn't need it's own DB, such as for background processing images, or generating compute intensive reports, I'm all for it. But that wasn't really what I was talking about at all.)
This had a single purpose, but it was structured as several discrete apps. I'm going off memory here as I didn't work across all the apps, but it looked like this:
2. a web service to accept the contracts (Ruby, Sinatra)
3. a web service to pre-process the contracts and queue them for insertion into the legacy system (Ruby, Grape, Amazon RDS)
4. a service to de-queue the contracts and insert them into the Windows-based legacy system (.NET, SQL Server, IIS)
There were many, many advantages to this structure:
* each individual component was trivial to understand - in the order of hundreds of lines of functional code
* we could choose the technology appropriate to the job
* we only had to use Windows for the bit that interfaced with the legacy system
* the only technology choices that spanned the entire stack were HTTPS and UTF-8
* status of each individual component was available through HTTP requests; most made RSS feeds of their activity available for monitoring
* we could easily experiment with new technologies as they emerged, without changing or breaking All The Things
* we had a highly skilled dev team, with experience across a wide range of technologies and platforms
* 'the business' understood what we were doing, and why - in fact, the whole purpose of the project was to transform our previously monolithic architecture in this way
* log collection (using Splunk) allowed us to track the progress of individual contracts through the system, and alerted us when one stalled (e.g. was being rejected due to formatting issues)
 Except for the last one, because of the ludicrous complexity introduced by interfacing with the legacy system. But all of it except for the legacy interop was easily grasped.
 Not knocking .NET or C# here; both are pretty good these days. But the Windows ecosystem is just not as developer-friendly as *NIX.
> "When you have limited developer resources modular balls of mud (AKA big glob of Buckyballs - like the magnetic desk toy) with single login identity and shared well structured DB are far better and easier to maintain than a half dozen+ separate applications with differing architectural choices, languages, and abilities to integrate with each other."
Easier to maintain, yes, easier to grow and change and evolve, no.
With the massive system, it is harder to accumulate technical debt, but some apps from 1999 might still be doing their simple job just fine backed by Access.
Really, I'm at a bit of a loss as to why they switched, but congrats and good luck. Glad to see another MariaDB mention. Just found Galera myself. Wish WordPress supported Postgres.
Makes sense. I guess I've always assumed that while MySQL has gotchas, so too would database X. They come with the territory of databases, being such a large part of any app, yet separate from it and usually infinitely more complex than the apps that use them. To that end, I think it's why some people end up writing their own simpler data stores or caching middleware between databases/OLAP and front-ends. Such written-from-scratch solutions can be easier to debug when things are mission-critical.
From what I've seen with databases, the really complicated part is "setting them up correctly". If you start out with most defaults, something will eventually become a gotcha forcing a schema rewrite or a database transplant using an SQL dump. But that's just been my experience: with databases, expect to not get it 100% right the first time (unless it's not your first time with the technology, of course). And from that perspective, the technology used is less important, it will equally have "gotchas".
That all said, I too will be playing around more with Postgres to see how it compares to the InnoDB/XtraDB I've a hate/love relationship with. ;)
Kind of a moot point since they were using MariaDB anyway, but the MySQL geospatial functionality hasn't gotten much attention in a long time. I remember quite distinctly hearing at the MySQL BoF at OSCON 2011 that the original author of the MySQL geospatial stuff was gone from Oracle. I find it hard to believe that they haven't picked up anyone to continue development on that but when you compare features with PostGIS, there's almost no point in even trying to compete.
We do plan to store a bunch more data than we deal with now, but mostly we migrated because we're looking to build a better ball of mud, to borrow a perfectly legitimate analogy from elsewhere in the thread. What we'd like to scale somewhat cleanly is a business which benefits from us building less broken applications, and PostgreSQL is obviously a less broken foundation for the kind of software where it matters if you are correct about things.
If anyone would like a reason to let us know at length what incomparable buffoons we must be for any of writing PHP, rolling our own application framework, still owning physical hardware, or getting a little overexcited when the code doesn't break too badly, I will cop to all of those and more in public, but please don't read the linked post as a strong technical claim about Hard Problems in the Land of Webscalonia which demands & merits a detailed refutation. Posts on sparkfun.com don't tend to be written for a web nerd audience (we sell blinkenlights) and most of us survey a very different conceptual landscape on a day-to-day basis from the average HN reader.
(For clarity here, I write web stuff at SparkFun, and most of my time lately has been on this project. Author of linked piece is my boss.)
Edit: And my apologies if this reply to a rather simple question comes off a bit reactive; consider this my reply to the thread as a whole?
One way someone else put it, MySQL is great if you're using it as a datastore to build an API layer on top (with relational features being a huge plus as opposed to a K/V store). Postgres, on the other hand, effectively does what an RDBMS was originally meant to do -- it itself acts as that data access API.
Also SparkFun is an awesome company -- the SparkFun Arduino kits on my desk do a lot more than blink lights (though they do that too). Please continue to use run-times may be less trendy but which return responses quickly and using physical hardware too.
> PostgreSQL is obviously a less broken foundation for the kind of software where it matters if you are correct about things.
> writing PHP, rolling our own application framework
In the same explanation? Seems like you still have some points where you can make serious gains in your technology stack that would have bought you more tangible benefits albeit not as much positive publicity.
It is your job, and it's not my place to comment about how to do your job; but when making such a public announcement, 'less broken' when wrapped with PHP and custom frameworks isn't a very compelling message.
If we could just sort of rewrite all of our code in a nicer language than PHP (taking it as read that most of us can think of one or two of those) with a commensurate effort, and the idea somehow just hadn't occurred to me until I read your comment, I probably would have started the project over my morning coffee.
In the real world, we have a bunch of existing software, a business to run on it, and a well-developed understanding of the pitfalls of the language and environment. There's no conceivable justification for the amount of effort it would take to rebuild our stuff from scratch in one fell swoop. PHP is a pain in my ass, I yield to almost no one in my visceral contempt for PHP, but PHP is not my problem, and it hasn't been for a long time.
What we can do, with a reasonable investment of time and energy, is move more of our business concerns into a more rigorous data store, and explore building new features against the data we model well using other languages and tools. Right now we've got various bits of tooling written in (at least) Ruby, Go, C, Perl, R, Python, and Node.js. Maybe someday we'll really zone in on something and figure out that it's a good time to start slicing off chunks of application and rebuilding them in [better language x] until the PHP has evaporated. Maybe we'll decide that everything should live in [third party system x]. Maybe we won't bother with any of this, because PHP gets a little less heinous with each passing release, we already know what we're doing with it, and there are bigger fish to fry.
Reasoning by analogy is frequently suspect anyway, but in your picture, I'm not sure how databases are barn doors. They tend, at least for the kinds of things we're doing, to be much more fundamental to the framing of the barn. Or maybe they're the stuff stashed in the hayloft. I dunno. At any rate, I don't think it's super coherent to argue that all the things must be fixed at once in order to justify improving one of the things.
I don't disagree with anything you said here, especially this statement above, but it seriously undermines the blog post and makes it sound more like "PostgreSQL is the latest hawtness, so we're going with that and grandstand about it for publicity".
I guess that's my real problem. It makes the whole thing sound like rearchitecting for the sake of publicity, instead of to address an actual issue.
No worries about the tone or replying to the entire thread instead of just my question. I would also be frustrated with some of the comments in this thread.
> From the development perspective Postgres was a necessity. Not a slam dunk but objectively a better tool for our use case.
In what way?
> And if we have countless reasons, all good, but they’re nuanced and technical and tricky to impart to a non-technical audience, where does that leave us?
HN would be interested in those reasons, I know I am.
> The vast majority of the company probably still doesn’t get why we went through this painful contortion at the expense of doing, well, a lot of other things.
Neither do I, since none are given.
Though I also understand that the point of the post was about trust. It's great that there is that trust within SparkFun, I would love such an environment.
(I'm a postgres community member, so of course I'm glad it's working out for them, but I just don't like the analogy.)
Ie. MySQL is easier to get started with (well, was, nowadays it's much of a muchness), and is easier to scale in some situations but you trade for some other features and useful things.
Screws and nails are both useful, but I wouldn't just rely on one for every project :)
The introduction tutorial is concise but more than sufficient.
The reference guide is well written, laid-out and covers everything you need:
the database server, its configuration and tools, the SQL language, its operators, data types and grammar, the built-in procedural languages, the low-level client library, and all the database internals (system tables, data file formats, network protocol, query optimiser, indexes, etc).
The mailing lists also seem to carry a lot of good material.
MySQL will just leave you with a half completed schema change which you'll have to fix by hand.
Or, ok, there may be a reason, South used to not play so well with MySQL/MariaDB, this has improved over time but still may be a reason to use PostgreSQL
(Or just don't bother and use SQLite, depending on your case it may be just fine)
Or even how much is limitations on the Python or Django DB drivers (two separate things)
We have a demo instance for version 7.0 , but it's in Portuguese (though you can change the language). OpenERP S.A. offers a free trial.