Not to bikeshed, but what about Postgres makes the big scalability story any different than the MySQL story, unless you're talking about a commercial distributed RDBMS built on top of Postgres like EnterpriseDB?
In terms of scalability in the small, many of the MySQL 5.6 replication features skew this comparison in MySQL's direction: multi-threaded (optionally row-based) replication, global transaction ids, and the like make it easier to improve availability, add read-slaves, and host multiple databases/shards on a single machine (replication afaik is no longer per-server, it is now per-database).
There's also been work on tuning innodb for SSDs (which is a near certain recommendation for your dataset -- it gives plenty of breathing room to forestall horizontal partitioning).
I'd also look very heavily into performance with large buffer caches, compression (great way to reduce IOPS and squeeze the most out of SSD or memory space), etc... I am curious to see if any of these were compared between MySQL and Postgres. As far as I understand, innodb's compression is somewhat more advanced than Postgres, but running either on ZFS is an even better bet for this.
On the other hand, Postgres has a better query optimizer, supports more complex data and relation formats, and so on. However, most of these features aren't going to be used at scale (I am presuming you're talking about OLTP workloads).
Honestly this is a bit of an unusual decision -- I'd probably start a project with Postgres and avoid using MySQL until later, but choosing Postgres out of scalability reasons seems a bit odd. I'm curious to know why!
Maybe it was not only for scalability, but also for reliability or embracing the Do the Right Thing thing?
From my experience with PostgreSQL and MySQL, it is a bit like the difference between python and php, one is a great work of craftmanship that is reliable, predictible, coherent, enjoyable to work with and minimise the wtf/mn rate (which is the best quality measure in software), while the other is a bunch of hacks knit together to make it work asap and its wtf/mn is skyscraping.
To put it bluntly after managing both for years, Postgres doesn't scare the shit out of me like MySQL does. I've had quite a few moments with MySQL doing stupid things that aren't intuitive or right, particularly in the backup/restore space. For example backup one schema and restore onto later versions doesn't always work.
Postgres has the same feel as *BSD i.e. deterministic. It is well documented and does exactly what the manual says and is devoid of surprises. It feels engineered and I can provide reliably engineered solutions because of this.
Basically I can sleep at night.
I really give less of a crap about scalability. I'd throw a bigger box at the problem or use heavy caching. I've built much larger ecommerce solutions with orders of magnitude more hits/orders than sparkfun on much smaller kit
I have to say, my experience with Postgres is the opposite. Sometimes the query planner will suddenly decide to pick a bad execution plan.
I have been woken up in the night multiple times and found that the problem is that Postgres has arbitrarily decided to stop using an index and changed to do a full table scan and the database has ground to a halt under the load.
Doing an ANALYZE sometimes fixes it. Increasing the statistics target for the table in addition sometimes fixes it. One time we had to add an extra column to an index to make it decide to use it even though it shouldn't have needed it (and didn't a few hours before!).
The developers seem dead set against allowing you to override the query planner to add determinism.
I'm not saying MySQL is any better, as I have not used that in production before.
Well if you grep for MySQL gotchas you should find a bunch of them. One did bite me hard one day. I wanted to add a simple check constraint on a column, as I did often on PG. A quick glance at the doc, I send the ADD CONSTRAINT command, and, no news is good news, I thought everything was all set and good. But in fact the beast did just ignore my command! After careful check in the docs in some case the constraints are ignored because not implemented.
That's crazy. How would you rate a shell that just said nothing and copied nothing when asked to copy an important backup file somewhere? For me it is the hell of software, not even telling there is something wrong.
The vessel is in danger when the captain don't know her position, but in much greater danger when he is not even knowing he doesn't know where she is.
You have to run your ANALYZE commands... I've run into this issue on very volatile tables (lots of inserts/deletes). With these, the table profile will change and without statistics the query planner might go on the wrong path.
That said, it's unfair to compare with MySQL's planner, which, although it is there, is so infantile that by comparison you can say MySQL has no planner. Example: Invariant sub selects run on every row of the master query, instead of being stored in a temporary table. Think SELECT a FROM foo WHERE foo.b IN (SELECT b FROM bar). Or another basic one SELECT a FROM foo ORDER BY b LIMIT 5 where there exists an index on b actually sorts the entire table instead of using the index and jumping out when it hits the LIMIT.
My experience with mySQL (admittedly none of that experience is recent so things may have changed a lot since) suggested that it didn't have a planner at all and it just attacked things in the order of your FROM/JOIN clauses using indexes where obvious from the WHERE clauses and linking ON definitions: it wouldn't for instance see circumstances where it could hit the second table of a join first and so need to consider for less of the first larger table for instance, so reordering inner joins in a query could make a significant difference.
I use MSSQL a lot in my day job and sometimes the query planner there surprises us by being fairly bright about this and other optimisations. Of course it sometimes gets things very wrong and you have to use index hints to push it in the right direction, but generally speaking if your queries are complex enough to hit that problem you might need to consider refactoring your design for other reasons too.
But the difference with MySQL is that you can easily forcibly make MySQL use the indexes you want it to, as long as it is possible.
With Postgres, while it usually seems to do a better job, when it refuses to use your indexes it can be a massive pain to coax it into doing what you want it to, even when you know it's making a bad choice.
Postgres has a number of issues like that, where ideological purity has a negative impact on usability. It still mainly use Postgres these days, but I regularly swear at it over things I know would be far simple than MySQL for those reasons (replication used to be a huge one, though that is now finally getting there, 8 years after I set up my first "large" replicated MySQL setup)
Bad execution plans are almost always the result of a missing or incompatible index, bad table statistics (corrected by running ANALYZE) or bad server configuration (corrected by adjusting memory parameters and cost parameters).
The query planner evaluates multiple query plans, and chooses the one with the lowest cost based. If that results in the wrong plan, you should attempt to correct the query planner input (that is, the statistics and the server configuration, and possibly the tables and their indices) so that it can better evaluate the possible query plans.
This is perhaps one of the more succinct ways of summarizing our core reasons for the switch. There's a lot more to it than that, of course, but in a nutshell we saw the upcoming scaling of our data footprint and complexity and wanted to use (what we evaluated to be) the better tool for the job.
I don't know the answer in this case, but a few thoughts:
* Why don't you think the optimizer is a consideration? That's a big deal, because it helps the database choose the right algorithms as your data changes. Continuing gracefully while the input data is changing sounds like scalability to me.
* Postgres is generally considered to scale to many concurrent connections and many cores. Useful for OLTP.
* Sometimes the right feature or extensibility hook can make a huge difference allowing you to do the work in the right place at the right time. For instance, LISTEN/NOTIFY makes it easier to build a caching layer that's properly invalidated -- not trivial to get right if the feature is missing, which might mean that you're not getting as much out of caching as you could be.
Tuning for SSDs, compression, buffer caching, etc., are the antithesis of scalability. Those things are pure performance issues. Scalability is about a system that adapts as input data and processing resources change. (Your point about replication does concern scalability, of course.)
> * Postgres is generally considered to scale to many concurrent connections and many cores. Useful for OLTP.
But MySQL scales trivially to many servers. Postgres replication is finally getting there, but it's been incredibly slow coming, while setting up huge MySQL installations with a dozen+ replicas, with selective sharding etc. has been effortless for about a decade.
... while setting up huge MySQL installations with a
dozen+ replicas ...
If you're careful about your function calls, sure. Include a UUID() or SYSDATE() in your INSERT statements, and the replication goes to pieces as the function calls are run separately on each slave (with differing results).
Doing compression on the ZFS level is significantly worse than InnoDB compression.
InnoDB has a lot of really smart optimizations which make it much better than just zipping things up. Included are the modification log (so you only have to re-compress occasionally) and a dynamic scaling ability to keep compressed pages in memory rather than always decompressing. These optimizations are really only possible with an understanding of the data.
I would only consider ZFS for something like an append-only data warehouse type system.
I'm sure you would want the compression done by the database and not the filesystem, since there are many ways to do compression to fit specific applications, and the database knows what it's trying to do. I read a little bit of the MySQL docs regarding how it uses compression, and it sounded pretty different than general-purpose compression.
> See, MySQL and its forks, for all of the really huge
> and amazing stuff built on them in the last decade,
> have got some drawbacks. On this point, I risk triggering
> a lot of strong feelings from the sort of people who
> have strong feelings about databases, so for the time
> being, let's just say that it started to seem like we'd
> have a better time writing business software on a
> database that embodied a more rigorous approach to
> formal types and relationships. And because of years
> of our nerd friends telling us we should probably just
> use PostgreSQL, it kind of seemed like PostgreSQL might
> be that database, especially since we already knew we
> were going to use something with open code.
The phrase "...we'd have a better time writing business software on a database that embodied a more rigorous approach to formal types and relationships." is what nails it for me. This isn't a technical feature, per se, of PostgreSQL, it's more of an attitude that's embodied in the product.
"As mentioned here before, SparkFun.com is really the tip of a much bigger iceberg called Sparkle. Sparkle is an internal website where we hang every tool we need. It does everything but the core accounting."
If you can share, what's the reason you folks went with this architecture?
Beyond DB choices, I'd be very concerned about a design that tied so many important business functions into the one tech. stack, let alone the same application.
I'd like to be able to do things like bring internal tools down for maintenance without hosing the public-facing CMS. Or change database technologies to suit whichever particular application I was optimising.
I think you are reading too much into it. They never said that there would only ever be one PostgreSQL db with everything. As you point out there are advantages to separating certain functions into separate dbs, but for most businesses that go that route, it makes sense to use multiple instances of the same core tool.
Also, wanting to have a single authoritative view of inventory and order data is NOT a big ball of mud. Nor is a single big db with everything. The bid ball of mud generally starts with the link editor and developers who think that the only way to integrate functionality is to link it all into one big binary. For all we know they are planning to use RabbitMQ as their main integration tool, and build lots of small single purpose apps that work together like an orchestra.
This sounds counter-intuitive, but the best (from the technical angle) companies I worked at -- or know friends at -- are muddle the "internal enterprise tools" and "public facing applications" divide quite a bit, but (and here's a catch) when their perceived and actual primary area of competence is building applications that at least somewhat resemble the internal tools in question.
Sparkfun is in an odd place, they could call themselves a store, but then again so could Amazon -- it's not clear that selling technology is requisite to technology being a primary area of competence.
> If you can share, what's the reason you folks went with this architecture?
In its present state it's a fairly sane architecture for our purposes, but the only honest & concise answer to this question is that we didn't know what we were doing and it just kind of worked out that way.
Not coincidentally, this also describes the architecture of the business as a whole.
When you have limited developer resources modular balls of mud (AKA big glob of Buckyballs - like the magnetic desk toy) with single login identity and shared well structured DB are far better and easier to maintain than a half dozen+ separate applications with differing architectural choices, languages, and abilities to integrate with each other.
You do not know hell if you've never encountered a medium sized organization without much in house tech talent who've somehow ended up with 5 different application with database contents and business functionality overlap and completely different architectures and languages with no ability to integrate with each other.
This is actually also the main reason I've come to dislike Rails - for complex businesses with single small teams or single developers it's hard to encapsulate different site functionalities as modular components of a larger system (more like 'apps' in Django) as compared to running multiple rails sites.
This same need is also why many business (prepare to shoot yourself) try to do everything with Drupal.
The ability to meet a need quickly with something that can just plug in easily to an existing system far outweighs anything else in most non-giant non-massive-scale organizations.
This is part of why Rails is so successful despite not making it easy to modularize your site functionality - the components and libraries of the application overall are highly modular and community package resources make it so stuff can be built quickly even it's hard to modularize within the Rails application itself.
The big bad ball of mud is more an outgrowth of business trying to move quickly and demanding their flavor of the week without being able to understand, plan, or manage the quality of their developers, modularity of their code, or the development process, rather than complex single (but modular) systems being bad.
"When you have limited developer resources modular balls of mud (AKA big glob of Buckyballs - like the magnetic desk toy) with single login identity and shared well structured DB are far better and easier to maintain than a half dozen+ separate applications with differing architectural choices, languages, and abilities to integrate with each other."
Says you. My experience has been that a sensibly separated range of simple, single-purpose applications with intuitive APIs is actually far easier to work with, especially given limited resources.
I feel like you're not talking about single purpose applications that have human users, but rather ones that have other systems as clients. If I told our users that I had written nice APIs for the things they do I'd probably be out of a job. Likewise, every time I've had to store the same data multiple times in separate databases it's eventually become a headache.
I might be missing something, though. Perhaps it's that other situations have concerns without lots of overlaps in business logic and data, or that there's an easier way than I've found to deploy multiple applications transparently so that users don't notice or have to deal with multiple applications.
I'm curious, though, how with a true multiple application architecture you deal with database duplication and fragmentation? This is always the single biggest issue in my experience.
(I also might not be talking about the same thing as you - if I can make a single purpose application that handles back-end tasks and isn't user facing, doesn't need auth, and also doesn't need it's own DB, such as for background processing images, or generating compute intensive reports, I'm all for it. But that wasn't really what I was talking about at all.)
I understand your concerns, the problem is when they're all tied in to the same database, they aren't modular. Modular means I can change one app without being forced to change the others and that's not possible when all apps integrate through a database. An integration database is a massive global variable that kills all modularity.
> "When you have limited developer resources modular balls of mud (AKA big glob of Buckyballs - like the magnetic desk toy) with single login identity and shared well structured DB are far better and easier to maintain than a half dozen+ separate applications with differing architectural choices, languages, and abilities to integrate with each other."
Easier to maintain, yes, easier to grow and change and evolve, no.
Massive systems are alluring for architectural purists, but disparate systems are easier on the pocket book. Say, like at SP, you need OLTP abilities for some new account/tracking app. Well, too bad, you've got to upgrade the Dog Walking app as well, even though there is no business benefit to upgrading that ancillary app which does not need OLTP.
With the massive system, it is harder to accumulate technical debt, but some apps from 1999 might still be doing their simple job just fine backed by Access.
I presume you're recommending PostGIS because of the allusions to the inventory location problem. While better modelling of geo data would be nice it's kind of a different concern. For us inventory location isn't about where on Earth our inventory is, or even where geometrically it is relative to something else, it's about where in a hierarchy of named locations it is. For instance: we have 10 of widget X in the storefront stock room, row F, rack 17, shelf 3, bin 1. Or we have 22 of widget Y in the receiving room on the quarantine table. Stuff like that. When modelling that many semantic nested locations and all of the transfers of items between them the primary issue is scalability and speed, not so much modelling geographic data.
Thanks for the reply. I see in another comment thread you agreed with the sentiment, "From my experience with PostgreSQL and MySQL, it is a bit like the difference between python and php, one is a great work of craftmanship that is reliable, predictible, coherent, enjoyable to work with and minimise the wtf/mn rate (which is the best quality measure in software), while the other is a bunch of hacks knit together to make it work asap and its wtf/mn is skyscraping."
Makes sense. I guess I've always assumed that while MySQL has gotchas, so too would database X. They come with the territory of databases, being such a large part of any app, yet separate from it and usually infinitely more complex than the apps that use them. To that end, I think it's why some people end up writing their own simpler data stores or caching middleware between databases/OLAP and front-ends. Such written-from-scratch solutions can be easier to debug when things are mission-critical.
From what I've seen with databases, the really complicated part is "setting them up correctly". If you start out with most defaults, something will eventually become a gotcha forcing a schema rewrite or a database transplant using an SQL dump. But that's just been my experience: with databases, expect to not get it 100% right the first time (unless it's not your first time with the technology, of course). And from that perspective, the technology used is less important, it will equally have "gotchas".
That all said, I too will be playing around more with Postgres to see how it compares to the InnoDB/XtraDB I've a hate/love relationship with. ;)
I was thinking the same thing about wanting to go PostGIS. Though they could probably do just as well for their needs (presuming the case is in a single building) by using vanilla postgres' geometric functions.
Kind of a moot point since they were using MariaDB anyway, but the MySQL geospatial functionality hasn't gotten much attention in a long time. I remember quite distinctly hearing at the MySQL BoF at OSCON 2011 that the original author of the MySQL geospatial stuff was gone from Oracle. I find it hard to believe that they haven't picked up anyone to continue development on that but when you compare features with PostGIS, there's almost no point in even trying to compete.
Awright, so I didn't write the linked piece, but Chris isn't talking about scaling our web traffic by an order of magnitude. We're a small-to-mid-size web store & light manufacturer and we do decent business and very occasionally get a bunch of traffic as far as mid-sized web stores go. We're not a big service provider, and we're never going to be one in the sense where we have to worry about scale the way that people with Real Problems do.
We do plan to store a bunch more data than we deal with now, but mostly we migrated because we're looking to build a better ball of mud, to borrow a perfectly legitimate analogy from elsewhere in the thread. What we'd like to scale somewhat cleanly is a business which benefits from us building less broken applications, and PostgreSQL is obviously a less broken foundation for the kind of software where it matters if you are correct about things.
If anyone would like a reason to let us know at length what incomparable buffoons we must be for any of writing PHP, rolling our own application framework, still owning physical hardware, or getting a little overexcited when the code doesn't break too badly, I will cop to all of those and more in public, but please don't read the linked post as a strong technical claim about Hard Problems in the Land of Webscalonia which demands & merits a detailed refutation. Posts on sparkfun.com don't tend to be written for a web nerd audience (we sell blinkenlights) and most of us survey a very different conceptual landscape on a day-to-day basis from the average HN reader.
(For clarity here, I write web stuff at SparkFun, and most of my time lately has been on this project. Author of linked piece is my boss.)
Edit: And my apologies if this reply to a rather simple question comes off a bit reactive; consider this my reply to the thread as a whole?
> PostgreSQL is obviously a less broken foundation for the kind of software where it matters if you are correct about things.
> writing PHP, rolling our own application framework
In the same explanation? Seems like you still have some points where you can make serious gains in your technology stack that would have bought you more tangible benefits albeit not as much positive publicity.
In my honest opinion, if you choose to change to PostgreSQL because it's "less broken" but still use PHP as your front end, I feel completely justified in asking WTF. Sure, you're securing the barn doors nicely, but the rest of the barn might as well be made of tissue paper.
It is your job, and it's not my place to comment about how to do your job; but when making such a public announcement, 'less broken' when wrapped with PHP and custom frameworks isn't a very compelling message.
You can always ask WTF. What else is the internet for? I'll try to give you a face-value answer.
If we could just sort of rewrite all of our code in a nicer language than PHP (taking it as read that most of us can think of one or two of those) with a commensurate effort, and the idea somehow just hadn't occurred to me until I read your comment, I probably would have started the project over my morning coffee.
In the real world, we have a bunch of existing software, a business to run on it, and a well-developed understanding of the pitfalls of the language and environment. There's no conceivable justification for the amount of effort it would take to rebuild our stuff from scratch in one fell swoop. PHP is a pain in my ass, I yield to almost no one in my visceral contempt for PHP, but PHP is not my problem, and it hasn't been for a long time.
What we can do, with a reasonable investment of time and energy, is move more of our business concerns into a more rigorous data store, and explore building new features against the data we model well using other languages and tools. Right now we've got various bits of tooling written in (at least) Ruby, Go, C, Perl, R, Python, and Node.js. Maybe someday we'll really zone in on something and figure out that it's a good time to start slicing off chunks of application and rebuilding them in [better language x] until the PHP has evaporated. Maybe we'll decide that everything should live in [third party system x]. Maybe we won't bother with any of this, because PHP gets a little less heinous with each passing release, we already know what we're doing with it, and there are bigger fish to fry.
Reasoning by analogy is frequently suspect anyway, but in your picture, I'm not sure how databases are barn doors. They tend, at least for the kinds of things we're doing, to be much more fundamental to the framing of the barn. Or maybe they're the stuff stashed in the hayloft. I dunno. At any rate, I don't think it's super coherent to argue that all the things must be fixed at once in order to justify improving one of the things.
> At any rate, I don't think it's super coherent to argue that all the things must be fixed at once in order to justify improving one of the things.
I don't disagree with anything you said here, especially this statement above, but it seriously undermines the blog post and makes it sound more like "PostgreSQL is the latest hawtness, so we're going with that and grandstand about it for publicity".
I guess that's my real problem. It makes the whole thing sound like rearchitecting for the sake of publicity, instead of to address an actual issue.
Not reactive at all, well put response to mostly confused posts (including my own). I think there was mis-understanding on what was meant by scaling -- web community took it to mean something very different from what you meant. Honestly this befuddled me as well me as well, so I suspected as much -- I didn't think Sparkfun would need to go to horizontal scaling which I am quite confident you'll likely never need (unless you're doing something very wrong).
One way someone else put it, MySQL is great if you're using it as a datastore to build an API layer on top (with relational features being a huge plus as opposed to a K/V store). Postgres, on the other hand, effectively does what an RDBMS was originally meant to do -- it itself acts as that data access API.
Also SparkFun is an awesome company -- the SparkFun Arduino kits on my desk do a lot more than blink lights (though they do that too). Please continue to use run-times may be less trendy but which return responses quickly and using physical hardware too.
Thanks! You answered my question, and it makes sense. There have been times where I have changed underlying technologies because it provided a better foundation for the future even though there was nothing particularly wrong with the existing solution.
No worries about the tone or replying to the entire thread instead of just my question. I would also be frustrated with some of the comments in this thread.
I don't like the "right tool for the job" analogy. When you use a hammer to put in a nail, the hammer isn't a part of the building when you're done with it. But when you choose a DBMS, you're stuck with it in your architecture for a long time and it's a huge cost to migrate.
(I'm a postgres community member, so of course I'm glad it's working out for them, but I just don't like the analogy.)
I think its less hammer more nail vs screw. Both can hold wood to a wall and will be part of the building when you are done. For example a nail can splinter the wood which can cause a break down the line while a screw wouldn't encounter such an issue.
The introduction tutorial is concise but more than sufficient.
The reference guide is well written, laid-out and covers everything you need:
the database server, its configuration and tools, the SQL language, its operators, data types and grammar, the built-in procedural languages, the low-level client library, and all the database internals (system tables, data file formats, network protocol, query optimiser, indexes, etc).
As a hobby-projects-only developer, would there be any reason to switch from mariaBD to postgreSQL?
Scalability is no issue for me (I don't have to scale). And lately I hardly write any queries myself because of django..
Inventory location is one of many, many problems our ERP system is tasked with. And we've looked at a lot of software packages over the years, both proprietary and open. In terms of open software the options are actually pretty limited and have the potential to introduce more problems with integration and customization than they would purport to solve by bringing in new suites of features. What is less limited and truly abundant are examples of other companies in our same position that went full-bore into an off-the-shelf solution and ended up severely disrupting their business operations and aborting transitions.
The best introduction is probably Integrate You Logistic Processes with OpenERP, a free ebook written by OpenERP S.A. themselves. It's slightly outdated, since it was published in 2011 and meanwhile version 7.0 has been released, but the changes to the stock area where limited (except for the UI).
We have a demo instance for version 7.0 , but it's in Portuguese (though you can change the language). OpenERP S.A. offers a free trial.