I strongly believe there are huge opportunities for AI to come along and, given a logical description of the data and query, do a better job of physically organising and serving the data. I realise some database management systems do this to an extent, but it's pretty weak sauce in my experience.
You're not likely to succeed with "here's the query and schema" and have AI somehow figure out what the best plan there is. There's a lot of rules to observe for correctness, and you can't just try to execute arbitrary valid plans, because the initial query plans are going to be really bad. So you again need a model of what valid query plans are, and a cost model to evaluate how efficient a plan is, without having executed it.
I think there's plenty opportunities for ML type improvements, but they're in individual parts, less as a wholesale query planner replacement. I think the most likely beneficiaries are around statics collection and evaluation, incremental improvements to query plans, and some execution time -> plan time feedback loops.
Surely query optimizers do this already. What about that sounds like "magic"?
Computers can beat Lee Sedol at Go, it hardly seems like "magic" for a computer to come up with a query plan that is obvious to a trained human.
For example, OP mentioned the case of creating a temporary table with an index instead of emitting a lot of temporary files executing a subquery. Surely you could statically verify that such a transformation was legal, and it seems likely that a cost model could accurately predict that such a transformation would be cheaper. So why doesn't the query optimizer perform this transformation? Presumably because it doesn't yet explore this part of the search space of possible transformations.
EDIT: apparently this post is making some people mad. I'm not saying it's easy or obvious, but with ML being applied to so many problems very successfully, it seems very strange to say it's expecting "magic" to suggest that ML could improve on obviously sub-optimal query plans.
We have a cost function and we have a search space of valid transformations. Sounds like a lot of room to experiment with ML-based approaches to me.
> Surely query optimizers do this already.
Sure. But not in a form that's applicable for ML / AI.
> What about that sounds like "magic"?
It's naming a buzzword and expecting revolutionary results, without actually suggesting how $buzzword can applied to achieve those results.
I'm asking for concrete proposals because I think it's an interesting field.
> Computers can beat Lee Sedol at Go, it hardly seems like "magic" for a computer to come up with a query plan that is obvious to a trained human.
Just because a technology worked for one field, doesn't mean it's trivially applicable for other fields (not that the original application for go was trivial in the least).
Assuming that there’s a single “AI” technology that’s used and widely applicable is deeply flawed, as is the idea that all of googles computational power for such a peacock-oriented game can make its way into a query planner. It’s already the case that Postgres is based on genetic algorithms, so it’s already got “AI” baked in.
I’m sure there is plenty of room for improvement with query planners, including new paradigms, but you can’t just throw such a generic buzzword at it and have that be intelligible. You might as well say “it just needs better algorithms.” Well, ya. It has zero substance.
I'm sorry that you think my suggestion has zero substance. If you think that the sub-second query plans that Postgres currently generates for ad-hoc workloads are adequate, I wish you luck. I personally think there are lots of missing heuristics that an ML approach could pick up, either in the general case, or over time in a specific database.
I'll be honest, for a lot of stuff I can live without any real intelligence. For my workloads, I'd be happy to wait for the query planner to gain full knowledge of the physical layout of the database before trying anything. My other main annoyance, as mentioned, is WHERE clause pushdowns. I think there is a simple, if not efficient, decision procedure for this stuff and I'm happy to wait for the planner to do the work.
FWIW, to me pushdown doesn't sound at all like a field where ML would be helpful...
In a particular case that annoyed me this week, casting an IN clause with a subquery to an array (on a query over a view with a GROUP BY) resulted in wildly better performance, despite setting the statistics target to the maximum value, indexing everything etc. I'm happy to demonstrate a trivial reproduction of this issue, and I'm sure I'm a crap DBA, but every day it's the same fight - SQL is _not_ a declarative language. It is incredibly sensitive to implementation details, in ways that are way less obvious than any other programming language.
Postgres could do a better job in all sorts of situations, but I personally think that finding all these edge cases algorithmically or analytically is futile. A platform built on a paradigm more closely aligned to the logical descriptions of data and queries could, over time, learn to perform better.
You have a cost function you want to minimize. That can't form the basis of a search among valid program transformations?
> Just because a technology worked for one field, doesn't mean it's trivially applicable for other fields
I didn't say "trivially." I just pushed back on the idea that it's "magic."
There are precedents for ML-like approaches to optimization for regular compilers. For example, superoptimizers.
Also, superoptimizers are stochastic search algorithms, which are only tangentially related to ML, which is again only tangentially related to AI. The relation between AI and superoptimization is close to nil.
I'm still learning what Postgres extensions are capable of. I've written a few but only to define functions & types. It seems like I've seen more "interesting" achievements though, like CitusDB seems able to rewrite queries. Can an extension even add syntax?
EDIT: This seems to be the important part: https://github.com/citusdata/citus/blob/master/src/backend/d...
I wonder what `planner_hook` is about. . . .
You can quite easily replace the planner, it's just a single hook. Building a replacement's the hard part ;)
> Can an extension even add syntax?
No, not really. There's no good way to achieve extensibility with a bison based parser. At least none that we PG devs know and "believe" in.
I personally think there are massive opportunities for data description languages and query languages that make expressing a single set of semantics simpler. SQL is supposedly a declarative language, but almost always requires you to understand _all_ the underlying implementation details of a database to get good performance.
Beyond that, I'd be more than happy with ML that enhanced, heuristically, some parts of the query optimiser. Hell, I'd be happy with a query optimiser that just went away and optimised, 24 hours a day, or at least didn't just spend 19ms planning a query that is then going to run for 8 hours.
Now I have no idea how much training data you would need or if a single, trained model would work for most use cases. But I think it could work in theory.
I wanted to perform some introspection on queries in-flight, and potentially re-write them, but this seemed like an awful chore with calcite. One, calcite didn't implement my dialect -- it looked non-trivial to add new dialects. Two, going from query to parse tree (with metadata) back to query doesn't seem to be something that's intended to be supported.
Calcite is really a replacement frontend with its own SQL dialect that you can tweak the compiled output (SQL, CQL, etc.). It has its own worts and is not generally applicable to other projects. It works well within the Apache ecosystem but I don't envision much adoption outside of it (especially non-Java, like Postgres).
If you were writing a Tableau replacement, you might consider using Calcite to generate your queries. I don't see many other use cases.
There are some organizational quirks to how the code is written. It has been around for a long time, so it still has some design elements from working within older Java versions. Like many projects it is reliant on integration tests, likely more than would be desirable.
As far as the wider Apache ecosystem, I know there are projects that get by with pretty low code quality. Unfortunately there aren't many central voices at Apache enforcing specific policies around how codebases are managed, they are more focused on community development. I think they may be better off trying to take a closer look at codebases during incubation. Then again, there is no excess of resources waiting around to review code, and a lot of budding communities that want to take part.
That's a weird trend to notice, given that none of the same developers are working on "Apache projects" generally. The ASF isn't like e.g. Mozilla; it isn't a monolithic org with its own devs. It's a meta-bureaucracy that takes ownership of projects for legal (copyright + patent assignment) reasons, and then offers project-management charters to member projects, giving them process and structure for contribution without a canonical owner.
An ASF project is sort of like a "working group" (e.g. Khronos, WHATWG), except that it's usually one or two large corporations and a bunch of FOSS devs, rather than being exclusively a bunch of corporations.
On the other hand, if there is a trend, it might be because of the increasing reliance on the "open core" strategy for corporate FOSS contributors to make money.
My own experience is with Cloudant, who contribute to Apache CouchDB, but also have their own private fork of CouchDB running on their SaaS. Originally, Cloudant did what they liked in their own fork, and then tried to push the results upstream. The ASF didn't like this very much, since these contributions weren't planned in the open, so Cloudant increasingly tried instead to mirror its internal issues as ASF Bugzilla issues and get public-member sign-off before going forward on private solutions. Which is kind of a shadow play, since many of the founding members of the CouchDB ASF project either have always worked for Cloudant, or have since been hired by Cloudant, so it's Cloudant employees (in their roles as ASF project members) signing off on these Cloudant issues. But it still slows things down.
A good comparison that people might be familiar with, with the same dynamics, is WebKit. The WebKit project has its own separate development organization, but most of the developers happen to also work for Apple.
Previously, WebKit was Apple and Google, but even two corporate contributors were too big for the small pond. Which, to me, shows that they were each there expecting to dominate the decision-making process, rather than find consensus with the FOSS contributors; and having an equally-powerful player that they had to form consensus with was too much for them.
I'm curious how much query planners have to make tradeoffs between effective optimizations and not overloading the analysis phase.
In case of PostgreSQL: constantly. Pretty much every planner improvement has to argue that it's not likely to unduly increase plan time.
A serviceable approach would be to perform additional optimizations based on the query's cost - but that's hard to do without incurring a complete replan, because of the bottom up dynamic programming approach of a good chunk of the planning.
There’s a few statements which I never change, but run thousands of times a day, and which are extremely expensive.
Here's a concrete example: a view with a GROUP BY. Give it a concrete ID to filter by, and it will push it down to the underlying table and return quick results. I can then script a loop to do this to get the full dataset for a list of IDs. However, if I supply a `WHERE id IN (...)` or a JOIN, the query plan will be totally different and will take forever. This is dumb, and this is with up-to-date statistics etc. I'm happy to accept I'm probably not the target market, but just having the option to leave the query planner running for minutes instead of milliseconds would be great (if it indeed has work to _do_ in those minutes).
ETA: some quick googling suggests no, since most tuning documentation is on the initial query or the indexing.
For example one of the expensive steps is exploring the possible join tree - deciding in what order the tables will be joined (and then using which join algorithm). That's inherently exponential, because for N tables there are N! orderings. By default PostgreSQL will only do exhaustive search for up to 8 tables, and then switch to GEQO (genetic algorithm), but you can increase join_collapse_limit (and possibly also from_collapse_limit) to increase the threshold.
Another thing is you may make the statistics more detailed/accurate, by increasing `default_statistics_target` - by default it's 100, so we have histograms with 100 buckets in histograms and up to 100 most common values, giving us ~1% resolution. But you can increase it up to 10000, to get more accurate estimates. It may help, but of course it makes ANALYZE and planning more expensive.
And so on ...
But in general, we don't want too many low-level knobs - we prefer to make the planner smarter in general.
What you can do fairly easily is to replace the whole planner, and tweak it in various ways - that's why we have the hook/callback system, after all.
I find "disable sequential scans entirely proves to me that Postgres doesn't care about performance for interactive use." in combination with not reporting the issue a bit contradictory. We care about stuff we get diagnosable reports for.
This could even be something that operates as a layer on top of Postgres, although it would probably need more precise data than what is currently available through stats tables, and since Postgres doesn't have query hints there's no way to modify query plans. It would also need the look at logs to see what the most frequent queries are, and possibly require a dedicated slave to "test run" queries on to gauge whether it has found an optimal solution (or maybe do it on the actual database during quiet hours).
AI would definitely be interesting, though.
This is the core idea behind Peloton, a research database from CMU: http://pelotondb.io. Andy Pavlo has a really interesting, and entertaining talk about it https://www.youtube.com/watch?v=mzMnyYdO8jk.
As a bonus, it could (like mod-security) also have a "production mode" where the DB consumes that hints file and applies the instructions in it as an overlay, rather than as a permanent patch to your schema; that way, you'd be able to re-train after changes and then swap out the hints in production by just swapping out the file and HUPping the DB server (which would hopefully do the least-time migration to turn the old indexes into the new indexes), rather than needing to scrub out all the old indexes before building new ones.
Although it might sound easy, this turns out to be a very challenging problem for many technical reasons.
A primary reason for it not getting traction in practice is also that, database administrators don't like an automated tool messing with their setup and potentially nullifying all the tricks they might have played to improve the performance. This was especially true in big DB2/Oracle deployments, but is increasingly less true, which has opened it up for innovations in the last few years.
But it's a ways off and can take many different forms.
This would make it easy to catch missing indexes without logging all queries and without the nightly batch updates which are expected to do full table scans.
It's using machine learning techniques to improve cardinality estimates.
Arbitrary ad hoc query support is hard, it feels like you want indices on everything but then data insertion is dog slow. Then even with perfect stats I’ve seen vertica and PostgreSQL order aspects of queries wrong, I know the pain you speak of. I’ve not convinced myself that it’s always solvable or obvious until you do it wrong though, I never spent much time explaining queries before executing them or explaining “fast” queries though.
“PostgreSQL 10 provides better support for parallelized queries by allowing more parts of the query execution process to be parallelized. Improvements include additional types of data scans that are parallelized as well as optimizations when the data is recombined, such as pre-sorting. These enhancements allow results to be returned more quickly.”
This sounds HUGE. I want to see detailed benchmarks and examples of the kind of workloads that are faster now.
Anyway, some more info http://rhaas.blogspot.hu/2017/03/parallel-query-v2.html
But I have always been curious: how does postgresql (or even mysql) stack up vs proprietary databases like Oracle and Microsoft sql server?
Be careful with that statement, Oracle’s license disallows users of the database from making benchmarks or any kind of performance comparisons (another reason to move to PGSQL)
EDIT: Source that refers to the ToS: https://stackoverflow.com/a/12116865
Now, I'm not going to discount Oracle entirely. From what I've learned from some truly hardcore DBAs, there are certain workloads and scenarios where Oracle is still unbeatable.[ß] But outside those special cases, postgres is the rational, correct choice.
ß: For example, if you have insane performance requirements on a single system, I have been explained that oracle can be used with their own "fs" layer, which in turn runs directly on top of block device layer. Essentially, a bare-bones file system designed for nothing but their own DB usage.
For example, APFS vs. HFS+ (filesystems are a kind of hierarchical database, so it's not that big a stretch). Multiple secondary sources, citing the the same outdated benchmark from months ago, declare that APFS is slower than HFS+. Here's one from HN, with some back & forth arguments and further ad-hoc measurements. Yet nobody bothered running or even mentioned fio.
Or the "ZFS will eat your data if you don't use ECC memory" meme that refuse to die.
 Maybe it's because in order to refute "X is bad is wrong", it's hard to not state "X is bad" first, and biased / less astute audiences tend to stop paying attention after absorbing "X is bad".
Imagine if CPU makers prevented you from running PassMark.
Hell, imagine if car's manufacturer warranty was voided if you took it to a dynamometer.
Why do we as a society allow these absurd rent seeking practices like these? What possible social good could come from preventing discussion about an enterprise software product?
Ferrari has the right to force you to give your car back if you do anything unauthorized with it, including any modification, or any unapproved testing. This was why Deadmau5 had to give back his Purrari (he modified the logo and theme to a nyancat theme)
I can see Ferrari not selling you any new cars because you did something they don't like, but I don't see how they have e the right to force you to give your car back (which you paid for) if you modify it.
Some (many?) Ferrari models aren't for sale. Rather, they're indefinitely leased.
Just like with all your online services, or your software.
That sounds like sort of claim that should be backed up with some pretty significant evidence.
This is far from the first time they've done this, as subsequent google searches will attest. I'll leave this as an exercise to the reader.
> The lucky few selected to get an F50 would make a down payment of $240,000. Yes, that's right: a down payment of $240,000... on a car. After that, monthly payments were $5,600. Five thousand. Six hundred. Per month. For 24 months. And then, at the end of the lease, you owned the car -- assuming you could come up with the final payment of $150,000. Total payments over the 2-year span: $534,400. Only then could you resell your F50 and make money off the wild speculation.
I also suggest the Google query.
(Of course nobody in danger of getting involved in a court case with Oracle is going to be asking me.)
> You may not
> - disclose results of any program benchmark tests without our prior consent.
1) Oracle is really lacking in modern features/usability. Features where frozen in roughly 1999 and they are pretty still the same (mostly). (You are STILL limited to 30 chars for a table name FFS). They do add new stuff from time to time but anything existing isn't modified. Works but not fun to work with.
2) MSSQL needs NOLOCK everywhere (I've seen codebases with this on EVERY query). The default locking really sucks. I'm sure a DBA can make the locking sane system wide but I've never seen this on any of the DBs I've worked with. Also, SQL Manager is a PITA IMHO. Toad it is not. Almost all DB interactions via a 1G windows only install is a "bad idea"
3) MySQL is nice but will just silently eat your data from time to time. Auto truncate is evil, as is missing enums. These have both hit me multiple times in production. Note: Not sure if this is still the case since I avoid it now for this reason.
4) Postgres. Lots of nice features and easy to work with but the optimizer will sometimes do something silly. Sometimes you have to cast your bound variable just use an index. (id=? => id=?::NUMBER just because you do a setObject in jdbc)
I think it's a horrible wart that you have to do that for every session, though. The default can't be changed.
NOLOCK is usually a bad idea.
In the latest release, 12.2, the limit has been increased to 128.
Of course I doubt I ever notice since I doubt my current employer will never upgrade that far. We have currently frozen our large oracle database as a way to force long term migration off it.
Pretty much everything is moving to Postgres on AWS with a bit of other databases thrown in for spice.
Looks like I am forced to use MySQL (or some of its variants ) in the near future. This thing about MySQL eating data is a statement I have read about occasionally. Is there any way to identify and beware use cases where this could happen? Would there be any more thorough documentation of this issue anywhere?
See STRICT_ALL_TABLES / https://mariadb.com/kb/en/library/sql-mode/
Modern MySQL is extremely well suited for data that cannot be lost - as is I'm sure, Postgres.
That said, if you're pre 5.6, I _strongly_ suggest upgrading to 5.6 or all the way to MariaDB 10. The performance, safety, stability, etc have skyrocketed in recent years.
Also watch out/avoid enums.
CREATE TABLE shirts ( ... size ENUM('x-small', 'small', 'medium', 'large', 'x-large'));
You have to specify all the values in the alter so to add xx-small it is
('xx-small','x-small', 'small', 'medium', 'large', 'x-large')
and then later if you add xx-large and forgot about the xx-small add:
('x-small', 'small', 'medium', 'large', 'x-large', 'xx-large')
You just silently lost all the xx-small values, they have been promoted to different values that exist. (Unless this has been fixed as well). Migration scripts are the real issue as they don't know about any custom values that may have been added out of band.
1) SQL Server Management Studio (SSMS) - the default GUI is a decent free front-end that integrates well and lets you do advanced stuff like graphically live-profile queries (live execution plans), easily setup security, build indices, setup various things like linked servers, compression, etc. Although I'm a text-editor sort of person, I don't have SQL syntax memorized for infrequent tasks like creating indices so a GUI (or an IDE) can really help productivity in these instances.
Postgres's default GUI, pgAdmin is comparatively weak, and the good ones are third-party payware.
2) Columnar indices - MSSQL has a fairly good implementation called columnstore indices, which creates a derived column-oriented data structure which speeds up analytic queries quite a bit.
3) Speed - SQL Server is very performant and optimized, and doesn't choke on very large datasets. Postgres is decent, but on my datasets it doesn't seem to be very performant.
Also, MSSQL locking is a boon and a bane. It's not the best for environments with high contention, but it is ok for most analytic database use cases. On the other hand, Postgres' MVCC (and oh the vacuuming) can be annoying.
I wonder if rewriting pgAdmin in Electron might help.
Others recommend using psql from the command-line. Now I spend most of my time on the command-line, and psql is great for one-off stuff, but when you have to edit, test and profile complex queries, the cmd-line interface just doesn't cut it for rapid iteration.
I think this is a huge gap in the Postgres world. But I also think that DataGrip is very promising. I have a pretty high view of Jetbrains' tools.
Datagrip is on its way to becoming really good, though it still has some issues.
There's a whole list here:
(I use the free Apex Refactor with SSMS and it makes editing SQL a pleasure)
PostgreSQL also used to be (is?) single-threaded, which limited performance of a single query on multi-core machines -- I haven't looked into it to see if there has been any fundamental change in the architecture in the last 4-5 years.
From the submission:
"Improved Query Parallelism - Quickly conquer your analysis"
Query parallelism was introduced in 9.6 and expanded in 10.
So...it isn't single threaded then? I mean that is exactly how the most advanced competitors operate (Oracle, SQL Server) as well -- a given connection stays on one thread, with the advantages that confers, unless the planner decides to parallelize.
Postgres processes share memory for all kinds of things. Hash tables may be duplicated, but not due to any fundamental limitations.
The commercial databases are still faster since they have more advanced algorithms and optimizations, as well as better scale out options and tooling - but Postgres is quickly catching up and will be fine for the majority of scenarios. Postgres also has better general usability with robust JSON support, CSV handling, foreign data(base) access, lots of extensions and other features that help make it a powerful data platform.
Today the real difference will be for companies that have some combination of existing Oracle/Microsoft tools and services, advanced clustering needs, complex security requirements, or a dependency on the more advanced features like MSSQL's in-memory OLTP.
Based on that experience, I’d rank them in this order:
2. SQL Server
Postgres often lags behind the others in features, but the dev team chooses their battles wisely and ends up with better implementations.
Postgres is a real beacon of light in the open source world. Solid community. Many projects claim the benefits of open source, but they are never fully realized. Also, because Postgres is not operated by a freemium model, you always have access to the latest and greatest features. The extensibility is fantastic and well-leveraged by the community. I’ve never experienced a case where Postgres tried to figure out what I was doing and decided to do the wrong thing. Postgres fails early and loudly when there’s a problem with what I’m asking it to do, which is exactly what I want it to do. I don’t ever want to have to second guess the integrity of my data.
I haven’t run explicit benchmarks between any of these databases. But when I do similar things across two different systems, I feel like they are generally on par. But like I said, I can’t prove that with any numbers. There are probably specific work profiles that people can come up with that would show better performance for one platform over the other. But I don’t think there’s a realistic difference in performance in general. Not one that’s big enough to push your decision.
The real moment of revelation though, is when you find out that you can run your preferred programming language inside of Postgres. When you actually get to the point that transformations are outside of what you want to do in SQL, and you can just write a Python function and have it execute inside your database instead of having to do I/O, process the data, and then push it back . . . it is life-changing.
The only reason SQL Server isn’t tied for first place is because of the lack of extensibility and because it’s expensive to use in production. But it is rock solid, and has some nice things that Postgres doesn’t have, like hinting queries. Again, the Postgres community has discussed this, and it may never actually happen, but there are reasonable points as to why not. But it is really handy in SQL Server to be able to guide the query planner on the fly like that. SQL Server has also had solid pub-sub for a long time, though we’re getting that now with this version of Postgres.
I’m not a huge fan of Microsoft in general, but you absolutely have to give them props for their tooling. There is nothing even close to SSMS for any other database system. It is by far the gold standard for a visual interface to you data.
Obviously, if you’re throwing down money, you’re also getting a certain level of support for the product. I’m not convinced this should be a deciding factor between Postgresand SQL Server because, again, the Postgres community is amazing.
I should also point out that there’s a free version of SQL Server that will suffice for the needs of a great many people. It’s features are limited (no pub/sub, and there’s a size limit on your total dataset), but it’s totally functional for a lot of use cases. Even though I use Postgres for everything in production, I will always keep a PC around to run SQL Server for one off things that are just easier to do there.
Oracle is mostly fine. I was so new to everything when I was using it that I probably can’t speak that well to its strengths and weaknesses. Other people who have used it more recently can probably do it better than me. I just can’t for the life of me understand why anyone would pay their prices when SQL Server and Postgres exist, unless it’s for the support contract. And where I’m kind of meh about Microsoft, I’m actively against Oracle and Sun Microsystems. I’m pretty sure that Larry Ellison’s personal model is, “Just go ahead and be evil.” But that’s kind of a tangent and not really all that relevant.
MySQL is a different animal. It has a different design philosophy than the others. It’s more like MongoDB in principle than the others are. It’s main goal is to be friendly to the developer. And to entice you into upgrading to the paid tier.
Which is all fine. But one consequence of that is that it tries really hard to succeed under any circumstance, even if “success” means corrupting or losing your data. So it fails rarely, late, and quietly under certain conditions.
For that reason, I don’t think of it as even being in the same category as the other three. As in, it would never be an option for me, similar to MongoDB. I want my dev tools and programming languages to be focused on the developer. And I want my data store to be focused on my data. I think that this is a fundamental and deadly flaw with MySQL.
Different use cases have different requirement though, so your mileage will vary. I’m an incredible pedant about data integrity because the work I do requires it. There are legitimate cases where it just doesn’t matter all that much.
But in terms of feature parity and performance, they are all pretty close in general terms. Each will have specific cases that they really excel at and have been optimized for.
I would say that MySQL and Postgres are generally ok to use I slightly perfer MySQL tooling.
Oracle seems to be the most different one but once I got use to it it was ok.
I wasn't able to convince people to fix it but I ended up writing a Go utility to reliably import/export csv into sql server using Bulk Insert for my own use (and sanity). And it ended up being faster than other methods to boot.
SQL server supports CSV with exactly the same senantics as Excel. Which is what people expect 99.9% of the time, because most CSV data goes to or from Excel in the real world.
If you’re DB-to-DB imports and exports, use a sane file format with a sane delimiter such as the Unicode INFORMATION SEPARATOR and RECORD SEPARATOR characters which were inherited from ASCII.
“Valid CSV” is a dubious phrase, since the closest thing CSV has to a spec is an RFC that tried to map out the space of the wide variety of inolementations then existing.
Anyhow, SQL Server is a database server; there are a wide variety of ETL tools that will export from the server to any common (or not, really) format you like, including just about any flavor of CSV/TSC you might be interested in.
"The bulk copy program utility (bcp) bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files."
Agreed. SQL Server Management Studio is fantastic and is one of the main reasons I enjoy working with MSSQL.
Unfortunately Postgres severely lacks in the tooling department. PgAdmin 3 used to be good, but PgAdmin 4 is simply horrendous. It makes me dread interacting with Postgres.
psql is great for inspecting schema and running ad-hoc queuries. SQL scripts can also be piped through it for cron-sql jobs etc.
pg_dump + psql + ssh to easily copy databases between hosts
pg_bench for benchmarking.
Lots more: https://www.postgresql.org/docs/current/static/reference-cli...
Postage, a tool developed by a family of software devs, was gaining popularity but recently became unmaintained without explanation (afaik). 
I started using IntelliJ DataGrip on a trial basis and it's good, but I probably won't pay for it. Sick of paying monthly subscription fee for every little tool I need from JetBrains, especially when I put down a project and don't need that tool for another x months.
Used DBeaver briefly but it's so many clicks just to set up a primary key that I shelved it for now. Will probably come back to it when DataGrip trial is over.
Not a conventional management tool but pgModeler  is a cool project IMO. Open-source, but they put a limit on the Windows binaries they distribute to try to get people to fund development. Can build from source yourself, install on Linux, or probably find free third-party builds elsewhere.
I think that most devs are just sticking with pgadmin3.
That said it seems there is little to gain except nice words for smaller open source efforts.
JetBrains' DataGrip is significantly better, and works with all the extant SQL implementations.
We can definitely debate the SQL implementations themselves, but DataGrip makes tooling a non-issue.
Does DataGrip let you do Kerberos authentication?
It allows you to manually specify every authentication parameter if you wish to, but the simple login page only supports password or client-cert auth.
I don't know what I don't know and am not familiar with situations in which the new functionality introduced here should be used.
TLDR: am database/postgres noob. Help?
They tell the what, but rarely the why.
It is a fairly easy read, even for someone (like myself) who has very little knowledge of what's actually going on under the hood of most DBs.
- PgSQL mailing list thread discussing Uber's post: https://www.postgresql.org/message-id/579795DF.10502%40comma...
- Slides from "A PostgreSQL Response to Uber", a talk that provides the PgSQL-perspective counterarguments to the Uber post: http://thebuild.com/presentations/uber-perconalive-2017.pdf [PDF]
- Previous HN discussion on the above: https://news.ycombinator.com/item?id=14222721
- Slides from "Migrating Uber from MySQL to Postgres" (2013), in which Uber's prior migration _to_ PgSQL is discussed: https://www.yumpu.com/en/document/view/53683323/migrating-ub...
Always be mindful of how your database is tuned, you could be doing everything right but the query planner will seem to act insane regardless.
If you do have a specific problem then you could ask the community about that and they will help you determine whether there is new functionality that addresses it. Then go read about it in the manual, try it out, and see if it works for you. I find learning from solving my own problems easier than reading a manual cover to cover.
Still, I’d argue that it’s valuable to explore new things, at least at the level where you get an overview of which tool to use for which job.
See http://blog.aiven.io/2017/10/aiven-is-first-to-offer-postgre... for more information
Can anyone speak to how much closer this brings Postgres to being able to work like Spanner/CockroachDB? Partitioning is great but having to define my own partitions ahead of time isn’t a great solution for us folks who want to be able to spin up new databases on the fly (without the overhead of assigning new ranges for the tables on these new instances.)
Obviously CockroachDB has a long way to go before it works as well as a single instance Postgres DB. But how far does Postgres have before it works as well as a multi-instance Cockroach DB?
We've pretty large deployment in our production with 45TB of data, across 1,280 cores 4.8TB of RAM. Most of our tables have hundreds of billions of rows that are under write-heavy pressure. Citus is currently handling around 60k writes per second. Most of our SELECT queries run under 200ms, in some cases those are fairly complex.
The current setup is a result of long tests done across many different database solutions. We never considered CockroachDB seriously for our production, but we did use or test MongoDB, Cassandra (including ScyllaDB), HBase (including Big Table) and Spanner.
We struggled with most, but we used HBase for a year before we moved to PG/Citus. Our expenses dropped by half once we done it as PG is just better suited for our setup. We tested out Spanner fairly heavily as GCP folks tried to convince us to give it a chance. The performance however wasn't even comparable.
Citus has its own quirks and it's not perfect, so definitely do your research before you decide to use it. The best part however that you the best SQL engine underneath with all the powerful features and tools that come with it. We for instance utilize heavily triggers and pub/sub directly from PG. Huge portion of our application logic is done on the DB level through UDF. The part we like about citus the most though is the team behind it. They are incredibly helpful and go beyond the scope of a traditional support even if they don't need to.
I know that Heap Analytics is using jsonb to have schema-free setup on citus and I think it's working well for them.
Column-oriented stores normally have just as much schema as row-oriented stores. What they're faster at is scans where the full row tuple is not required to compute the result. Storing the data in columnar format rather than row also means much better compression: similar data is brought closer together.
Column stores are great, for some tasks and have disadvantages in others. Redshift was I believe the first implementation at top of PG so it's doable, however not sure that PG itself is a good engine for it.
In our case, where we need to have access to multiple columns at once, column stores (including the ones that clump them as families/groups) proved to be slower and required more resources than the row-oriented stores.
I think this really goes case by case based on your needs. If we would benefit from column-oriented DB we would not chose PG/Citus but something else (probably would stay still on Hbase).
i.e. “switching” to MariaDB is probably just a sysadmin upgrading the software and no changes to your code or database queries (unless replication is involved) but pgsql will certainly require more involved changes to the software.
PostgreSQL also provides additional features like LISTEN/NOTIFY, Row Level Security, transactional DDL, table functions (like generate_series), numbered query parameters ($1 instead of ?), and many others.
In my opinion, the only reasons for choosing MySQL/MariaDB are if you absolutely need clustered indexes (also known as index organized tables — PostgreSQL uses heap organized tables) or if you architecture relies on specific MySQL replication features (for example using Vitess to shard your database).
But at this point, I think PostgreSQL should be the default choice for most people for new systems, and then move away if you have some real problem with it. The chance of regret is a lot lower with postgres.
Replication can be a bit of a pain to set up compared to anything in the MySQL family since the tooling to manage it isn't part of the core project (there are tools out there, like repmgr from 2ndQuadrant).
Similar story with backups, bring your own tooling - again, 2ndQuadrant has a great solution with barman, there's also WAL-E if you want to backup to S3 along with many others.
Uber certainly presented a valid pain-point with the way indexes are handled compared to the MySQL family, any updates to an indexed field require an update to all indexes on the table (compared to MySQL which uses clustered indexes, as a result only specific indexes need to be updated). If you have a lot of indexes on your tables and update indexed values frequently you're going to see an increase in disk I/O.
Someone else can probably come up with a more exhaustive list, but the first two are things I've personally been frustrated with - even with the tooling provided by 2ndQuadrant I still have to admit other solutions (namely Microsoft SQL Server) have better stories around replication and backup management, though the edge is in user-friendly tooling and not so much underlying technology.
On the other hand, PostgreSQL has a lot of great quality of life features for database developers. pl/pgsql is really great to work with when you need to do heavy lifting in the database; composite types, arrays and domains are extremely useful for complex models and general manipulation; full-fat JSON support can be extremely useful for a variety of reasons, as can the XML features; PostGIS is king when it comes to spatial data; and a whole hell of a lot more.
PostgreSQL is hands-down my favorite database because it focuses on making my life, when wearing the database developer hat, a lot nicer. With the DBA hat on it complicates things some compared to other products, but the tooling out there is at least decent so it's not a huge deal.
The biggest issue with replication in PostgreSQL is restoring a failed master, pg_rewind provides the low-level plumbing to do this and it's somewhat integrated into repmgr now - but it's far from being easy to use compared to something like SQL Server Availability Groups. Being the sole Linux sysadmin / PostgreSQL DBA in my organization means I have to take responsibility for this since the tools are complex enough I can't easily throw them over to our Windows admins or SQL Server DBA's in an emergency. This is partially an issue with staffing, but if the tooling was a little easier to understand and robust enough in fully-automatic operation I could just leave common troubleshooting steps in a runbook, but right now when replication breaks it REALLY breaks.
I have a single pg admin in current project and he can throw most tasks to the Oracle (running on Linux) admins or even junior Linux admins.
While "complex" they are still fairly basic to command line admins.
Anything with less than 3 pipes or one regex should be simple to command line *nix person. Granted to excludes you including a complex AWK script or inline perl execution on the command line instead of in a file like a normal sane person.
> Replication can be a bit of a pain to set up compared to anything in the MySQL
This is a very large understatement.
MariaDB compared to postgres is a difficult choice, both have a multitude of features. MariaDB 10 was a game changer for the mysql landscape.
To make postgres 10 as easy to scale as mariadb 10, and for query semantics in mariadb 10 to allow for the complexity that postgres 10 allows; both require tradeoffs in management tooling, development practices and understanding.
Neither of them are a silver bullet. They both require effort, like any RDBMS. But both are great choices.
Can you please elaborate in what respect?
Another thing to take into account is that updating between minor versions (major versions per 10.x) is a bit tricky since IIRC the WAL format can change. This means that upgrading from e.g. 10.x to 11.0 requires you to either take your cluster offline, or use something like pg_logical. This is really my only complaint, but again it's not really a reason to _not_ use PostgreSQL.
When ACID support is required, you use InnoDB, and that has the same `select count(star)` performance as other database engines.
What workload do you have that you need exact count of rows in a big table? Because if the table is not big or if inexact count would suffice, there are solutions (e.g. HyperLogLog).
* <all the usual caveats about attempting to tune query performance>
As far as I know, that's true when you use the MyISAM storage engine (which is non transactional), but not when you use InnoDB (which has been the default for years now).
I've recently started working with PostgreSQL and stumbled upon a problem with their UPSERT implementation.
Basically, I want to store an id -> name mapping in a table, with the usual characteristics.
CREATE TABLE entities (id SERIAL PRIMARY KEY, name VARCHAR(10) UNIQUE NOT NULL);
It appears that's impossible to do with PostgreSQL. The closest I can get is this:
INSERT INTO entities (name) VALUES ('a'), ('b'), ('c')
ON CONFLICT (name) DO NOTHING
RETURNING id, name;
We like to mock Java, C++ and Go for being stuck in the past, but that's nothing compared to the state of SQL. Sure, there are custom extensions that each database provider implements, but they're often lacking in obvious ways. I really wish there was some significant progress, e.g. TypeSQL or CoffeeSQL or something.
INSERT INTO entities (name)
(VALUES ('a'), ('b'), ('c') EXCEPT SELECT name FROM entities);
SELECT id, name FROM entities WHERE name IN ('a', 'b', 'c');
Use WITH select id, name from entities where name in :data AS existing
Then return a UNION of existing and (INSERT into entities (name) :data except existing returning id, name)
I've actually seen the specific misunderstanding many times on forums and on IRC. I wonder if it'd be possible to change the syntax to something like "RETURNING [ ALL | NEW ] * | output_expression [ [ AS ] output_name ]" where when empty "NEW" would be used and be equivalent to the current semantics where "ALL" would include rows that were not inserted on conflict. "ALL" would have no different meaning on "ON CONFLICT DO UPDATE" or when no conflict resolution was specified.
I guess that would be possible, but I would be very surprised if that was so... Also, I would also expect writes to rows to be atomic (i.e. you wouldn't see a row with half-written data, or a row that has an `id` but not (yet) a `name`) - again, that kind of behaviour would be possible, but very surprising to the point of the DB being unusable.
Race conditions are always surprising. Fortunately, we have a simple remedy for that: transactions. :)
Suppose you have another connection and it runs DELETE FROM t WHERE id = (SELECT MAX(id) FROM t). If that winds up in the middle, it could screw things up for you. Is it likely to happen? No, but again... the remedy is simple. Also, the scenario you describe seems simple enough, but what happens when you bulk insert a few thousand rows?
By the way, with Postgres, there really is no way to not have a transaction. If you don't issue BEGIN, it treats the statement as a small transaction. So you're getting the same overhead on a single statement, if your plan was to somehow avoid the overhead.
If this makes you question the database's usability, I have some bad news for you: all ACID compliant databases work this way. The "atomic" unit is the transaction, not the connection. The consistency guarantees pertain to the transaction, not the connection. This is not a weird thing about Postgres, this is how they all work (except, of course, that MySQL can't always roll back aspects of a transaction, like DDL).
most people do read (sometimes even in another thread / other connection) -> transaction -> write.
Also the basic transaction isolation in pg would still have these inconsistentsis of phantom/non repeatable reads and serialization anomalies.
basically yes you need transactions to guard against dirty reads, but most often you don't care for dirty data, because if you are not having financial data or data that needs to be taken with care, let's say you have a timeline with posts, you don't care if the last read was dirty and probably the user does not care anyway, he will just refresh the page once he needs more up to date data.
For now, there are optimizations planned for the query planner that would optimize this to a single query in the next releases.
Some commercial databases already do that.
INSERT RETURNING only returns the value of the changed rows, but it does not return anything if you redirect writes with triggers or child tables, including partitioning
Actually, I find SQL pretty nice. It's strongly typed in non-MySQL engines. It's pretty expressive, and can be highly optimized without having to update queries. Analytical features found in non-MySQL databases such as windows are also invaluable for many types of analysis.
The biggest issue I've ever had is that nesting queries when doing some analytical work can be a pain. WITH does handle this to some extent, but it doesn't seem quite as nice as something like PIG where results an "flow" through the query.
The other issue is transformations like pivot tables where you don't know the column names ahead of time. pg has extensions for that, but they aren't quite as nice as a native SQL solution would be.
input_rows (name) AS
(VALUES ('a'), ('b'), ('c')),
(INSERT INTO entities (name)
FROM input_rows AS ir
ON CONFLICT DO NOTHING
RETURNING id, name),
-- returned when insert was successful
SELECT 'new' AS src, i.id, i.name
FROM insert_rows AS i
-- return when there's a conflict
SELECT 'existing' AS src, e.id, e.name
FROM entities AS e
JOIN input_rows USING (name);
Amazing job everyone who has worked on it.
I suspect many larger shops will wait till PGSQL 10.1 before going whole-hog on this feature, but exciting stuff nonetheless!