The recommended approach is to generate SQL that looks like:
SELECT \* FROM users
WHERE id = $1
AND ($2 IS NULL OR username = $2)
AND ($3 IS NULL OR age > $3)
AND ($4 IS NULL OR age < $4)
It's worth noting that this approach has significant dangers for execution performance--it creates a significant chance that you'll get a query plan that doesn't match your actual query. See: https://use-the-index-luke.com/sql/where-clause/obfuscation/... for some related material.
Agree. With patterns like this you are leaning on your db server’s CPU - among your most scarce resources - versus doing this work on the client on a relatively cheap app server. At query time your app server knows if $2 or $3 or $4 is null and can elide those query args. Feels bad to use a fast language like Rust on your app servers and then your perf still sucks because your single DB server is asked to contemplate all possibilities on queries like this instead of doing such simple work on your plentiful and cheap app servers.
I've seen many devs extrapolate this thinking too far into sending only the most simple queries and doing all of the record filtering on the application end. This isn't what I think you're saying -- just piggybacking to try and explain further.
The key thing here is to understand that you want the minimal correct query for what you need, not to avoid "making the database work".
The given example is silly because there's additional parameters that must be either NULL or have a value before the query is sent to the DB.
You shouldn't send queries like:
SELECT \* FROM users
WHERE id = 1234
AND (NULL IS NULL OR username = NULL)
AND (NULL IS NULL OR age > NULL)
AND (NULL IS NULL OR age < NULL)
But you should absolutely send:
SELECT \* FROM users
WHERE id = 1234
AND age > 18
AND age < 35
While sub-optimal, your first example is probably fine to send and I'd expect to be simplified early during query planning at a negligible cost to the database server.
What you shouldn't send is queries like:
SELECT \* FROM users
WHERE ($1 IS NULL OR id = $1)
AND ($2 IS NULL OR username = $2)
AND ($3 IS NULL OR age > $3)
AND ($4 IS NULL OR age < $4)
because now the database (probably) doesn't know the value of the parameters during planning and needs to consider all possibilities.
Interesting. I try to use prepared statements to avoid redoing the planning. But since the database schema is small compared to the data, the cost of query planning is quickly negligible compared to running an generic query that is inefficient.
While this is no excuse for sending sloppy queries to the database server, my rule of thumb with databases - as I was told by my elders - is ”if it can be reasonably done in the database, it should be done by the database”. Data base engines are meant to be quite performant at what they do, possibly more than your own code.
They have limited CPU and IO resources. They can only optimize within the bounds of the current table/index structure. And sometimes they make a bad optimization decision and need to be pushed to do the right thing.
Databases can, for example, sort things. However, if that thing being sorted isn't covered by an index then you are better off doing it in the application where you have a CPU that can do the n log n sort.
Short quips lead to lazy thinking. Learn what your database can and can't do fast and work with it. If something will be just as fast in the application as it would be in the database you should do it in the application.
I've seen the end result of the "do everything in the database" thinking and it has created some of the worst performance bottlenecks in my company. You can do almost everything in the database. That doesn't mean you should.
That’s an optimization, and does not mean that the general rule is not valid.
If that happens to be a bottleneck and you can do better, you should definitely do it in code locally. But these are two ifs that need to evaluate to true
Your bad query can not be a bottleneck but can negatively impact the performance of everyone else.
Databases are highly susceptible to the noisy neighbor problem.
Databases aren't magic. If you can do something better or the same outside the database, you should. 1000 cpu cycles are better spent on the application than the database. You can easily add more application servers.
Your general rule is invalid because of this. It doesn't have to be a bottleneck before it can be a problem. It's a "general rule" I particularly hate because I do performance tuning at my company and have OFTEN seen this be a root cause to negative outcomes. Devs using these sorts of shorthands without understanding what their database is and isn't good at. It's right up there with "premature optimization" which gets parroted at me by jr devs that want to write an n^3 algorithm when an n algorithm exists if they'd just use a data structure besides `List`.
Don't shut your brain off when coding. Sometimes it is better to make the database do something, sometimes it isn't. When that is true is context and situation dependent.
> If that happens to be a bottleneck and you can do better, you should definitely do it in code locally. But these are two ifs that need to evaluate to true
If the OP said what you are saying, I'd probably agree. However, the above statement makes it clear that the OP is saying "put it in the database unless you can prove it doesn't belong there".
That is what I disagree with. There's a lot of reasonable things you can do with a database which aren't the best thing to do from both a system and performance perspective. It is, for example, reasonable to use the sort method on a database. It's also not something you should do without proper covering indexes. Especially if the application can reasonably do the same sort.
This is actually an incredible way of articulating something that's been on my mind for quite a while. Thank you for this, I will use this.
The received wisdom is, of course, to lean on the DB as much as possible, put all the business logic in SQL because of course the DB is much more efficient. I myself have always been a big proponent of it.
But, as you rightly point out, you're using up one of your infrastructure's most scarce and hard-to-scale resources - the DB's CPU.
I think there are two different concerns here though:
The article recommends something that may lead to using the wrong query plans. In the "right" conditions, you will do full table scans of all your data for every query.
This is making the DB waste a lot of CPU (and IO).
Wasting resources like that is different from just where to do work that has to be done anyway!
I am a proponent of shifting logic toward the DB, because likely it ends up there anyway and usually you reduce the resource consumption also for the DB to have as much logic as possible in the DB.
The extreme example is you want to sum(numbers) -- it is so much faster to sum it in one roundtrip to the DB, than to do a thousand roundtrips to the DB to fetch the numbers to sum them on the client. The latter is so much more effort also for the DB server's resources.
My point is: Usually it is impossible to meaningfully shift CPU work to the client of the DB, because the client needs the data, so it will ask for it, and looking up the data is the most costly operation in the DB.
Sum is a good thing to do in the Db because it's low cost to the db and reduces io between the db and app.
Sort can be (depending on indexes) a bad thing for a db because that's CPU time that needs to be burned.
Conditional logic is also (often) terrible for the db because it can break the optimizer in weird ways and is just as easily performed outside the db.
The right action to take is whatever optimizes db resources in the long run. That can sometimes mean shifting to the db, and sometimes it means shifting out of the db.
It's hard to think of situations where you don't want to do the sorting on the DB. If you're sorting small numbers of rows it's cheap enough that it doesn't matter, and if you're sorting large numbers of rows you should be using an index which makes it vastly more efficient than it could be in your app.
And if your conditional logic is breaking the optimizer then the solution is usually to write the query more correctly. I can't think of a single instance where I've ever found moving conditional logic out of a query to be meaningfully more performant. But maybe there's a specific example you have in mind?
> if you're sorting large numbers of rows you should be using an index
Perhaps, depends on what the table is doing and needs to be optimized for.
Indexes are not free, they have a write penalty as they need to be updated every time the data in the index is updated.
> I can't think of a single instance where I've ever found moving conditional logic out of a query to be meaningfully more performant. But maybe there's a specific example you have in mind?
Certainly.
In one of our applications we effectively represent the types as subtables with a root table for the parent type. There were roughly 10 different types with different columns per type.
One way the queries were written, which is slow, was that on insertion the client app would send in (effectively) a block of data with all columns for these types to insert. In the database, the conditional logic would pull out the type id from the input and make the decision on that type information for which subtable would be inserted.
There's really no way to make this something the SQL optimizer can well consume.
The right solution was to instead break this up in the application and per type do the insertions directly into the table type in question. It simplified both sides of the code and ran faster.
I agree that for a) inserts and b) single entity access -- in both these cases the backend can do a lot of the preparation. And your example is both a) and b). We are then in O(1) optimization territory.
If you are only processing a single entity -- the backend should tell the DB exactly what to do. And one shouldn't have if-statements in SQL of course that is "doing it wrong".
But if you have a chunk of 10000 entities like that in your example, all of different types, then you will have to insert some subset of data into all those tables (1000 in one tables, 500 another table, and so on). That logic is well suited for where conditions without much overhead.
But yes for inserts most of the logic can usually be shifted to the DB client as that is where the data resides already. The problem I was talking about was meaningfully shifting for to the client for queries, where the client has no data to work with and must fetch it from the DB.
Let us take your example and turn it into "fetch 10000 such objects". Fetching the right rows for all of them at once using joins and where conditions (+temporary tables and multiple return sets in the same query roundtrip) is going to be more efficient for the DB than the backend first fetching the type, then branching on type, then fetching from another table and so on.
> Fetching the right rows for all of them at once using joins and where conditions (+temporary tables and multiple return sets in the same query roundtrip) is going to be more efficient for the DB than the backend first fetching the type, then branching on type, then fetching from another table and so on.
Nope, not if done correctly.
Now, this isn't to say there's not valid reasons to do it all at once in the DB, the chief among them being ACID requirements. However, from an efficiency standpoint both for the application and the DB the most efficient action is to first request from the parent table and then turn around and, in parallel, send out requests for the child tables of the various types as needed.
Assuming you have a connection pool, the overhead of doing multiple requests in parallel is small. The DB has less data to lookup. The DB has less temporary memory to store (which in our case was a problem). The response io is smaller (not a bunch of empty columns sent back) and both the DB and the downstream application are capable of querying against these tables in parallel.
There is a latency downside in needing the load up the parent table first, if the datasize is large enough then you could overcome that problem by making batch requests to the DB as the parent dataset comes back. Say every 1k values of a given type start the parallel request to load that data.
Splitting the request into these smaller and parallel requests also has systemic benefits to the DB, new writers are able to sneak in which isn't possible when you try to do everything at one go (another issue we had).
The added benefit here is the optimizer in the DB is more likely to do the right thing for the subtable requests than it is for the temp table mess request. A simple fetch is far easier to optimize than a complex one.
> However, from an efficiency standpoint both for the application and the DB the most efficient action is to first request from the parent table and then turn around and, in parallel, send out requests for the child tables of the various types as needed.
This is not true. It is generally considered an anti-pattern.
The fundamental reason it is not true is because it is generally orders of magnitude faster for the DB to do a join or subquery within the same query, rather than perform an entire query, output it, transfer potentially large amounts of data across the network, process that, do the whole thing in reverse, etc.
I don't know how you learned that queries should be split up like that, but it is generally horrendous from a performance standpoint. There is no "correct" way to do it that can compensate for the massive overhead. The correct way is to do it all in a single query with joins and subqueries (including possibly correlated subqueries) whenever possible.
Perhaps you learned this pattern from a services architecture, where it is correct because the pieces of data all sit in different services. But when all the data resides in the same database, splitting up queries is not generally something you want to do unless circumstances force you to.
> This is not true. It is generally considered an anti-pattern.
By who?
> it is generally orders of magnitude faster for the DB to do a join or subquery within the same query, rather than perform an entire query, output it, transfer potentially large amounts of data across the network, process that, do the whole thing in reverse, etc.
If you are talking about a straight relationship, then yes, this is true.
IE
select a from foo join bar on a=b join baz on b=c
However, when you start talking about the exact scenario I put forward it becomes slower. Primarily because the single thread/connection handling the request also has to store sometimes a non-trivial amount of data in memory while it is doing all the query processing. Especially with some conditional logic in there that makes it hard for the db to immediately return until after it's collected the entire dataset.
This becomes extra true when you start dealing with wide datasets that have a large amount of reuse throughout the dataset.
If Foo has a Bar, and there are 10 million foo and 1000 Bar used throughout them, then it's faster, less network, and less data intense to load up bar separately from foo.
> I don't know how you learned that queries should be split up like that
Profiling, benchmarking, and common sense. Furthermore, if you look at how NoSQL Dbs operate, you'll notice they all came to exactly the same conclusion WRT performance.
> but it is generally horrendous from a performance standpoint.
Says you.
> There is no "correct" way to do it that can compensate for the massive overhead. The correct way is to do it all in a single query with joins and subqueries (including possibly correlated subqueries) whenever possible.
You are VASTLY overestimating the "massive overhead" of separate queries. Assuming your DB isn't in literally another country, it'll be ms at most for the round trip. For some applications dealing with small amount of data that may be a game changer, but for the apps I work on and their scenario that's peanuts compared to the actual data fetching time.
With our joining tables, we aren't sending back the full dataset. We are sending back the ids to load which correspond to the subtypes to load. You can stuff a ton of 64bit values into 1KB on the request. Further, there are ways to structure those 64bit values with some RMDBS to correspond them with the sub tables clustered index (which we do).
The only added overhead is resending the ids. Which, again, I'll point out is peanuts in a modern networking setup. I've benchmarked it, what actually takes the most time in terms of db interaction is authentication when a connection needs to be redone. Everything else pales in comparison.
> But when all the data resides in the same database, splitting up queries is not generally something you want to do unless circumstances force you to.
Correct, in the exact scenario I laid out circumstances force us to. These are not narrow tables, few types, or small datasets that we are dealing with.
But even if they were, there's legitimate reasons to consider this approach. For example, when the joined table is mostly static and highly cacheable then it'd make sense splitting it from a general join to store off in a cache store. That's sort of the whole reason memcache exists as a product.
You seem to be taking some kind of extreme example of a particular use case of yours and trying to make it a general rule. But it's not.
And none of this is "says me", it's standard practice, it's relational databases 101. And none of this is about NoSQL, it's about relational databases. NoSQL performance can be abysmal for trying to do things relational databases are meant for.
And the overhead is not about network latency, it's about all of the overhead involved in serializing, transferring, deserializing, and then doing it all over again in the other direction.
Your comment seems to boil down to:
> If Foo has a Bar, and there are 10 million foo and 1000 Bar used throughout them, then it's faster, less network, and less data intense to load up bar separately from foo.
I assume you're not retrieving 10 million Foo for the user, god forbid -- you're retrieving 20 or 50 or something user-friendly. Then you should join to Bar. It is slower and more overhead to load up Bar separate from Foo. It is an anti-pattern.
If you are getting results to the contrary, your query may not be written correctly -- e.g. you are joining 10 million rows of Foo to Bar in a subquery without a WHERE clause, and then only applying the WHERE at a higher level (in which case one solution is to move the WHERE clause into the subquery). Or your tables may not be architected suitably for the queries you need to perform, and you need to revisit your normalization strategy.
Again, there are super-super-complex queries where yes it becomes necessary to split them up. But that is not the "rule", it is not the starting point -- it is what you do only when you've exhausted all possible options of keeping it in the query. It is never a recommendation of how to use databases in a general sense, which is what you are suggesting.
> use case of yours and trying to make it a general rule
This is a fair critique. Definitely our system is a bit unique in what it works with and the amount of random data it needs to pull together.
> it's about all of the overhead involved in serializing, transferring, deserializing, and all the way back.
Serializing and deserializing are typically not a huge cost in DB communications. Most DB protocols have binary data transfer protocols which minimize the amount of effort on server or client side needed to transform the data into native language datatypes. It's not going to be a Json protocol.
Transfer can be a problem, though, if the dataset is large.
> I assume you're not retrieving 10 million Foo for the user, god forbid
In our most extreme cases, yeah we are actually pulling 10 million foo. Though a lot of our ETL backend is where these big data requests are happening as the upstream data is being processed. That's primarily where I end up working rather than the frontend service.
And I'll agree with you. If you are talking about requests which result in the order of 10 to 100 items then yes, it's faster to do that all within the database. It depends (which is what I've been trying to communicate throughout this thread).
> you are joining 10 million rows of Foo to Bar in a subquery without a WHERE clause
No, properly formed SQL. The issue is the mass of data being transferred and, as I mentioned earlier, the temporary memory being stored in the DB while it waits to transfer everything to the application.
Splitting things into the smaller and multiple queries ends up being faster for us because the DB doesn't end up storing as much temp data, nor does it end up serializing a bunch of `null` values which ultimately take up a significant chunk of the transfer.
Also, you should recognize that now you are talking about query structure that it's not universal on what's the best/fastest way to structure a query. What's good for postgresql might be bad for mssql.
> The received wisdom is, of course, to lean on the DB as much as possible, put all the business logic in SQL because of course the DB is much more efficient. I myself have always been a big proponent of it.
This is not received wisdom at all and the one edict I have when leading a project is no stored procedures for any OLTP functionality.
Stored Procs make everything about your standard DevOps and SDLC process harder - branching, blue green deployments and rolling back deployments.
There's a different reason to lean on the DB: it's the final arbiter of your data.
Much harder to create bad/poisoned data if the DB has a constraint on it (primary, foreign, check, etc) than if you have to remember it in your application (and unless you know what serializable transactions are, you are likely doing it wrong).
Also you can't do indexes outside of the DB (well, you can try).
Replying to this whole sub-thread, not just this post specifically;
All SQL advice has to take _context_ into account. In SQL, perhaps more than anywhere else, context matters. There's lots of excellent SQL advice, but most of it is bound to a specific context, and in a different context it's bad advice.
Take for example the parent comment above; In their context the CPU of the database server is their constraining resource. I'm guessing the database if "close" to the app servers (ie low network latency, high bandwidth), and I'm also guessing the app developers "own" the database. In this context moving CPU to the app server makes complete sense. Client-side validation of data makes sense because they are the only client.
Of course if the context changes, then the advice has to change as well. If the network bandwidth to the server was constrained (cost, distance etc) then transporting the smallest amount of data becomes important. In this case it doesn't matter if the filter is more work for the server, the goal is the smallest result set.
And so it goes. Write-heavy systems prefer fewer indexes. Read-heavy systems prefer lots of indexes. Databases where the data client is untrusted need more validation, relation integrity, access control - databases with a trusted client need less of that.
In my career I've followed a lot of good SQL advice - advice that was good for my context. I've also broken a lot of SQL "rules" because those rules were not compatible, or were harmful, in my context.
So my advice is this - understand your own context. Understand where you are constrained, and where you have plenty. And tailor your patterns around those parameters.
I think the conventional wisdom is to lean on the DB to maintain data integrity, using constraints, foreign keys, etc.; not to use it to run actual business logic.
>It's worth noting that this approach has significant dangers for execution performance
This extra WHERE id=$1 clause makes it behave different from the slow examples you cited from the Markus Winand blog. The query planner should notice that id column is a selective index with high cardinality (and may even be the unique ids primary key). The query optimizer can filter on id to return a single row before the dynamic NULL checks of $2,$3,$4 to avoid a full-table scan.
The crucial difference is the id clause doesn't have an extra "OR id IS NULL" or "$1 IS NULL" -- like the blog examples.
Note that null is being used here in the “unknown value” sense in order to combine several possible queries into one query plan. Which is a bad idea for a query you need to be performant: $1 can be null (which is possible in the original stackoverflow question that this blog post is a reaction to), and if the parameters are passed in via a bound argument in a prepared statement (not uncommon), then the query plan won't necessarily be taking the passed parameters into account when deciding on the plan.
You are right. If that’s the query you need to write, you’ll be ok.
That said, I don’t think I’ve ever had occasion to write a query quite like that. I’ve written
select * from blah where id in (1,2,3…) and condition
or
select * from blah where condition1 and condition2
but never a query quite like this. Do you know of use cases for it?
Given that most queries don't look like that, I think my criticism is reasonable. For most use cases, this query will have performance downsides, even if it doesn't for some very narrow use-cases.
Record selector to drive a data grid. Ex: Filter employees by location, or active/terminated, or salary/hourly, etc. and let the user choose one or many of these filters.
That sounds like the developer use case. Data scientists doing ETL and analyzing messy data with weird rules like the ones above are common (although the id is usually a contains/in to handle lists of rows that don’t fit any of the conditions but must be included).
I’ve had to do some weird things to clean up data from vendor databases in several industries.
It really depends on your query engine, this would be considered a "catch all query" in SQL Server, and you're going to have really bad parameter sniffing blowing you up, you do not want to do this usually.
I would expect the query plan for SQL server to essentially return records matching `id` first (which again should be a situation where uniqueness comes into play) and then performing the rest of the execution on the subset that matches, which is hopefully one.
I leave allowances for `id` to be a stand-in for some other identity column that may represent a foreign key to another table. In which case I'd still expect SQL server's query planner to execute as: initial set is those where said column matches the supplied number, then further applies the logic to that subset. In fact I'd love to see where that isn't the case against a transactional DB.
If you include a clustered index/primary key on every query, you might be right, but in practice there's no benefit to include any other params if you have the pk, the catch all query is easy to modify into something that supports batches AND single values (which is where it really gets bad for parameter sniffing) and that's what I usually see in production causing problems.
I have hundreds of queries like this in production (Go + Postgres + pgx), and don't have issues leveraging the right indexes. Make sure when using prepared statements, you have custom query plans per query for this via `SET plan_cache_mode = force_custom_plan`.
These optimizations are trivial for Postgres to make optimization/plan time, so there's no runtime hit. But as always, profile and double check. You're definitely right that assuming can get you in trouble.
I don't have experience with the other databases to speak to their quirks. But for my specific setup, I haven't had issues.
AFAIK, that's only a risk if you're using prepared statements.
If you're just running each query individually, the parser should be smart and ignore the Boolean clauses that can be skipped, and use the appropriate indexes each time.
But yes, if you're trying to optimize performance with prepared statements, you almost certainly do not want to follow this approach if any of columns are indexed (which of course you will often want them to be).
This may be a situation where you classify the parameters and create a handful of queries that include or exclude each class.
The same way we always have a distinct query for selecting by ID, you have one with just username, one with demographics, one with account age or activity ranges, and then C(3,2) combinations of categories.
At least for MSSQL: Never do this (before learning about query caches). Or at least, if you do, add (option recompile) to the query.
For each combination of parameters to search for you may want to use a different index.
But... the query plans are cached by query string lookup!
So it is imperative that your search string looks different for each query plan/index being used.
The code suggested here will pick a more or less random index (the one optimized for the parameters of the first execution) and stick with it for remaining executions, leading to bad queries for combinations of non-null that doesn't match the first query.
You could just add a comment inside the string that was different depending on what parameters are null, but that is no less complex than just generating the query.
PS: Of course there are situations where it fits, like if your strategy is to always use the same index to do the main scan and then filter away results from it based on postprocessing filters. Just make sure to understand this issue.
I debugged an app a couple of years ago that from time to time brought entire MSSQL down. The server had to be physically restarted. Nobody could figure out for years what was going on, all the queries had been analyzed and checked for missing indexes, everything was profiled... Except when an app generated a query like this which did not go fine through the cached plan.
All of these are simple, almost unrealistic queries. Show me how to handle optional joins in the filter.
> My naive-self in the past used to create a fancy custom deserializer function that transformed 11,22,33,44 from a String into a Vec<i64> and that is useless work that could have easily been handled by the database.
Great, now the database has no idea what the cardinality of the IN clause is and has to generate a sub-optimal plan, because it could be 1 or it could be 10000.
This article just gives me the impression that the Rust query builder has terrible DevEx.
Why is adding the clause and binding variables two calls and not one? The lack of variadic functions makes this clunky but you could limit people to 3 binds per clause and that would cover 95% of people. Formatting and cognitive load both crap out around 3 anyway.
While clunky to implement, variadic calls is mostly a solved problem in rust by way of tuples and macro rules. It’s ugly to implement as a library author (generate K different declarations using macro expansions, where K is the limit of the variadic expansion), but IMO that’s secondary to the library user experience, which is fine with this approach.
Depends what you mean by optional join- if you mean you want the flexibility to build up a list of columns you actually want data from on the fly, you would probably have good results from doing left joins to the tables that contain those columns while ensuring you have a unique column (PK, Unique constraint) you're joining to. Your query engine should be smart enough to realize that it can avoid joining that table.
The logic is that you've ensured the query engine that it'll never return more than one row from the optional table so if you're not returning any actual columns from that table there's no need to think about it. Without the unique constraints the query engine has no idea how many rows may be returned (even if you aren't selecting the data) so it still needs to go through the work.
I don’t get the point of this article. Just reading the samples, I strongly dislike this query builder because it looks flaky and difficult to parse by eye. And the examples get worse and worse.
This isn’t an argument against query builders, that just seems like an argument to make your query builder easier to use and understand. I wouldn’t argue against programming languages by picking bad C++ libraries.
It seems article shows the opposite argument. SQL builders are useful not to write fragile raw sql ridden with noisy filter patterns with repeated numbered placeholders which could be easily broken on refactoring. Also it's impossible to compose queries with abstracted parts.
Yep, it's quite handy for complex reporting. An original motivation was to disentangle a 1.5k lines (SQL + corresponding python code to assemble the query) of almost identical clickhouse queries. There were two big almost similar data tables with own dictionary relations. And there is a bunch of reporting with complex grouping and filtering. Around of 20 report instances. Each report was a separate monstrous SQL.
Schema changes were quite often and reports started to rot up to the point of showing wrong data.
After refactoring it became 200 lines and allows to query more reports due to increased generality.
It's a small app with an attempt (and a goal) to model persistence without ORM. I think it suits quite well, could be fully type hinted (with some raw force though) and somewhat less verbose in this particular context.
But also I see how it could be a maintenance hell for a medium/large scale apps.
Yeah! With SQLAlchemy as a query builder I can easily find all usages of a particular column with Find Usages action in PyCharm. With strings it’s not as easy, and some columns are not really greppable (id, author_id, user_id, etc).
Also, haven’t seen how sqlbind handles optional JOINs and especially optional LATERAL.
Whenever we do recursive, or really long queries (not often), we’re dropping to basically raw SQL with sqlalchemy.text(). It’s really a lot to keep it wholly in SQLAlchemy: write a complex query in SQL, convert to SQLA, and remember how all the moving parts should be connected.
Eventually people will have enough of Little Bobby Tables and url spoofing and then query engines won’t allow string concatenation at all.
The only alternative I know of is to make a query engine that exactly emulates the String Interpolation syntax of the host language and can detect string concatenation in the inputs.
But the problem with non-builders is always going to be GraphQL and advanced search boxes, where there are any of a couple dozen possible parameters and you either build one query that returns * for every unused clause or you have a factorial number of possible queries. If you don’t use a builder then Bobby always shows up. He even shows up sometimes with a builder.
Something I’ve ran into a lot over the years is people not realising that (at least in MySQL) prepared statement placeholders can only be used for values, not identifiers like column names.
Because many query builders abstract away the creation of a prepared statement, people pass variables directly into column fields and introduce injection vulns.
Number one place I see this is data tables: you have some fancy table component where the user can control which columns to see and which to sort by. If you’re not checking these against a known good allow list, you’re gonna have a bad time.
Yes, it varies by database whether placeholders can be used for table names.
Personally I find table names sufficiently non-dynamic that an enum of accepted values & string concatenation works. Whenever I've wanted to template user input into table names, I've realised that I should probably refactor my schema or add some views.
Your post made me realize that custom string interpolation handlers from C# could probably be used to do injection-safe SQL querying (https://learn.microsoft.com/en-us/dotnet/csharp/advanced-top...). Makes me wonder whether scenarios like that were considered when they built that feature or if it's just a fun coincidence from making string interpolation extensible.
Interpolated string handlers are a newer API made primarily with customization and efficiency of interpolation in mind. They are more complicated to implement and use but enable zero-cost and/or zero-allocation use cases and performing custom behavior on appending literals or data. They are also used to implement the default string interpolation handler and a few auxiliary ones like Append handler for a string builder which writes directly into it, bypassing intermediate string construction.
It's quite a mouthful to say though but very useful for advanced scenarios. Haven't seen any other language provide anything comparable.
Javascript added 'tagged templates' a while ago that allows you write a function that gets passed the 'parse tree' of a template string, so js libs do a lot of 'interpolation magic' for sql queries
I mean, in C++ (17? 20? Whenever constexpr was introduced) it's totally possible to create a library that allows you to build a SQL query via the language's string concatenation libraries/etc., but only allows you to do it with static strings unless you use ~shenanigans. (C++ unfortunately always allows ~shenanigans...)
I guess you do wind up needing to potentially re-implement some basic things (or I guess more complex, if you want format string support too). But for basic string concatenation & interpolation, it's reasonable.
That's a pretty useful way to get basic string concatenation while also preventing it from creating opportunities for SQL injection.
For example, you have a class that requires a constexpr input & can be appended to/concatenated/etc.:
SqlStringPart(constexpr ...)
operator+(SqlStringPart ...)
(so on)
And you have a Query API that only takes SQL string expressions that are built out of compile time constants + parameters:
SqlQuery(SqlStringPart ..., Parameters ...);
This doesn't solve the problem mentioned in the article around pagination & memory usage, but at least it avoids letting someone run arbitrary SQL on your database.
I see a lot of push back against this approach. And since it is something I've been experimenting with recently, this is pretty interesting stuff. Clearly it has issues with query planning getting messed up, which is not something I had been aware of since my DB size I've been experimenting with is still only in the 10s of thousands of rows. But...
Using raw SQL file addresses:
1. Very difficult for devs to expose SQL injection vulnerabilities because you need to use parameters.
2. Having all available filtering dimensions on a query makes it very clear what the type of filtering is for that particular query.
3. Easy debugging where you can just throw your query into an SQL client and play around with the parameters.
4. Very clear what the total query footprint of you application is (e.g. files all neatly listed in a dir).
5. Super readable and editable.
6. Code for running the SQL is pretty much: here is my query, here are my params, execute.
7. Etc?
So the amount of good you can get our of this approach is very high IMO.
So an open question to anybody who is more familiar with DBs (and postgres in particular) than myself. Is there a reliable way to address the issue with this approach to querying that you all are flagging as problematic here. Because beyond the query planning issues, raw SQL files (with no building/templating) just seems to me like such a better approach to developing a db access layer.
This is the kind of anti-pattern that can work on toy or small projects but doesn't scale well to larger projects or groups.
> 1. Very difficult for devs to expose SQL injection vulnerabilities because you need to use parameters.
You should use parameters either way.
> 2. Having all available filtering dimensions on a query makes it very clear what the type of filtering is for that particular query.
Code is easier to document well than a SQL query
> 3. Easy debugging where you can just throw your query into an SQL client and play around with the parameters.
Query builders will give you a query you can do the same thing with.
> 4. Very clear what the total query footprint of you application is (e.g. files all neatly listed in a dir).
This seems like a design/organization choice that is separate from whether those files are query or code.
> 5. Super readable and editable.
Doesn't scale as a project grows, you end up with massive unwieldy queries or a bunch of duplicated code across a bunch of files.
> 6. Code for running the SQL is pretty much: here is my query, here are my params, execute.
It is pretty much the same with a query builder, in either case the 'execute' is calling a library where all the actual stuff happens.
If you know your project is gonna stay small with simple queries and your scope won't creep, raw SQL files might the right choice, but they will create technical debt as the project grows. It's worth the time in the long run to get comfortable with a query builder.
Thanks for summing this up! I'm also in the thousands of rows space at the moment and that's probably why I've fallen in the query planning trap that many pointed out.
That's neither more nor less comprehensible than what I often see in python's built-in DSL within list comprehensions.
At least the SQL variant has the excuse of being designed back when language design was still in its infancy. The madness in Python's list comprehensions and the ad hoc DSL in ruby has no such excuse.
`push_bind` covers a good deal of the concerns for a query builder, while letting us think in SQL instead of translating.
That said, an ORM like ActiveRecord also handles joins across related tables, and helps avoid N+1 queries, while still writing consistent access to fields.
I find myself missing ActiveRecord frequently. I know SeaORM aims to address this space, but I don't think it's there yet.
These (avoid N+1, join across related tables) seem like problems that could be solved by writing the SQL by hand. Is it that much of a lift to treat the database like an API and just write a thin library/access layer around it?
ORMs seem like they are a good fit for dynamic queries, where the end user, not the programmer, are developing the models.
For established software where performance matters, hand-writing the SQL is reasonable.
Hand-writing SQL for, say, a faceted filtering UI is a tedious job that takes most of a day in even fairly simple cases, or about 20 minutes with a decent ORM.
ActiveRecord (and related libraries like ActiveAdmin) are _amazing_ for rapid prototyping - eg if you don't even know whether you're going to end up keeping the faceted search.
> For established software where performance matters, hand-writing the SQL is reasonable.
These things aren’t mutually exclusive though. Every ORM I know gives you an escape hatch to write whatever sql you want. ORMs are great for 90% of things and as a reviewer I don’t need to scrutinize their queries too much. It’s much easier to for me to review an ORM builder query because I know it’s going to do the correct joins on the correct columns. For example in the ORM I use id rather see:
query()
.where()
.eq(“parent”, parent);
Instead of:
“select * from table join parent on parent.id = table.parent_id where parent.id = :parent”
Since you’re not returning anything from `parent`, it makes much more sense to use a semijoin, which is something ORMs usually bury in an obscure section of docs, if they support them at all.
SELECT * FROM `table` t
WHERE EXISTS (
SELECT 1 FROM parent p
WHERE p.id = :parent)
Or, you know, just eliminate the other table entirely (which the optimizer may well do) since p.id = t.parent_id = :parent
You’re completely missing the point while also completely making my point.
The ORM is going to do the correct thing here, while the SQL I quickly typed out will work, but does the inefficient thing and requires more manual review and back and forth in discussions.
I disagree. The point you're making is predicated on not understanding SQL. If you know an ORM well, and don't understand SQL, then of course it will be easier to review. I would however argue that if you don't understand SQL, then you can never truly understand an ORM, in that you can't know what the DB is capable of doing. You'll probably get lucky for things like `WHERE foo IN (...) --> WHERE EXISTS` translations that the DB's optimizer does for you, but you also probably won't even know that's happening, since in this scenario you don't understand SQL.
ORMs typically do an OK job at producing OK queries, in that they're unlikely to be the worst possible option, but are unlikely to be optimal. This is largely driven by schema decisions, which, if you don't understand SQL, are unlikely be optimal. The clunkiest, least-performant queries I've ever dealt with were always rooted in having a poorly-designed schema.
Well. Query builders are composable. You can create a builder with partial query, and reuse in many queries. With sql strings, you either have to copy paste the string, or to define sql functions. It's a trade off!
My last job had a strong "no query builders or ORMs" policy, but of course then we wanted composability, so we had in-house, half-baked implementations of both that were full of injection bugs and generated incorrect queries with miserable performance.
That's not to say there's never a place for "keep your queries as SQL files and parameterize them", just that I think your point is 100% valid: if you're unaware you're making tradeoffs, you'll at some point experience some downsides of your chosen system, and to alleviate those you might start building the system that would fit your use case better, totally unaware of the fact that you eschewed an existing, off the shelf system that would do what you want.
Good point, even though copying strings isn't hard. Figuring out where in the string to inject new parts isn't always as easy. You end up with `select_part`, `where_part`, etc.
Making identical updates to copies of the strings when a bug is discovered is hard though. People who act like it isn’t hard create most of the evidence that it is.
Tried to explain ORM composability at work (without praising ORM like a fanatic), most didn't care, someone said to pass context dicts for future open-nes... weird.
Having had the same argument at work in the past, I feel your pain. Trying to migrate away from massive SQL files and random strings here and there to a collection of interdependent composable SQL builders is apparently a tough sell.
From experience, this goes from many little piles of hell to sprawling piles of hell. Obviously the current situation isn't good, but the "collection of interdependent composable SQL builders" will turn into an insane system if you roll things on 5 years. Everybody will yolo whatever they want into the collection, things in the collection will almost match a use cases but not quite and you'll get 90% similar components.
Obviously that is just one persons experience. But I'd take a single crazy sql file any day of the year because it's insanity is scoped to that file (hopefully).
But I'd agree the random string are no good. Maybe refactoring them into an enum either in the code or in the DB would be a good step forward.
My experience is this sort of thing is cyclical. You start with one approach, you use it a lot, you start noticing the flaws, oh here's a brand new approach that solves all these flaws... and the cycle repeats.
I don’t know if it’s still true but some databases used to be able to process prepared statements more efficiently. We ran into a bottleneck with Oracle 9i where it could only execute queries currently in the query cache. Someone fucked up our query builder so a bunch of concurrent requests weren’t using the same prepared statement and should have been.
SQL isn't composable. It would be great if it was, but it isn't. So we can use query builders or write our own, but we're going to have to compose queries at some point.
Common Table Expressions enable a lot of composability. Using them can look like you're asking the DB to repeat a lot of work, but decent query optimizers eliminate much of that.
When I want to make a complicated intermediate query that other queries can reference I create it as a view.
I will admit that future me hates this sometimes when I need to dismantle several layers of views to change something.
And some people hate to have logic in the database, Personaly I tolerate a little logic, but understand them who don't like it. any way, common table expressions (with subquery as ()...) are almost as composable as views. and have the benefit of being self contained in a single query.
Common table expressions do exist, and they compose ~alright (with the caveats that you're limited to unique names and they're kinda clunky and most ORMs don't play nice with them).
SQL is terrible at allowing this sort of transformation.
One benefit of PRQL [disclaimer: maintainer] is that it's simple to add additional logic — just add a line filtering the result:
from users
derive [full_name = name || ' ' || surname]
filter id == 42 # conditionally added only if needed
filter username == param # again, only if the param is present
take 50
I never looked into prql; does the ordering matter? As if not , that would be great; aka, is this the same:
from users
take 50
filter id == 42 # conditionally added only if needed
filter username == param # again, only if the param is present
derive [full_name = name || ' ' || surname]
? As that's more how I tend to think and write code, but in sql, I always jump around in the query as I don't work in the order sql works.
I usually use knex or EF or such where ordering doesn't matter; it's a joy however, I prefer writing queries directly as it's easier.
`take 50` of all `users` records in random order and after filter the result with username and id? I hope it's the right answer and prql authors are sane.
I use SQlAlchemy and just generate a pydantic model that specifies which fields are allowed and what kind of filtering or sorting is allowed on them. Bonus is the resulting generated typescript client and use of the same pydantic model on the endpoint basically make this a validation issue instead of a query building issue.
I recently used SQLAlchemy for the first time and was delighted it has something I’ve always wanted in Laravel’s Eloquent: columns are referenced by identifiers on the models rather than plain strings.
Seeing `.where(Foo.id == Bar.foo_id)` was a little jarring coming from a language where `==` cannot be anything but a plain Boolean comparison, but it’s nice to know that if I make a typo or rename a field, it can be picked up statically before runtime.
It's definitely not Rust specific, or even that new. Certainly I was hearing things (e.g. SQLAlchemy's lower level API) referred to as query builders 15 years ago.
Using the OR approach can actually cause some headaches. It can cause SQL Server to make an suboptimal plan for the other queries which have the same query text but due to the parameters behave completely different.
Indeed. I was about to write about it to the author by mail, but noticed that he posted this link, and probably will read these comments, so can as well provide the feedback by joining this subthread: at least a few readers are not happy about our scrollbars being hidden.
I've hidden the sidebar because I did not like the push-to-the-left it causes when going from certain pages to others. Am I breaking accessibility features for some?
The push-to-left doesn't matter. Most people will read one thing on the site so they're not navigating between pages anyway, most of the remainder aren't clicking through pages fast enough to even notice the jump, most of those that do notice it know how scrollbars work, and the remainder is you :)
I guarantee nobody will complain about the jump, but they will (did) complain about disabling basic browser functionality.
If the jump really bothers you, you can replace your rule with
html {
overflow-y: scroll
}
which should force scrollbars to appear on every page whether they're needed or not. But you don't need it.
Not exactly accessibility features, but a more general usability, even for people without disabilities. I personally noticed the lack of a scrollbar when I wanted to check how far through the article I am, but another common use case for those is to actually scroll through the document (particularly for skimming of larger documents, for which mouse wheel, space bar, or Page Down key are too slow, or in more rare situations when they are not easily available).
Unfortunately tinkering with visual presentation tends to conflict with the principle of least surprise, user settings, or even basic functionality.
The lack of expressiveness in query builders that the author refers to in their first post as a motivation for ditching them is an easily solvable problem. It seems like most ORMs have easily solvable design issues though, and I would definitely agree that you should ditch tools that get in your way. What I've been doing is sporadically working on an _experimental_ Golang ORM called Trance, which solved this by allowing parameterized SQL anywhere in the builder through the use of interfaces. e.g.,
The whole query builder is an optional abstraction built onto the ORM. It has many benefits, but will also never completely support every SQL variant in existence, as is true with all query builders. so I feel as though the only responsible way to build ORMs is with escape hatches, and this is one of them.
There was an essay a couple years ago that really convinced me to not use query builders, https://gajus.medium.com/stop-using-knex-js-and-earn-30-bf41... , and from that I switched to using Slonik (by the author of that blog post). There were some growing pains as the API was updated over the years, especially to support strong typing in the response, but now the API is quite stable and I love that project.
It's pretty much impossible not to end up with a lot of repeated spaghetti code, if you are doing anything beyond a really trivial single user app.
Even for simple stuff, like each user only having permission to see parts of the database, it's essential to have a systematic way of filtering that is composable.
I'm not a fan of ORMs and I actually like SQL and yet have been using sqlalchemy expression language (the low level part of sqlalchemy) for many many years and i wouldn't really go to SQL strings.
With only 4 optional Params you can just have 15 queries. Heh
I remember back when everything was someone's idea and others would both compliment it and improve it. Now it is like things are unchangable holy scripture. Just let `Null < 42 or Null > 42 or name = Null` all be true. What is the big deal? I can barely wrap my head around joins, the extra luggage really isn't welcome. Just have some ugly pollyfills for a decade or so. All will be fine.
It seems to me a big part of the problem is that the "query builder" in TFA is little more than a string builder. In the .Net world I've used SqlKata [0] and been very pleased with it. It allows you to easily dynamically build and compose queries.
I completely disagree. I love .NET Entity Framework Core. It's possible to build queries in code with a SQL-like syntax and a lot of helpers. But it's also possible to provide raw SQL to the query builder. And the top notch feature: You can combine both methods into a single query.
Everything has it's place though. Query builders and ORMs require some effort to keep in sync with the database schema. Sometimes it's worth the effort, sometimes not.
I don't know Rust well, is this what's known as a query builder in Rust? That's weird to me, because in other typed languages that I know, query builders are typically typesafe and don't just concatenate strings (see e.g. jOOQ for the JVM).
This is weird. When you say “query builder” I’m thinking of something associated with an ORM so it already knows the table specifics and you don’t have to initialize it with “SELECT * FROM table”.
Avoid pushing business logic into SQL. Broadly, if a solution is "be more sophisticated or nuanced", you're raising the bar for entry. That's one of the worst things to do for development time across a set of collaborators.
will drive anybody crazy. For a query builder though, you should write something table driven where for instance you have a hash that maps query names to either functions or objects
gets broken into "username" and "gt" functions, and the where_username function gets these as arguments in the operator and quantity fields. Easy-peasy, wins at code golf if that's what you're after. Your "field" can be a subselect statement if you want to ask questions like "how pictures are in this photo gallery?"
This is the kind of code that Lisp wizards wrote in the 1980s, and there's no reason you can't write it now in the many languages which contain "Lisp, the good parts."
Even then, if you actually need a good query builder for Ruby that manages more complex cases than AR does, then Sequel is there and is so, so powerful.
Ah yes, the SQL injection cycle begins anew. A solved vulnerability for decades, only for the new generation of junior devs to ignore wisdom of the old generation again and introduce it anew.
Don't ever do this. Query builders exist to sanitize inputs in a failsafe manner. SQL has so many pitfalls that tools like sqlmap [1] exist for a reason. You will never be able to catch all encoding schemes in a regex approach to filter unsanitized input.
The examples in the blog can be exploited with a simple id set to "1 or 1=1;--" and is literally the very first web exploitation technique that is taught in highschool-level CTFs.
sqlx can mitigate a lot of problems at compile time, but sanitization is completely ignored in the post, and should at least be mentioned. If you recommend to juniors that they don't need a query builder, tell them at least why they existed in the first place.
Because type correctness does not imply branch correctness. SQL has side effects of interpretation, and any string/query builder that is not aware of grammatical implications should be avoided in my opinion.
I clearly don't understand something about implications. Could you please elaborate or give a link to read about it? What is branch correctness? How could it be exploited? How does sanitization prevent it?
sqlx looks like a usual builder, I don't see nothing criminal about it.