Hacker News new | past | comments | ask | show | jobs | submit login
SQL: The difference between WHERE and HAVING (sql-bits.com)
117 points by FedericoRazzoli on Dec 14, 2021 | hide | past | favorite | 132 comments



Nothing of essence in this post. I had a look at the other posts they all look similar. I greatly dislike this new trend of "doc-spaming" where everyone makes a blog writing walls of text explaining trivial things, crowding out the actual docs from search results.

I swear to god whenever I search for something in the python standard library I have to scroll past a bunch of sites like this before i find the actual docs.


A much better resource if you are new and want to understand the difference between where and having is here near the bottom.

https://www.postgresql.org/docs/14/tutorial-agg.html


Relevant snippet:

> It is important to understand the interaction between aggregates and SQL's WHERE and HAVING clauses. The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed. Thus, the WHERE clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine which rows will be inputs to the aggregates. On the other hand, the HAVING clause always contains aggregate functions. (Strictly speaking, you are allowed to write a HAVING clause that doesn't use aggregates, but it's seldom useful. The same condition could be used more efficiently at the WHERE stage.)

Thank you for linking this, it's very concise and actually helpful.


The recipe writing community is leaking into the developer community.


Coming soon to a dev recipes site near you: post after post starting with a four paragraph retelling of a trip to the countryside, the packing for which made the writer think of custom aggregation functions...


I see you would like a recipe for developers?

https://wikidiff.com/spam/spam

https://www.codegrepper.com/code-examples/python/frameworks/...

EDIT2: replaced statement of fact about WHERE/HAVING with this message.


I've been looking for a decent application where I can import recipes, and add my own ones, keeping them private (because I didn't make them, but also allowing them to be released if I did invent them and want to share). A kind of Git for recipes, if you will. I've seen a few ones, but they always lack in one way or another: bloated Electron crap, small database or bogus content, or terrible UI.


A lot of those feel like they're automatically generated. Highly formulaic structure, repetitive, borderline nonsense parroting of the "common wisdom". If they're actually written by humans, I feel sorry for them.


Probably non-native English speakers, such as Indian (example being "The Hacker News"...).


This is a problem with the search engines, not the person who writes the website. (And I question why you wouldn't use the rather well organized docs.python.org for looking things up in the standard lib.)

It may be time to admit that general purpose search engines are fundamentally broken and that the only useful way forward is subject-specific search that allows users to be in control of their results.


>It may be time to admit that general purpose search engines are fundamentally broken

I've been saying that since I first heard someone pitching SEO while eating lunch. Once you start gaming an index for visibility, Tragedy of the Commons goes into full swing, and the Yellow Pages ironically become a truer representation of actual local state. To this day, the memory of that...individual just going on and on casually talking pitching about what is no less than intentional pollution of the search structure of the collective knowledge of human still causes a hot nugget of incandescent rage and disgust deep within my soul. Before now, the thought of forgiving them has never entered my mind before now, and in this minute is quickly dismissed. I didn't know the person, they were themselves probably just trying to get by best they knew how. But I just can't seem to let it go. Poisoning the well like that... It either requires the most appalling sense of shallow callousness, profound ignorance, or a complete disregard for the integrity of our species' collective informational output. It's the burning of the Library of Alexandria by way of Salesman vying for attention. As time goes on the disgust only becomes becomes more pointed as the practice is normalized and refined to the point the way information is organized and made accessible on the Internet makes a bigger statement on what society really values than the indexed content itself does.

I'd normally say live and let live, but I'm getting increasingly bothered by the suffering and extra cognitive load foisted on the world by ads, marketing, etc..., and the lack of societal wherewithal to push back and say enough is enough.

Anyway, this rant brought to you by Libraries! Support your local branch today! Strike a blow for open access!


I don't think it's completely a problem with search engines. Blogs like this are engaging some amount of SEO the may push a fairly low quality entry up in the results, and that is a never ending arms race between the search engine.

The bloggers themselves initiate the issue, and I don't see how subject-specific engines could completely solve that. Why would they be able to distinguish this article from a well written entry on a SQL topic that is not just a restatement of fairly basic knowledge? Docs don't cover everything, there's space for articles like this that follow the same basic structure but with more useful information, a and once people can infer a little bit about the search engine's algorithm from how it shows results then we're back to the SEO arms race.


I agree it feels more and more that general purpose search engines are no longer the tool for looking for code documentation.

But I wouldn't go as far as to say that the person who writes the website is not the problem. The content in these sites is usually borderline spam.


100% agree that general purpose search engines are broken, the problem I see is that search functions on subject specific sites are often not indexed well (Google has had 20+ years fine tuning this after all).


> I greatly dislike this new trend of "doc-spaming" where everyone makes a blog writing walls of text explaining trivial things, crowding out the actual docs from search results.

The thing is, documentation is (usually) written very dense, with a lot of assumptions about knowledge. The PostgreSQL documentation on WHERE vs HAVING is decent (https://www.postgresql.org/docs/8.1/tutorial-agg.html), but the MySQL documentation (https://dev.mysql.com/doc/refman/8.0/en/select.html) is ... a lot to read and not very explanatory for someone who only rarely deals in SQL.

Documentation writing is an art and most projects don't have the resources to have actually working documentation, much less understandable documentation - which is part of the reason why w3schools, stackoverflow, quora and random SEO "blogs" get so much traffic.


To be fair, the actual Python docs are also pretty verbose when you’re looking for quick answers.


True, but the Python docs aren't also trying to tell you a life story at the same time.


I havent written any SQL in many years so I found it an informative read. I'm glad HN surfaced it to me, it's right at the fringe of my SQL knowledge and grew the domain.

I think your problem is with search engines and not the writers of these articles, who probably don't consider you to be the target audience.


This is why I love devdocs.io. I think it's supported by freecodecamp.org


That's great! I had made my mind to actually make this.


wow, this is a great resource!


I've had the opportunity to watch developers search for docs in different languages and python has surprisingly bad SEO for the official docs. Ruby and Golang do a lot better, and JS usually ends up at MDN or W3.

(This is not a comparison between the languages, just an oddity about the docs)


That's why I always search with a site:python.org filter.

I created special "search engines" on my browser, so I start typing ":p" and whatever I write next it automatically searches using the site:python.org filter.


I swear to god whenever I search for something in the python standard library I have to scroll past a bunch of sites like this before i find the actual docs.

The standard library docs have their own search. Plus there is always help()


I swear to god I have the same problem as you do. These websites witch don't offer any additional value have taken all the search rankings.


It’s less awful than the “hey guys look at the thing I just learned!” blog posts that are so embarrassing to read.


There's no good reason for using two different keywords for filtering and it harms composability. But of course SQL isn't very composable anyhow.

Preferably, SQL would look more like:

  FROM Foo f
  WHERE f.value < 10
  JOIN Bar b on b.id = f.bar_id
  WHERE b.other_value > 20
  SELECT f.group, f.value
  GROUP BY f.group the_group AGGREGATING SUM(f.value) the_sum
  WHERE the_sum > 100
  ORDER BY the_group
And so on. FROM introduces a source of data which flows into the next line. SELECT does a projection, GROUP BY applies aggregation operators on groups and also does a projection. WHERE could be inserted pretty much anywhere.


If you want to see an incredibly successful data manipulation system that works exactly as you describe, you should check out R's dplyr package immediately. To me it's the perfect composable SQL API, but hardly anyone knows about it outside of R/RStudio/tidyverse aficionados!

It also cross-compiles to SQL and can thus be used to build queries, although the result is not necessarily the most efficient. Someday I hope someone with expertise in compilation and query optimization will take a look at dplyr and figure out how to get it to generate optimally efficient SQL.

If we did that, and also ported dplyr to some other languages, programmers might never have to hand-write SQL again. (LINQ might be a good way too, but I know much less about it)


Kind of sad Hadley Wickham isn't as prolific as a paper writer anymore. These papers [1] were, as its nature, very consumable even for someone outside R community. I liked them a lot.

[1] https://scholar.google.com/citations?user=YA43PbsAAAAJ&hl=en...


Totally agree... but if I have to choose between him writing papers and spending more time perfecting APIs, there's no question. The man is one of the most brilliant interface designers ever to live.

Dplyr was already great in its first version, but Hadley and company have worked incredibly hard to make it much, much better. It's hard to think of another system that has improved so much from such an already high starting point.


That's fair point. I'm mostly on Python so I cannot complain.


If you like dplyr in R, you really should check out Linq. Dplyr as nice as it is doesn't remotely compare to how sensible and ergonomic Linq is.


I've heard! I don't have much occasion to use Linq as a data scientist, but will definitely keep an eye out if I ever write C#, etc.


Yup. To be blunt: SQL is a flawed old language that should've been replaced decades ago, but every time somebody tries to replace it they throw the baby out with the bathwater and throw out the relational theory too.

It was a fine language for its time. So was Pascal, and FORTRAN. We don't use those anymore generally for a variety of good reasons.

SQL ANSI nulls alone are a good-enough reason to toss it. Any language where "x = x" commonly returns an object that is treated as falsey is broken.

edit: I'm getting "You're posting too fast" so I can't continue this conversation, but no being able to express "unknown" is not worth throwing out the reflexive property of equality. That is a bad trade-off and I will die on this hill.


> SQL ANSI nulls alone are a good-enough reason to toss it. Any language where "x = x" commonly returns an object that is treated as falsey is broken.

Nah. One of the most common uses of null is to communicate missing values in the sense of "this entity has this attribute but we don't know what its value is". This is a fundamental feature needed in data management.

So say you've got data on two balls A and B. You don't know what the color of either ball is. Is color(A) = color(B)? What's the best answer to this question?

Obviously it's "we don't know", and the value we use to communicate "we don't know" is null.

SQL nulls aren't broken. They're just a third option outside of True/False that needs its own handling, because reality is complicated.

PS: Contrary to popular opinion in certain coding circles, floating point arithmetic isn't broken either. It's a beautiful, delicate tool that works the way it does because it must in order to support its use cases (e.g. numerical linear algebra and representing continuous physical quantities). If it doesn't work exactly the way you want, you may need a couple guardrails or a different system like fixed point arithmetic.


> One of the most common uses of null is to communicate missing values in the sense of "this entity has this attribute but we don't know what its value is"

The correct way to represent this, which has been widely used for several decades now, is sum types. If your value might be unknown, you need to explicitly represent this in the type of your value (e.g. Optional<Foo>) instead of relying on a hack that allows subverting the type system.


You're approaching this at too low a level. SQL is a 4th generation language, quite possibly the only successful 4GL. Unlike every other major language, it doesn't deal with "how" you get the answer. You see the "how" with commands like EXPLAIN. Instead it focuses on "what" you want and letting the engine figure out how to get there.

Optional<Foo> lends itself well to functional programming and prevents things like null pointer errors, but that's still at 3GL level, the EXPLAIN level. SQL doesn't throw null pointer exceptions. Ever. The query syntax is correct or it isn't. The same concerns in a programmatic language simply do not apply, which is why Optional<Foo> is a poor fit in this context.

Null in SQL is not the null in JavaScript or C++. Same name, but not the same animal, and I think this is what trips folks up. In your mind, replace SQL's null with a new keyword, UNKNOWN, and it makes more logical sense.

SELECT ... FROM ... WHERE x = UNKNOWN;

Even if x were UNKNOWN in a row, is it the same unknown? For example, if you see two girls, but you don't know their birthdays. Their birthdays are unknown.

WHERE Keasha.birthday = Cynthia.birthday

This partial statement would always be false no matter what until you know both of their actual ages. Just because something is unknown doesn't mean it's the same unknown.

Set theory, not a programmatic series of steps as found in most languages. And that's what SQL is: a DSL for set theory.

I believe it's advantageous to avoid nullable columns in relational schema design, but that's a separate issue from the concept of NULL in SQL as distinct from programmatic languages.


> I believe it's advantageous to avoid nullable columns in relational schema design, but that's a separate issue from the concept of NULL in SQL as distinct from programmatic languages.

Agreed, but they should be used where appropriate. My favorite example of a nullable column is an end date. Until an end date occurs, it's unknown - thus null. Representing the unknown end date with anything other than null is semantically incorrect, and in many cases can lead to errors.


Supporting sum types is not in any way incompatible with being a query or logic language.


I agree, and more sophisticated analytics data structures, like R's data frame, do have "typed nulls" (except in R they're called NAs).

I don't know, and would love to know, why databases have been so slow to adopt this.


You could argue any SQL value not marked as NOT NULL is an Optional<Foo>, with NULL being the None value. It's not really subverting the type system, it's just an unconventional choice.


Yes, but I can't think of any language where, given Optional<Foo> x and y, x.equals(y) would return null instead of either true, false, or an exception if y is null.

That's what makes SQL null-values distinct - the fact that expressions that return booleans can return NULLs.


In the relational model used by SQL, types are atomic. Sum types are inherently non-atomic. I'm sure you could graft them onto the language, the same way Postgres can have JSON type fields and all that, but I don't think you'd be able to really integrate them into SQL without fundamentally changing what kind of language SQL is.


The query algebra is completely compatible with sum types. I don't think "atomic" is describing anything central to the QA model.


If 3-value logic is such a good idea, why doesn't any modern general-purpose language use it?

And even if it were a good idea, leaving the user with no built-in operators, not even a simple COALESCE(bool) to properly manage the null complexity? That's unacceptable.

Given a SQL boolean expression `f(b)` that can return true, false, or null, how do I test if `f(b)` has returned NULL? How do I handle that NULL case?

ANSI SQL gives users a strange new paradigm for boolean algebra and then says "GLHF" and gives you zero tools to handle the situation, not even linguistic features that would let you roll your own boolean-based functions, since you can't write SVFs that take boolean parameters.

I mean this trashboat of a language doesn't even let you bind boolean to a variable! You have to deal with the PITA of converting from Booleans to bits because those are totally different things.


> If 3-value logic is such a good idea, why doesn't any modern general-purpose language use it?

Many languages have a null/undefined of some sort, but they are often simply coerced to false. But, for a language whose sole purpose is to manipulate data the null value nearly always has meaning.

> Given a SQL boolean expression `f(b)` that can return true, false, or null, how do I test if `f(b)` has returned NULL? How do I handle that NULL case?

http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

Section 8.6, the null predicate.

`where f(b) is null`

> ANSI SQL gives users a strange new paradigm for boolean algebra and then says "GLHF" and gives you zero tools to handle the situation.

It's also not a strange new paradigm, and is in fact built on foundational topics. Relational algebra and set theory are not new concepts.


The term 3-value logic pretty obviously implies the behavior of operators as they relate to NULL, not just the fact that data is nullable, of course.

AFAIK MySQL, Oracle, and SQL Server don't have the Boolean datatype as a variable/column type - only Bit, which lacks boolean semnatics, and so Booleans are not considered data values, and therefore can't be used with IS NULL or COALESCE.


Because you can implement 3-valued logic using boolean logic in a general-purpose programming language if you need it, and most applications don't benefit from the extra nuance and complexity. But data management sure does, as I illustrated in my post with the color example.

Notice that databases, with their 3-valued logic, are written in a systems language like C/C++, a general-purpose programming language with boolean logic.

Some systems are best thought of using non-boolean logic. Quantum systems are another example. You can simulate just about anything with a Turing-complete language, but it doesn't mean that standard computation with boolean logic is the best way to think about every domain.


> Because you can implement 3-valued logic using boolean logic in a general-purpose programming language if you need it, and most applications don't benefit from the extra nuance and complexity.

Surely if it were beneficial you'd see more people doing it, but I have literally never encountered software outside of SQL that does it. I could probably implement it in C++ or C# in an afternoon using enums and operator overloading, at least for the boolean-boolean operators. A little longer to make comparison operators for the other basic types that would yield nulls when compared to null. But nobody does this. Why?


> Surely if it were beneficial you'd see more people doing it, but I have literally never encountered software outside of SQL that does it.

Do you see SQL as some weird little edge case of data management? 99.9% of all serious data management happens in a database server with a SQL interface. Maybe other languages don't implement 3-valued logic because they don't need it, because they don't need to do data management, because they'd be bad at data management.

It's not even true that languages outside SQL don't have 3-valued logic. They may not have it built into the base language, but analytics packages like Python/pandas and R certainly have some version of a null or NA, and nobody would question the value of it for a second. And databases are typically implemented using C/C++, so there's another example of someone using a general-purpose programming language to implement 3-valued logic (as part of the SQL implementation that is written in C/C++).

It's not like SQL is some weird separate language on a different planet as your favorite general purpose programming languages. It's implemented with those languages. So there's your 3-valued implementation right there. When the feature is valuable, it's implemented, and not before.


> Nah. One of the most common uses of null is to communicate missing values in the sense of "this entity has this attribute but we don't know what its value is". This is a fundamental feature needed in data management.

I rarely use nulls like that. And then only for things outside of SQL databases like lazy loading or caching. And even then I still prefer null == null to be true.

If nulls are for what you have written then why we are calling them "null" - no value? Why not calling them "unknown". Even then what is benefit in introducing non boolean logic to a relational query language? I have never seen anyone using any of three-valued logic benefits when writing SQL. Every time I get unknown logical value it means I have made a mistake in my query, which would be better handled by rising an error.


> I rarely use nulls like that.

Well, what can I say? Other people need them, and I illustrated why.

> And even then I still prefer null == null to be true.

Some databases have alternate operators, like <=>, that make null == null true. And you can always coalesce null to a sentinel value if your dialect doesn't have this operator. Heck, if you're writing SQL using templates, you could have the template language replace a <=> b with coalesce(a, 99999999) = coalesce(b, 99999999) (or whatever other sentinel value you want to use).


The problem, though, is that SQL standard treats NULL as unknown or as missing. One word for two meanings.

For example, AVG(col) and COUNT(col) treat it as a missing value (NULLs do not alter the average or the count), but 1 + NULL treats it as an unknown value (the result is unknown). The complete list of NULL-related inconsistencies would be long.

It would have been better to use different words for different semantics, or not implementing NULL at all. Also, in some cases it can lead to poor query performance.


I emphatically reject the idea that having an imperfect null is worse than no null at all, but I understand frustrated programmers may have different beliefs.


Yup. The semantics are great for joins. Two missing bits of data do not make a found.


Then why doesn't it error out when faced with NULL?

I mean, we hear over and over again when learning SQL "NULL bool and FALSE are different things"...

But the SQL engine itself treats NULL and FALSE as equivalent all the time.

    DECLARE @x INT = NULL 
    SELECT 'foo' WHERE @x = 1
If three-value logic were truly embraced, that should be an error, because we don't know the state of @x. But SQL talks out of both sides of its mouth about Booleans. It freely just assumes that null is false, which we, as developers, are never ever ever allowed to do. There isn't even an operator included in ANSI SQL to let us convert null boolean values into false, because for some reason bits and bools are two different things and you can't just say

   COALESCE(x = y, false)
It's not just that three-value-logic is painful, it's that SQL makes working with them miserable because, at a linguistic level, it seems to politely forget that three-value-logic exists.


> There isn't even an operator included in ANSI SQL to let us convert null boolean values into false, because for some reason bits and bools are two different things

“bit” is not a standard SQL type at all, it's an SQL server proprietary 1-bit numeric type. boolean is an standard SQL type, but SQL Server doesn't support it as a storage type or consistently elsewhere (it does exist, annoyingly, in some contexts, but is limited in where it or expressions producing it can appear, which—not any limitation of standard SQL—is the source of your problem.)

> and you can't just say COALESCE(x = y, false)

In something that supports standard SQL in the relevant parts, like Postgres, yes, you absolutely can.

In SQL Server, you have to do something clumsy like:

  CASE
    WHEN x=y THEN 1
    ELSE 0
  END
But that is “SQL Server is broken and barely supports booleans even as intermediate values” not “standard SQL is broken”.

There are plenty of valid criticisms of standard SQL around NULLs, but the particular brokenness of SQL Server’s handling of booleans is not one of them.


AFAIK MySQL and Oracle have the same flaws.


MySQL and Oracle also lack standard boolean support while allowing true boolean values as results in certain positions, AFAIK, yes, although some versions of MySQL use “Boolean” as an alias for tinyint(1) and even more support TRUE and FALSE (without quotes) as aliases for 1 and 0, which confuses the issue a bit.

SQLite has a similar problem, averts the practical because it doesn't use booleans natively internally in which they apply, instead natively using 0 and 1 (with TRUE and FALSE as aliases) and then converting to boolean in the narrow set of boolean contexts. This means that can use COALESCE as described (and, generally, SQLite will behave as you would expect with real Boolean support, even though it doesn't actually have it.)

DB2, from what I can find, supports BOOLEAN from v11.1.

In any case, the problem here is not a standard SQL problem, even if it is a problem of several popular SQL implementations.


I don't know why OP described SQL NULLs as 'three-valued logic' because it's anything but.

SQL boolean operators can only result true or false, there isn't a third option.

NULL is a data value for all data types, which happens to return 'true' to the boolean operation 'is null', and 'false' to any other boolean operation. NULL = NULL or NULL = 1 don't return a third-option NULL, there is no such thing, they return a plain old 'false'.

(NULL = NULL returns false because otherwise it would make outer joins terrible)

As you pointed out, SQL does not implicitly convert boolean values (bits) to or from the result value of boolean operators. Bits are values and can be NULL. Boolean results are not values and can only be true or false.

Now, is this optimal? Hell no. (Sum types would probably be the way to go if SQL were designed 50 years later.) But it's consistent.


> SQL boolean operators can only result true or false, there isn't a third option.

What...? Go type this into your favorite DB and get back to me:

SELECT TRUE AND NULL


In your example, "TRUE AND NULL" is a value, it's not a boolean condition.

What is the difference? Try this:

SELECT CASE WHEN (TRUE AND NULL) THEN 'A' ELSE 'B' END

Now "TRUE AND NULL" is a boolean condition, and it will return false, and the boolean conditional will evaluate to 'B'. If SQL had three-way logic, some expressions inside the parentheses would make the whole conditional return NULL, but it will always return either 'A' or 'B'.


Ah, I got it now. Yeah, it's definitely weird how nulls reproduce themselves in values (when appropriate), but they get treated as zeros or false in boolean conditions. Base R has more consistent behavior, e.g. if you index into a data frame with an NA, the NA will give you a row full of NAs.


It's almost as if the people designing SQL engines realized how furiously inconvenient null boolean semantics were and chose to ignore them wherever it suited them.

I just wish they'd ignored them in one more spot: equality checking.


Because then you couldn't do a join across non-mandatory foreign keys. Practically, making all NULL comparisons error out would make them almost entirely useless.


It may be flawed, it may be old, but it clearly continues to be sufficiently expressive and useful.

Every time someone tries to replace it, it eventually comes back again. Just look at the recent NoSQL and Big Data craze, and then how we're today with a miriad of SQL engines.


I think GP is correct when he says "every time somebody tries to replace it they throw the baby out with the bathwater and throw out the relational theory too".

However, I would add that SQL is a very strong local optimum: a "good" language that keeps the good parts would be insufficiently different from actual SQL to convince anybody to switch.

Also people forget that querying is only half of the equation, there's DDL too.


DDL is a bit curious, because it’s less a programming language and more of a set of configuration and operational commands. It just happens to look similar to DQL, but there’s nothing really pushing that beyond the fact that DDL can rely on DQL as a subexpression (eg create view as …). You could probably say the relationship between DQL and DDL is largely aesthetic

The only reason I can think of where you often can’t strictly treat DQL and DDL as totally different languages is the just from the total lack of interest in consistency in SQL language design generally — eg update and delete steal parts of DML syntax haphazardly and differently; the syntax for update/delete data referring to another table

   UPDATE table WHERE … JOIN … 
vs

   DELETE FROM table USING … 
The lack of complexity in DDL generally (its very wide, but not very deep; there’s not much meaningful composition to be discussed) and its lack of a real relationship to DQL is probably why no one discusses it much


I'd hang on to NULLs, but I'd propagate nullability in the type system. That could mean disallowing anything but non-nullable boolean as filter predicates, forcing the use of COALESCE for clarity and explictness.


By far the best option would be to add support for sum types and pattern matching. Besides being tremendously useful in its own right, it would also eliminate the need for nulls.


I don't think that's right; absent data is really common and a desirable feature, and ergonomically handling it warrants built-in support.

Outer joins inherently require a concept of absence of data, so you'd need a built-in Optional type which would be nullability in all but name. And you'd want lifting of operators on the inner type to the optional type for ergonomic reasons; doing a pattern match for every selection of bar.* where bar is left-joined would be extremely painful. I think you'd end up where you started on the null front.

Sum types generally, for columns etc., maybe (pun not intended), but I'm skeptical about complex types in a relational language - structure ought to be represented as relations where possible so that everything else composes.


It would be ergonomic - you'd just use a variant of the equals operator specialized to Option<A> that behaved in the way you wanted.


> SQL ANSI nulls alone are a good-enough reason to toss it. Any language where "x = x" commonly returns an object that is treated as falsey is broken.

Seems like an attempt to throw the baby out. 3-valued logic anyone?


If 3-value logic is such a good idea, why doesn't any modern general-purpose language use it?


> If 3-value logic is such a good idea, why doesn't any modern general-purpose language use it?

Most dynamically typed general-purpose languages use a large-number-of-values logic, they just use a different formulation than SQLs 3VL. Particularly, nulls tend to be equal to each other, and only propagate from basic boolean ops where false would, not whenever they are an input. (Though IEEE 754 NaNs, which most use and are another flavor of null-like value, are not equal to each other, and, in the case of quiet NaNs, propagate a lot like SQL nulls through arithmetic operations [though not consistently through other language features].)


Optional<Boolean>

As found in a few general purpose languages.

Hell, JavaScript has true, false, null, and undefined. 4-value logic!

The real world upon which we model our code has a "true", "false", and "can't tell yet; not enough info". As inconvenient as it feels psychologically, you can't deny the fundamental reality of the third, "can't tell yet", state.


Yes, those languages have nulls, but the nulls don't display the strange operator semantics of SQL ANSI nulls.

In JavaScript, null == null yields true. "Test" == null yields false. In SQL, those expressions yield null. SQL is afaik unique in its belief that algebra involving nulls should yield nulls instead of boolean values. That's what's meant by "three value logic", and that's what I find problematic. It violates several mathematical rules, like the reflexive property of equality.


SQL doesn't have this "belief". SQL has the behavior that is most useful for querying and joining when some columns have unset values, or foreign key relations that are unset.

I understand your position, but your idealism would be strictly worse in real life. Can you imagine if an optional foreign key relation immediately made all joins fail because one was unset? Or if a join between two optional values made all NULL values join together?

The SQL NULL behavior also allows UNIQUE nullable columns to have more than one instance that is NULL, so you can have uniqueness over a column that may also be unset.

If you're a purist or an idealist, SQL NULL might be a mess, but pragmatically, it's the only behavior that is sane for SQL as-is.


> SQL is afaik unique in its belief that algebra involving nulls should yield nulls instead of boolean values.

Really? The basic operations of boolean algebra or AND, OR, and NOT.

In JavaScript, what is:

1. null && true

2. false || null

3. !null

Well, one of them produces a boolean result, but... Actually, JS has basically an infinite valued logic, because of its truthiness/falsiness dichotomy and the rules around how the “boolean” operators preserve truthy or falsy inputs; only ! of the core boolean operators always returns a boolean which is why !! is equivalent to Boolean().


Right, but js treats all values as either truthy or falsey. Only SQL does this weird 3rd-way approach where values can be neither true nor false (except in a WHERE or ON clause where NULL is treated as false).

Like, this is a language where "X OR NOT X" is not true if X = null

JS, for all its wonky truthiness rules, still satisfies the basic premise that "X || !X" yields something truthy.


SQL NULL and IEEE 754 NaN are very similar in weirdness of how they propagate, for very similar reasons (they are both hacks for encoding “not a value in the domain” in a value otherwise restricted to the domain; they aren't actual values but information about the absence of an acceptable value.)

Most other languages treat NULL as a distinct-but-normal value. IMO this isn't better or worse than SQL, or rather, the SQL approach is better for querying in SQL, whereas the approach common in general purpose programming languages is better more generally, but forces more special case handling.


> Hell, JavaScript has true, false, null, and undefined. 4-value logic!

You have vastly underestimated the number of values in JS’s logic.


Most modern languages have option types.


> SQL ANSI nulls alone are a good-enough reason to toss it. Any language where "x = x" commonly returns an object that is treated as falsey is broken.

So, any language using IEEE 754 floats is broken?


> Any language where "x = x" commonly returns an object that is treated as falsey is broken.

Err, that will include any language that implements IEEE floats.


"Commonly" is the operative word here. I'd say that null values are far more common than NaN.


This is basically how Linq (in .net) does it. It is much more composable than SQL but still implement the relational algebra.


And LINQ is converted into SQL if you do LINQ queries against a database. The queries are sometimes far from optimal, but the developer experience is quite nice.


This is pretty close to how my Julia library [0] for composable construction of SQL queries works:

    From(foo) |>
    Where(Get.value .< 10) |>
    Join(From(bar) |> As(:bar), on = Get.bar.id .== Get.bar_id) |>
    Where(Get.bar.other_value .> 20) |>
    Select(Get.group, Get.value) |>
    Group(Get.group) |>
    Where(Agg.sum(Get.value) .> 100) |>
    Order(Get.group)
There is no HAVING and the you can use any tabular operators in any order. Aggregates are also separated from grouping and can be used in any context after Group is applied.

[0] https://github.com/MechanicalRabbit/FunSQL.jl


I would go one step further and make the join just

   JOIN Bar b
if you've defined the foreign keys before.

The need to spell them out again and again is one of my personal SQL pet peeves.


Building https://tablam.org I learn a bit about the relational model and found that the whole having/group by is plain bad "hack".

GROUP BY is totally the opposite of what you expect. It NOT "group by"!,

This is group by:

    1:
       A
       B
    2:
       A
       C
If where like this (as in the RM) then you can bet other hacks like graph-ql where not necessary (and will make database more friendly to OO langs)


GROUP BY is two things, partitioning and aggregation to a single row within each partition. The relational model doesn't really have a model for partitioning on its own so they're fused, but if they were distinct concepts it might make e.g. analytic window functions more composable and less awkward.


In one interview I had, I was asked, "What are you good at?". I said, "Pretty good with SQL stuff". This was _literally_ the question they asked. It was a good jumping off point for them to probe how much I knew. I like this article's explanation. The thing that trips me up on specific databases, is whether aliases assigned in the "select" portion are available in the "group by" sections.


It's not available in SELECT (barring some non-standard implementations). That's because SELECT is executed after FROM, WHERE, GROUP BY and HAVING. I've made it a habit to start composing my queries in the logical order first, and rearranging it in my editor afterwards.

I learned this (and most of my understanding of SQL) from https://blog.jooq.org/10-easy-steps-to-a-complete-understand... (point 2.), which was recommended by Hadley Wickham (of R tidyverse/dplyr fame).


I always hate that aliases thing — after seeing query builders do it I've taken to using column numbers (eg. GROUP BY 1, 2, 3) in my own code if I'm grouping on a lot of computed columns. More foolproof.

SQL is not a great language. Serviceable, but not great.


Foolproof til you restructure your select and forget to change the numbers. SQL: ya just can't win sometimes.


To be fair you will have the same problem if you ever change the name of the alias.


Are there databases where the alias defined in the select part of the query is available in the group by/having section?


Postgres reference SELECTed aliases in ORDER BY/GROUP BY clauses, but _not_ in HAVING:

    test=# create table event (who text not null,
    test(#                     what text not null,
    test(#                     at timestamptz not null default now());
    CREATE TABLE
    test=# insert into event (who, what)
    test-#            values ('Bob', 'sent a message'),
    test-#                   ('Alice', 'received a message'),
    test-#                   ('Alice', 'sent a message receipt'),
    test-#                   ('Bob', 'received the message receipt'),
    test-#                   ('Bob', 'waited patiently for a response'),
    test-#                   ('Bob', 'worried whether his message had been well-received'),
    test-#                   ('Bob', 'paced'),
    test-#                   ('Alice', 'forgot to reply for two weeks'),
    test-#                   ('Bob', 'began to panic'),
    test-#                   ('Bob', 'fled the country for Moldova in a fit of panic'),
    test-#                   ('Alice', 'finally remembered to reply'),
    test-#                   ('Bob', 'never got it');
    INSERT 0 12
    test=#   select who as "The poor sod",
    test-#          count(what) as "Events"
    test-#     from event
    test-# group by "The poor sod"
    test-# order by "Events" desc;
     The poor sod | Events
    --------------+--------
     Bob          |      8
     Alice        |      4
    (2 rows)


Really varies by implementation.

Snowflake supports it - https://docs.snowflake.com/en/sql-reference/constructs/havin...

Redshift doesn’t: https://docs.aws.amazon.com/redshift/latest/dg/r_HAVING_clau...

Databricks doesn’t: https://docs.databricks.com/sql/language-manual/sql-ref-synt...

All of them do also support teradata’s QUALIFY clause which is like another layer over WHERE and HAVING that filters windowed data.


MySQL/MariaDB certainly permits their use in GROUP BY, HAVING and ORDER BY.


Huh, I always thought HAVING was the equivalent for GROUP BY that WHERE is for SELECT, i.e. that it happened during aggregation, just as WHERE happens during data retrieval from the table. But apparently you can use HAVING instead of WHERE and it would work the same even if you don't have a GROUP BY clause (and even though it would possibly be less efficient)? Seems a bit redundant.


It's not redundant. As you point out: selecting rows _before_ aggregation and _after_ are different beasts. You can't exclude rows from aggregation with HAVING.


> You can't exclude rows from aggregation with HAVING.

In theory¹ a DB that supports skip-scanning could use feedback from some HAVING clauses to short-circuit such a scan significantly, for instance if HAVING COUNT(*)<4 is present and a fourth row for that grouping comes up it can discard the row and skip to the next value. This would likely only be useful for indexes containing large numbers of the same few values though.

[1] I'm not aware of any engine that does this. It might be that the time cost of assessing if the optimisation is worthwhile² far outways the sum benefit over time of the saving when it is actually found to be worthwhile.

[2] Similarly, writing+testing+maintaining the code to do so might not be good use of dev time compared to other improvements


I believe I sorta did something like this recently in differential reasoner. See this code here: <insert link to the code range in differential reasoner in the file where I use the count trace to throw out ineligible ones>. It's only _kinda_ like this, though.

I will post the link later, if necessary in a comment.


I was thinking more in terms of during than before or after. And that HAVING was part of the GROUP BY syntax, and could therefore not be used without it.


Think of it this way: there is always a GROUP BY (implicitly), i.e. GROUP BY rownum(). So there is always a big group with all the rows (or rather many small groups).


I don't think HAVING is strictly needed, but you need to use a sub-query.

   SELECT x ... HAVING y => SELECT * FROM (x ...) t WHERE y


Yeah, you are correct it's really just syntactic sugar for that. As another comment points out, WHERE and HAVING are just σ in relational algebra.


In that sense, HAVING and WHERE being different is basically a "free" subquery without allowing subqueries.

One inner σ (WHERE) and one outer σ (HAVING).

If subqueries are allowed, then you don't need that.


You are correct, HAVING is essentially syntax sugar. But to be fair, I'm glad it exists.


Sure, and I remember the version of mySQL that didn't have sub-queries. Having HAVING has your back, then.


You are right, though (some?) DBMSs allow to use HAVING for columns that appear in the GROUP BY clause.


For me the aha moment was to understand the logical order of SQL operations: https://blog.jooq.org/a-beginners-guide-to-the-true-order-of.... Never had a problem since.


SELECT department_id, count(*) AS employees_no FROM employee WHERE gender = 'F' GROUP BY department_id HAVING employees_no < 10;

This phrase returns departments where the number of female employees is <10 right, rather than listing female employees from departments that have <10 employees in total?


That sounds right to me. It creates counts of female employees in each department, and then only outputs the departments that have 10 or more.


should return an error, the select part is read last by the DB, so whatever you define in the select part (employees_no) is not available to the having clause.

  SELECT department_id, count(*) AS employees_no FROM employee WHERE gender = 'F' GROUP BY department_id HAVING count(*) < 10;
The logical order in which the DB processes the query is:

  FROM Employee
  WHERE gender = 'F'
  GROUP BY department_id
  HAVING count(*) < 10
  SELECT department_id, count(*) as employees_no
so you apply count(*) < 10 after you filtered by gender and after you grouped by department_id


> should return an error

Except some databases allow for that syntax. The one we use, SQLAnywhere, does.


Depends on the db. Plenty make such aliases available for having & order by.


  select category, sum(amount) as total
  from mytable
  group by category
  HAVING total > 100
is equal to (and potentially more performant than)

  select category, total
  from (
   select category, sum(amount) as total
   from mytable
   group by category
  ) as mysubquery
  WHERE total > 100


On my local Postgres 13 instance, the EXPLAINation of both of those queries looks identical – which is a trap, because the path of the query planner's understanding is narrow. It's quite easy to write a subquery which must be evaluated over and over again for each row of the parent query. Grouping (either inside or outside of the subquery) is usually a good way to trigger this pathological behaviour (which sucks, because grouping is often where you want to use a subquery!).

I usually find CTEs to be a relatively tolerable middle ground of readability and performance (although I don't run from HAVING, either):

    with category_totals as (select category,
                                    sum(amount) as total
                               from mytable
                           group by category)

    select category,
           total
      from category_totals
     where total > 100;
Either way, subqueries are a code smell for me.


My fully custom website (thousands of pages per minute) has an authentication query that runs on every page—and it has a subquery inside a subquery inside a subquery. That's four layers of SQL inception. And it executes in under ten milliseconds.

Subqueries are indeed a performance red flag when the query is returning more than a few rows AND the subquery references fields outside of the subquery. Or put more simply, a subquery that will need to be executed an unreasonable number of times per run.

But so long as you aren't doing anything like that, and your database engine has a competent query planner, there should be zero difference in performance. For this reason I don't consider subqueries to be a code smell. For me a query is "smelly" when it does things which aren't obvious to a human parsing the query with their brain.


The article does a good job of explaining the difference between WHERE and HAVING. The simplest resource I've found for this is Julia Evans' "SQL queries run in this order" [0], which points out for example that SELECTs are one of the last things to run in a query.

I've managed software teams and data engineering teams, and both teams get tripped up with even moderate SQL queries. To simplify, we encouraged teams to use a clearer subset of SQL. Most HAVING can be replaced with a WHERE inside a more readable and explicit subquery. Similarly, we got rid of most RIGHT JOINS.

[0] https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with...


WHERE folks HAVING difficulty with this concept? Sorry, couldn't resist.

But seriously, this concept doesn't seem to require as much attention as this post gives it.


HAVING is after aggregation, WHERE is before.

/thread


Is there any book that has the exact specifications of this dumpster fire that is called sql?

With no other language I have faced so many gotchas with implicit conversions and unexpected behaviors.


I've found that most traditional developers have a rough time with SQL, because they're trying to use it like a procedural language, which it isn't - it's relational algebra/calculus based.

Cornell has a good explanation: https://www.cs.cornell.edu/projects/btr/bioinformaticsschool...


Exactly! It's a DSL for set theory. A true 4th-gen language. Perhaps the only truly successful 4GL.


https://www.iso.org/committee/45342/x/catalogue/p/1/u/0/w/0/...

If you want to take a peek at an older revision: http://web.cecs.pdx.edu/~len/sql1999.pdf (I'm guessing this will be another case of be careful what you wish for).


Surely you've used Javascript?


Are there any resources e.g books, blogs etc. that have a low-level description of the data structures/algos involved when SQL queries are run?


You might look at SQLToy which is a DB and blog I wrote to teach how SQL processing works internally: https://github.com/weinberg/SQLToy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: