SQL interview questions are an interesting counterpoint to stereotypical programming interviews: while typical algo questions in SWE interviews tend to test what's taught in academic contexts but have little real-world application; the questions in SQL interviews are more practical outside the whiteboard.
A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance; the questions are often know-it-or-don't questions (especially involving window functions w/ uncommon syntax). A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.
That said, as an IRL data scientist, the amount of times I've had to do a SQL self-join in the past few years can be counted on one hand. And the BETWEEN ROW syntax for a window function.
Given the likelihood that I would have to work with the person who was interviewing me, I would count it a blessing that they show me their true colors during the interview rather than after I hire on. And if I didn't get an offer, I'd consider it a bullet dodged.
I'll terminate this by being so pedantic that we cross over into non-pedanticism.
Per descriptive linguistics (https://en.wikipedia.org/wiki/Linguistic_description), the assertion that it's wrong to start a sentence with a conjunction is incorrect. Ditto for splitting infinitives, ending a sentence with a preposition, etc.
(Descriptive linguists basically define the rules of language to be how humans actually use the damn language. It turns out such a philosophy saves you from falling into a lot of traps)
Unfortunately merely noting 'descriptivism' as if it solves the issue of linguistic prescription doesn't help that much. We prescribe all the time, like it or not, language isn't a free for all at the individual level. It's also _descriptively_ true that many people regard certain rules as inviolable, which certainly means that prescription is, in some sense, completely reasonable: comply, or we'll refuse to accept you. How does 'I'll do whatever I damn want' really handle that? It can be true that it's 'valid' on one level, while being wrong on another.
I immediately dismiss as gauche and lacking credibility anyone who even hints at a prescriptivist rule, so my acceptance goes against prescriptivism. These rules make writing unclear, stolid, dull and uninteresting for no real benefit.
In my opinion, parroting prohibitions against split infinitives, ending with a preposition, "snuck is not a word", "'they' is only plural", etc etc not only shows a profound lack of education, this shows an inability to learn, since these so-called "rules" are widely and easily known to be made up whole cloth. To my mind, this is a person who absorbed Elements of Style (perhaps even second-hand from a misinformed high school teacher) and called it a day, believing they understand how English is properly written.
"I immediately dismiss as...lacking credibility anyone who even hints at a prescriptivist rule"
Isn't this a prescriptivist rule? Imagine a highly skilled author who recognizes a younger one that misuses a semicolon or has poor grammar; the older says to the younger, "only write with perfect grammar". By doing so, the younger learns more precisely the elements of the English language and the purpose of each device (semicolon, period, comma, etc.). After, the younger becomes a much better writer, and integrates that new knowledge with his or her existing writing. In this way, prescriptivist rules help inexperienced writers hone in on their weaknesses and draw focus to some facet they lack.
Any harsh prescription is kind of a red flag for me, because you have to judge yourself by that harshness (fall on your own sword, so to speak). Or, as what usually happens, hold some hypocritical double standard or semantic loophole which pardons yourself while incriminating others.
<< Imagine a highly skilled author who recognizes a younger one that misuses a semicolon or has poor grammar; the older says to the younger >>
I cannot imagine a highly skilled author who never employs split infinitives. Show me that person.
Regarding semicolons, here is Kurt Vonnegut's opinion on them: "Here is a lesson in creative writing. First rule: Do not use semicolons. They are transvestite hermaphrodites representing absolutely nothing. All they do is show you've been to college."
Is that good advice? Perhaps for some writers, it actually is; but it's not about the semicolon, really, but about stilted, bad writing.
<< Isn't this a prescriptivist rule? >>
I was responding to the previous comment "It's also _descriptively_ true that many people regard certain rules as inviolable, which certainly means that prescription is, in some sense, completely reasonable: comply, or we'll refuse to accept you." My direct answer is "It's the prescriptivists 'refusing to accept' who should worry about acceptance, since they are failing on so many levels it's hard to know where to start"
Do you have any good recommendation for improving writing skills?
I picked up elements of style few days ago. I couldn't parse it. I was only able to understood it somewhat through intuition and prior knowledge. The sentences were too confusing, out of the place for me with added jargon and story of the author that I didn't want to know about.
Never thought of this, but you are right. Usually it’s used when you didn’t know you were dodging the bullet at the time but more of a hindsight thing.
I wrote sql for years for a forex platform used by some of the top dogs (literal). Pretty hardcore sql and had to be very fast, never used BETWEEN ROW.
I use it all the time, both in ad hoc queries and in data prep for models. Most commonly just to look into a future window from the current event (i.e. did this pass in a game of soccer lead to a shot etc). It's quite rare that I'll specifically care about X rows preceding or following but it does happen that you care about a shorter period than the entire window. Entirely possible people are using graph databases for that sort of thing, but the window syntax is nice.
I have to use it quite often because the dialect I'm using defaults on BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW which I have to manually change to BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
> A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.
Sad. One needs a thick skin to interview, because the manner of rejection is far more detrimental than rejection itself. I have seen friends cry unable to believe their luck or themselves. Often, due to negative bias, it takes a lot of confidence away from an individual and they start to feel inadequate, doubly so, if they had prepared super hard and yet failed because they couldn't remember a trivia.
One of the best interviewing advice I got from a mentor was, interviews aren't a pissing-off or a dick-measuring contest but often are. It changed the way I approached the interviews as an interviewee. Also, a good thing that I seeked for and got this advice way before I started interviewing candidates.
One point I've made repeatedly here on HN is that technical "interviews" have morphed into a form of entrance exam with very little oversight.
If you read about entrance exams at older institutions and professions with entrance exams (the bar, medical and nursing boards, actuarial exams, and so forth), you'll find that a they are often considered among the more stressful events in a person's academic career. They are often (and should be) very rigorous, but I do think a certain unspoken bill of rights has emerged to protect the student as well was the people conducting the exams. For example, entrance exams should have a study path, a known body of knowledge that is getting tested. They should be graded consistently and fairly, by acknowledged experts in the field. There should be a way to file a grievance, and the evaluation metrics must be "transparent" - if not the specific deliberations, then at least the general approach.
Tech interview exams have none of this. They are conducted very, very capriciously, often by people who have limited skills and experience - even if they are experts in their field (which they often aren't), they may not have any idea how to evaluate a candidate.
One basic tenet here is that you don't "mock" a candidate. Seriously, wow.
Sure, that's a fair criticism. That said, you can build multi-step interview problems with SQL (I tried to convey one or two in this doc) such that interviewers can build up towards needing a more advanced window function instead of starting there.
I've used BETWEEN ROW maybe once or twice in my career in a professional setting. Self-joins more often, but as others have pointed out window functions are more efficient here for writing dashboard ETLs, etc.
Btw, are you minimaxir who wrote gpt-2-simple? I was looking at your tutorial a month ago while putting together a solution for the Kaggle COVID-19 NLP challenge!
First, from someone with 25 years of SQL development experience, let me say while it may be true these skills could be learned in a week, most are not learned at all even by regular practitioners.
And as with all such questions I'm judging your comfort with the concepts behind the skills - in this case set theory, conditional logic, data modeling, etc.
So it's more how you say it, not what you say it. And yes it's mostly negative signal - most interviews are just to weed out completely hopeless candidates.
> Why would you ask interview questions to quiz on skills that can be learned in a week?
I'm not an SQL expert, but it's very important to understand whether something is a known unknown or unknown unknown to a candidate. So, I imagine, if the candidate would say something along the lines of "oh, this looks like something for a window function - I don't know the syntax and the nitpicks, but I could google it", it'd be completely OK. However, if he wasn't even aware that such a tool existed, it means that he would never go and learn it in a week without outside guidance, and would choose a less-fitting tool to complete a task.
One thing I mention in the doc intro is that SQL questions are one part of the data analyst interview, not all of it. If this doc helps people get through the SQL questions so that they can spend more time being thoughtful with open-ended analytics questions, that's good thing (both for interviews and for work more generally)!
Theres no shame in knowing what it is/the theory behind a convept but not knowing the syntax, especially if the interview is not for a DBA. Id be honest and ask to look it up online or move on..
I don’t think it’s fair to say these examples are know-it-or-not.
Sure OP has presented some example solutions, but most of these solutions can be built up from the basics of tabular data manipulation. Furthermore, there are multiple steps involved that would allow a candidate to show their grasp of these fundamentals and receive help from an interviewer (that doesn’t trivialise the problem).
Then you’ve only encountered shitty interviewers. The point isn’t if you know some syntax, that can be easily taught / googled. The point is how you think about the problem.
In the case of percent change in MAU, you don’t need to know how to do a self join, exactly. But rather to identify that you would want to do a self join and on what conditions would be the key
Yea, I ask the employer/manager query problem sometimes. I'm not looking for the a right answer, but rather how they think through the problem and how they think about depth bounds. Occasionally I'll have a candidate who can throw out an oracle "connect by," but it is absolutely not what I'm looking for. (Even though that's probably the "most correct" answer).
I'd say it's unlikely you know about self joins but don't know at least one syntax for it just because of how we traditionally teach SQL to others. This is the know-it-or-not part: SQL is rarely taught without syntax and syntax is used to illustrate the theories. This is in contrast to regular programming where people just pick up pseudo code eventually.
If you had a pseudo code for sets, joins, etc, it'd probably be using more mathematical symbols.
I think it depends on your data / query patterns. Self Inner/Left/Right joins are one of my most common queries. I often need to chain joins in various ways to get the desired output.
I've never heard of BETWEEN ROW, but self-joins have been a pretty common thing for me, particularly when you're doing analysis on any kind of business-centric table that (rightly) has tons of columns -- e.g. a "users" or "products" or "items" or "orders" table.
> A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance
I've given SQL interview questions quite a few times and like to present a sample dataset or schema, and state a problem statement or expected result. Then I collaborate with the candidate to come up with assumptions, and I give gentle guidance if they are stuck or going down the wrong path.
I'm always up-front with the candidate about not looking for perfect syntax, and I am more interested in their problem solving
and collaboration skills than the actual SQL they write.
People who ask ask questions about the data and schema usually do much better than those who jump right into the solution. Just like in the real world where really nice SQL or code doesn't matter if it solves the wrong problem.
I use self-joins and windows constantly. It's way easier to have our massive redshift cluster do those calculations in parallel than to try and do it in python on a VM
I do use window functions all the time; just not with BETWEEN ROWs. (to your other point, I work with BigQuery on the DB side, but R/dplyr is great at working with window functions as well in a much cleaner manner than Python/pandas).
That was incidentally the exact question I was asked. (When I asked during the interview "I know you need to use a SUM window function but I'm not sure how you constrain it to the past 7 days", the interviewer replied "you don't know the BETWEEN syntax? Everyone knows that." And then the interview ended.)
"In a real world situation I'd have googled it and jogged my memory in less than a minute. Since I suspect you were more interested in how I would approach solving X, can you help refresh my memory on the syntax and we can pretend I looked it up to get back to what's important here?"
This seems fishy - if your data has a whole day gap somewhere for whatever reason (e.g. I'm thinking sales data where certain emergencies might result in an unusual, unplanned 'zero day') then that query silently gives a wrong result, IMHO a trailing week total should always include a specific date boundary calculation to ensure that it's really a week.
For the transactional DB sure, an analytic DB / data warehouse / or even just table would have a 0 record for the no sales day. If there are gaps in daily data, at some point they need to be filled in with 0s. You can either do that at report building or query time using a cartesian join or you can fill the nulls with 0s during the data engineering pipeline. A robust BI process should do the later.
For example, if I want to plot this trailing calculation and it has gaps in the data, then the plot will be very wrong. The DE pipeline should fill in these gaps.
Two ways to address this:
- Join to a date dimension and coalesce nulls to 0
- use RANGE BETWEEN (only available in some DBs) on a date_difference column
I have only in the last week had occasion to use `ROWS BETWEEN` in a window function, and once I did I immediately thought of dozens of places I could replace moderately complex imperative/functional code with a simple query. It's definitely not common knowledge and the expectation that you would know it should be very specific to certain skillsets, but damn once you know it, it can be a powerful tool.
Going from doing most of my analytics work in MariaDB/MySQL to Python/Pandas, and R/Dplyr, I'm amazed by how much trickier the syntax and operations are in Pandas.
They are only indeterminate when the SQL engine is given an ambiguous ordering. If you specify the ordering without ambiguity, it's entirely deterministic.
This is idiotic. Why in the world would testing for rote memorization of something anyone can look up easily be a reasonable filter for talent and experience in a programming role?
A friend of mine did numerous interviews at a large company, hours out of his time and those of the interviewers, only to be caught up by some inane SQL question asked by a know-nothing after the entire process of interviews had been completed.
Why not ask about obscure regex expressions? Better yet, how about baseball scores? Hair bands from the 80s?
It's time for the valley to get real about how to judge the merit of applicants. The present state of affairs in tech recruiting is a joke.
"Idiotic" is a strong word. The question is: if candidate A answers these successfully, and candidate B does not, is there a reason to believe that candidate A is a better fit for your SQL data analyst position?
You may say no, that any competent person can read the book and learn this stuff. But in my hiring experience there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL. Once you've hired that person they do not magically turn into an efficient or self-aware programmer.
> there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL
Such questions don't test for that, they test for rote memorization. I don't learn things by heart I only use rarely, and certainly not to help you with a problem I'm not causing and that such memorization doesn't fix.
First of all, in your example, the DB server is doing the other 97 lines of work that java had to do; one approach uses data structures the other uses a SDE designed for data management and access. I think most employers would prefer a developer that can do both.
Second, the most important skill to DBA or Data Scientist role and to an employer is finding someone that understands the data domain well enough to point out obvious mistakes, bad approaches and thinking to managers and provide leadership, not just access, to information.
Third, "any competent person" is just belittling a skillset that, like programming, can take years or decades to truly understand.
I agree with most of what you are saying. But sometimes it's better to go with Java code instead of creating 3 line SQL commands.
Then again, this depends a lot on your architecture. For instance, if your database is running on expensive hardware with very limited CPU and memory quotas then often it's better to export intensive problems to the application. Also a lot of the more complicated SQL commands are implemented differently in all the dialects of MySQL, MSSQL, Oracle and DB2. Sticking to simpler queries lends itself to cross platform compatibility if you need that.
I think that when you run into someone who knows these in depth things about SQL you either have a smart technically aware person or a rote learner. I think the best bet is to ask them how they would use the different stuff and why. Rote learners will hit a brick wall or answer generically.
These examples are not obscure at all. If the job involves working with a lot of complex SQL, then these are some great questions to test the candidate's deep understanding of relational queries. It will be critical in their ability to maintain and write SQL quickly. Time and accuracy is of the essence in many such roles. However, the "if" is very important and often ignored. A web developer doesn't need to be interviewed for complex SQL.
That’s because evaluating talent is really hard, and this is the best proxy we have. The dirty truth is that promotion’s is usually even more arbitrary/political. I’ve seen extremely talented people be denied deserved promotions multiple times and people so bad I have no idea why they have a job get promotions. It’s more about who you know than what you bring to the table.
It isn't the best proxy though, because it's not even trying to simulate the conditions under which real work is done (e.g. in real work, most developers will often look up details, and it's no problem at all - what matters is how they think about solving problems, that they know what they don't know, and that they fit into the team). This is a proxy for 'is this an experienced person with a strong memory?', not 'is this a qualified candidate who would perform well in the position?'. It's quite possible to fit the latter and not the former. If you're the size of e.g. Google so your problem is more to weed out bad candidates quickly rather than catch all good candidates, then sure, go ahead (false negatives are acceptable to completely avoid false positives). If you are a smaller company, then your pool of applicants is quite limited and it's much more important that you don't pass over any qualified candidates that happen to fall into your net.
Why do you assume candidates can't look information up during interviews? Karat is probably the largest provider of technical interview as a service and they allow candidates to look things up while they're interviewing. The goal is to learn how the candidate works and what they can accomplish, not simply what they've memorized.
I interviewed quite a few times when I was at google, and in very few cases would I deduct score for something that could be looked up online. I always started the interview with saying that if they wanted to look something up they could ask me, and if I didn’t know we could make it up and pretend the API existed. I frequently discussed this with others and they all did the same.
The point is absolutely not to find some technicality and use it against the candidate.
While I agree that testing someone on concepts is more favorable than testing someone on tool specifics, I absolutely think there's merit to (tool) proficiency.
If there's one thing I've noticed among the "10x" coders, then it is that they know their environments and tools like their back pockets.
This isn't rote memorization at all. They're typical real-world problems to solve, and the point is to ensure the candidate can "think their way through" the problem to come up with a solution: not to come up with a solution using a specific function or keyword.
I can't blame someone for wanting to filter for SQL expertise. I don't know if these questions are a good way to do it. Maybe they'd be good bonus questions for when you're trying to different between two good enough applicants.
Thanks for the feedback! The first section is intentionally about self-joins since they get asked about in interviews, but other people have brought up that window functions are more efficient, so I'll add in those solutions as well.
Sorry you feel that way! Thankfully my employer felt differently :)
If someone claims to be a SQL expert but hasn't bothered to read a comprehensive book cover to cover, it probably isn't because they didn't have $50, especially now that we have Library Genesis.
I've never claimed to be an expert in anything, but if I read a book on SQL from cover to cover, I'd probably still not remember wtf BETWEEN ROWS was unless I was using it daily.
-- The guy that uses C and C++ all the time and often forgets the order of the arguments to common functions.
Hell, at least 50% of my day job involves hacking together SQL Server sprocs, most of which have thoroughly used (and abused) most or all of the concepts demonstrated in the article, and I take no shame in admitting that I'm frequently double-checking syntax or looking at how other people solved similar problems.
I spend more and more of my time writing relatively complex SQL, and I too frequently look stuff up. That's what good engineers do, like all the time. I seldom leave my editor when it's TypeScript, because intellisense does most of the looking up for me. But for SQL I hit the google frequently and I'm not ashamed to admit that at all.
I guess what I'm trying to say is that having a reference book won't help you if you don't know what to look for. Perhaps the interview problems can tell you who knows what to look for and who doesn't.
Also for Q1 - what if there is gap one or more months? It seems we should first generate a month series based on the min and max months, and left join to the series to account for months with zero activity.
Sure! Practically speaking, as a data analyst, I would probably notice a missing month when plotting the trend of MAU over time.
(You can make "but what if a month is missing?" a latter part of a multi-part interview question)
Generally I would assume that data engineers would have a month of no users set to zero or that I could ask them why that's not the case and note that for future reference.
> It's better to use windowing functions, I believe it should be faster than self-join.
Your statement is making the assumption you have completely dense data and you can simply offset a number of rows to get the desired denominator. Sparse data is a very common occurrence, and now your MoM/YoY/XoX queries are completely incorrect.
Good point. I often find the easiest approach is to join against a sequence to 'unsparsify' the data, then use the window function. I'd guess this is likely still faster than a self-join, unless the data is very sparse.
I think for a lot of people, SQL is a skill that doesn't stick.
You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore
The days of dedicated SQL programers are mostly gone.
I disagree. I've been programming for over 20 years, and SQL has been one of the only constants used at every job. I'm surprised how many other programmers I run into who are don't know or are even scared of learning SQL. Which is a shame because data will almost always outlive the program(s).
I used to think this was true, back in the days when I used 'Group by' mindlessly and fiddled with the syntax dumbly until I got the right answer. Once you spend some time thinking about the concepts of it all you should do much better at remembering and if you totally forget the SQL syntax that doesn't really matter (Modulo the hideously annoying random differences between the databases themselves, but that's true of anything.)
I completely disagree. A good full-stack dev knows SQL pretty well: at least well enough to be able to work through the general questions in the document. Good SQL skills are essential, and the lack of good SQL developers I would attribute mostly to ORMs that abstract it away so mediocre developers can mostly ignore it (which makes the job a whole lot harder for people that actually DO know how a relational database works, and give it the respect it deserves)
I really don't believe that to be true. For our application(s) at work, the database is always the hardest part to scale and tehrefore also the biggest performance bottleneck, and the difference between good and bad SQL can be magnitudes of order of performance impact (on the query itself and locking up the DB for other queries).
Only in a "Every database has 2 TB of memory and 64 cores"-world is SQL (and database design) a negligible skill.
What I forget about SQL is the little inconsistencies between platforms. It's a standard with a dozen carve outs. Do I use TOP or LIMIT? What's the wildcard syntax?
I think these are great. But I think there should be some representation around locking / concurrency / deadlock topics. Those tend to be the hardest because you can’t clearly recreate the right/wrong answer in a local test environment. Speaking as a person who waited far too long in his career to fully appreciate these topics, I wish I had been pushed to learn them much earlier.
Behaviors around locking and concurrency are not standard. Not only do they vary widely across RDBMS implementations, they also vary depending on how you configure a specific RDBMS implementation in many cases.
Some databases use lock structures that automatically detect and resolve deadlocks, so from a user standpoint there are no deadlocks but deadlock resolution has visible side effects that are implementation defined.
Yeah but the functions and style used in these answers vary across implementations (date_trunc, etc.)
I highly recommend the SQL Cookbook for newer SQL dev as a quick reference comparison to see how often this is the case for even trivial problems in any RDBMS.
Also, for your language I wouldn't expect all mid tier SQL devs to be able to write a recursive CTE from memory (though its useful, you can just look it up again), but something like breaking apart a query plan on your platform of choice is way more important as that muscle allows you to tell if your CTE was crap or not.
I suspect more SQL is written to support analytics than online systems. Locking and concurrency are used in a specific type of application, namely oltp.
I learned how much when I joined a big tech company. The devs don’t write sql unless processing logs in the warehouse. But everyone from PM to support to data science and marketing all write sql.
Hey, HN! Since I couldn't find a good resource online for the self-join and window function SQL questions I've encountered over the years in interviews, I made my own study guide which I'm now sharing publicly as a Quip doc. Would love your feedback or thoughts!
Great resource! I have personally found the top Leetcode SQL questions to be hit-or-miss without curation (too easy, too hard, doesn't cover topics like window functions / self-joins like I would encounter in interviews, etc.) but you're right, it is a great, interactive interview resource
I enjoyed reading this. My feedback is that although a few of the problems had links to blog posts that provided some analysis of the problem and solution most of these did not. I think if you added a short analysis or explanation of what makes the problem tricky and the thinking behind the solution it would be very beneficial.
Yeah, it's because (1) I started this doc over a year ago when I was still at Salesforce (which owns/employees use Quip) and (2) AFAIK Google Docs still doesn't have native code block support :/
With the extreme popularity of pandas, I think a lot of Python programmers would be amazed how clean and easy-to-read SQL queries look like, compared to the (downright) mess that's being written to query pandas dataframes.
...how is it ugly in SQL? Don't SELECT those columns. Or, if you need to do multiple things with the dataset sans those columns, use a CTE of a query that doesn't SELECT those columns. Or a view.
It seems like SQL is only uglier if you're used to "SELECT *"ing, which is asking for trouble--in SQL or when doing the equivalent with other row/column data manipulation tools.
Agreed! Pandas is great but at this point instead of using logic to solve data structuring tasks I often find myself googling for an optimized built-in Pandas method to help me out. Leads to less elegant code -- not sure if it is less readable though.
For the second one, it seems most natural to reach for exists, or something (I have not tried this code..)
select
node
, (case when parent is null then 'Root'
when exists (
select * from tree c
where c.parent = node
) then 'Inner'
else 'Leaf'
end) "label"
from tree
EDIT: also, in the fourth, it seems like you'd want to partition the window function, who cares about order. Something like
sum(cash_flow) over (partition by date) "cumulative_cf"
A subquery in the from-clause is usually lighter than the select-clause. I just so happen to have a table with this structure and hundreds of thousands of rows. This way runs in half the time. It is surprising, in fact, that the difference isn't wider, because Postgres's explain-command says it costs 1/100 as much.
select
t.node,
case
when t.parent is null then 'Root'
when p.parent is null then 'Leaf'
else 'Inner'
end as label
from tree t
left join (select distinct parent from tree) p on t.node = p.parent
Great article. I can't help but feel like SQL is a poor choice for some of this stuff, though. More often than not, I find it much easier to pull the raw data into memory, and use a higher level language to do these sorts of queries. I am all for knowing the intricacies of SQL, as the cost for not can be very high, but I'm curious for your opinion here.
For small datasets, pretty much any approach will work. Once you hit hundreds of millions of records (which isn't even that big of a dataset), SQL still performs well on pretty modest hardware.
In my experience, once you're hitting hundreds of millions of records, implementation details of your database engine will start to matter. A database designed for transactional workloads like Postgres will start to choke on aggregate and window functions, often taking minutes to run instead of milliseconds. A columnar database like Redshift (which exposes a SQL interface) will breeze through it without a sweat.
Of course - I was comparing SQL to loading all the data in memory and then using one's programming language of choice to do the hard work. Even a poor schema in a row-based database will outperform what I described after a certain number of records.
The first example "month over month" will never result in a large dataset if you just do count(distinct user_id) ... group by month order by month. There's only max 12 values per year.
That query is much simpler, safer, faster, no self-join or windowing necessary, and you can properly handle missing months in your dataset in your higher-level language (which the provided solution doesn't do BTW).
And you're still getting the performance boost from the DB indexes for grouping and sorting.
These are interview questions first and foremost, so always going to be in an artificial context. But for real-world work I would use SQL plus the relevant amount of aggregation/joining to pull rows into memory, then do the remaining manipulation in the higher-level lang. "The relevant amount of aggregation/joining" is obviously a moveable judgement call based on experience, but one thing I have found is that the more complex yr SQL becomes the easier it is for cardinality errors to creep in, and of course in general you can't easily see the intermediate results, so dropping into a high-level language can also be safer and less error-prone.
That's fine for tiny amounts of data, but it's also how you end up with (for example) Ruby taking four minutes to do what a carefully-crafted SQL query can do in four seconds (yes, this stuff happens a lot in the wild)
Yah, somebody handed me a ruby script that looped over a query and was causing a whole bunch of other processes to fail because the result wasn't available in a timely way. My boss was impressed by my having reduced the run time from 22 minutes to 5 by pushing that join into the database.
I work in R regularly and do most of my more complex data manipulation in R whenever I can, but most of these examples are simple enough that I think it makes sense to keep them in SQL. The only ones that struck me as really weird to do in SQL were the cumulative cash flows and the "histograms"/binning one, though the ones requiring window functions might also be slightly easier in R than in SQL.
Also, a lot of these would need to be done in SQL in practice since the data wouldn't fit in memory. Any solution that requires loading the table [login (user_id, timestamp)] into memory probably won't scale very well!
Good q! +1 to the other replies on this comment. I have two points worth mentioning:
(1) The flavor of SQL I use at work supports macros, which are functions that can take in parameters like a function in R/Python might. So, the SQL is "turbo-charged" in that sense and some of the value-added of switching over to Python/R is diminished.
(2) Like I mentioned in the doc, I personally use the SQL in these practice problems for ETLs on dashboards showing trends. AFAIK, much easier/efficient to write metrics for daily ETLs in SQL than R or Python, especially if these are top-line metrics like MAU.
Besides the many responses about data size, the premise of your question is baffling to me. This is exactly the kind of stuff SQL is for, it's designed specifically for doing this kind of work. Even on a smaller dataset, the "higher level" languages you speak of will certainly be able to produce the same results, but their implementations will almost certainly be more complex, with more points of failure, a great deal less flexible (what if the dataset grows?), and a whole lot more verbose... and all you get for all of those downsides is a more familiar language.
No questions/examples featuring recursive CTE's? They tend to come up in anything involving queries over trees or graphs. They're also a relatively new feature where having some examples to show how they work may be quite helpful.
Took me some time to wrap my head around writing a recursive query. Also had to study the docs a while to get "merge into" to work right, for doing combined insert/update.
I flip back and forth between deep diving in (my case) SQL Server skills and .NET Manipulation. In the world I live, it makes the most sense to do set based manipulation in SQL and logical entity based logic in C#. I work in a unique enterprise niche that has about 4 options based one either java or .net. Sql knowledge definitely gives you a leg up for complex reporting, and there are cases where I love being able to debug super quickly when comparing inputs to outputs. However, when I run into a SQL script that is 5000+ lines long and have to debug it, I much prefer the .NET side of the fence. Should someone ever come up with a bridge that gives you .NET level of visibility into the active datasets in a SQL query I would pay them 4 figures without question...
I take it you've already looked into CLR stored procedures? I haven't used 'em in a whole lot of depth, but they do seem to at least give some of the tools for a "best of both worlds" approach.
Legacy code from Powerbuilder days. But it does things like calculate month over month inventory, Mark to Market value, Risk, or things related to Energy and commodities trading.
Very interesting. I never really "got" declarative languages, I remember a very long time ago I was working with Oracle and you could see the "execution plan" for your SQL queries. I kept wondering "why can't I build my queries directly with this?" - it seems so much simpler to my brain than SQL itself.
As I responded to another comment, I have to wonder if that's because you haven't spent the time to become proficient enough to appreciate them. I understand that a lot of programmers decry "magic" and want to get at the underlying steps and components that produce a given result, but literally all of us work at some level of abstraction above that because it would be impossible not to. None of us are directly transforming sand with electricity into computed values.
You're describing being more comfortable with an imperative set of statements than a declarative description of results. Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test, except that the description is less freeform and must conform to a certain syntax and structure so that a machine can write the test for you.
It's "magic", but it's only so much magic because the scope of what it can do for you is limited and well defined.
Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test
Except that you know/expect how exactly it will be executed and if it doesn’t, maybe test should not really pass. SQL in general is full of this “we write declarative queries expecting this exact imperative result and investigating if it’s not”. It’s much like an interview question: it may have many different answers, but you must provide the one that satisfies a grumpy plan builder guy. I know sql and use it when it’s shorter/more descriptive, but sometimes you just want to take them rows and do them things, without leaving a database process and its guarantees.
Not much db experience, but the fact that such powerful engines (i.e. acid, indexing, good ods, pooling, etc) are always hidden beyond some cryptic uppercase frontend with a weak execution layer always bothered me. Just give me that postgres-grade index scan directly in C, dammit. /rant (inb4 just write some parts as a sp in a language of choice)
> Except that you know/expect how exactly it will be executed
You're writing bad tests. Do yourself a favor and write tests that pass "it [satisfies some human expectation]". Stop writing tests that care about how it's implemented.
I would like to know what to do if it fails to deliver results on time on a big enough dataset. Then someone will say hey, don’t use COUNT DISTINCT? But why, tests showed no issues and I don’t care how it’s implemented. What to do with that loop?
You're hitting on the difference between unit tests and integration tests. Unit tests should generally not try to test load limits or timing, while integration tests should have the freedom to load millions of rows into a table and verify it doesn't take too long to query the table. Unit tests should be quick enough to run before every commit, while integration tests run on a shared server with enough resources to exercise limits. Most interesting projects need both unit and integration tests.
Yes, but the original discussion was declarative vs imperative, and tests were introduced only as an analogy by gp. I just showed that it doesn’t really apply, it was not a question on how to test in general.
You can still write your test to express user intent (`it completes in [time window a user expects]`), generate the load your representative of what your user would encounter, and test that. If your test looks for `COUNT DISTINCT`, you aren't verifying that the results are delivered in the time expected, you're testing something else entirely and you may have no idea what a user will experience.
You can. First, find or implement a library in your language of choice to store indexed data on disk. Then implement some very hairy logic to make sure your read and write operations are ACID. Then choose an algorithm for the query you'd like to write. And then write it.
I don't intend to be glib, but rather to illustrate some of the benefits of a database platform like modern RDBMSes. These things are all implemented and battle-tested for you.
If you would prefer a more fine-grained control over execution of queries, you could probably get very far by starting with the SQLite code base and working at a more primitive level through its library.
You can't write query plans directly because you would have to manually take into account a number of factors that the query planner considers for you automatically, such as size of the table, statistics on the distribution of values, whether there are indexes that could be used to speed the query, and so on. Some SQL dialects (like Microsoft's T-SQL) do give you some ability to influence the decisions the query planner makes, though, like forcing it to use specific indexes, or forcing it to use scans or seeks.
As a practical matter, when you can't control the optimizer or affect how the DBAs configure things, you break your huge query into multiple ones with temporary tables.
This is from the perspective of trying to get queries to run in 5 minutes instead of 30 minutes instead of hours or days or forever, not brief transactions measured in milliseconds. And it's not something I figured out on my own, but by paying attention to the guy who never talked but was consistently 10x faster in producing reports than anyone.
The thing you should not do, that I also saw people do, is use procedural PL/SQL or T-SQL to process things in a loop - that can be orders of magnitude slower.
And then comes Oracle and insists on applying equality filters first, because, duh, they are fast, and never take any of those other details into consideration.
Honestly, Postgres does it right - you can enforce your query plan to any level of detail you want.
Data changes, so the best query plan changes with it. SQL was built to handle this data-dependent environment.
Is one column composed of mostly one or two values? Then an index lookup on that column is not very optimal, and the database can use something else.
There is more than one type of join (INNER, LEFT OUTER, etc), and more than one join algorithm (neesed loop, merge, hash, etc). All these change based on the data. Even the join order can have a huge impact on query time, and needs to adapt based on the number of rows you'll pull from each table.
A lot of SQL queries are built from templates, or built by ORMs. Optimisations are critical to turn these templates queries into something efficient.
SQL can also be very expressive, a "NOT EXISTS (SELECT 1 FROM thing WHERE foobar)" could be more readable than doing a join and where clause.
Though interestingly, you get a much more declarative query style with nosql databases and key-value stores. So there are alternatives out there.
DISTINCT generally requires the results to be sorted which has O(n^2) worst performance so it can have a big performance hit on a query. It is best to make your database structure such that queries only return distinct data. E.g. by disallowing duplicates
If your sorting algorithm degrades to anywhere near O(n^2) in pathological cases, you're doing something wrong. And even if it's just a kind of timeout/operations-limit to detect pathological cases and just run an in-place mergesort instead. Tail latency/containing pathological data is quite important if there's any interactivity.
Nearly every time, it's a symptom of bad data normalization.
But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).
In order to determine the distinct items, the items need to be deduplicated. Generally that's done in only two ways: a hash table that skips items already seen, or a sort followed by a scan that skips over duplicates. The hash table is O(1), but the sort is easier to make parallel without sharing mutable state and has more established algorithms to use when spilling to disk.
Bingo. I used to work with a guy who would see duplicate results and just throw a distinct on his query. I had to keep on him to fix his queries or explain why distinct was correct in this case. My default is that distinct is almost always not the solution.
In your example you must also have another table, 'sample' with all the samples. So yes, you would use an exists or in subquery with the table you suggested.
The first few answers are unidiomatic where I work. Analytical functions would be vastly preferable to self joins, especially in the case of the join with an inequality that is proposed to compute running totals, which I assume would have horrible performance on large datasets
RDBMS platforms without function indexes means that some of these queries will force a row-by-row execution over your entire table. Enjoy running SELECT ... FROM a INNER JOIN b WHERE DATEPART(a.date, month) = b.whatever on a table with 500 million rows in it.
As someone who used to use MSSQL extensively and is now forced into MySQL, trust me, do not be jealous of the MySQL people. 5.7 fixed some of the most egregious issues and 8.0 is adding some features, but the smoldering embers of the dumpster fire are still around.
I think I'd be able to do all of these in my daily work, probably not as efficiently, with minor references to syntax guides (I don't use window functions often enough).
In an interview, presumably my logic would, hopefully, shine through minor issues of syntax.
Where would that put me? Maybe "okay to decent" when dealing with "medium-hard" questions?
I would fail utterly at DBA management SQL and stored procedures, my responsibilities skew towards data analysis.
Yeah, I'd be something similar and also wouldn't consider myself a database expert. I think this question list is tilted towards "can you express this complicated abstract transformation in legal SQL" than the practical day to day considerations of building and working with databases.
Hey! MAU = Monthly Active Users. Sorry about that, I think this could be described as data analyst/analytics jargon. Good feedback to spell these out, thanks!
Sorry about that -- in real life you would just ask the interviewer what MAU meant if they didn't spell it out.
The purpose was to make the questions more realistic, since at least in my experience in data analyst interviews the questions are asked in the context of actual business or product situations ... like company leaders, PMs, or others wanting to understand trends in MAU.
Totally understandable, I think the frustration is more towards companies that try to make the questions a bit more obscure, definitely not your fault.
SQL is better than 99% of the nosql alternatives out there.
But one thing it falls apart are these time series data processing tasks.
It's because of its model of unordered sets (or multisets to be more precise, but still undordered). When you look at majority of those queries and other real life queries they always involve the dimension of time. Well - then that means we have a natural order of the data - why not use an array data structure instead of this unordered bag and throw the sort order out of the window.
SQL realized this and bolted window functions on top of the original relational model. But you still feel its inadequacy when trying to do simple things such as (x join x on x.t=x.t-1) or even the infamous asof joins where you do (x join y on x.t <= y.t).
In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.
In traditional set oriented SQL it results in horrible O(n^2) cross join with a filter or in the better case od equijoin in sort merge join which still has to do two expensive sorts on my massive time series tables and then perform the same trivival linear zip that the array processor would do. Which is also completely useless on unbounded streams.
Also many stackoverflow answers suggest to use binary blobs for time series data and process them locally in your favorite programming language - which points at wrong conceptual model of SQL engines.
Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?
Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.
Asof join is still an open issue in their issue tracker so it is not usable as a time series database.
If AS OF is a dealbreaker for you, so be it, but I think TimescaleDB is trying hard to be more than a "simple timeseries" aggregator like so many competitors. We're using it to store hundreds of millions of complex IIOT data packets, where a fully normalized packet adds dozens of rows across ~4 tables.
Having it be mostly plain-ol-postgres has been huge for designing and evolving the data model in a way that I'd hate to do without "real" SQL.
I'm sure your product is great for many customers in their domains.
The asof and other time based joins are a necessity for just slightly more complicated data processing than just a trivial aggregation at the tail of the stream which is majority of use cases in IOT and real time monitoring.
Without that I can pipe my data to /dev/null giving me even better write performance while also not being able to solve this common financial industry use case.
Sorry, didn't mean to imply I work there, I don't; my employer uses Timescale and I designed our schema.
Timescale has been quite performant for us fetching arbitrary past time windows, but I'll stop the cheerleading there as I think mfreed covered the details far better.
Just an observation that the parent post was the unordered "set" nature of tables, and not strictly about ASOF joins.
I should point out there is an alternative reason for this in TimescaleDB (and many other databases): to support loose-time-order writes (or even data backfill), where records don't always arrive in perfect timestamp order.
So some of these engineering decisions are to balance tradeoffs across insert vs. query rates.
In TimescaleDB, for example, we write data to time intervals in arrival order, not timestamp order, in order to achieve higher writes rates. On the other hand, we support the automated asynchronous rewriting of data to reorder data after it reaches a certain age, according to arbitrary columns.
This can be used to reorder data precisely in timestamp order. It's also highly useful and commonly used for composite reorderings, such that you can then reorder on (some_id, timestamp DESC), such that, within a time interval (chunk in TimescaleDB), data belonging to each some_id is collocated on disk, and within each id, it's sorted on disk again by timestamp. (Here, some_id might be a unique hostname, device_id, metric_id, stock ticker symbol, etc.)
These allow you to take much faster queries for SELECT * from table WHERE some_id = foo and time > X and time < Y.
We also take a similar approach when performing asynchronous columnar-style compression, where we "segment" together many records belonging to the same id into a single internal row, then order those records in array style within each row, sorted by some column (typically timestamp) before applying a type-specific compression algorithm to then. Again, all metrics then belonging to some_id are collocated, ordered by timestamp as an array on out-of-line pages so that only the requested columns need to be fetched from disk. And this gets hidden behind a SQL view that makes it look like you are still interacting with a standard row-based table.
Anyway, totally appreciate that we haven't prioritized ASOF JOINs, as our experience is that these typically come up in more specific financial use cases, rather than the vast majority of time-series use cases.
And we also find that our "gapfill/LOCF" functionality can be used to bucket-aggregate data with different ids for related behavior. That is, it's a data modeling choice to store different ticker symbols in different tables (and then ASOF JOIN), and not necessarily the only way to achieve the desired goals.
>After an interview in 2017 went poorly — mostly due to me floundering at the more difficult SQL questions they asked me
Could be you dodged a bullet. A company with advanced interview questions may have some ugly SQL code. For jobs that lean heavily on SQL, I expect candidates to know things like windowing & dynamic variables, in SQL & an ORM library. For SWE's, I feel basic SQL is fine.
Yes! Window functions and decent demonstration of datasets with varying granularity on top of the basic join/aggregation assessment is all that is needed in the SQL for interviews. And if you're an engineer, attention to redefining the query plan through appropriate use of temp tables. Anything else is a poor use of precious interview time for both parties, in my opinion
The problem with hard SQL problems is that they are often one of two camps:
1. use some underlying implementation detail of the particular RDBMS or proprietary extension to the standard
2. are essentially tricks, like the typical ridiculous interview problem designed to "make you think outside the box". Yes, you can do almost anything in SQL but often you should not.
I get the perspective here is data analysis where you probably need to know more SQL than the standard developer, but I still feel you should be testing for solid understanding of basics, understanding of the relational algebra concepts being used and awareness of some typical gotchas or bad smells. That's it. They'll be able to google the precise syntax for that tricky by date query when you're not guaranteed to have data for every month or whatever on-demand.
A bit off-topic, but this is another one of those sites that show nothing without JS, and looking at the source reveals something a little more amusing than usual: content that's full of nullness.
Most people have no idea what an optimization fence is and opt for CTEs because they yield "cleaner" queries, despite nearly always killing performance.
I've always found explicit temporary tables, where I can add indexes, are often a great solution for performance and readability.
Do you mean subselects in the SELECT clause or in the FROM? Usually subselects in the SELECT clause have been the source of dogslowness. Reorganizing them into the FROM clause sped up queries 1,000 times. And I thought CTEs were just a way of moving subselects from the midst of the FROM clause to the top of the statement, where they're easier to understand.
In the FROM clause -- although I understand future versions of pgsql will address this, the "WITH foo AS ... " approach can be horribly wasteful, materializing all the results, whereas "SELECT * FROM ( SELECT f(x) AS a FROM baz ) qq limit 1" lets the optimizer do its thing.
I mostly agree. I'd start with the simplest possible query and move from there. I enjoy SQL, but once I start querying large datasets or doing anything remotely complex, the 'right' solution is the one that is fastest. And fast always depends. Typically I have to write up a solution a few different ways, test and tweak.
At least on recent versions of Postgres, CTEs are often equal in performance to subqueries (if they're side-effect free and non-recursive they get inlined).
Out of curiosity: what is the use case for SQL window functions in application programming? Unlike most SQL, it doesn’t seem to reduce the order of the data coming back from the server, nor do anything especially faster than can be done on the client - and has the disadvantage of extra load on the database (which is harder to scale).
Is it only useful for ad hoc/analytical queries, or am I missing something?
It allows you to "group by/aggregate/etc" on a row by row basis rather than once over the entire table. So whle GROUP BY defines a set to aggregate over the entire table ( and hence you get 1 row back ), window functions creates a criteria by which each row gets a set ( aka window ) to aggregate over for that particular row. In addition, you can define a "frame" over that "window" for even more refined aggregation. I suppose window functions are used more in the OLAP space than OLTP, but it is a very useful part of SQL. In my opinion, windows function is the most impressive and important part of the language.
People can define, describe and provide examples but with SQL, it won't sink in until you try it out yourself and have the "Ah ha" moment. If you have a database server and some test data, try writing a few window function queries or try some online examples.
Why do you think it's not "especially faster" than doing it on the client?
The data is already sorted, partitioned, in memory .. adding a moving average calc or max value per category is certainly faster than fetching the set to disk, recreating it in an intermediate structure and then calculating the new value with all that partitioning, sorting, etc. to be redone client side.
You can certainly use them in app dev for things like:
* figuring out how many things are ahead of you in a queue
* snapshotting (what's changed since the last time you were here)
* comparing something to some other sample for outliers or inconsistencies
These may be quasi analytical still, but ultimately they can manifest as properties of an object model like any other property to be developed against.
I suppose it depends on the language. Normally I’m writing in either Java, C#, Rust or C - and in those I’ve never found that having the DB run a moving average to be any faster, provided the result comes back in the right order. Indeed, if I need to use another column in the result set, the deserialisation overhead means it’s normally slower. For reference I normally write financial time series processing code which is where window functions ostensibly could be quite useful.
However certainly if your app is written in Python or Ruby I can see there being a big difference.
The reason I mention analytics is more to do with the ease of scaling out the DB than the nature of the queries. In busy OLTP databases usually one wants to keep the work off the database because they’re difficult and expensive to scale.
I use them for cursor based pagination. You can use rank and a where clause to apply the limit. You can also compute max rank to determine whether there are more data to fetch after limit. Finally, if you do left joins where you may have multiple rows per objet, you use dense rank instead of rank
This is more like "we want to make sure you can use recursive CTE" questions. To add some variety to medium-"hard" SQL questions you could add some lateral joins, window functions (especially lag if you want to get creative) and compound logic statements in where clauses.
For creating a table of dates, what are our thoughts on:
select * from
(select adddate('1970-01-01',t4.i10000 + t3.i1000 + t2.i100 + t1.i10 + t0.i) selected_date from
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v
where selected_date between '2016-01-01' and now()
The WITH clause, otherwise known as Common Table Expressions, is in ANSI SQL99. Common table expressions are supported by all of the major databases: PostgreSQL, Microsoft, Oracle, MySQL, MariaDB, and SQLite. They are also in certain minor ones, like Teradata, DB2, Firebird, and HyperSQL. Recursive WITH clauses are especially useful.
Worth noting that MySQL only got CTE's in 8.0 while something like MSSQL has had them since ~2005. The reason this is important is that something like AWS Aurora only supports up to MySQL 5.7, and thus no CTEs. :/
Not anymore. Other implementations, like MariaDB and SQLite, have adopted common table expressions. They're basically syntactic sugar for subqueries in most implementations, but they can make some queries much more readable
some of the problems with SQL, is it was written to solve problems when hardware was expensive. BCF, n all the normal forms etc. when doing analytics you want a flat table that's it. & when working with a flat table for analytics they're other tools better for analysis than sql e.g pandas. or sql like language used by column databases. once you've a flat table, you no longer have to do joins etc.
I’ve done my fair share of complex sql queries and complex data migrations, asking about JOIN during a random interview is unfair unless documentation.
Every time I see an article such as this it reminds me how much I deeply abhor SQL. It is an ugly language, closer in feel to COBOL than something that can at times approach elegance, like Ruby or Scala. With languages like those, you can loo at your work after you are done and be proud of it beyond its purely functional aspect. SQL never elicits a response beyond “the task is finished and it does what I want”, typically with a “finally” in there somewhere.
Where you see ugliness I see beauty. I guess it is just what clicks for someone and what doesn’t. But together our strengths make for one really good programmer hence why I like working on teams.
Yeah, my experience is basically the opposite to the GP's as well: for expressing the projection of data I want to work with, there are very few syntaxes I find more elegant than an SQL query. (I have similar thoughts about CSS)
With SQL as an analysis / insights tool (as opposed to prod or dashboard use), answering one (nontrivial) question leads to follow-up questions that need a lot of query restructuring.
SQL isn't really code though, its more like a specification of what you want. Working out how to actually get what you want is then (usually) the engine's problem. And when you're describing precisely what you want from large datasets with lots of columns, its always going to look ugly.
I'd wager that if you re-wrote those snippets as Ruby or Scala operating against something tabular like CSV files, with all the joins and aggregates and so on done in code, it would look uglier.
* ok sometimes sql (thinking specifically of SELECT statements) gets code-y, e.g. with inline formulas. But generally its more on the specification side.
Have you considered that your distaste for SQL has prevented you from becoming proficient enough with it to appreciate it? I used to share your distaste, but as I've been faced with problems where SQL is not only the objectively best solution but also the other available solutions are either impossible (due to, for instance, memory constraints) or significantly harder to understand and maintain (due to SQL's expressiveness), I've come to learn how to use the tool more effectively and in turn I've come to deeply appreciate it.
Codebases I work in continue to use ORMs extensively, but when I need a query of any complexity I'm far more likely to start with raw SQL than to try to make the ORM do what I want. I'm far more likely to "let the database do the work" when I need to do any kind of data analysis or reconciliation. It is very good at what it does, and if you understand how it works it's very simple to do complex things, and quickly. For me, SQL often elicits a response of wonder, and even gratitude for its expressiveness and power.
I also suspect your feelings are rooted in syntactic aesthetics (there's little else in common between SQL and COBOL, for example). I can certainly agree that the syntax of SQL is not what I would choose (though I would not look to Ruby or Scala as examples of my preferences). But SQL is not just syntax, it's also a tool with incredibly expressive capabilities for viewing, analyzing and manipulating data effectively.
>SQL never elicits a response beyond “the task is finished and it does what I want”...
To you this is a bad thing, to me, this is exactly why I like SQL. I don't need to be in love with the elegance and majesty of a language I just need it to work and SQL typically just works and 9 times out of 10 it hooks up nicely with whatever high level language I'm using.
And for what it's worth, I've seen quite a few senior level SQL developers write some pythonic SQL queries that would put my regular Python programming to shame.
It can also play a devil’s advocate game with your data. Last time I wrote a 4-level join on complex conditions and ctes to find the latest effective non-deleted non-property-deactivated maybe-from-a-group-default-markupped product sale price it worked like charm, really. Until it broke two months later and started multiplicate rows in a frontend. It wasn’t me who designed that schema, if that matters. We already sold that part of business at that time and a poor guy who took that on a support had to repeat my hair pulling investigation into this madness again and how to coerce it into working, while production suffered badly. If a solution was imperative-y, it wouldn’t show such an effect naturally (loop over products, lookup a price by an algo, maybe presort few tables). I would make it like that hands down, but requirements were to use the database as is and no workaround was suitable. Now imagine this error in a financial analysis before an aggregation and no clear human-controlled sum checkpoints.
Don’t get me wrong, I see where SQL shines, but I also see where my (your, their, anyone’s) mind shines and where it will lose its traction and prediction abilities. People are fine with some level of declarativeness, but add some more and it turns into hardest puzzles where formulating it correctly is a hardly provable task itself. While that’s true for imperative code, it breaks along the way of your thinking, and not at randomly ‘optimized’ expressions. UB issues in C language group is essentially the same sort of trouble.
This is a dangerous attitude, because it leads to someone like yourself being assigned a task and doing it in a loop in the procedural language associated with your database, where it could be done in one query that is orders of magnitude faster.
And although it may be a matter of opinion, but I find a query that runs one or more orders of magnitude faster than another query to be a thing of beauty.
If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.
The ironic thing is that the way developers like to deal with data today is more like how they did in the early days of COBOL, to which SQL was an improvement.
The first computerized databases were navigational, which just means hierarchical objects. You "navigated" through the data to find the parts that were interesting for a given query, just like with JSON you might loop around through the properties. In 1973 Charles Bachman wrote a book called The Programmer as Navigator.
These data structures were insidious, because: (1) you wind up with duplicated data, vulnerable to getting out of sync with itself, and (2) complex queries can get slow. For example, imagine an array of Customer objects. Each has an Orders field, which is an array of Order objects. Each of those has, among other things, fields for the item name and description, and so on. With this structure, it's easy to fetch all the orders of a certain customer, but it's slow and complex for other queries, like the total number of orders for each item. For that, you might duplicate the data into a different structure. It was just like NoSQL, only there was no SQL at the time. It was PreSQL.
Programmers are immediately attracted to such data structures because they are amenable to the first few pages you have in mind to build. It's really easy to run those nested objects through a template and output HTML, and it's straightforward to take data from a form and save it as one of these objects. As your application grows and spirals, though, your original data structures become more and more cumbersome and less suited to the new pages you have to make.
This was a problem in the 1960s and 70s just as much as today, which is why E. F. Codd wrote his papers, most famously "A Relational Model of Data for Large Shared Data Banks." You might say, relational? Those old navigatorial objects sounded like they had lots of relationships. But it is a popular misconception that Relational here meant the relationships among tables (i.e., foreign keys). Dr. Codd was a mathematician, and he meant the mathematical term relation, which is essentially a grid of values, a table. So they were called relational databases not because you could relate one table to another but because they were databases made up of relations (tables).
Tabular data solves the speed problem in navigational data. But now fetching your data is even more tedious, if you have to navigate those tables by hand (Loop through Table A. If the value in cell 12 > 10, then save it to a temporary variable...). But in that very same paper, Codd also proposed a very high-level language for working with the tables. It wasn't called SQL. IBM came up with SQL specifically, after examining Codd's papers (who in fact was a researcher at IBM). Believe me, SQL was an improvement. Codd's original language, called Alpha, was mathematical hieroglyphics. The foundation was solid but the user friendliness was lacking. SQL was an attempt to have the same nature but resemble English instead of Mathematics.
But the two pillars, tabular data structures and a high-level query language, were introduced simultaneously and are both equally part of what makes SQL what it is. Which one would you like to remove?
Chesterton's Fence comes to mind when watching programmers meet SQL:
"In the matter of reforming things, as distinct from deforming them, there is one plain and simple principle; a principle which will probably be called a paradox. There exists in such a case a certain institution or law; let us say, for the sake of simplicity, a fence or gate erected across a road. The more modern type of reformer goes gaily up to it and says, 'I don't see the use of this; let us clear it away.' To which the more intelligent type of reformer will do well to answer: 'If you don't see the use of it, I certainly won't let you clear it away. Go away and think. Then, when you can come back and tell me that you do see the use of it, I may allow you to destroy it.'" --- https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence
(The rest of your comment is very compelling, just have a nitpick.)
> If it helps, you don't really have to type SQL key words in all capitals ;) I stopped years ago.
I continue to urge my team to capitalize SQL keywords. This is because the vast majority of SQL queries in our codebases are embedded in another language, as a string. Syntax highlighting is not available (I know there are some tools for this in some environments for some host languages, but it's not widely available or even remotely a solved problem). Static analysis tools for this scenario are generally hard to come by. Every syntactical hint is a godsend for reading and comprehending these queries. I also encourage quoting every identifier even if it isn't strictly required, and extensively using whitespace to make a query's structure more apparent.
If I were writing and reading SQL under better circumstances, it's quite likely I would have different preferences.
Fair enough. The vast majority of SQL queries in my codebase likewise was embedded in another language as a string. Yet I instantly became happier when I started lowercasing everything. I also quote as few identifiers as possible. It's a matter of a clean look, which for me leads to clear thought. I will agree with you, though, that indentation is critical.
But I have taken it one step further and reaped a boon with a more advanced technique: I extract as much as I can of those multi-line SQL statements into database views and, if absolutely necessary, functions. I define these in .sql files, so first of all: syntax highlighting! The strings embedded in my code therefore become mostly one-liners:
select * from view where a = ?
etc.
I version-control the SQL files right along with the normal codebase. Migrations aren't complicated for me, the way that some people complain about them. Then again, I am super-comfortable writing raw SQL. Especially with Postgres, which lets you wrap BEGIN and ROLLBACK around definitions and redefinitions of tables, views, functions --- all DDL --- it's very safe to test and idempotent to run over and over.
This is something I've seriously considered, I just worry that one more layer of indirection creates one more opportunity for problems. (Also being maybe overly publicly blunt, I work with a number of juniors who don't have a great track record for following or even taking interest in why things are structured the way they are, and who also don't have great comfort with SQL.)
RE version control: yes, everything that interfaces with the database is in version control.
RE migrations: I also don't find them confounding, and I've mandated that my team write them all in raw SQL rather than reusing abstractions that were available in our environment, because what I discovered was that migrations were being altered unexpectedly over time where imported abstractions were changing. Everyone has benefited from this.
RE transactions: my only complaint is that (AFAIK, and would happily be corrected if I'm wrong), `BEGIN`, any operation, and `END` are separate statements. When I'm experimenting before testing a full migration, I'll often break down smaller problems and test them within a transaction in my preferred GUI (Postico). Which by default runs the statement where the cursor is placed, and only runs multiple statements if you (remember to) select them. I would love to be able to wrap an entire statement (or set of statements if need be) in a single `BEGIN ... RETURNING * ... ROLLBACK` statement.
The file, changes.sql in this case, might look like this:
begin;
alter table t1 add column c ...;
alter table t2 drop column d ...;
alter table t3 alter column e ...;
drop view if exists v;
create view v as
select ...
;
rollback;
I save the file in version control with the rollback statement, for safety. When I am ready to run it, I temporarily change "rollback" to "commit".
If I wanted to test out just some of the statements, I would comment out the others.
EDIT:
You can inspect the changes in flight by inserting a select-statement:
begin;
alter table t1 alter column c ...;
select *
from t1
where ...
;
rollback;
(Interesting, Hacker News's user interface puts no reply link below your comment. So I could not reply directly. I guess it allows only so many nested replies. This is probably an optimization for readability or to avoid flamewars).
That's great, but the statefulness of trying something is still a problem. It would still be nice to be able to wrap a whole change or set of changes with a guarantee that I'll be able to view the outcome without a state change in the database. This nicety is compounded by the fact that I'm often working with datasets that take a few hours to get set up before I can even evaluate. There are a ton of other guard rails I can set up to make that turnaround less shitty, but even cloning a backup database in the worst case is painful. There's no reason the expressiveness of SQL can't embrace complex expressions as a single statement to encapsulate a whole transaction in a single expression. CTEs already do most of that, they just don't have any transactional capability.
Edit: I also don't just use psql, because having normal editing controls that I'm used to in my operating system is muscle memory I leverage and also error prone if I try to use the wrong tool with the wrong muscle memory. I know how to move a caret around every single Mac app, but it works differently the moment I enter anything other than emacs in my terminal.
> You can inspect the changes in flight by inserting a select-statement
All true, but you still can't express a transaction as a single expression.
> Interesting, Hacker News's user interface puts no reply link below your comment. So I could not reply directly. I guess it allows only so many nested replies. This is probably an optimization for readability or to avoid flamewars
They just hide the reply link at a certain depth of threading. You can navigate to the parent comment (by parent link or datestamp depending) and reply.
If you think of SQL as specifying (without implementing) a pure function operating on a set—rather than imperative steps or statements operating on lists—you'll have a lot better luck reasoning about how to use it to operate on data. CTEs, window functions and joins can all be used to accomplish things you'd do in a loop in an imperative language. It really is a different paradigm, in the same way as logic, functional and object-oriented programming are different paradigms. But if you embrace the paradigm you will surely learn how to accomplish whatever you would in a paradigm more familiar to you.
The trouble is that any loop you might write may be looping over the wrong thing.
Probably the biggest single determinant of a SQL query's performance is the order of joins. The best join order is dependent on how many rows result after each join, which in turn depends on how selective the predicates are that can be applied to each table, and that depends on the data distribution. The database does this with statistics. That means it can change the join order - the nesting level of your respective loops - as the data distribution changes.
You can write loops in (the procedural extensions of) SQL, but IME it's very rare that you'd want to.
The bigger issue is the lack of performant abstractions - I copy-paste orders of magnitude more code in SQL than in any other language I've used. There are basically just views and functions, and both can carry substantial (read: multiple orders of magnitude) performance penalties compared to the copy-paste approach, especially when you try to nest or compose them. Materialized views rectify this somewhat but they come with various RDBMS-specific limitations and gotchas.
One of the best data tools I have ever used: DBT. The big concept is that everything is a SELECT and handles most DDL and DML under the hood. It also provides the ability to add scripting, such as loops, with Jinja. It is primarily meant for OLAP and ELT, but could be used for some OLTP too.
Recursive CTEs are great for expressing very poorly performing queries. They might give the right answer, but after you've given up waiting.
If you've got tree or graph structures, you're better off denormalizing paths or transitive closures, to reduce the amount of indirection required to answer questions.
A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance; the questions are often know-it-or-don't questions (especially involving window functions w/ uncommon syntax). A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.
That said, as an IRL data scientist, the amount of times I've had to do a SQL self-join in the past few years can be counted on one hand. And the BETWEEN ROW syntax for a window function.