Hacker News new | past | comments | ask | show | jobs | submit login
Medium-hard SQL interview questions (quip.com)
1230 points by thomzi12 on May 2, 2020 | hide | past | favorite | 287 comments

SQL interview questions are an interesting counterpoint to stereotypical programming interviews: while typical algo questions in SWE interviews tend to test what's taught in academic contexts but have little real-world application; the questions in SQL interviews are more practical outside the whiteboard.

A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance; the questions are often know-it-or-don't questions (especially involving window functions w/ uncommon syntax). A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.

That said, as an IRL data scientist, the amount of times I've had to do a SQL self-join in the past few years can be counted on one hand. And the BETWEEN ROW syntax for a window function.

Given the likelihood that I would have to work with the person who was interviewing me, I would count it a blessing that they show me their true colors during the interview rather than after I hire on. And if I didn't get an offer, I'd consider it a bullet dodged.

If you do get an offer, and choose not to take it, that's dodging a bullet.

If you don't get the offer there's no bullet to dodge, right?



Your pedantry tag has no opening



It wasn't a pedantry tag. And yours is differently misspelled!


Don't start a sentence with a conjunction.

I'll terminate this by being so pedantic that we cross over into non-pedanticism.

Per descriptive linguistics (https://en.wikipedia.org/wiki/Linguistic_description), the assertion that it's wrong to start a sentence with a conjunction is incorrect. Ditto for splitting infinitives, ending a sentence with a preposition, etc.

(Descriptive linguists basically define the rules of language to be how humans actually use the damn language. It turns out such a philosophy saves you from falling into a lot of traps)

Unfortunately merely noting 'descriptivism' as if it solves the issue of linguistic prescription doesn't help that much. We prescribe all the time, like it or not, language isn't a free for all at the individual level. It's also _descriptively_ true that many people regard certain rules as inviolable, which certainly means that prescription is, in some sense, completely reasonable: comply, or we'll refuse to accept you. How does 'I'll do whatever I damn want' really handle that? It can be true that it's 'valid' on one level, while being wrong on another.

I immediately dismiss as gauche and lacking credibility anyone who even hints at a prescriptivist rule, so my acceptance goes against prescriptivism. These rules make writing unclear, stolid, dull and uninteresting for no real benefit.

In my opinion, parroting prohibitions against split infinitives, ending with a preposition, "snuck is not a word", "'they' is only plural", etc etc not only shows a profound lack of education, this shows an inability to learn, since these so-called "rules" are widely and easily known to be made up whole cloth. To my mind, this is a person who absorbed Elements of Style (perhaps even second-hand from a misinformed high school teacher) and called it a day, believing they understand how English is properly written.

Serious red flag for me.

"I immediately dismiss as...lacking credibility anyone who even hints at a prescriptivist rule"

Isn't this a prescriptivist rule? Imagine a highly skilled author who recognizes a younger one that misuses a semicolon or has poor grammar; the older says to the younger, "only write with perfect grammar". By doing so, the younger learns more precisely the elements of the English language and the purpose of each device (semicolon, period, comma, etc.). After, the younger becomes a much better writer, and integrates that new knowledge with his or her existing writing. In this way, prescriptivist rules help inexperienced writers hone in on their weaknesses and draw focus to some facet they lack.

Any harsh prescription is kind of a red flag for me, because you have to judge yourself by that harshness (fall on your own sword, so to speak). Or, as what usually happens, hold some hypocritical double standard or semantic loophole which pardons yourself while incriminating others.

<< Imagine a highly skilled author who recognizes a younger one that misuses a semicolon or has poor grammar; the older says to the younger >>

I cannot imagine a highly skilled author who never employs split infinitives. Show me that person.

Regarding semicolons, here is Kurt Vonnegut's opinion on them: "Here is a lesson in creative writing. First rule: Do not use semicolons. They are transvestite hermaphrodites representing absolutely nothing. All they do is show you've been to college."

Is that good advice? Perhaps for some writers, it actually is; but it's not about the semicolon, really, but about stilted, bad writing.

<< Isn't this a prescriptivist rule? >>

I was responding to the previous comment "It's also _descriptively_ true that many people regard certain rules as inviolable, which certainly means that prescription is, in some sense, completely reasonable: comply, or we'll refuse to accept you." My direct answer is "It's the prescriptivists 'refusing to accept' who should worry about acceptance, since they are failing on so many levels it's hard to know where to start"

Agreed. Elements of Style is full of errors, some of which are so obviously wrong that even its own authors don't follow them.

Do you have any good recommendation for improving writing skills?

I picked up elements of style few days ago. I couldn't parse it. I was only able to understood it somewhat through intuition and prior knowledge. The sentences were too confusing, out of the place for me with added jargon and story of the author that I didn't want to know about.

How many pedants are descriptivists?


But sometimes it's OK to break the rules.

Not with -pedantic enabled, that’s kind of the purpose.

Unless you're already calling someone else out for a spelling mistake.

The most reliable way to dodge bullets is to not have them fired at you in the first place.

Unless you know a guy, you have to wade through the firing range to land a competitive job.

Sometimes you need to wade through the firing range as part of your job:


I've always preferred deflecting them with my katana :)

It depends on whether you consider "dodge" in the phrase to be active or passive.

Dodging taxes is active. Dodging the in-laws because you were in the shower when they dropped by is passive.

Both applications of the verb are in common usage and are therefore valid and obvious enough that you clearly understood the meaning. </pedantry>

"Dodged a bullet there" is often used in hindsight, where you find out something was terrible after you avoided it for a different reason.

Never thought of this, but you are right. Usually it’s used when you didn’t know you were dodging the bullet at the time but more of a hindsight thing.


I've been using SQL for 24 years and don't think I've ever used BETWEEN ROW before. Did read up on it now, though.

I wrote sql for years for a forex platform used by some of the top dogs (literal). Pretty hardcore sql and had to be very fast, never used BETWEEN ROW.

What breeds of dogs use forex platforms?

Fiat Retrievers

I use it all the time, both in ad hoc queries and in data prep for models. Most commonly just to look into a future window from the current event (i.e. did this pass in a game of soccer lead to a shot etc). It's quite rare that I'll specifically care about X rows preceding or following but it does happen that you care about a shorter period than the entire window. Entirely possible people are using graph databases for that sort of thing, but the window syntax is nice.

There are so many specialized aspects that you'll never use outside of an interview or an edge case when you can learn it on-demand.

I have to use it quite often because the dialect I'm using defaults on BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW which I have to manually change to BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

> A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.

Sad. One needs a thick skin to interview, because the manner of rejection is far more detrimental than rejection itself. I have seen friends cry unable to believe their luck or themselves. Often, due to negative bias, it takes a lot of confidence away from an individual and they start to feel inadequate, doubly so, if they had prepared super hard and yet failed because they couldn't remember a trivia.

One of the best interviewing advice I got from a mentor was, interviews aren't a pissing-off or a dick-measuring contest but often are. It changed the way I approached the interviews as an interviewee. Also, a good thing that I seeked for and got this advice way before I started interviewing candidates.

"mocked me".

That's really reprehensible.

One point I've made repeatedly here on HN is that technical "interviews" have morphed into a form of entrance exam with very little oversight.

If you read about entrance exams at older institutions and professions with entrance exams (the bar, medical and nursing boards, actuarial exams, and so forth), you'll find that a they are often considered among the more stressful events in a person's academic career. They are often (and should be) very rigorous, but I do think a certain unspoken bill of rights has emerged to protect the student as well was the people conducting the exams. For example, entrance exams should have a study path, a known body of knowledge that is getting tested. They should be graded consistently and fairly, by acknowledged experts in the field. There should be a way to file a grievance, and the evaluation metrics must be "transparent" - if not the specific deliberations, then at least the general approach.

Tech interview exams have none of this. They are conducted very, very capriciously, often by people who have limited skills and experience - even if they are experts in their field (which they often aren't), they may not have any idea how to evaluate a candidate.

One basic tenet here is that you don't "mock" a candidate. Seriously, wow.

Sure, that's a fair criticism. That said, you can build multi-step interview problems with SQL (I tried to convey one or two in this doc) such that interviewers can build up towards needing a more advanced window function instead of starting there.

I've used BETWEEN ROW maybe once or twice in my career in a professional setting. Self-joins more often, but as others have pointed out window functions are more efficient here for writing dashboard ETLs, etc.

Btw, are you minimaxir who wrote gpt-2-simple? I was looking at your tutorial a month ago while putting together a solution for the Kaggle COVID-19 NLP challenge!

yep! :) Glad you made use of it!

Why would you ask interview questions to quiz on skills that can be learned in a week?

This is only useful if you tell candidates to study a book for a week first.

Otherwise you are filtering for narrowminded memorizers, not smart people who can learn and solve problems

First, from someone with 25 years of SQL development experience, let me say while it may be true these skills could be learned in a week, most are not learned at all even by regular practitioners.

And as with all such questions I'm judging your comfort with the concepts behind the skills - in this case set theory, conditional logic, data modeling, etc.

So it's more how you say it, not what you say it. And yes it's mostly negative signal - most interviews are just to weed out completely hopeless candidates.

> Why would you ask interview questions to quiz on skills that can be learned in a week?

I'm not an SQL expert, but it's very important to understand whether something is a known unknown or unknown unknown to a candidate. So, I imagine, if the candidate would say something along the lines of "oh, this looks like something for a window function - I don't know the syntax and the nitpicks, but I could google it", it'd be completely OK. However, if he wasn't even aware that such a tool existed, it means that he would never go and learn it in a week without outside guidance, and would choose a less-fitting tool to complete a task.

One thing I mention in the doc intro is that SQL questions are one part of the data analyst interview, not all of it. If this doc helps people get through the SQL questions so that they can spend more time being thoughtful with open-ended analytics questions, that's good thing (both for interviews and for work more generally)!

Theres no shame in knowing what it is/the theory behind a convept but not knowing the syntax, especially if the interview is not for a DBA. Id be honest and ask to look it up online or move on..

I don’t think it’s fair to say these examples are know-it-or-not.

Sure OP has presented some example solutions, but most of these solutions can be built up from the basics of tabular data manipulation. Furthermore, there are multiple steps involved that would allow a candidate to show their grasp of these fundamentals and receive help from an interviewer (that doesn’t trivialise the problem).

Then you’ve only encountered shitty interviewers. The point isn’t if you know some syntax, that can be easily taught / googled. The point is how you think about the problem.

In the case of percent change in MAU, you don’t need to know how to do a self join, exactly. But rather to identify that you would want to do a self join and on what conditions would be the key

Yea, I ask the employer/manager query problem sometimes. I'm not looking for the a right answer, but rather how they think through the problem and how they think about depth bounds. Occasionally I'll have a candidate who can throw out an oracle "connect by," but it is absolutely not what I'm looking for. (Even though that's probably the "most correct" answer).

I'd say it's unlikely you know about self joins but don't know at least one syntax for it just because of how we traditionally teach SQL to others. This is the know-it-or-not part: SQL is rarely taught without syntax and syntax is used to illustrate the theories. This is in contrast to regular programming where people just pick up pseudo code eventually.

If you had a pseudo code for sets, joins, etc, it'd probably be using more mathematical symbols.

I think it depends on your data / query patterns. Self Inner/Left/Right joins are one of my most common queries. I often need to chain joins in various ways to get the desired output.

I've never heard of BETWEEN ROW, but self-joins have been a pretty common thing for me, particularly when you're doing analysis on any kind of business-centric table that (rightly) has tons of columns -- e.g. a "users" or "products" or "items" or "orders" table.

> A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance

I've given SQL interview questions quite a few times and like to present a sample dataset or schema, and state a problem statement or expected result. Then I collaborate with the candidate to come up with assumptions, and I give gentle guidance if they are stuck or going down the wrong path.

I'm always up-front with the candidate about not looking for perfect syntax, and I am more interested in their problem solving and collaboration skills than the actual SQL they write.

People who ask ask questions about the data and schema usually do much better than those who jump right into the solution. Just like in the real world where really nice SQL or code doesn't matter if it solves the wrong problem.

That's pretty funny. I've been doing data science for ten years or so and I've never heard of BETWEEN ROW before.

As if "data science" would be a field where the full scope of SQL is covered. SQL was developed for the enterprise world.

I'd be either suspicious or very jealous of a data scientist who in any reasonably long career never had to touch SQL themselves.

I use self-joins and windows constantly. It's way easier to have our massive redshift cluster do those calculations in parallel than to try and do it in python on a VM

I do use window functions all the time; just not with BETWEEN ROWs. (to your other point, I work with BigQuery on the DB side, but R/dplyr is great at working with window functions as well in a much cleaner manner than Python/pandas).

I use between rows for rolling averages all the time.

SUM() OVER (PARTITION BY columns ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) gives you a trailing week total.

That was incidentally the exact question I was asked. (When I asked during the interview "I know you need to use a SUM window function but I'm not sure how you constrain it to the past 7 days", the interviewer replied "you don't know the BETWEEN syntax? Everyone knows that." And then the interview ended.)

"In a real world situation I'd have googled it and jogged my memory in less than a minute. Since I suspect you were more interested in how I would approach solving X, can you help refresh my memory on the syntax and we can pretend I looked it up to get back to what's important here?"

You dodged a bullet with that one.

Definitely not cool to use syntax as an interview question, but it is a very powerful tool so I wouldn't write it off because of one interview.

What a jerk. Sorry that happened to you.

sounds like he is some old guy who is afraid to loose his position and is gatekeeping so he can keep his current job...

This seems fishy - if your data has a whole day gap somewhere for whatever reason (e.g. I'm thinking sales data where certain emergencies might result in an unusual, unplanned 'zero day') then that query silently gives a wrong result, IMHO a trailing week total should always include a specific date boundary calculation to ensure that it's really a week.

For the transactional DB sure, an analytic DB / data warehouse / or even just table would have a 0 record for the no sales day. If there are gaps in daily data, at some point they need to be filled in with 0s. You can either do that at report building or query time using a cartesian join or you can fill the nulls with 0s during the data engineering pipeline. A robust BI process should do the later.

For example, if I want to plot this trailing calculation and it has gaps in the data, then the plot will be very wrong. The DE pipeline should fill in these gaps.

Two ways to address this: - Join to a date dimension and coalesce nulls to 0 - use RANGE BETWEEN (only available in some DBs) on a date_difference column

I have only in the last week had occasion to use `ROWS BETWEEN` in a window function, and once I did I immediately thought of dozens of places I could replace moderately complex imperative/functional code with a simple query. It's definitely not common knowledge and the expectation that you would know it should be very specific to certain skillsets, but damn once you know it, it can be a powerful tool.

Going from doing most of my analytics work in MariaDB/MySQL to Python/Pandas, and R/Dplyr, I'm amazed by how much trickier the syntax and operations are in Pandas.

Yea, when I do use R, I end up almost always doing the data manipulations in sqldf (creates a sqlite db on the fly).

But I do love doing vectorized operations in pandas/numpy. Writing a quick function then calling it against an entire numpy array is kind of fun.

The problem with SQL windowing functions (OVER ORDER BY, LAG/LEAD, RANK, etc) is that they are nondeterministic.

They are only indeterminate when the SQL engine is given an ambiguous ordering. If you specify the ordering without ambiguity, it's entirely deterministic.

Eg: indeterminate in -> indeterminate out

but determinate in -> determinate out.

This is idiotic. Why in the world would testing for rote memorization of something anyone can look up easily be a reasonable filter for talent and experience in a programming role?

A friend of mine did numerous interviews at a large company, hours out of his time and those of the interviewers, only to be caught up by some inane SQL question asked by a know-nothing after the entire process of interviews had been completed.

Why not ask about obscure regex expressions? Better yet, how about baseball scores? Hair bands from the 80s?

It's time for the valley to get real about how to judge the merit of applicants. The present state of affairs in tech recruiting is a joke.

"Idiotic" is a strong word. The question is: if candidate A answers these successfully, and candidate B does not, is there a reason to believe that candidate A is a better fit for your SQL data analyst position?

You may say no, that any competent person can read the book and learn this stuff. But in my hiring experience there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL. Once you've hired that person they do not magically turn into an efficient or self-aware programmer.

> there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL

Such questions don't test for that, they test for rote memorization. I don't learn things by heart I only use rarely, and certainly not to help you with a problem I'm not causing and that such memorization doesn't fix.

First of all, in your example, the DB server is doing the other 97 lines of work that java had to do; one approach uses data structures the other uses a SDE designed for data management and access. I think most employers would prefer a developer that can do both.

Second, the most important skill to DBA or Data Scientist role and to an employer is finding someone that understands the data domain well enough to point out obvious mistakes, bad approaches and thinking to managers and provide leadership, not just access, to information.

Third, "any competent person" is just belittling a skillset that, like programming, can take years or decades to truly understand.

I agree with most of what you are saying. But sometimes it's better to go with Java code instead of creating 3 line SQL commands.

Then again, this depends a lot on your architecture. For instance, if your database is running on expensive hardware with very limited CPU and memory quotas then often it's better to export intensive problems to the application. Also a lot of the more complicated SQL commands are implemented differently in all the dialects of MySQL, MSSQL, Oracle and DB2. Sticking to simpler queries lends itself to cross platform compatibility if you need that.

I think that when you run into someone who knows these in depth things about SQL you either have a smart technically aware person or a rote learner. I think the best bet is to ask them how they would use the different stuff and why. Rote learners will hit a brick wall or answer generically.

These examples are not obscure at all. If the job involves working with a lot of complex SQL, then these are some great questions to test the candidate's deep understanding of relational queries. It will be critical in their ability to maintain and write SQL quickly. Time and accuracy is of the essence in many such roles. However, the "if" is very important and often ignored. A web developer doesn't need to be interviewed for complex SQL.

That’s because evaluating talent is really hard, and this is the best proxy we have. The dirty truth is that promotion’s is usually even more arbitrary/political. I’ve seen extremely talented people be denied deserved promotions multiple times and people so bad I have no idea why they have a job get promotions. It’s more about who you know than what you bring to the table.

It isn't the best proxy though, because it's not even trying to simulate the conditions under which real work is done (e.g. in real work, most developers will often look up details, and it's no problem at all - what matters is how they think about solving problems, that they know what they don't know, and that they fit into the team). This is a proxy for 'is this an experienced person with a strong memory?', not 'is this a qualified candidate who would perform well in the position?'. It's quite possible to fit the latter and not the former. If you're the size of e.g. Google so your problem is more to weed out bad candidates quickly rather than catch all good candidates, then sure, go ahead (false negatives are acceptable to completely avoid false positives). If you are a smaller company, then your pool of applicants is quite limited and it's much more important that you don't pass over any qualified candidates that happen to fall into your net.

Why do you assume candidates can't look information up during interviews? Karat is probably the largest provider of technical interview as a service and they allow candidates to look things up while they're interviewing. The goal is to learn how the candidate works and what they can accomplish, not simply what they've memorized.

I interviewed quite a few times when I was at google, and in very few cases would I deduct score for something that could be looked up online. I always started the interview with saying that if they wanted to look something up they could ask me, and if I didn’t know we could make it up and pretend the API existed. I frequently discussed this with others and they all did the same.

The point is absolutely not to find some technicality and use it against the candidate.

While I agree that testing someone on concepts is more favorable than testing someone on tool specifics, I absolutely think there's merit to (tool) proficiency.

If there's one thing I've noticed among the "10x" coders, then it is that they know their environments and tools like their back pockets.

This isn't rote memorization at all. They're typical real-world problems to solve, and the point is to ensure the candidate can "think their way through" the problem to come up with a solution: not to come up with a solution using a specific function or keyword.

A good interview would allow you to read SQL documentation to solve this. If you knew it off the top of your head, you'd ace the interview.

If you can't solve the problems after looking it up, then perhaps you are not a SQL expert. I think these sort of questions are valuable.

I can't blame someone for wanting to filter for SQL expertise. I don't know if these questions are a good way to do it. Maybe they'd be good bonus questions for when you're trying to different between two good enough applicants.

Do you have suggestions on how to get real with this?

Checked just the first two answers:

1. MoM Percent Change

It's better to use windowing functions, I believe it should be faster than self-join.

2. It seems that the first solution is wrong -- it returns whether "a"-s parent is Root/Inner/Leaf, not "a" itself.

I'd instead add a "has_children" column to the table, and then it would be clear.

Second solution works, but without optimization it's 1 query per row due to nested query -- slow, but not mentioned.

Answer 4 is very very bad, you're doing O(N^2) JOIN. It's not just slow, it will fail on bigger data.

The question just screams for windowing functions, and cumsum is a canonical example for them.

Sorry post author, you'd fail my interview :)

Thanks for the feedback! The first section is intentionally about self-joins since they get asked about in interviews, but other people have brought up that window functions are more efficient, so I'll add in those solutions as well.

Sorry you feel that way! Thankfully my employer felt differently :)

If it runs fast fast enough or the business can wait long enough, slow and ugly is acceptable.

That's exactly what I look against on interviews. Slow SQL queries tend to work fine for some time, and as data grows, they start failing.

I'd want to hire a professional, who knows the perfect way of doing that, so for them it would be as easy to make it right, as to make it ugly.

Yes, in particular for data analysts using SQL to answer product/business questions in a one-off fashion

Unless the dataset grows, poorly optimised queries slow down superlinearly, and things start falling over.

Would you really interview and hire for a job based on SQL tricks instead of spending $50 to give someone a reference book?

If someone claims to be a SQL expert but hasn't bothered to read a comprehensive book cover to cover, it probably isn't because they didn't have $50, especially now that we have Library Genesis.

I've never claimed to be an expert in anything, but if I read a book on SQL from cover to cover, I'd probably still not remember wtf BETWEEN ROWS was unless I was using it daily.

-- The guy that uses C and C++ all the time and often forgets the order of the arguments to common functions.

Hell, at least 50% of my day job involves hacking together SQL Server sprocs, most of which have thoroughly used (and abused) most or all of the concepts demonstrated in the article, and I take no shame in admitting that I'm frequently double-checking syntax or looking at how other people solved similar problems.

I spend more and more of my time writing relatively complex SQL, and I too frequently look stuff up. That's what good engineers do, like all the time. I seldom leave my editor when it's TypeScript, because intellisense does most of the looking up for me. But for SQL I hit the google frequently and I'm not ashamed to admit that at all.

I guess what I'm trying to say is that having a reference book won't help you if you don't know what to look for. Perhaps the interview problems can tell you who knows what to look for and who doesn't.

Also for Q1 - what if there is gap one or more months? It seems we should first generate a month series based on the min and max months, and left join to the series to account for months with zero activity.

Sure! Practically speaking, as a data analyst, I would probably notice a missing month when plotting the trend of MAU over time.

(You can make "but what if a month is missing?" a latter part of a multi-part interview question)

Generally I would assume that data engineers would have a month of no users set to zero or that I could ask them why that's not the case and note that for future reference.

> It's better to use windowing functions, I believe it should be faster than self-join.

Your statement is making the assumption you have completely dense data and you can simply offset a number of rows to get the desired denominator. Sparse data is a very common occurrence, and now your MoM/YoY/XoX queries are completely incorrect.

Good point. I often find the easiest approach is to join against a sequence to 'unsparsify' the data, then use the window function. I'd guess this is likely still faster than a self-join, unless the data is very sparse.

Yep. Same here. I stopped reading after the second solution. I liked the questions though - just maybe the answers were too myopic.

I think for a lot of people, SQL is a skill that doesn't stick.

You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore

The days of dedicated SQL programers are mostly gone.

I disagree. I've been programming for over 20 years, and SQL has been one of the only constants used at every job. I'm surprised how many other programmers I run into who are don't know or are even scared of learning SQL. Which is a shame because data will almost always outlive the program(s).

I used to think this was true, back in the days when I used 'Group by' mindlessly and fiddled with the syntax dumbly until I got the right answer. Once you spend some time thinking about the concepts of it all you should do much better at remembering and if you totally forget the SQL syntax that doesn't really matter (Modulo the hideously annoying random differences between the databases themselves, but that's true of anything.)

I completely disagree. A good full-stack dev knows SQL pretty well: at least well enough to be able to work through the general questions in the document. Good SQL skills are essential, and the lack of good SQL developers I would attribute mostly to ORMs that abstract it away so mediocre developers can mostly ignore it (which makes the job a whole lot harder for people that actually DO know how a relational database works, and give it the respect it deserves)

I really don't believe that to be true. For our application(s) at work, the database is always the hardest part to scale and tehrefore also the biggest performance bottleneck, and the difference between good and bad SQL can be magnitudes of order of performance impact (on the query itself and locking up the DB for other queries).

Only in a "Every database has 2 TB of memory and 64 cores"-world is SQL (and database design) a negligible skill.

I'm sure data scientists and such probably use SQL at least somewhat regularly.

What I forget about SQL is the little inconsistencies between platforms. It's a standard with a dozen carve outs. Do I use TOP or LIMIT? What's the wildcard syntax?

Many data analysts use exclusively SQL and R/Python

I think these are great. But I think there should be some representation around locking / concurrency / deadlock topics. Those tend to be the hardest because you can’t clearly recreate the right/wrong answer in a local test environment. Speaking as a person who waited far too long in his career to fully appreciate these topics, I wish I had been pushed to learn them much earlier.

Behaviors around locking and concurrency are not standard. Not only do they vary widely across RDBMS implementations, they also vary depending on how you configure a specific RDBMS implementation in many cases.

Some databases use lock structures that automatically detect and resolve deadlocks, so from a user standpoint there are no deadlocks but deadlock resolution has visible side effects that are implementation defined.

Yeah but the functions and style used in these answers vary across implementations (date_trunc, etc.)

I highly recommend the SQL Cookbook for newer SQL dev as a quick reference comparison to see how often this is the case for even trivial problems in any RDBMS.

Also, for your language I wouldn't expect all mid tier SQL devs to be able to write a recursive CTE from memory (though its useful, you can just look it up again), but something like breaking apart a query plan on your platform of choice is way more important as that muscle allows you to tell if your CTE was crap or not.

I suspect more SQL is written to support analytics than online systems. Locking and concurrency are used in a specific type of application, namely oltp.

I learned how much when I joined a big tech company. The devs don’t write sql unless processing logs in the warehouse. But everyone from PM to support to data science and marketing all write sql.

To your point, I personally write SQL for analytics and product/business purposes, not system monitoring

Hey, HN! Since I couldn't find a good resource online for the self-join and window function SQL questions I've encountered over the years in interviews, I made my own study guide which I'm now sharing publicly as a Quip doc. Would love your feedback or thoughts!

Leetcode has some SQL problems that I’ve found useful in interview prep: https://leetcode.com/problemset/database/

Great resource! I have personally found the top Leetcode SQL questions to be hit-or-miss without curation (too easy, too hard, doesn't cover topics like window functions / self-joins like I would encounter in interviews, etc.) but you're right, it is a great, interactive interview resource

I enjoyed reading this. My feedback is that although a few of the problems had links to blog posts that provided some analysis of the problem and solution most of these did not. I think if you added a short analysis or explanation of what makes the problem tricky and the thinking behind the solution it would be very beneficial.

For cumulative averages why not use window functions?

Good catch -- I can add that in as an alternate solution. Thanks!

Of course!

Now the thing to do is for me to recreate your list on github.

Yes -- making this a proper webpage with an embedded SQL editor like selectstarsql.com does would be a good next step

Are there any good ways to embed SQL in markdown? Like JupyterLab notebooks?

Yes, Azure Data Studio allows you to embed SQL in notebooks right out of the box: https://docs.microsoft.com/en-us/sql/azure-data-studio/noteb...

Thanks! I'll have to check that out.

Great resource. Thanks for sharing. Do you have any more resource recomondations besides the ones in the article? Maybe a book?

We try to surface the best SQL interview questions at Interview Query (https://www.interviewquery.com/)

Hello, how come you chose to use quip over Google docs? I'd expect you to choose the latter since you work for Google.

Yeah, it's because (1) I started this doc over a year ago when I was still at Salesforce (which owns/employees use Quip) and (2) AFAIK Google Docs still doesn't have native code block support :/

I think SQL is a language to be known by every programmer. With the right query, you may solve a problem that may take 100 lines in other languages.

It is so usefull, reliable and does not change every year.

With the extreme popularity of pandas, I think a lot of Python programmers would be amazed how clean and easy-to-read SQL queries look like, compared to the (downright) mess that's being written to query pandas dataframes.

Depends on the transformation. Dropping a couple of columns in a wide dataset is extraordinarily ugly in sql vs anything else, for example.

...how is it ugly in SQL? Don't SELECT those columns. Or, if you need to do multiple things with the dataset sans those columns, use a CTE of a query that doesn't SELECT those columns. Or a view.

It seems like SQL is only uglier if you're used to "SELECT *"ing, which is asking for trouble--in SQL or when doing the equivalent with other row/column data manipulation tools.

I don’t see what the issue is with using something like spark’s drop over using select.

Just don't pick them in the first place?

Though I'll grant you that extremely wide data is a lengthy experience to write SQL for.

Agreed! Pandas is great but at this point instead of using logic to solve data structuring tasks I often find myself googling for an optimized built-in Pandas method to help me out. Leads to less elegant code -- not sure if it is less readable though.

Well R has packages that buy you similar functionality eg dplyr.

Dplyr is IMO the best thing to come from R. The design is so much simpler than pandas and the operations mirror SQL operations so closely.

For the second one, it seems most natural to reach for exists, or something (I have not tried this code..)

    , (case when parent is null then 'Root'
            when exists (
              select * from tree c
              where c.parent = node
            ) then 'Inner'
            else 'Leaf'
      end) "label"
  from tree
EDIT: also, in the fourth, it seems like you'd want to partition the window function, who cares about order. Something like

  sum(cash_flow) over (partition by date) "cumulative_cf"

A subquery in the from-clause is usually lighter than the select-clause. I just so happen to have a table with this structure and hundreds of thousands of rows. This way runs in half the time. It is surprising, in fact, that the difference isn't wider, because Postgres's explain-command says it costs 1/100 as much.

          when t.parent is null then 'Root'
          when p.parent is null then 'Leaf'
          else 'Inner'
      end as label
  from tree t
      left join (select distinct parent from tree) p on t.node = p.parent

Great article. I can't help but feel like SQL is a poor choice for some of this stuff, though. More often than not, I find it much easier to pull the raw data into memory, and use a higher level language to do these sorts of queries. I am all for knowing the intricacies of SQL, as the cost for not can be very high, but I'm curious for your opinion here.

For small datasets, pretty much any approach will work. Once you hit hundreds of millions of records (which isn't even that big of a dataset), SQL still performs well on pretty modest hardware.

What do you mean by SQL?

In my experience, once you're hitting hundreds of millions of records, implementation details of your database engine will start to matter. A database designed for transactional workloads like Postgres will start to choke on aggregate and window functions, often taking minutes to run instead of milliseconds. A columnar database like Redshift (which exposes a SQL interface) will breeze through it without a sweat.

Of course - I was comparing SQL to loading all the data in memory and then using one's programming language of choice to do the hard work. Even a poor schema in a row-based database will outperform what I described after a certain number of records.

The first example "month over month" will never result in a large dataset if you just do count(distinct user_id) ... group by month order by month. There's only max 12 values per year.

That query is much simpler, safer, faster, no self-join or windowing necessary, and you can properly handle missing months in your dataset in your higher-level language (which the provided solution doesn't do BTW).

And you're still getting the performance boost from the DB indexes for grouping and sorting.

These are interview questions first and foremost, so always going to be in an artificial context. But for real-world work I would use SQL plus the relevant amount of aggregation/joining to pull rows into memory, then do the remaining manipulation in the higher-level lang. "The relevant amount of aggregation/joining" is obviously a moveable judgement call based on experience, but one thing I have found is that the more complex yr SQL becomes the easier it is for cardinality errors to creep in, and of course in general you can't easily see the intermediate results, so dropping into a high-level language can also be safer and less error-prone.

That's fine for tiny amounts of data, but it's also how you end up with (for example) Ruby taking four minutes to do what a carefully-crafted SQL query can do in four seconds (yes, this stuff happens a lot in the wild)

Yah, somebody handed me a ruby script that looped over a query and was causing a whole bunch of other processes to fail because the result wasn't available in a timely way. My boss was impressed by my having reduced the run time from 22 minutes to 5 by pushing that join into the database.

I work in R regularly and do most of my more complex data manipulation in R whenever I can, but most of these examples are simple enough that I think it makes sense to keep them in SQL. The only ones that struck me as really weird to do in SQL were the cumulative cash flows and the "histograms"/binning one, though the ones requiring window functions might also be slightly easier in R than in SQL.

Also, a lot of these would need to be done in SQL in practice since the data wouldn't fit in memory. Any solution that requires loading the table [login (user_id, timestamp)] into memory probably won't scale very well!

Good q! +1 to the other replies on this comment. I have two points worth mentioning:

(1) The flavor of SQL I use at work supports macros, which are functions that can take in parameters like a function in R/Python might. So, the SQL is "turbo-charged" in that sense and some of the value-added of switching over to Python/R is diminished.

(Big Query has UDFs, which seem similar: https://cloud.google.com/bigquery/docs/reference/standard-sq...)

(2) Like I mentioned in the doc, I personally use the SQL in these practice problems for ETLs on dashboards showing trends. AFAIK, much easier/efficient to write metrics for daily ETLs in SQL than R or Python, especially if these are top-line metrics like MAU.

Besides the many responses about data size, the premise of your question is baffling to me. This is exactly the kind of stuff SQL is for, it's designed specifically for doing this kind of work. Even on a smaller dataset, the "higher level" languages you speak of will certainly be able to produce the same results, but their implementations will almost certainly be more complex, with more points of failure, a great deal less flexible (what if the dataset grows?), and a whole lot more verbose... and all you get for all of those downsides is a more familiar language.

I use to think that too. This is a good talk about how smart SQL query engines are. It is an hour though https://youtu.be/wTPGW1PNy_Y

No questions/examples featuring recursive CTE's? They tend to come up in anything involving queries over trees or graphs. They're also a relatively new feature where having some examples to show how they work may be quite helpful.

Took me some time to wrap my head around writing a recursive query. Also had to study the docs a while to get "merge into" to work right, for doing combined insert/update.

One of the first tasks I give associate DBAs is writing a recursive CTE to get a blocking chain. It’s an interesting and useful exercise.

I flip back and forth between deep diving in (my case) SQL Server skills and .NET Manipulation. In the world I live, it makes the most sense to do set based manipulation in SQL and logical entity based logic in C#. I work in a unique enterprise niche that has about 4 options based one either java or .net. Sql knowledge definitely gives you a leg up for complex reporting, and there are cases where I love being able to debug super quickly when comparing inputs to outputs. However, when I run into a SQL script that is 5000+ lines long and have to debug it, I much prefer the .NET side of the fence. Should someone ever come up with a bridge that gives you .NET level of visibility into the active datasets in a SQL query I would pay them 4 figures without question...

I take it you've already looked into CLR stored procedures? I haven't used 'em in a whole lot of depth, but they do seem to at least give some of the tools for a "best of both worlds" approach.

5000 + line SQL scripts????

What would one of these do?

Legacy code from Powerbuilder days. But it does things like calculate month over month inventory, Mark to Market value, Risk, or things related to Energy and commodities trading.

The ones I've seen handle tons and tons of edge cases or do extra validation. The SQL programming languages are pretty verbose in terms of syntax.

Very interesting. I never really "got" declarative languages, I remember a very long time ago I was working with Oracle and you could see the "execution plan" for your SQL queries. I kept wondering "why can't I build my queries directly with this?" - it seems so much simpler to my brain than SQL itself.

> I never really "got" declarative languages

As I responded to another comment, I have to wonder if that's because you haven't spent the time to become proficient enough to appreciate them. I understand that a lot of programmers decry "magic" and want to get at the underlying steps and components that produce a given result, but literally all of us work at some level of abstraction above that because it would be impossible not to. None of us are directly transforming sand with electricity into computed values.

You're describing being more comfortable with an imperative set of statements than a declarative description of results. Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test, except that the description is less freeform and must conform to a certain syntax and structure so that a machine can write the test for you.

It's "magic", but it's only so much magic because the scope of what it can do for you is limited and well defined.

Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test

Except that you know/expect how exactly it will be executed and if it doesn’t, maybe test should not really pass. SQL in general is full of this “we write declarative queries expecting this exact imperative result and investigating if it’s not”. It’s much like an interview question: it may have many different answers, but you must provide the one that satisfies a grumpy plan builder guy. I know sql and use it when it’s shorter/more descriptive, but sometimes you just want to take them rows and do them things, without leaving a database process and its guarantees.

Not much db experience, but the fact that such powerful engines (i.e. acid, indexing, good ods, pooling, etc) are always hidden beyond some cryptic uppercase frontend with a weak execution layer always bothered me. Just give me that postgres-grade index scan directly in C, dammit. /rant (inb4 just write some parts as a sp in a language of choice)

> Except that you know/expect how exactly it will be executed

You're writing bad tests. Do yourself a favor and write tests that pass "it [satisfies some human expectation]". Stop writing tests that care about how it's implemented.

I would like to know what to do if it fails to deliver results on time on a big enough dataset. Then someone will say hey, don’t use COUNT DISTINCT? But why, tests showed no issues and I don’t care how it’s implemented. What to do with that loop?

You're hitting on the difference between unit tests and integration tests. Unit tests should generally not try to test load limits or timing, while integration tests should have the freedom to load millions of rows into a table and verify it doesn't take too long to query the table. Unit tests should be quick enough to run before every commit, while integration tests run on a shared server with enough resources to exercise limits. Most interesting projects need both unit and integration tests.

Yes, but the original discussion was declarative vs imperative, and tests were introduced only as an analogy by gp. I just showed that it doesn’t really apply, it was not a question on how to test in general.

You can still write your test to express user intent (`it completes in [time window a user expects]`), generate the load your representative of what your user would encounter, and test that. If your test looks for `COUNT DISTINCT`, you aren't verifying that the results are delivered in the time expected, you're testing something else entirely and you may have no idea what a user will experience.

You can. First, find or implement a library in your language of choice to store indexed data on disk. Then implement some very hairy logic to make sure your read and write operations are ACID. Then choose an algorithm for the query you'd like to write. And then write it.

I don't intend to be glib, but rather to illustrate some of the benefits of a database platform like modern RDBMSes. These things are all implemented and battle-tested for you.

If you would prefer a more fine-grained control over execution of queries, you could probably get very far by starting with the SQLite code base and working at a more primitive level through its library.

You can't write query plans directly because you would have to manually take into account a number of factors that the query planner considers for you automatically, such as size of the table, statistics on the distribution of values, whether there are indexes that could be used to speed the query, and so on. Some SQL dialects (like Microsoft's T-SQL) do give you some ability to influence the decisions the query planner makes, though, like forcing it to use specific indexes, or forcing it to use scans or seeks.

As a practical matter, when you can't control the optimizer or affect how the DBAs configure things, you break your huge query into multiple ones with temporary tables.

This is from the perspective of trying to get queries to run in 5 minutes instead of 30 minutes instead of hours or days or forever, not brief transactions measured in milliseconds. And it's not something I figured out on my own, but by paying attention to the guy who never talked but was consistently 10x faster in producing reports than anyone.

The thing you should not do, that I also saw people do, is use procedural PL/SQL or T-SQL to process things in a loop - that can be orders of magnitude slower.

And then comes Oracle and insists on applying equality filters first, because, duh, they are fast, and never take any of those other details into consideration.

Honestly, Postgres does it right - you can enforce your query plan to any level of detail you want.

How do you do that? Postgres doesn't even offer a way to enforce that specific indexes get used during a query as far as I know.

Does PostgreSQL have plan hints now? I thought they were opposed to them for fear they become unmaintainable and hard to read.

If you really want to, the pg_hint_plan extension can be used for this - though I would use it very sparingly, if at all:


(available on some cloud providers as well)

Data changes, so the best query plan changes with it. SQL was built to handle this data-dependent environment.

Is one column composed of mostly one or two values? Then an index lookup on that column is not very optimal, and the database can use something else.

There is more than one type of join (INNER, LEFT OUTER, etc), and more than one join algorithm (neesed loop, merge, hash, etc). All these change based on the data. Even the join order can have a huge impact on query time, and needs to adapt based on the number of rows you'll pull from each table.

A lot of SQL queries are built from templates, or built by ORMs. Optimisations are critical to turn these templates queries into something efficient.

SQL can also be very expressive, a "NOT EXISTS (SELECT 1 FROM thing WHERE foobar)" could be more readable than doing a join and where clause.

Though interestingly, you get a much more declarative query style with nosql databases and key-value stores. So there are alternatives out there.

I always thought that I suck in SQL but if these are medium to hard then I am not that bad actually.

SQL for Smarties is the book that has proven, without a doubt, I am bad at SQL. https://www.amazon.com/Joe-Celkos-SQL-Smarties-Programming/d...

If you want to get undeniably good with SQL, this is the book.

These are easy enough to do in your head.

Also I would do a lot of them differently. And I don't think the answer to #6 is valid (in T-SQL at least).

One problem with this article is the number of times the solution involves COUNT(DISTINCT).

One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".

What is wrong with DISTINCT?

DISTINCT generally requires the results to be sorted which has O(n^2) worst performance so it can have a big performance hit on a query. It is best to make your database structure such that queries only return distinct data. E.g. by disallowing duplicates

If your sorting algorithm degrades to anywhere near O(n^2) in pathological cases, you're doing something wrong. And even if it's just a kind of timeout/operations-limit to detect pathological cases and just run an in-place mergesort instead. Tail latency/containing pathological data is quite important if there's any interactivity.

Nearly every time, it's a symptom of bad data normalization.

But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).

“Think before you DISTINCT”

In order to determine the distinct items, the items need to be deduplicated. Generally that's done in only two ways: a hash table that skips items already seen, or a sort followed by a scan that skips over duplicates. The hash table is O(1), but the sort is easier to make parallel without sharing mutable state and has more established algorithms to use when spilling to disk.

There is a third way: keep the data pre-sorted in the database (via an index).

It covers up bad queries, so you may not see an underlying data duplication problem.

Often better to group explicitly so you know what's actually going on.

> It covers up bad queries,

Bingo. I used to work with a guy who would see duplicate results and just throw a distinct on his query. I had to keep on him to fix his queries or explain why distinct was correct in this case. My default is that distinct is almost always not the solution.

seriously.. I'm building out some functionality using plpgsql and have used it. This is going to be haunting my dreams

there is possible performance hit [1]. Also, it could mean that the data granularity has not been modeled well if there are duplicate rows.


Huh? If you have a table with attributeid, sampleid and value how would you count how many samples have a value in any attribute? Exists subquery?

In your example you must also have another table, 'sample' with all the samples. So yes, you would use an exists or in subquery with the table you suggested.

The first few answers are unidiomatic where I work. Analytical functions would be vastly preferable to self joins, especially in the case of the join with an inequality that is proposed to compute running totals, which I assume would have horrible performance on large datasets

RDBMS platforms without function indexes means that some of these queries will force a row-by-row execution over your entire table. Enjoy running SELECT ... FROM a INNER JOIN b WHERE DATEPART(a.date, month) = b.whatever on a table with 500 million rows in it.

Thankfully even MySQL has them, at least as of version 8

Yeah, I’m jealous of all the postgresql and MySQL people. SQL Server doesn’t have them and this kind of query would cause pain.

As someone who used to use MSSQL extensively and is now forced into MySQL, trust me, do not be jealous of the MySQL people. 5.7 fixed some of the most egregious issues and 8.0 is adding some features, but the smoldering embers of the dumpster fire are still around.

SQL Server can index a computed field, which is more-less the same thing.

I think I'd be able to do all of these in my daily work, probably not as efficiently, with minor references to syntax guides (I don't use window functions often enough).

In an interview, presumably my logic would, hopefully, shine through minor issues of syntax.

Where would that put me? Maybe "okay to decent" when dealing with "medium-hard" questions?

I would fail utterly at DBA management SQL and stored procedures, my responsibilities skew towards data analysis.

Yeah, I'd be something similar and also wouldn't consider myself a database expert. I think this question list is tilted towards "can you express this complicated abstract transformation in legal SQL" than the practical day to day considerations of building and working with databases.

For me the hardest part of the first question is understanding the acronyms. I think MoM is month on month. But MAU, no idea.

Hey! MAU = Monthly Active Users. Sorry about that, I think this could be described as data analyst/analytics jargon. Good feedback to spell these out, thanks!

I agree, it almost seems they were put there to add artificial noise to the problem. It doesn't seem like you're selecting for the right skills here.

Sorry about that -- in real life you would just ask the interviewer what MAU meant if they didn't spell it out.

The purpose was to make the questions more realistic, since at least in my experience in data analyst interviews the questions are asked in the context of actual business or product situations ... like company leaders, PMs, or others wanting to understand trends in MAU.

Totally understandable, I think the frustration is more towards companies that try to make the questions a bit more obscure, definitely not your fault.

Thanks for sharing, I found it very insightful.

Ah that makes sense. Thanks!

SQL is better than 99% of the nosql alternatives out there.

But one thing it falls apart are these time series data processing tasks.

It's because of its model of unordered sets (or multisets to be more precise, but still undordered). When you look at majority of those queries and other real life queries they always involve the dimension of time. Well - then that means we have a natural order of the data - why not use an array data structure instead of this unordered bag and throw the sort order out of the window.

SQL realized this and bolted window functions on top of the original relational model. But you still feel its inadequacy when trying to do simple things such as (x join x on x.t=x.t-1) or even the infamous asof joins where you do (x join y on x.t <= y.t).

In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.

In traditional set oriented SQL it results in horrible O(n^2) cross join with a filter or in the better case od equijoin in sort merge join which still has to do two expensive sorts on my massive time series tables and then perform the same trivival linear zip that the array processor would do. Which is also completely useless on unbounded streams.

Also many stackoverflow answers suggest to use binary blobs for time series data and process them locally in your favorite programming language - which points at wrong conceptual model of SQL engines.

Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?

Have you seen TimescaleDB? https://www.timescale.com/

I've seen it.

Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.

Asof join is still an open issue in their issue tracker so it is not usable as a time series database.


If AS OF is a dealbreaker for you, so be it, but I think TimescaleDB is trying hard to be more than a "simple timeseries" aggregator like so many competitors. We're using it to store hundreds of millions of complex IIOT data packets, where a fully normalized packet adds dozens of rows across ~4 tables.

Having it be mostly plain-ol-postgres has been huge for designing and evolving the data model in a way that I'd hate to do without "real" SQL.

I'm sure your product is great for many customers in their domains.

The asof and other time based joins are a necessity for just slightly more complicated data processing than just a trivial aggregation at the tail of the stream which is majority of use cases in IOT and real time monitoring.

Without that I can pipe my data to /dev/null giving me even better write performance while also not being able to solve this common financial industry use case.

Sorry, didn't mean to imply I work there, I don't; my employer uses Timescale and I designed our schema.

Timescale has been quite performant for us fetching arbitrary past time windows, but I'll stop the cheerleading there as I think mfreed covered the details far better.

I'd certainly still like to see AS OF support!

(Timescale person here)

Just an observation that the parent post was the unordered "set" nature of tables, and not strictly about ASOF joins.

I should point out there is an alternative reason for this in TimescaleDB (and many other databases): to support loose-time-order writes (or even data backfill), where records don't always arrive in perfect timestamp order.

So some of these engineering decisions are to balance tradeoffs across insert vs. query rates.

In TimescaleDB, for example, we write data to time intervals in arrival order, not timestamp order, in order to achieve higher writes rates. On the other hand, we support the automated asynchronous rewriting of data to reorder data after it reaches a certain age, according to arbitrary columns.

This can be used to reorder data precisely in timestamp order. It's also highly useful and commonly used for composite reorderings, such that you can then reorder on (some_id, timestamp DESC), such that, within a time interval (chunk in TimescaleDB), data belonging to each some_id is collocated on disk, and within each id, it's sorted on disk again by timestamp. (Here, some_id might be a unique hostname, device_id, metric_id, stock ticker symbol, etc.)

These allow you to take much faster queries for SELECT * from table WHERE some_id = foo and time > X and time < Y.

We also take a similar approach when performing asynchronous columnar-style compression, where we "segment" together many records belonging to the same id into a single internal row, then order those records in array style within each row, sorted by some column (typically timestamp) before applying a type-specific compression algorithm to then. Again, all metrics then belonging to some_id are collocated, ordered by timestamp as an array on out-of-line pages so that only the requested columns need to be fetched from disk. And this gets hidden behind a SQL view that makes it look like you are still interacting with a standard row-based table.

Anyway, totally appreciate that we haven't prioritized ASOF JOINs, as our experience is that these typically come up in more specific financial use cases, rather than the vast majority of time-series use cases.

And we also find that our "gapfill/LOCF" functionality can be used to bucket-aggregate data with different ids for related behavior. That is, it's a data modeling choice to store different ticker symbols in different tables (and then ASOF JOIN), and not necessarily the only way to achieve the desired goals.

https://docs.timescale.com/latest/api#add_reorder_policy https://docs.timescale.com/latest/using-timescaledb/compress... https://docs.timescale.com/latest/api#time_bucket_gapfill

And thanks clatonjy for your use!

>After an interview in 2017 went poorly — mostly due to me floundering at the more difficult SQL questions they asked me

Could be you dodged a bullet. A company with advanced interview questions may have some ugly SQL code. For jobs that lean heavily on SQL, I expect candidates to know things like windowing & dynamic variables, in SQL & an ORM library. For SWE's, I feel basic SQL is fine.

I agree. Using advanced SQL as a hiring signal would be silly in that it just discards a lot of programming candidates for little benefit.

I'd be happy if they knew what EXPLAIN was, since that impacts production on a daily basis.

Also, the OP is basically "fighting the last battle" again. Most interviews don't filter out candidates based on SQL.

Source: DBA.

Yes! Window functions and decent demonstration of datasets with varying granularity on top of the basic join/aggregation assessment is all that is needed in the SQL for interviews. And if you're an engineer, attention to redefining the query plan through appropriate use of temp tables. Anything else is a poor use of precious interview time for both parties, in my opinion

This is neat, but it would be really helpful if these examples included some kind of sample output of what was expected.

Some of them have sample output but yes, it would be helpful to extend this to all examples. Good feedback!

Then you can expect your candidates to exactly provide that answer.

The problem with hard SQL problems is that they are often one of two camps:

1. use some underlying implementation detail of the particular RDBMS or proprietary extension to the standard

2. are essentially tricks, like the typical ridiculous interview problem designed to "make you think outside the box". Yes, you can do almost anything in SQL but often you should not.

I get the perspective here is data analysis where you probably need to know more SQL than the standard developer, but I still feel you should be testing for solid understanding of basics, understanding of the relational algebra concepts being used and awareness of some typical gotchas or bad smells. That's it. They'll be able to google the precise syntax for that tricky by date query when you're not guaranteed to have data for every month or whatever on-demand.

A bit off-topic, but this is another one of those sites that show nothing without JS, and looking at the source reveals something a little more amusing than usual: content that's full of nullness.

At least they do it with a bit of flair though: the thin blue "loading" bar works without JS!

I'd ding the over-use of CTEs, when subselects are often more appropriate and better-performing. Kind of a "every problem a nail" thing going on here.

Most people have no idea what an optimization fence is and opt for CTEs because they yield "cleaner" queries, despite nearly always killing performance.

I've always found explicit temporary tables, where I can add indexes, are often a great solution for performance and readability.

Do you mean subselects in the SELECT clause or in the FROM? Usually subselects in the SELECT clause have been the source of dogslowness. Reorganizing them into the FROM clause sped up queries 1,000 times. And I thought CTEs were just a way of moving subselects from the midst of the FROM clause to the top of the statement, where they're easier to understand.

In the FROM clause -- although I understand future versions of pgsql will address this, the "WITH foo AS ... " approach can be horribly wasteful, materializing all the results, whereas "SELECT * FROM ( SELECT f(x) AS a FROM baz ) qq limit 1" lets the optimizer do its thing.

I mostly agree. I'd start with the simplest possible query and move from there. I enjoy SQL, but once I start querying large datasets or doing anything remotely complex, the 'right' solution is the one that is fastest. And fast always depends. Typically I have to write up a solution a few different ways, test and tweak.

At least on recent versions of Postgres, CTEs are often equal in performance to subqueries (if they're side-effect free and non-recursive they get inlined).

Out of curiosity: what is the use case for SQL window functions in application programming? Unlike most SQL, it doesn’t seem to reduce the order of the data coming back from the server, nor do anything especially faster than can be done on the client - and has the disadvantage of extra load on the database (which is harder to scale).

Is it only useful for ad hoc/analytical queries, or am I missing something?

It allows you to "group by/aggregate/etc" on a row by row basis rather than once over the entire table. So whle GROUP BY defines a set to aggregate over the entire table ( and hence you get 1 row back ), window functions creates a criteria by which each row gets a set ( aka window ) to aggregate over for that particular row. In addition, you can define a "frame" over that "window" for even more refined aggregation. I suppose window functions are used more in the OLAP space than OLTP, but it is a very useful part of SQL. In my opinion, windows function is the most impressive and important part of the language.

People can define, describe and provide examples but with SQL, it won't sink in until you try it out yourself and have the "Ah ha" moment. If you have a database server and some test data, try writing a few window function queries or try some online examples.

Why do you think it's not "especially faster" than doing it on the client?

The data is already sorted, partitioned, in memory .. adding a moving average calc or max value per category is certainly faster than fetching the set to disk, recreating it in an intermediate structure and then calculating the new value with all that partitioning, sorting, etc. to be redone client side.

You can certainly use them in app dev for things like:

* figuring out how many things are ahead of you in a queue * snapshotting (what's changed since the last time you were here) * comparing something to some other sample for outliers or inconsistencies

These may be quasi analytical still, but ultimately they can manifest as properties of an object model like any other property to be developed against.

I suppose it depends on the language. Normally I’m writing in either Java, C#, Rust or C - and in those I’ve never found that having the DB run a moving average to be any faster, provided the result comes back in the right order. Indeed, if I need to use another column in the result set, the deserialisation overhead means it’s normally slower. For reference I normally write financial time series processing code which is where window functions ostensibly could be quite useful.

However certainly if your app is written in Python or Ruby I can see there being a big difference.

The reason I mention analytics is more to do with the ease of scaling out the DB than the nature of the queries. In busy OLTP databases usually one wants to keep the work off the database because they’re difficult and expensive to scale.

I use them for cursor based pagination. You can use rank and a where clause to apply the limit. You can also compute max rank to determine whether there are more data to fetch after limit. Finally, if you do left joins where you may have multiple rows per objet, you use dense rank instead of rank

Off topic but... anyone knows what they used for the rich text editor?

It uses React but I imagine there is some other library like ProseMirror here.

It's likely all/mostly custom. Quip is saleforce's Word/Docs competitor. (It's pretty good, tbh.)

It rendered that doc pretty fast although editing is not enabled.

I've never used Office online but Google docs is terrible performance wise.

This is more like "we want to make sure you can use recursive CTE" questions. To add some variety to medium-"hard" SQL questions you could add some lateral joins, window functions (especially lag if you want to get creative) and compound logic statements in where clauses.

For creating a table of dates, what are our thoughts on:

select * from (select adddate('1970-01-01',t4.i10000 + t3.i1000 + t2.i100 + t1.i10 + t0.i) selected_date from (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3, (select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v where selected_date between '2016-01-01' and now()

This works in MySQL / MariaDB.

Oh yeah that’s super readable

Worth noting that this isn't all ANSI-SQL... e.g. I'm pretty sure WITH is a Postgres thing?

CTEs are implemented by most (all?) major RDBMS platforms and were introduced in the SQL:1999 standard revision.

- SQL:1999 https://en.wikipedia.org/wiki/SQL:1999#Common_table_expressi...

- SQL standardization https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...

The WITH clause, otherwise known as Common Table Expressions, is in ANSI SQL99. Common table expressions are supported by all of the major databases: PostgreSQL, Microsoft, Oracle, MySQL, MariaDB, and SQLite. They are also in certain minor ones, like Teradata, DB2, Firebird, and HyperSQL. Recursive WITH clauses are especially useful.

Worth noting that MySQL only got CTE's in 8.0 while something like MSSQL has had them since ~2005. The reason this is important is that something like AWS Aurora only supports up to MySQL 5.7, and thus no CTEs. :/

Not anymore. Other implementations, like MariaDB and SQLite, have adopted common table expressions. They're basically syntactic sugar for subqueries in most implementations, but they can make some queries much more readable

the WITH clause is now in SQL Server and Oracle

I don't work with Oracle, but SQL Server has had CTE's since 2005.

My first reaction was that it was a thing in Oracle when I started using it in 2007, and it appears that they introduced the feature in 2002.

Yup those questions are a stretch for me too! Love that though.

Where are the pivot and unpivot questions?

Hmm, haven't seen those a ton at work / in interviews! Would love to see some examples though if you have any.

Mode's SQL tutorial uses SUM(CASE ...) and CROSS JOIN to mimic pivots: https://mode.com/sql-tutorial/sql-pivot-table/

I think they're specific to Oracle and SQL Server. It's a pity because they can be very useful.

PostgreSQL can do it via the tablefunc extension and there's `\crosstabview` which is built into psql.



Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact