A weakness of these types of SQL questions however is that it's near impossible for the interviewer to provide help/guidance; the questions are often know-it-or-don't questions (especially involving window functions w/ uncommon syntax). A data science interviewer years ago mocked me during a whiteboard test for not knowing the BETWEEN ROW syntax for a window function.
That said, as an IRL data scientist, the amount of times I've had to do a SQL self-join in the past few years can be counted on one hand. And the BETWEEN ROW syntax for a window function.
If you don't get the offer there's no bullet to dodge, right?
Your pedantry tag has no opening
It wasn't a pedantry tag. And yours is differently misspelled!
Per descriptive linguistics (https://en.wikipedia.org/wiki/Linguistic_description), the assertion that it's wrong to start a sentence with a conjunction is incorrect. Ditto for splitting infinitives, ending a sentence with a preposition, etc.
(Descriptive linguists basically define the rules of language to be how humans actually use the damn language. It turns out such a philosophy saves you from falling into a lot of traps)
In my opinion, parroting prohibitions against split infinitives, ending with a preposition, "snuck is not a word", "'they' is only plural", etc etc not only shows a profound lack of education, this shows an inability to learn, since these so-called "rules" are widely and easily known to be made up whole cloth. To my mind, this is a person who absorbed Elements of Style (perhaps even second-hand from a misinformed high school teacher) and called it a day, believing they understand how English is properly written.
Serious red flag for me.
Isn't this a prescriptivist rule? Imagine a highly skilled author who recognizes a younger one that misuses a semicolon or has poor grammar; the older says to the younger, "only write with perfect grammar". By doing so, the younger learns more precisely the elements of the English language and the purpose of each device (semicolon, period, comma, etc.). After, the younger becomes a much better writer, and integrates that new knowledge with his or her existing writing. In this way, prescriptivist rules help inexperienced writers hone in on their weaknesses and draw focus to some facet they lack.
Any harsh prescription is kind of a red flag for me, because you have to judge yourself by that harshness (fall on your own sword, so to speak). Or, as what usually happens, hold some hypocritical double standard or semantic loophole which pardons yourself while incriminating others.
I cannot imagine a highly skilled author who never employs split infinitives. Show me that person.
Regarding semicolons, here is Kurt Vonnegut's opinion on them: "Here is a lesson in creative writing. First rule: Do not use semicolons. They are transvestite hermaphrodites representing absolutely nothing. All they do is show you've been to college."
Is that good advice? Perhaps for some writers, it actually is; but it's not about the semicolon, really, but about stilted, bad writing.
<< Isn't this a prescriptivist rule? >>
I was responding to the previous comment "It's also _descriptively_ true that many people regard certain rules as inviolable, which certainly means that prescription is, in some sense, completely reasonable: comply, or we'll refuse to accept you." My direct answer is "It's the prescriptivists 'refusing to accept' who should worry about acceptance, since they are failing on so many levels it's hard to know where to start"
I picked up elements of style few days ago. I couldn't parse it. I was only able to understood it somewhat through intuition and prior knowledge. The sentences were too confusing, out of the place for me with added jargon and story of the author that I didn't want to know about.
Dodging taxes is active.
Dodging the in-laws because you were in the shower when they dropped by is passive.
Both applications of the verb are in common usage and are therefore valid and obvious enough that you clearly understood the meaning. </pedantry>
Sad. One needs a thick skin to interview, because the manner of rejection is far more detrimental than rejection itself. I have seen friends cry unable to believe their luck or themselves. Often, due to negative bias, it takes a lot of confidence away from an individual and they start to feel inadequate, doubly so, if they had prepared super hard and yet failed because they couldn't remember a trivia.
One of the best interviewing advice I got from a mentor was, interviews aren't a pissing-off or a dick-measuring contest but often are. It changed the way I approached the interviews as an interviewee. Also, a good thing that I seeked for and got this advice way before I started interviewing candidates.
That's really reprehensible.
One point I've made repeatedly here on HN is that technical "interviews" have morphed into a form of entrance exam with very little oversight.
If you read about entrance exams at older institutions and professions with entrance exams (the bar, medical and nursing boards, actuarial exams, and so forth), you'll find that a they are often considered among the more stressful events in a person's academic career. They are often (and should be) very rigorous, but I do think a certain unspoken bill of rights has emerged to protect the student as well was the people conducting the exams. For example, entrance exams should have a study path, a known body of knowledge that is getting tested. They should be graded consistently and fairly, by acknowledged experts in the field. There should be a way to file a grievance, and the evaluation metrics must be "transparent" - if not the specific deliberations, then at least the general approach.
Tech interview exams have none of this. They are conducted very, very capriciously, often by people who have limited skills and experience - even if they are experts in their field (which they often aren't), they may not have any idea how to evaluate a candidate.
One basic tenet here is that you don't "mock" a candidate. Seriously, wow.
I've used BETWEEN ROW maybe once or twice in my career in a professional setting. Self-joins more often, but as others have pointed out window functions are more efficient here for writing dashboard ETLs, etc.
Btw, are you minimaxir who wrote gpt-2-simple? I was looking at your tutorial a month ago while putting together a solution for the Kaggle COVID-19 NLP challenge!
This is only useful if you tell candidates to study a book for a week first.
Otherwise you are filtering for narrowminded memorizers, not smart people who can learn and solve problems
And as with all such questions I'm judging your comfort with the concepts behind the skills - in this case set theory, conditional logic, data modeling, etc.
So it's more how you say it, not what you say it. And yes it's mostly negative signal - most interviews are just to weed out completely hopeless candidates.
I'm not an SQL expert, but it's very important to understand whether something is a known unknown or unknown unknown to a candidate. So, I imagine, if the candidate would say something along the lines of "oh, this looks like something for a window function - I don't know the syntax and the nitpicks, but I could google it", it'd be completely OK. However, if he wasn't even aware that such a tool existed, it means that he would never go and learn it in a week without outside guidance, and would choose a less-fitting tool to complete a task.
Sure OP has presented some example solutions, but most of these solutions can be built up from the basics of tabular data manipulation. Furthermore, there are multiple steps involved that would allow a candidate to show their grasp of these fundamentals and receive help from an interviewer (that doesn’t trivialise the problem).
In the case of percent change in MAU, you don’t need to know how to do a self join, exactly. But rather to identify that you would want to do a self join and on what conditions would be the key
If you had a pseudo code for sets, joins, etc, it'd probably be using more mathematical symbols.
I've given SQL interview questions quite a few times and like to present a sample dataset or schema, and state a problem statement or expected result. Then I collaborate with the candidate to come up with assumptions, and I give gentle guidance if they are stuck or going down the wrong path.
I'm always up-front with the candidate about not looking for perfect syntax, and I am more interested in their problem solving
and collaboration skills than the actual SQL they write.
People who ask ask questions about the data and schema usually do much better than those who jump right into the solution. Just like in the real world where really nice SQL or code doesn't matter if it solves the wrong problem.
SUM() OVER (PARTITION BY columns ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) gives you a trailing week total.
For example, if I want to plot this trailing calculation and it has gaps in the data, then the plot will be very wrong. The DE pipeline should fill in these gaps.
But I do love doing vectorized operations in pandas/numpy. Writing a quick function then calling it against an entire numpy array is kind of fun.
Eg: indeterminate in -> indeterminate out
but determinate in -> determinate out.
A friend of mine did numerous interviews at a large company, hours out of his time and those of the interviewers, only to be caught up by some inane SQL question asked by a know-nothing after the entire process of interviews had been completed.
Why not ask about obscure regex expressions? Better yet, how about baseball scores? Hair bands from the 80s?
It's time for the valley to get real about how to judge the merit of applicants. The present state of affairs in tech recruiting is a joke.
You may say no, that any competent person can read the book and learn this stuff. But in my hiring experience there are sadly oceans of coders who, say, reach for 100 lines of buggy slow Java code to solve something that could be done in 3 lines of SQL. Once you've hired that person they do not magically turn into an efficient or self-aware programmer.
Such questions don't test for that, they test for rote memorization. I don't learn things by heart I only use rarely, and certainly not to help you with a problem I'm not causing and that such memorization doesn't fix.
Second, the most important skill to DBA or Data Scientist role and to an employer is finding someone that understands the data domain well enough to point out obvious mistakes, bad approaches and thinking to managers and provide leadership, not just access, to information.
Third, "any competent person" is just belittling a skillset that, like programming, can take years or decades to truly understand.
Then again, this depends a lot on your architecture. For instance, if your database is running on expensive hardware with very limited CPU and memory quotas then often it's better to export intensive problems to the application. Also a lot of the more complicated SQL commands are implemented differently in all the dialects of MySQL, MSSQL, Oracle and DB2. Sticking to simpler queries lends itself to cross platform compatibility if you need that.
I think that when you run into someone who knows these in depth things about SQL you either have a smart technically aware person or a rote learner. I think the best bet is to ask them how they would use the different stuff and why. Rote learners will hit a brick wall or answer generically.
The point is absolutely not to find some technicality and use it against the candidate.
If there's one thing I've noticed among the "10x" coders, then it is that they know their environments and tools like their back pockets.
If you can't solve the problems after looking it up, then perhaps you are not a SQL expert. I think these sort of questions are valuable.
1. MoM Percent Change
It's better to use windowing functions, I believe it should be faster than self-join.
2. It seems that the first solution is wrong -- it returns whether "a"-s parent is Root/Inner/Leaf, not "a" itself.
I'd instead add a "has_children" column to the table, and then it would be clear.
Second solution works, but without optimization it's 1 query per row due to nested query -- slow, but not mentioned.
The question just screams for windowing functions, and cumsum is a canonical example for them.
Sorry post author, you'd fail my interview :)
Sorry you feel that way! Thankfully my employer felt differently :)
I'd want to hire a professional, who knows the perfect way of doing that, so for them it would be as easy to make it right, as to make it ugly.
-- The guy that uses C and C++ all the time and often forgets the order of the arguments to common functions.
(You can make "but what if a month is missing?" a latter part of a multi-part interview question)
Generally I would assume that data engineers would have a month of no users set to zero or that I could ask them why that's not the case and note that for future reference.
Your statement is making the assumption you have completely dense data and you can simply offset a number of rows to get the desired denominator. Sparse data is a very common occurrence, and now your MoM/YoY/XoX queries are completely incorrect.
You learn enough to do the queries you need for your project, they work then you forget about them as you work on the rest of your project. These skills are perishable. Left outer join? Yeah, I knew what that was some time ago, but not anymore
The days of dedicated SQL programers are mostly gone.
Only in a "Every database has 2 TB of memory and 64 cores"-world is SQL (and database design) a negligible skill.
Some databases use lock structures that automatically detect and resolve deadlocks, so from a user standpoint there are no deadlocks but deadlock resolution has visible side effects that are implementation defined.
I highly recommend the SQL Cookbook for newer SQL dev as a quick reference comparison to see how often this is the case for even trivial problems in any RDBMS.
Also, for your language I wouldn't expect all mid tier SQL devs to be able to write a recursive CTE from memory (though its useful, you can just look it up again), but something like breaking apart a query plan on your platform of choice is way more important as that muscle allows you to tell if your CTE was crap or not.
I learned how much when I joined a big tech company. The devs don’t write sql unless processing logs in the warehouse. But everyone from PM to support to data science and marketing all write sql.
Now the thing to do is for me to recreate your list on github.
It is so usefull, reliable and does not change every year.
It seems like SQL is only uglier if you're used to "SELECT *"ing, which is asking for trouble--in SQL or when doing the equivalent with other row/column data manipulation tools.
Though I'll grant you that extremely wide data is a lengthy experience to write SQL for.
, (case when parent is null then 'Root'
when exists (
select * from tree c
where c.parent = node
) then 'Inner'
sum(cash_flow) over (partition by date) "cumulative_cf"
when t.parent is null then 'Root'
when p.parent is null then 'Leaf'
end as label
from tree t
left join (select distinct parent from tree) p on t.node = p.parent
In my experience, once you're hitting hundreds of millions of records, implementation details of your database engine will start to matter. A database designed for transactional workloads like Postgres will start to choke on aggregate and window functions, often taking minutes to run instead of milliseconds. A columnar database like Redshift (which exposes a SQL interface) will breeze through it without a sweat.
That query is much simpler, safer, faster, no self-join or windowing necessary, and you can properly handle missing months in your dataset in your higher-level language (which the provided solution doesn't do BTW).
And you're still getting the performance boost from the DB indexes for grouping and sorting.
Also, a lot of these would need to be done in SQL in practice since the data wouldn't fit in memory. Any solution that requires loading the table [login (user_id, timestamp)] into memory probably won't scale very well!
(1) The flavor of SQL I use at work supports macros, which are functions that can take in parameters like a function in R/Python might. So, the SQL is "turbo-charged" in that sense and some of the value-added of switching over to Python/R is diminished.
(Big Query has UDFs, which seem similar: https://cloud.google.com/bigquery/docs/reference/standard-sq...)
(2) Like I mentioned in the doc, I personally use the SQL in these practice problems for ETLs on dashboards showing trends. AFAIK, much easier/efficient to write metrics for daily ETLs in SQL than R or Python, especially if these are top-line metrics like MAU.
What would one of these do?
As I responded to another comment, I have to wonder if that's because you haven't spent the time to become proficient enough to appreciate them. I understand that a lot of programmers decry "magic" and want to get at the underlying steps and components that produce a given result, but literally all of us work at some level of abstraction above that because it would be impossible not to. None of us are directly transforming sand with electricity into computed values.
You're describing being more comfortable with an imperative set of statements than a declarative description of results. Assuming you write automated tests, think of the declarative approach as writing the description (rather than the implementation) of a test, except that the description is less freeform and must conform to a certain syntax and structure so that a machine can write the test for you.
It's "magic", but it's only so much magic because the scope of what it can do for you is limited and well defined.
Except that you know/expect how exactly it will be executed and if it doesn’t, maybe test should not really pass. SQL in general is full of this “we write declarative queries expecting this exact imperative result and investigating if it’s not”. It’s much like an interview question: it may have many different answers, but you must provide the one that satisfies a grumpy plan builder guy. I know sql and use it when it’s shorter/more descriptive, but sometimes you just want to take them rows and do them things, without leaving a database process and its guarantees.
Not much db experience, but the fact that such powerful engines (i.e. acid, indexing, good ods, pooling, etc) are always hidden beyond some cryptic uppercase frontend with a weak execution layer always bothered me. Just give me that postgres-grade index scan directly in C, dammit. /rant (inb4 just write some parts as a sp in a language of choice)
You're writing bad tests. Do yourself a favor and write tests that pass "it [satisfies some human expectation]". Stop writing tests that care about how it's implemented.
I don't intend to be glib, but rather to illustrate some of the benefits of a database platform like modern RDBMSes. These things are all implemented and battle-tested for you.
If you would prefer a more fine-grained control over execution of queries, you could probably get very far by starting with the SQLite code base and working at a more primitive level through its library.
This is from the perspective of trying to get queries to run in 5 minutes instead of 30 minutes instead of hours or days or forever, not brief transactions measured in milliseconds. And it's not something I figured out on my own, but by paying attention to the guy who never talked but was consistently 10x faster in producing reports than anyone.
The thing you should not do, that I also saw people do, is use procedural PL/SQL or T-SQL to process things in a loop - that can be orders of magnitude slower.
Honestly, Postgres does it right - you can enforce your query plan to any level of detail you want.
(available on some cloud providers as well)
Is one column composed of mostly one or two values? Then an index lookup on that column is not very optimal, and the database can use something else.
There is more than one type of join (INNER, LEFT OUTER, etc), and more than one join algorithm (neesed loop, merge, hash, etc). All these change based on the data. Even the join order can have a huge impact on query time, and needs to adapt based on the number of rows you'll pull from each table.
A lot of SQL queries are built from templates, or built by ORMs. Optimisations are critical to turn these templates queries into something efficient.
SQL can also be very expressive, a "NOT EXISTS (SELECT 1 FROM thing WHERE foobar)" could be more readable than doing a join and where clause.
Though interestingly, you get a much more declarative query style with nosql databases and key-value stores. So there are alternatives out there.
If you want to get undeniably good with SQL, this is the book.
Also I would do a lot of them differently. And I don't think the answer to #6 is valid (in T-SQL at least).
One of the best SQL interview questions is "Explain what is wrong with DISTINCT and how to work around it".
But every time, it interferes badly with any kind of locking (that's DBMS dependent, of course), and imposes a high performance penalty (on every DBMS).
Often better to group explicitly so you know what's actually going on.
Bingo. I used to work with a guy who would see duplicate results and just throw a distinct on his query. I had to keep on him to fix his queries or explain why distinct was correct in this case. My default is that distinct is almost always not the solution.
In an interview, presumably my logic would, hopefully, shine through minor issues of syntax.
Where would that put me? Maybe "okay to decent" when dealing with "medium-hard" questions?
I would fail utterly at DBA management SQL and stored procedures, my responsibilities skew towards data analysis.
The purpose was to make the questions more realistic, since at least in my experience in data analyst interviews the questions are asked in the context of actual business or product situations ... like company leaders, PMs, or others wanting to understand trends in MAU.
Thanks for sharing, I found it very insightful.
But one thing it falls apart are these time series data processing tasks.
It's because of its model of unordered sets (or multisets to be more precise, but still undordered). When you look at majority of those queries and other real life queries they always involve the dimension of time. Well - then that means we have a natural order of the data - why not use an array data structure instead of this unordered bag and throw the sort order out of the window.
SQL realized this and bolted window functions on top of the original relational model. But you still feel its inadequacy when trying to do simple things such as (x join x on x.t=x.t-1) or even the infamous asof joins where you do (x join y on x.t <= y.t).
In array databases with implicit time order both examples are a trivial linear zip operation on the the sorted tables.
In traditional set oriented SQL it results in horrible O(n^2) cross join with a filter or in the better case od equijoin in sort merge join which still has to do two expensive sorts on my massive time series tables and then perform the same trivival linear zip that the array processor would do. Which is also completely useless on unbounded streams.
Also many stackoverflow answers suggest to use binary blobs for time series data and process them locally in your favorite programming language - which points at wrong conceptual model of SQL engines.
Is SQL really so inadequate for real life data problems or have I been living under a rock and Kdb is no longer the only option to do these trivial time oriented tasks?
Most of these 'time series databases' are for processing of structured logs and metrics to be plugged into you favorite system for monitoring site latency.
Asof join is still an open issue in their issue tracker so it is not usable as a time series database.
Having it be mostly plain-ol-postgres has been huge for designing and evolving the data model in a way that I'd hate to do without "real" SQL.
The asof and other time based joins are a necessity for just slightly more complicated data processing than just a trivial aggregation at the tail of the stream which is majority of use cases in IOT and real time monitoring.
Without that I can pipe my data to /dev/null giving me even better write performance while also not being able to solve this common financial industry use case.
Timescale has been quite performant for us fetching arbitrary past time windows, but I'll stop the cheerleading there as I think mfreed covered the details far better.
I'd certainly still like to see AS OF support!
Just an observation that the parent post was the unordered "set" nature of tables, and not strictly about ASOF joins.
I should point out there is an alternative reason for this in TimescaleDB (and many other databases): to support loose-time-order writes (or even data backfill), where records don't always arrive in perfect timestamp order.
So some of these engineering decisions are to balance tradeoffs across insert vs. query rates.
In TimescaleDB, for example, we write data to time intervals in arrival order, not timestamp order, in order to achieve higher writes rates. On the other hand, we support the automated asynchronous rewriting of data to reorder data after it reaches a certain age, according to arbitrary columns.
This can be used to reorder data precisely in timestamp order. It's also highly useful and commonly used for composite reorderings, such that you can then reorder on (some_id, timestamp DESC), such that, within a time interval (chunk in TimescaleDB), data belonging to each some_id is collocated on disk, and within each id, it's sorted on disk again by timestamp. (Here, some_id might be a unique hostname, device_id, metric_id, stock ticker symbol, etc.)
These allow you to take much faster queries for SELECT * from table WHERE some_id = foo and time > X and time < Y.
We also take a similar approach when performing asynchronous columnar-style compression, where we "segment" together many records belonging to the same id into a single internal row, then order those records in array style within each row, sorted by some column (typically timestamp) before applying a type-specific compression algorithm to then. Again, all metrics then belonging to some_id are collocated, ordered by timestamp as an array on out-of-line pages so that only the requested columns need to be fetched from disk. And this gets hidden behind a SQL view that makes it look like you are still interacting with a standard row-based table.
Anyway, totally appreciate that we haven't prioritized ASOF JOINs, as our experience is that these typically come up in more specific financial use cases, rather than the vast majority of time-series use cases.
And we also find that our "gapfill/LOCF" functionality can be used to bucket-aggregate data with different ids for related behavior. That is, it's a data modeling choice to store different ticker symbols in different tables (and then ASOF JOIN), and not necessarily the only way to achieve the desired goals.
And thanks clatonjy for your use!
Could be you dodged a bullet. A company with advanced interview questions may have some ugly SQL code. For jobs that lean heavily on SQL, I expect candidates to know things like windowing & dynamic variables, in SQL & an ORM library. For SWE's, I feel basic SQL is fine.
I'd be happy if they knew what EXPLAIN was, since that impacts production on a daily basis.
Also, the OP is basically "fighting the last battle" again. Most interviews don't filter out candidates based on SQL.
1. use some underlying implementation detail of the particular RDBMS or proprietary extension to the standard
2. are essentially tricks, like the typical ridiculous interview problem designed to "make you think outside the box". Yes, you can do almost anything in SQL but often you should not.
I get the perspective here is data analysis where you probably need to know more SQL than the standard developer, but I still feel you should be testing for solid understanding of basics, understanding of the relational algebra concepts being used and awareness of some typical gotchas or bad smells. That's it. They'll be able to google the precise syntax for that tricky by date query when you're not guaranteed to have data for every month or whatever on-demand.
I've always found explicit temporary tables, where I can add indexes, are often a great solution for performance and readability.
Is it only useful for ad hoc/analytical queries, or am I missing something?
People can define, describe and provide examples but with SQL, it won't sink in until you try it out yourself and have the "Ah ha" moment. If you have a database server and some test data, try writing a few window function queries or try some online examples.
The data is already sorted, partitioned, in memory .. adding a moving average calc or max value per category is certainly faster than fetching the set to disk, recreating it in an intermediate structure and then calculating the new value with all that partitioning, sorting, etc. to be redone client side.
You can certainly use them in app dev for things like:
* figuring out how many things are ahead of you in a queue
* snapshotting (what's changed since the last time you were here)
* comparing something to some other sample for outliers or inconsistencies
These may be quasi analytical still, but ultimately they can manifest as properties of an object model like any other property to be developed against.
However certainly if your app is written in Python or Ruby I can see there being a big difference.
The reason I mention analytics is more to do with the ease of scaling out the DB than the nature of the queries. In busy OLTP databases usually one wants to keep the work off the database because they’re difficult and expensive to scale.
It uses React but I imagine there is some other library like ProseMirror here.
I've never used Office online but Google docs is terrible performance wise.
select * from
(select adddate('1970-01-01',t4.i10000 + t3.i1000 + t2.i100 + t1.i10 + t0.i) selected_date from
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t0,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t1,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t2,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t3,
(select 0 i union select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9) t4) v
where selected_date between '2016-01-01' and now()
This works in MySQL / MariaDB.
- SQL:1999 https://en.wikipedia.org/wiki/SQL:1999#Common_table_expressi...
- SQL standardization https://en.wikipedia.org/wiki/SQL#Interoperability_and_stand...
Where are the pivot and unpivot questions?
Mode's SQL tutorial uses SUM(CASE ...) and CROSS JOIN to mimic pivots: https://mode.com/sql-tutorial/sql-pivot-table/
PostgreSQL can do it via the tablefunc extension and there's `\crosstabview` which is built into psql.