
Ask HN: Can you explain how a database index works in an interview? - mountaineer
Over the course of the past year or so I&#x27;ve been doing a fair number of interviews for web developers (full-stack and back-end) and have asked this question nearly every time. Across a wide range of experience levels, not one has been able to answer this question well. About half just honestly say they can&#x27;t, a quarter know a little bit, but can&#x27;t explain it well, and the other quarter come up with some wild response. It is not my intent for this to be a &quot;gotcha&quot; question, I ask because I was once asked it and the learning I did after that interview really stuck with me.<p>How would you explain how an index works?
======
brlewis
I think you might get better results with a slight change in wording: "What
can you tell me about how an index works?"

With the wording you used, some interviewees' minds will race to the limits of
what they know and then panic as they think about all the things beyond that.

A "what can you tell me" question is more likely to prompt them to start with
the basics and then expand more comfortably to the limits of their knowledge.

------
codegeek
I am not a database expert but as a developer, I of course work with databases
and SQL as needed. My answer would be:

"Index helps you locate information quicker than usual in a database. If
database is similar to a Book, then think of index as a Table Of Contents with
page numbers for specific topics. Imaging if I gave you a book with 1000 pages
and ask you to find a specific topic but there is no table of contents. You
will have to look through many pages before you find it. If it had a table of
contents, you could flip to that page much faster.

Will this answer be good enough for you ?

But then, I have no technical knowledge of how index is implemented within
various databases and that is where I will say "Sorry I can't tell you the
technical details".

EDIT: I will also add that it is difficult to list EVERY possible topic in a
book in the table of contents. It will then defeat the purpose of the TOC. So
you have to pick and choose the topics that you want to show with page
numbers. Same for database index. You cannot index everything.

~~~
dragonwriter
> If database is similar to a Book, then think of index as a Table Of Contents
> which page numbers for specific topics.

FWIW, books frequently also have indexes, which are a better analog to DB
indexes than tables of contents are.

~~~
codegeek
good point :).

------
smyatkin_maxim
Well...

\- On the 2nd year of education I knew that "index is something in database,
which helps to locate data faster. Like an index at the end of a book, it
helps to find the right page w/o whole scan of a book".

\- Later, probably at 4th or 5th year I got the idea how these are implemented
using B+-trees.

\- And even later have seen some alternative implementations.

But anyways, IMO even for a senior web developer it's enough to know that
index is SOME fast data structure on disc, which trades duplication and
(usually) slower writes for faster lookup. And optimizer sometimes will chose
index scan, sometimes won't. And to know some basic optimization techniques
(like throwing ORDER BY away when data is already sorted by index).

~~~
derekp7
I would also like developers to understand that having additional columns in
an index doesn't necessarily help -- if the first column is mostly unique
(such as an employee ID number) then having an additional column (on the same
index) of employee name doesn't get you much. You'd need a separate index that
begins with employee name. (Edit: an exception is if the entire index fits in
RAM, the DB can do an in-memory full index scan vs. a full table scan).

On the other hand, if the first column isn't unique, such as employee's city
or state, then the DB can use the next column as a "skip" index. So an index
on "state, name" will still usually improve the query time of queries against
"name" (but not as much improvement as if you had a separate index beginning
with "name").

~~~
btown
To that point, I think it's necessary for any engineer to understand at least
how tree structures work, so that this type of reasoning is intuitive. There's
a tree for the first column in the index, and at each of its leaves, there's a
tree for the second column. So if you have a really complicated tree on the
first column, subsequent columns won't work well, because you'd need to look
at many many subtrees. But if you only have a few leaves on that first tree,
then the system will just explore a subtree for each one, and you'll only have
a few of those to run through.

You don't need to know O notation or know what kind of trees they are (much
less how they're implemented) to understand that. But if you only think of a
database as an Excel table and not as trees sitting on top, you'll tend to
make inefficient design decisions. And at the rate that data structures are
moving into the frontend, arguably everyone in the stack should know how to
think in this way.

~~~
bjourne
Wouldn't it be better to have one tree containing compound keys rather than
subtrees? Often trees have a lot of overhead so a tree of trees could be
rather inefficient. It would also be hard to balance.

~~~
btown
Effectively this is the same; at a sufficient diversity of the first element
in the compound key, you'll have almost as many tree nodes as you would if you
considered them subtrees. It can be intuitively easier to think of them as
subtrees, though, and that was my point; it's not necessary to know all the
implementation details, just generally how the system finds data.

------
skookum
Are you asking this from a DBMS internals or an application perspective? Are
you posing it as a "one true answer" type question about a specific DBMS?

I'd expect any software engineer or CS grad to be able to make a reasonable
guess at how a secondary index might work internally.

I'd expect any experienced software engineer who has worked with databases to
be able to explain the application side aspects of "how an index works". e.g.:
What access patterns a given index would help with and which it would not help
with, what the downsides of having an index are, etc. Though even an
inexperienced engineer should be able to derive answers to these questions
once they've made a reasonable guess at the internals.

I think a lot of people are responding to this negatively due to viewing it as
a trivia question whereas it can actually be a "reason through what has to
happen to make something work" class of question.

------
jballanc
Items from a sorted list can be retrieved by a binary search in O(log(n))
time, so a database index is merely a list, sorted by the column being
indexed, that points back at the original record for each entry. Simple,
right?

So simple, in fact, that if someone tells you they _don 't_ know how a DB
index works, you could easily walk them through it and, in doing so, get a
much _better_ evaluation of how they think than merely how well they can
recite a bit of memorized text-book knowledge.

In fact, the only time I ever ask "Can you explain how X works?" type
questions in interviews is so that I can specifically _exclude_ asking an
interviewee how to walk through the derivation of a concept with which they
are already familiar. By that metric, I think your "how a DB index works"
question is a great interview question!

Your mistake is stopping when they say "no".

~~~
agentultra
I like your idea. At the Recurse Centre, I've heard, they have a rule that you
should never act surprised if someone _doesn 't_ know something and instead
see it as an opportunity to teach them something cool. I use this on my team
but I haven't seen it used in interviews before.

I would imagine you either know what data-structures to use for a given data
type or you do not... it's a fairly prescriptive question.

However if someone could reason by what they _do_ know about how it might work
given enough information then it'd be interesting to know what you could learn
from them by walking them through the problem.

First, obviously if they take a try at it then it shows the person can reason
about problems they haven't encountered by using concepts.

How much walking do you do for them and what does it tell you about their
problem solving capabilities?

~~~
jballanc
Personally, when conducting technical interviews, I try to keep the
interviewee right at the limits of their knowledge and ability. It's sort of
like standing blindfolded in a room and walking in different directions until
you hit a wall to get a sense of the size of the room (don't try this at
home). So this often involves starting overly complex, then walking back until
you find the interviewee's comfort zone. Let them demonstrate their knowledge
for a while (and hopefully gain back some confidence), then push them out to
the complexity zone again until they stop answering/start looking
uncomfortable again.

To give a concrete example of what that would look like with this question
(let's call "A" the interviewer and "B" the interviewee):

...(this post has been truncated for your scrolling enjoyment, full version
at:
[https://gist.github.com/jballanc/82cbe551b2b667a1ab72acabc71...](https://gist.github.com/jballanc/82cbe551b2b667a1ab72acabc7183d5d))...

------
chollida1
If you come from the perspective that an interview question is a chance to
learn how the candidate thinks then this is undoubtedly a bad question.

Secondly it doesn't really give any signal positive or negative wrt to full
stack developers.

I mean if someone does't know how an index works it doesn't mean they can't
use them effectively.

And if they can answer it, it also doesn't mean they can use them effectively.

Maybe turn the question on its head and ask....What possible info do you
expect to get out of a candidate when asking this question?

~~~
kthejoker2
Concur, any question which is rendered moot by reading a blog article or 2
isn't that great a question.

------
YZF
For one thing there is no single canonical way. Cassandra indexes (sstable
indexes + bloom filters) are quite different than e.g. SQL server indexes
(probably some variant of B-trees).

If I wasn't prepared I could probably say something about these schemes but
database indexes are kind of specialized knowledge since they combine some
attributed of in-memory data structures with the performance characteristics
and limitations of storage.

Let's turn the tables here, explain in detail how a B+ tree works and what's
the difference between that and a "standard" B-tree? What do you know about
sstables? LSMs?

------
user5994461
Do you know the difference between a list and a hash table? Hopefully yes.

When there is no index, the database is stored in a list. You look for
something, you have to go through every item.

When there is an index, the databases uses a hash table to reference the
locations of items. You look for something, you get the location instantly.

It's really simple.

There are more types of structures: trees, hash tables, bimap, arrays, etc...
with different performance characteristics.

------
WindyCityBrew
I live mostly in the front end world, but if I was asked this (without looking
it up) I'd say a DB builds a tree (balanced tree, binary tree maybe?) using
the columns requested. This allows for very fast lookups when the table is
queried for its indexed column(s), and blows away the worst case scenario of a
non-indexed query time wise, where the DB has to do a full table scan to find
out that a record doesn't exist. The trade-off is that the index must be
rebuilt on inserts (probably deletes and maybe updates too?) and the resulting
data structure stored, which is why you don't just index all the things.

------
andreyk
I am a sort of generalist software dev (I've been working back end for the
last few years), and think I could give it a try. As far as I understand, an
index is pretty much a sorted copy of entries of a column, which enables fast
lookup for that column (but slows down insertion and takes up extra space). I
would guess an index could also be something other than just a sorted list,
the important bit is that is an extra efficient lookup structure. Beyond that
though, I don't know anything. I think a back end dev, who ever interacts with
SQL, should know at least that much (assuming it's not totally wrong).

------
kbenson
My "out-of-my-ass" answer would be:

My layman's understanding is that indexes are separated data structures with
generally fixed size copies of one or more specific fields from the main
record sorted to allow efficient searching, that contain offsets into the main
record store. That is, it allows you to use that field to find records and
index into the main set of data. That said, this is how I have inferred it
works, and how I would initially attempt to create a database, given little or
no knowledge of the field and not allowed to research.

------
dudul
What are you trying to prove by asking this question to candidates? Does it
matter? Does not understanding what is happening under the hood prevent them
from using them correctly?

~~~
mountaineer
Good question, and one of the reasons I'm posting to gauge if this is really a
valid interview question. What I'm trying to understand in the interview is
how deep their knowledge goes in the stack, to your point, these are not DBA
jobs. It matters a little, to at least understand the trade-offs and how they
are used to improve performance of high volume sites.

~~~
lloeki
I'd say this is a very valid question, as with all interview questions, how it
is answered matters more than what is answered (unless the interviewee makes a
gross error to the point that it's fairly obvious they should not be hired).
Interview questions are first and foremost windows into a thought process, not
pop quiz. Even for a front end job, this may hint at how much someone has a
view of the global stack and/or general CS knowledge, or how they react
stepping out of their comfort zone.

Maybe then your question here is not so much about an eventual interviewee
answer but more about how to set your own expectations.

------
skewart
I think it's a perfectly reasonable question. I wouldn't expect a candidate to
know all the implementation details for any particular database, even one
they've worked with a lot. But I would expect them to have a general idea
about how an index typically works (e.g. "some form of binary search tree"),
or at least be able to reason from first principals given the problem it's
intended to solve. From a different angle, I think it would be perfectly
reasonable to ask someone what kind of data structures they would turn to if
they needed fast lookups in an unordered list.

No one question should ever disqualify a candidate, but I'd be a little
concerned that someone who can't even begin to reason through how an index
might work would treat them as magical black boxes that just make everything
faster. And that's certainly not true. For some kinds of data they don't speed
up lookup time at all. In some situations they speed up reads, but slow down
writes considerably. I'd expect someone to have enough foundational knowledge
so that when they're on the job they could do a bit of quick research and
decide if adding an index on any given column was a good idea or not.

------
ddebernardy
I can't imagine most web developers understand or care about or even need to
know the intricacies of how a B-Tree or the various types of inverted indexes
- or other indexes - work.

Much more interesting, IMO, and only for the certain type of dev that actually
hits the database, is whether they understand that there are different index
types, which to use when, and how this know-how translates to them being able
to anticipate what the query plan for this or that query will look like based
on the available indexes and some pointers on how the data is distributed in
disk pages.

If anything you want the candidate that, given a set of assumptions, will tell
you "a new index is needed here" or "a new index isn't useful for this query"
because reasons and is able to explain why in sensible terms.

If you're trying to test SQL skill level, take a cursory look at the Postgres
tag in StackOverflow and scan for performance-related questions. You'll find
plenty of concrete "exercise" types of interview questions that, for the most
part, will be straightforward for anyone with the relevant know-how to answer.

------
joshka
Here's as simple as an explanation as I could come up with in 30 seconds:

An index is a supplemental data structure that provides O(searchable) access
time to another data structure that would has O(scannable) access time given
particular search or filter parameters.

Everything beyond that is implementation detail.

~~~
kapauldo
Not O(log(searchable))?

~~~
joshka
This depends on your index function, and what operation you're optimizing for.
My notation here was a shortcut for <insert explanation of differences to
searching / filtering / sorting / etc. algorithm time complexity> where
O(search) < O(scan), but by how much is implementation specific.

------
pixeloution
I'm not sure I could give an in-depth answer, especially under pressure in an
interview. What I could do is explain the important bits at a high level:

    
    
        * how a B-Tree index is structured (and how that affects range queries, and wildcard queries)
        * how a query uses a single index per table, or can create a merge index
        * how a covering index increases performance
        * how composite indexes affect queries, why you don't need an index on both (col-a) and (col-a, col-b)
    

I don't think its unreasonable to expect a senior developer to know this
stuff. It affects your ability to write performant queries against larger (1M
- 1B rows) datasets.

------
mnm1
An index is a copy (or copies) of some or all of the searchable data in a
table to a data structure that's easy to search (binary tree, etc.). As long
as the interviewee knows it's a copy of the data used to speed up searching, I
don't see what more you're looking for (if you are looking for more). If you
want to get into more detail, you'll be looking at various tree structures and
search algorithms on those structures. Unless you're writing a DB, language,
or something else low-level, you won't have experience with those and they are
largely irrelevant for typical web development.

------
MichaelGG
It seems that a lot of developers simply don't know very simple basics that
could be learned in a couple weeks of reading an intro algorithms book. Even
without having studied databases, it should be within reason to think up that
a database is a collection of records, and you need to find those records
fast. Just like if you needed to find an object in an in-memory array quickly.

Maybe it's being asked in a weird way? Or they have an expectation that you
want them to describe exact DB internals? Any decent candidate should be able
to have a conversation about this.

------
rocqua
B-tree (balanced however you like), or a sorted list of elements, or a skip
list of sorted elements. You could also do a hash table of elements if there
aren't too many unique elements i.e. O(sqrt(totalElems)

Those are the traditional things I can come up with. I couldn't tell you how
to balance a B-tree, just that red-black trees are apparently great for it. I
couldn't tell you which database uses what, and I wouldn't dare claim my list
is complete.

I'd say the question is so open as to be unclear.

------
gmiller123456
Suppose someone answers the question very well, even to the point giving
details (possibly incorrect) that you were unaware of? What good does this do
you? Suppose their answer contradicts what you think you know? This is one of
those where you'd better be an expert in the field before dismissing their
answer as incorrect since databases have been around for a really long time
and there's a lot of correct answers.

------
vim_user
I was asked that question in my recent interview, and I told them that beyond
the basics of the intent for using the index and some basic commonalities,
it's hard to explain the details of how it works due to the fact:

1) All RDBMS vendors work differently. 2) It's an abstraction that's best
handled by the RDBMS implementation, and is bound to change over time as
versions change.

------
defined
I think this is a very interesting point.

How important is it to be something of a generalist, who could answer
questions like this at least in broad terms, as opposed to being purely a
specialist who has little knowledge outside of their domain of expertise?

Would it be significantly helpful to understand more about the things that
happen under the hood, as it were, of technologies that we use?

Would a developer who could at least _kind of_ answer this question show an
aptitude for broader thought and a deeper interest in technology, and be a
potentially more valuable hire as a consequence?

EDIT: To attempt to answer your question, any collection of structured data,
whether objects in a memory-based data structure, delimited text fields, or a
DBMS file, could be considered a database.

In many data structures, such as hash tables or primitive key/value stores,
there is only one key. If you want to find data based on a field that is not
the key, you either have to search sequentially through all the records,
finding matches on that field, or create an index on that field.

If the number of records is small and the storage medium is fast, a sequential
search may be adequate. If not, an index is needed.

Creating an index generally involves scanning all the records in the database
and extracting the field required for searching, together with the location of
the record within the database. The location would preferably be a direct
record number to avoid unnecessary indirection, but it could also be the
primary key of the database.

The list of key values and locations is put into a suitable lookup data
structure. This could be something as simple as a sorted list in memory, a
hash table, or a disk-based structure like a B-tree, B+-tree, or one of many
others.

In the most simplistic case, looking up a record using the index means
searchng the index for the matching record locator, then using that to
retrieve the actual record in a separate step.

Obviously this is a bit more complex for non-unique keys, but that's the
general idea.

Finally, the choice of index structure has tradeoffs, because once the index
is added, it must be maintained when records are added, deleted, or modified
in a way that affects the index. If the db has 100 million records, having to
add a new one to a simple sorted index and re-sort it could be a performance
disaster.

------
positr0n
I could give a general explanation about reducing the number of disk seeks by
using B-trees to store index values with pointers to the database row.

That being said I mostly agree with the other commenters that while a senior
developer should know the answer, it isn't disqualifying if they don't. And if
that's the case what is the point in asking it?

~~~
ovao
I think that explanation might actually be shooting yourself in the foot
_slightly_ , depending on the interviewer, because an index needn't
necessarily be a B-tree.

------
systems
what answer are you expecting?

if you give us the answer you are expecting, it will help us evaluate the
fairness of the question

------
flukus
Does this not get taught in universities? I only went to a crappy trade school
but they made us write our own "database", nothing fancy, just fixed record
entries with indexes.

------
platz
Feels like OP spent a significant amount of time investigating implementation
details and wants to validate/justify/perform that knowledge/time investment.

------
cbanek
Seems like a reasonable question. Here's my answer:

Let's say you have a database table with two columns. One is an ID, and the
other is a random integer. You want to be able to go through and find a
particular integer by value, such as SELECT id,randomint from t WHERE
randomint=4 (4 gotten by xkcd random dice roll, of course).

Now, if you had just a list of these integers, and they weren't sorted you'd
have to scan through them all one at a time. This is called a table scan, and
it can be slow.

Imagine if you had another list, that had the same values from the table, but
they were sorted in the order of the values, rather than table order. Now
you'd be able to do all the sorted array tricks, such as binary search,
starting from the beginning, starting from the end, etc. This means you can
find it faster. The index can also be smaller than the table, because you only
have to store the parts you are indexing. But of course, there are performance
tradeoffs, because you are increasing the size on disk, and it takes time to
modify the index when modifying the table. (This is why some DB upgrades drop
indexes then add them back after doing a lot of activity)

------
petraeus
as far as i understand it it sorts the data at the disk level like windows
defrag and puts the data 1st together and secondly ordered so the indexed data
can be quickly and easily found.would this be an acceptable answer?

------
Jemaclus
As a full-stack developer, I really dislike these kinds of questions for the
reasons outlined by others, just like I hate having to sort lists. In real-
life, I do some version of sort(foo) or foo.sort() and call it a day. I don't
need to know whether it's using Quicksort or Heapsort or whatever. The built-
in sorting algorithm for our chosen language is generally not going to be the
bottleneck of anything, and if it does wind up being a bottleneck, you look up
alternatives and implement them. Same thing goes for binary trees or inverted
indexes or graph searching or something. The real-world solution for those
things in the real world is almost always going to be to use some out-of-the-
box solution, like the built-in sort or Elasticsearch or Redis or something.
Sometimes it's useful to see if they understand how the underlying thing
works, but most of the time it doesn't.

I very much prefer application-level problems that have a hundred different
possible solutions. For example, for a US developer, I might ask the following
question: "Given the address 123 Main St, Boston, MA 00215, write an algorithm
that will break it apart into its components of number, street, city, state,
and ZIP." The reason I like this is that A) everyone who lives in the US can
identify those parts on sight, requiring no special domain knowledge, and B)
there are a hundred ways to do it.

I often get solutions like "I'll write a regex" (Okay, write one!) or "I'd use
some sort of machine learning" (Okay, write something!). Most people will
start off and say "Ok, I'll split on commas and wind up with [123, 'Main St',
'Boston', 'MA 00215'], and then..." and work there way from there.

So then I can throw in any number of wrinkles.

* What if there are no commas? (this breaks the regex solution)

* What if it's "San Francisco" instead of "Boston"? How might you decide you've gotten the whole city name and not just part of it? What about "Palm Beach" vs "West Palm Beach"?

* What if we have "123 St Francis Street"? Or "12A Main St"?

* What if it's "123 Main St West Palm Beach FL"? Is it "Main St West" or is it "West Palm Beach"? How might you tell?

And so on. The possible ways to write an address are endless, and there are no
correct answers right out of the box to parse the address. This gives me a
glimpse into the developer's mindset: how do they initially scope the problem?
do they existing solutions or do they reinvent the wheel all the problem? Do
they know how to write a regex? Are they bluster or not? Can they adapt when
you throw curveballs at them? Do they start thinking ahead of you and say "ok,
if Boston could be San Francisco, then this problem could also happen in the
street name..."? And so on.

This becomes much more of a dialogue and a discussion of the problem and
possible solutions than sitting there watching someone write code under
pressure and tut-tutting when they do something wrong.

(This is an actual issue I ran into on the job, so it's not just a
hypothetical one.)

My two cents. Good luck.

------
hobonumber1
I'm a full-stack web developer. I don't know how an index works, nor do I
really care. I know when to use it, and how it affects lookups.

If I was a database administrator, then it's a fair question. But if I was
asked this for a full-stack web engineering role, I'd consider it a little
unfair.

~~~
mountaineer
I'm posting here in part to get a feel for the "fairness" aspect, so
appreciate the feedback. I haven't eliminated anybody over this single
question, but do see many interviewees get frustrated with it which may hinder
their performance afterwards in the interview. If I were to dig in re: "when
to use it and how it affects lookups", that would be good discussion.

~~~
arockwell
If most of your candidates cannot answer the question, but you would still
hire them I would drop the question. Knowledge of when to use an index is more
valuable than the specific mechanics of how the index is implemented.

When I want to test database knowledge, I usually give the candidate a simple
database schema and ask them to write progressively harder SQL queries.
Database modeling questions are also good for teasing out the seniority of a
candidate as well.

~~~
flukus
Not all interview questions are about eliminating candidates, you want to rank
them as well. If all 5 candidates are roughly equal but one knows much more
about databases then they are the one that gets hired.

