Hacker News new | past | comments | ask | show | jobs | submit login
The Joy of Indexing (kylebanker.com)
141 points by DanielRibeiro on Sept 17, 2011 | hide | past | web | favorite | 11 comments

We commonly encounter users who believe that searching on two fields can be facilitated by creating two separate indexes on those fields. This example should give you some intuition about why this just isn’t possible.

Not really, as the example only allows a search on ingredient, since it explicitly says the name is not known at all, and hence could never be used for search.

Now let's say we know two things about the unknown recipe: it contains chicken and it's from the Italian kitchen.

Given these two indexes:

    1, 15, 582, 1032

    3, 15, 16, 82, 1032, 6821
I could definitely use both to speed up my search by looking only at pages that occur in both indexes (in this case, 15 and 1032). This can be done in O(# of indexes * index length) time if both indexes are in sorted order.

You don't even need to look at the entire index. See the point where chicken jumps from 15, to 582? We know we can then skip to at least 582 in all the other indexes we're looking at. (Which is doable as the indexes are going to be stored as some sort of tree we can retrieve a particular range out of) This is pretty much a merge join http://en.wikipedia.org/wiki/Sort-merge_join

Incidentally you can make indexes on Ingredient->Page/Cuisine->Page rather than maintaining Page indexes for specific ingredients/cuisines. So with just two indexes you can filter out any Cuisine/Ingredient combo (You can even do multiple ingredients! As it's an intersection though, multiple cuisines wouldn't be useful as few dishes are likely to fall under multiple cuisines, you could still do it though.)

That wasn't "intuitive" to me either. And what about the simpler case of multiple values of a single index, for example Chicken and Cashews. Any good retrieval system should be able to intersect the lists from the indexes. The existence of these compound indexes leads me to believe there is some limitation.

This was a very graphic and easy-to-understand explanation of indexes in general. Although I thought I understood indexes, it helped me understand secondary indexes better. Thanks for linking!

Did they ever write the promised second part where they wanted to present B-trees in a similar form? I think I understand B-trees, but it wouldn't be the first time I realise I didn't really understand something after all...

>It’s knowing how indexes work and having the intuition to create the best indexes for your queries and your data set. Lacking this intuition, your production database will eventually slow to a crawl, you’ll upgrade your hardware in vain, and when all else fails, you’ll blame both gods and men.

it isn't an intuition. It is knowledge about access paths your system uses in query plans and how to manage them. "Intuition" and "use [always] indexes" - these 2 perpetrated legends is what frequently the root cause of the pitiful performance of reasonably large DB systems (ie. ones that many times larger than the RAM of the DB [cluster]). People get charmed by the "magic" of B-tree and miss the iron-platter reality of random vs stream IO. You see a DBA or DB developer expressing unconditional disgust pointing toward the red line of "Full-Table Scan" in query plan he managed to display in the GUI tool and you immediately understand what the situation at this client site is :)

Again, don't get me wrong, i'n not saying that indexes shouldn't be used - that would be the same mistake as to say they should always be used. What i'm saying is that you need to know (not intuit) when and how and why to use or not to use specific access paths.

Can you give an example of what an access path is and how to manage it? This concept is new to me, and I think I understand what you're saying, but would like to make sure.

i'm not RTFM-ing you, it is just they explain it better :


How about know and intuit? They aren't mutually exclusive.

I think this article is about learning about both the art and the science of database design.

unfortunately, on practice the "intuition" too frequently is just a camouflage for "uneducated/uninformed guessing". Have you been to the "performance" meetings with customer's DBAs, developers/analysts, PMs, business function managers, integrator's PMs, engineers, vendor's PMs, developers/managers? A Niagara of "intuition" that one can drown in.

Any discussion about indexing needs a link to http://use-the-index-luke.com/

It's the best resource that I've seen with respect to databases and indexing.

Answer to the first quiz is clearly wrong; scanning an index - any index - is faster than scanning all 5k pages.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact