Hacker News new | past | comments | ask | show | jobs | submit login
A terrible schema from a clueless programmer (rachelbythebay.com)
842 points by zdw 72 days ago | hide | past | favorite | 450 comments

Related to database indexes, but not the post: a busted database index brought down ticket sales of the 2008 Olympics Games.

This was the first time regular people could go buy tickets for events & they had been lining up overnight at Bank of China locations through the country. We were down for over a day before we called it off. Apparently this led to minor upheaval at several locations in Beijing & riot police were called in.

We were pretty puzzled as we had an index and had load tested extensively. We had Oracle support working directly with us & couldn't figure out why queries had started to become table scans.

The culprit? A point upgrade to DBD::Oracle (something like X.Y.3 to X.Y.4) introduced subtle but in character sets. So the index required using a particular Unicode character set, and we were specifying it, but when it was translated into the actually query, it wasn't exactly the right one, so the DB assumed it couldn't use the index. Then, when all the banks opened & a large portion of very populous country tried to buy tickets at the same time, things just melted.

Not a fun day.

Can't imagine the pain to even discover this problem. It normally takes me a long time to be like, "maybe I didn't make a mistake, and I have a third party bug?". Third party bugs are always the most annoying to me, usually the longest to diagnose and then ultimately I have to just tell my boss I can't do anything to fix it, but hopefully I can find a way to avoid it.

Imagine having to diagnose a bug in the processor!


"More experienced programmers know very well that the bug is generally in their code: occasionally in third-party libraries; very rarely in system libraries; exceedingly rarely in the compiler; and never in the processor."

Fork and fix? I’ve personally been pleasantly surprised by how fast maintainers will merge a PR I submit.

If it causes an issue with our own systems it also gives me a very good justification for doing it.

In Oracle's proprietary system?

Oracle doesn't appear to maintain the DBD::Oracle Perl client module.

Sometimes I've had a pr merged. Sometimes I've had stubborn maintainers that think that their project should not follow common domain standards for /shrug reasons.

And then many other times I honestly don't have the knowledge to be able to contribute into large third party systems and it seems very rare to get an issue resolved without submitting a pr.

And lastly sometimes it's hard to judge if unexpected behavior is really a 'bug'. Similar to the last point, it can honestly be hard to tell the difference sometimes.

At my previous job, a particular database was designed using ascii strings fields for a particular field rather than unicode. If you then query with a string in unicode format, the database decided that the comparison should be done in unicode. The only way was to table scan and convert all ascii fields to unicode on the fly. It was found only in production.

Given that you mention China in your story, did GB 18030 have anything to do with your problems?


That sounds like newbie developers wrote that and no DBAs were involved in checking the code, not a database design problem, you are throwing blame without knowing enough about the matter. If your string fit in a varchar, it makes absolutely no sense to change it to nvarchar because your query has unicode as input.

I reread the parent quote several times and cannot detect a hint of throwing blame at all.

Most, but not all of the team were young, and the DB schema was designed by themselves. I did not intend to throw blame. It is more the kind of situation where a sensible decision (at first glance) creates a footgun down the road.

What? The database made the call to compare the strings as Unicode. Are you accusing the DB developers of not doing their job?

Yes, databases don't decide, developers do.

The moral of this story, to me, is always lock in all dependencies to the exact specific versions tested for production release.

I love DBD::Oracle so much I've always arranged to have other people test oracle related DBIx::Class features for me.

Writing the pre-ANSI-JOIN-support SQL dialect code was pretty good fun though.

The one that always got us is much more mundane. Deleting a row requiring an index in every table with a FK to avoid tables scans.

Near as I can tell, we assume there is some bit of magic built into the foreign key concept that handles this for us, but that is not the case.

Yeah I’ve run into this a few times. In Postgres, you can’t add a foreign key referencing columns that aren’t indexed. I assume this is for performance reasons so inserts/updates are efficient, but there is no such check that the source columns are also indexed, so updates/deletes on the referenced table can be terribly slow.

Isn’t the source almost always an indexed primary key?

No. The target is normally a primary key but the source is not. The source is rarely even unique.

Foreign keys are naturally referencing primary keys, which have their own (unique) indexes by default. And there's some extra magic with CASCADE.

But it seems like I am missing your point: care to expand on it so I can learn about the gotcha as well? What are you attempting to do, and what's making the performance slow?

If I've understood correctly it's the foreign key column itself that isn't indexed by default. Deleting a record then requires a table scan to ensure that the constraint holds

Example: if you delete a record from the customers table, you want an index on the foreign key in the orders table to delete the corresponding entries. This also means the performance of the delete can have the unfortunate property where the small work of deleting a single row cascades into the work of deleting many rows.

Which is why many elect to just ban deletes.

That’s not the reason. The reason why people don’t delete is because nobody wants to be left with inconsistent data relations. Deleting a customer is more deleting their PII(our scrambling it) and leaving everything else in tact.

Or in the case of Silicon Valley, leave everything in tact with a disabled flag and then spam you for the next decade or so.

I don't buy that reason, because that inconsistency can be easily prevented with good schema hygiene (either ON DELETE NO ACTION or ON DELETE CASCADE). The problem is rather that in order to maintain consistent data relations, delete operations must be carefully designed and that part of the functional design is usually skipped because it's perceived as not important.

Which is more or less what the GP says as well, deletes are not implemented because doing it properly requires proper design.

Had a similar experience (albeit with much minor consequences): an Oracle Portal installation which worked fine in Dev/Test but with degrading performance on the actual Prod server (which was also immensely more powerful than what we used to develop).

Oracle was called in, "en masse". Keep also in mind that Oracle Portal was basically running on top of lots of dedicated tables and stored procedures and you could not access the data directly, you interacted to the contents only through vistas, aliases etc.

It was indeed due to a smallish version bump. But we had a pretty good Team Leader who had made Oracle declare that the two different version were absolutely compatible and allowing us to write code on the older version.

He made them state this in written form, before starting the development so it was later impossible to blame the developers for the problem.

I've never had to deal with a production-breaking issue resulting from this, but I have had to debug a collation-related issue in SQL Server that caused table scans to occur. Fortunately, the SQL profiler made it really obvious that a collation mismatch was the culprit.

Could you say more about why that was the culprit and not some problem with the load testing? For the Olympics, it seems like the most important test is "What happens when we launch?" That's the kind of thing I'd run on every commit.

an automated system level upgrade in the production env, exactaly 10 sections before launch.

load tests usually only happens on staging env, not production, that is the policy.

In that case, it sounds like a really good reason to question why the staging environment is not enough like prod that load tests are valid. Or why a not-fully-tested system upgrade would happen 10 seconds before launch.

Why exactly would it be so bad to just put a suitable index on the table containing strings? The time complexity of the resulting search would be the same, so I assume there will be some constant factor slowdowns. Is it that indices over string fields are stored inefficiently on disk? (If so, can that not be fixed in the db engine directly?) Or is this fine today but wasn't fine 15 years ago?

I think the article describes the solution that was used and not necessarily the best solution. If they made the “obvious” mistake in the first case they mightn’t arrive at the best solution in the second. Adding the single index and using the four tables have the same worst case complexity for reads, O(log n), with different constants and massively beat scanning at O(n).

It may be that the second solution wasn’t the best, or that it was better for write performance or memory usage—this was 2002 after all. It may also be the case that the cardinal it’s of some columns was low in such a way that the four table solution was better, but maybe getting the columns in the right order for a multi column index could do that too.

Either way, I don’t think it really matters that much and doesn’t affect the point of the article.

Indexing existed 15 years ago. The article never mentions why indexing didn't solve this problem. Super weird take on the author's part...

Early InnoDB* had pretty strict limits on varchar indexes and was not the most efficient. I don't remember the details but it's entirely possible the single-table format Rachel described ran head on into those limitations.

Also remember indexes take space, and if you index all your text columns you'll balloon your DB size; and this was 2002, when that mattered a lot more even for text. Indexes also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.

Finally, normalizing your data while using the file-per-table setting (also not the default back then) can additionally provide better write & mixed throughout due to the writes being spread across multiple file descriptors and not fighting over as many shared locks. (The locking semantics for InnoDB have also improved massively in the last 19 years.)

* Assuming the author used InnoDB and not MyISAM. The latter is a garbage engine that doesn't even provide basic crash/reboot durability, and was the default for years; using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.

indexes take space, and if you index all your text columns you'll balloon your DB … also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.

Indexes didn’t go away with these extra tables, they live in the ‘id’ field of each one. They also probably had UNIQUE constraint on the ‘value’ field, spending time on what you describe in the second half of my citation.

I mean, that should have saved some space for non-unique strings, but all other machinery is still there. And there are 4 extra unique constraints (also an index) in addition to 4 primary keys. Unless these strings are long, the space savings may turn out to be pretty marginal.

Right, indexes don't "go away" as you normalize, but the amount of data indexed drastically shifts and four unique indexes are pretty much guaranteed to be smaller in size than one massive combinatorial index of (all, four, possible, values). Not just in the case where any duplicates exist, but just alone in terms of the database's overhead in things like field delimiters and page metadata. While the specifics will vary a lot with specific DB implementations, generally with varchars involved you can usually assume a worst-case in terms of field delimiting (and from there sometimes a worst-case in average page size and how that impacts B-Tree balancing).

In general, Indexes alone can't save you from a normalization problems. From certain points of view an Index is itself another form of denormalization and while it is very handy form of denormalization, heavily relying on big composite indexes (and the "natural keys" they represent) makes normalization worse and you have to keep such trade-offs in mind just as you would any other normalization planning.

It is theoretically possible a database could do something much smarter and semi-normalized with for instance Trie-based "search" indexes, but specifically in 2002 so far as I'm aware of most of the databases at the time such indexes were never the default and didn't have great multi-column support and would have been expensive to compute if you did turn them own. Even such "smart" Indexes would likely still have suggested you normalize first.

To me it seems that spreading it over four tables would lead to a lot more potential read locks while the big combined table is waiting for a join on each of the others, and some process is trying to insert on the others and link them to the main. This is assuming they were using foreign keys and the main table 4-column index was unique.

Old best practice for InnoDB performance was actually to never use foreign keys because of these locking issues. Not sure if that's the case in 2021.

InnoDB uses row-level locking, and foreign keys are (usually) a great feature to ensure data integrity. But using multiple foreign keys from tables `a`,`b` as a composite index for table `x` can cause deadlock if both are being updated in rapid succession, because an update on `a` gets a lock on `x`, which needs a read lock on `b` which is waiting for the lock from `a` to be released. I try to never structure multiple foreign keys as a composite index.

This is fascinating, thanks for sharing.

> Also remember indexes take space

Indexes take space, except for the clustered index.

An important distinction.

If the point of this table is always selecting based on IP, m_from, m_to, then clustering on those columns in that order would make sense.

Of course, if it's expected that a lot of other query patterns will exist then it might make sense to only cluster on one of those columns and build indexes on the others.

But IP addresses are actually _numbers_, and you can also save them as bytes 32 bits for ip4, 128 bits for ip6 addresses. (Postgresql has a native datatype 'inet' which supports ip6 since 7.4, if you're using postgres and save an IP address in a character string, you're doing it wrong) If you would save the IP as number and put an index on it, all your queries would be blazing fast!

> using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.

The author mentions that these events took place in 2002. At that time the site was likely still running MySQL 3 and InnoDB was very new and considered experimental. Sure MySQL 4.0 had shipped in 2002 and InnoDB was a first class citizen, but back then upgrades from one version of MySQL to another weren't trivial tasks. You also tended to wait until the .1 release before making that leap.

So in fairness for the folks who originally set up that database MyISAM was likely their only realistic option.

I looked after and managed a fleet of MySQL servers from back then and for many years afterwards and even then it wasn't until MySQL 5 that we fully put our trust in InnoDB.

Surely more than 15 years! 50 maybe?

I had the same initial reaction in reading the post. My assumption was that indexing would have sufficiently sped up the query speed problem.

However, normalizing the fields that contain repetitive data into separate tables could create significant space savings since the full text for each column would not need to stored for each row. Instead of 20-30 bytes per email address, a 4 byte (assuming 32-bit era) OID is stored in its place.

It's pretty easy to imagine how quickly the savings would add up, and that it would be very helpful in the era before SSDs or even 1TB HDDs existed.

Normalization is important for deduplication, not only to index and compare a few short numbers instead of a few long string: those host names and email addresses are long and often repeated.

That's not what normalization is. You're thinking of deduplication or compression. This table is normalized as per usual database terminology.

But if it’s guaranteed to be 1:1, why? The two implementations (normalized and denormalized) in that case should be completely isomorphic, quirks of the DB engine aside.

Why what? I don't understand what you are asking.

Yeah, while this is not optimal (ip should be converted into integer, time should be ts), the table would be small (as old entries could be safely deleted). The only real issue is the lack of indices.

Also is helo field even needed?

IP should NOT be an integer. IPv6 does not fit in a database integer. MySQL has functions to convert to bytes.

Even better, use a database with a proper inet datatype. That way you get correctness, space efficiency, and ability to intelligently index.

In 2002 it definitely should have been an integer.

In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.

> In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.

This is the funniest IPv6 excuse I've heard on HN.


Thanks! I was really surprised to find it myself. I used HN in some sample code for analyzing how SSL tickets are reused and was surprised to see an IPv4 address come back.

Yes. A proper mail implementation won't randomize it, so it should be consistent. Depending on the design of the validation stages it might only need a case-normalized lookup to the string index table and that id as part of the final multi-column unique index, and in that last case only to convey to the database the expectations to optimize around (and require the data adhere to).

Let's say the article is a critique to some system by someone that is not much better at databases. As many people also pointed out, a bad solution was replaced with a bad solution that works better for someone.

At the worst you could concatenate the strings and have a single primary key.. or hash the string concatenation.

It was fine on -some- databases 20 years ago. mysql of that era, less often.

The ending is the most important part.

> Now, what do you suppose happened to that clueless programmer who didn't know anything about foreign key relationships?

> Well, that's easy. She just wrote this post for you. That's right, I was that clueless newbie who came up with a completely ridiculous abuse of a SQL database that was slow, bloated, and obviously wrong at a glance to anyone who had a clue.

> My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?

(It seems a lot of folks are getting nerd-sniped by the set-up and missing the moral of the story, eh?)

I don't know the actual numbers, but it's been pointed out that at any given time something like half of all programmers have been doing it less than five years, for decades now.

That, plus the strident ignorance of past art and practice, seem to me to bring on a lot of issues.

I don't think it's purely nerd-sniping. If your story is "at first you are bad, but then you get good", but your example is of a case where you did something fine but then replaced it with something worse, that rather undermines the story.

If even someone with Rachel's level of experience still doesn't know all the minutiae of database optimization, I think that just amplifies her point about the importance of mentoring novices.

I recall reading a rant on another site about someone so upset they had to deal with clueless noobs at work.

They then went on to list the errors that this brand new entry level employee had made when writing ... an authentication system ...

I was more than a little shocked when I realized they were serious and hadn't realized the issue was sending the new entry level guy to do that job alone.

I would expect most seniors to want a set of eyes across their work doing any form of authk or z. If only to share blame :p

> authk

I've always seen authn. Where'd you pick this usage up?

good question! I meant authn. I think I just subbed in the phonic from the ..cation part of the word.

You get my upvote both for having a friendly response, and because your answer is what my brain filled in and I’m glad I don’t need to learn a new shorthand.

You need to distinguish authentication from authorization. "auth9n" and "auth8n" might be a slight bit too obsure for the purpose.

What a completely nonresponsive "response". To the best of my knowledge, no one has ever even tried to use auth9n or auth8n. But that's not where the 'n' in authn comes from. authn and authk are both equally distinct from authz.

You appear to be unfamiliar with the extremely common, and problematic, abbreviation "i18n" for "internationalization".

For my part, I have not encountered "authn", "authz", or "authk" before. But prior discussion did not even mention the word "authorization", and that seemed worth bringing out in the open.

So, not nonresponsive, just responsive to things you weren't personally interested in.

Generally authentication is authn and authorization is authz

> but your example is of a case where you did something fine but then replaced it with something worse

They way I see it is that it was a case of something that wasn't working, then replaced with something that was working.

I don't know what your metrics about "good" or "bad" are but eventually they got a solution that covered their use cases and the solution was good enough for stakeholders and that is "good".

Chronology alone only progresses; progress sometimes regresses.

> It seems a lot of folks are getting nerd-sniped

that is an excellent term.

Old enough some people under 35 may not have encountered it… ;)

I feel like that phrase has been there forever, and I'm exactly 35. Will this happen more often as I grow older? Ugh. Feels weird. Maybe also a bit depressing.

It was weird seeing all the spongebob squarepants memes take over the internet when I was too old to ever grow up with that. I turned 28 in 1999 when that first aired. That was my "holy shit I'm so old" moment when that finally trickled up into my awareness as someone nearly turning 40 or so.

As a 48 year old who has no idea who an awful lot of modern celebrities are, I'm constantly wondering if it is because I'm just old now or if its because there are just way more celebrities these days due to the large parasocial celebrity class that didn't really exist when I was younger.

Maybe a bit of both.

> parasocial


There, FTFY.

For me it is Pokemon. I was a few years older than the demographic it targeted.

I could have jumped on it and played it but I figured just another toy craze, it'll be over before long.

Same age as you. I have a lot of friends in the Philippines. I swear the Facebook employment info of half the people in the Philippines says they work at the Krusty Krab. (And for most filipinos, the internet = facebook.) For some reason I never asked what that meant, and for many years I thought that was just some odd joke, and was vaguely puzzled about how widespread it is. Eventually I happened on the Spongebob connection!

Well, then, they are part of the 10,000


Only if they were born in the US

I'm 49, I don't know how I never saw that xkcd.

I just spent several minutes trying to solve that resistor problem (or a simplified version with adjacent nodes) before giving up and deciding I’ll look up other people’s analyses when I get the time. Definitely a good example.

The math would resemble xkcd's Lucky Ten Thousand: https://xkcd.com/1053/

With yes the added confounding factors of how often our industry seems to prefer to hire youth over experience, and subsequently how often experience "drains" to other fields/management roles/"dark matter".

> (It seems a lot of folks are getting nerd-sniped by the set-up and missing the moral of the story, eh?)

The narrative should lead to the conclusion. If I told you the story of the tortoise and the hare in which the hare gets shot by a hunter, then said the moral is "perseverance wins", you'd be rightfully confused.

Yep. Folks are getting lost in the weeds discussing indexing of database tables. That's _totally_ beside the point here.

The thing is, the first implementation was a perfectly fine "straight line" approach to solve the problem at hand. One table, a few columns, computers are pretty fast at searching for stuff... why not? In many scenarios, one would never see a problem with that schema.

Unfortunately, "operating in a vacuum with no mentorship, guidance, or reference points", is normal for many folks. She's talking about the trenches of SV, it's even worse outside of that where the only help you might get is smug smackdowns on stackoverflow (or worse, the DBA stackexchange) for daring to ask about such a "basic problem".

Yes and no and this post highlights a subtle issue with mentorship (which I think is important): Technology does not stand still. What was true in 2002, might not be true today. While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.

I've experienced this just by switching languages. C# had many articles dedicated to how much better StringBuilder was compared to String.Concat and yet, other languages would do the right thing by default. I would give advice in a totally different language about a problem that the target language did not have.

As the song goes:

"Be careful whose advice you buy but be patient with those who supply it

Advice is a form of nostalgia, dispensing it is a way of fishing the past

From the disposal, wiping it off, painting over the ugly parts

And recycling it for more than it's worth"

That might be true if you just take blanket advice. The key is to find out why say StringBuilder is better than String.Concat. If you understand the implementation details and tradeoffs involved, this makes the knowledge much more applicable in the future. The core concepts in technology do not move nearly as fast as individual projects, libraries, frameworks, languages, etc...

You wrote: <<While adopting the naïve approach was detrimental back then, today databases recognise that this happens and provide easy workarounds to get you out of trouble that didn't exist back then.>>

Do you have a specific example that would help in the case described in Rachel's blog post?

I am still making (and finding) occasional indexing and 3NF mistakes. In my experience, it is always humans finding and fixing these issues.

But trust me on the sunscreen.

Totally off topic I know, but what is the concern with String.Concat ?

If you repeated use concat to build up a string, the amount of time grows exponentially. This is because the string I copied each time you concat. Note that the + operator on strings gets turned into a call to concat.


I was wondering about it in the context of "yet, other languages would do the right thing by default". Repeatedly concatenating to the same string (as opposed to concatenating an array of strings in one go) would be slow in any language I know of, unless you allocate a larger buffer up front, which is what StringBuilder does.

Some languages have mutable strings, but you would still need to allocate a sufficiently larger buffer if you want to add strings in a loop.

In languages that have immutable string semantics and can detect that they are the exclusive owner of a string, they can modify them in place. For reference counting languages like Python, they can easily determine they are the exclusive owner by checking for a reference count of 1. See:


Rust's ownership system ensures that there cannot exist mutable and immutable references to the same object at the same time. So if you have a mutable reference to a string it is ok to modify in place.

I have not confirmed either the Python example or the Rust example too deeply, since I'm replying to 5 day old comment. But the general principal holds. GC languages like C# and Java don't have easy ways to check ownership so they always copy strings when mutating them. Maybe I'll write a blogpost on this in the future.

I don't think StringBuilder is faster specifically because it allocates a large buffer. Pretty much every language with growable data structures can already grow any such structure pretty fast, including arrays, maps, and mutable strings. They already have a bunch of pre-set constants about stuff like how big to allocate for the initial empty one and how much more to allocate every time you overflow to balance speed and memory efficiency. It's faster because it's mutable and keeps adding new data to the existing buffer until it gets too large.

The cost is not the allocation per se, the cost is copying the bytes. When concatenating two strings, the bytes from both strings are copied into the new string. If you repeatedly concatenate, the bytes end up getting copied many times. E.g if you concatenate 100 strings by appending one by one, the bytes in the first string is copied 99 times, the bytes in the second string is copied 98 times and so on.

Using a StringBuilder, the strings are only copied once when copied into the buffer. If the buffer need to grow, the whole buffer is copied into a larger buffer, but if the buffer grows by say doubling the size, then this cost is amortized, so each string on average is still only copied twice at worst.

Faster yet is concatenating an array of strings in a single concat operation. This avoids the buffer resize issue, since the necessary buffer size is known up front. But this leads to the subtle issue where a single concat is faster than using a StringBuilder, while the cost for multiple appends grows exponentially for concat but only lineary for StringBuilder.

Repeatedly concatting a string is the fastest way I am aware of to build a longer one in Javascript. (This has been deliberately optimized for) I believe PHP this might also be the case for, or at least very fast. Perl might be pretty fast at this but I could be wrong.

I know JS's case somewhat well and the lesson there is not the comment such as above that "it just does the right thing". JS doesn't use a "String Builder" under the hood with a lot of string + calls, but instead that JS does subtly the "wrong" thing as an optimization: in the .NET CLR strings are always, always immutable (or they are not strings according to guarantees that the CLR makes). JS lives in a single language VM that makes no such guarantees, and often lives in strictly single threaded worlds where the VM doesn't have to guarantee the immutability of strings to anyone at the binary memory layout level, so most JS VMs are free to do the "wrong thing" (from the .NET CLR perspective) and just mutably alter strings under the hood as a deliberate optimization.

There's a lot of interesting conflations of "right" versus "wrong" in this this thread. There's the C# "best practices" "right" versus "wrong" of knowing when to choose things like StringBuilder over string.Concat or string's + operator overloading (which does just call string.Concat under the hood). There's the "does that right/wrong" apply to other languages question? (The answer is far more complex than just "right" and "wrong".) There's the VM guarantees of .NET CLR's hard line "no string is mutable" setting a somewhat hard "right" versus "wrong" and other language's VMs/interpreters not needing to make such guarantees to themselves or to others. None of these are really ever "right" versus "wrong", almost all of them are trade-offs described as "best practices" and we take the word "best" too strongly especially when we don't know/examine/explore the trade-off context it originated from (and sometimes wrongly assuming that "best" means "universally the best").

I believe joining an array of strings is still faster in JavaScript, since the combined size can be known up front.

I agree, and would even go one step further and say the first implementation was a decent first pass. Sometimes the 'awful' implementation is good enough and your time is better used on something else. However, this pattern can sometimes bite you in the long term. As you can have tons of little 'hacks' all over the place and people become afraid to touch them, or worse copy from them.

This also nicely shows one of the fun things with table scans. They look decent at first then perf goes crappy. Then you get to learn something. In this case it looks like she used normalization to scrunch out the main table size (win on the scan rate, as it would not be hitting the disk as much with faster row reads). It probably also helped just on the 'to' lookups. Think about how many people are in an org, even a large company has a finite number of people but they get thousands of emails a week. That table is going to be massively smaller than keeping it in every row copied over and over. Just doing the 'to' clause alone would dump out huge amounts of that scan it was doing. That is even before considering an index.

The trick with most SQL instances is do less work. Grab less rows when you can. Use smaller tables if you can. Throw out unnecessary data from your rows if you can. Reduce round trips to the disk if you can (which usually conflicts with the previous rules). Pick your data ordering well. See if you can get your lookups to be ints instead of other datatypes. Indexes usually are a good first tool to grab when speeding up a query. But they do have a cost, on insert/update/delete and disk. Usually that cost is less than your lookup time, but not always. But you stick all of that together and you can have a DB that is really performant.

For me the fun one was one DB I worked in they used a GUID as the primary key, and therefore FK into other tables. Which also was the default ordering key on the disk. Took me awhile to make them understand why the perf on look up was so bad. I know why they did it and it made sense at the time to do it that way. But long term it was holding things back and ballooning the data size and crushing the lookup times.

How did I come by all of this? Lots of reading and stubbing my toe on things and watching other do the same. Most of the time you do not get a 'mentor'. :( But I sure try to teach others.

I’d be a lot more sympathetic if the major RDMSes didn’t have outstanding and thorough reference manuals or that there weren’t a mountain of books on the subject that cover, among other things, the topic of indexing and its importance. MySQL’s manual, for example, has covered this subject from the very beginning: https://dev.mysql.com/doc/refman/5.6/en/mysql-indexes.html (I don’t have the 3.x manuals handy but it was there back then too).

It’s not clear from the article whether the author spent any time studying the problem before implementing it. If she failed to do so, it is both problematic and way more common than it ought to be.

“Ready, fire, aim”

But you know what they say: good judgment comes from experiences and experience comes from bad judgment.

Compounding the problem here is that the author now has much more experience and in a reflective blog post, still got the wrong answer.

IMO the better lesson to take away here would have been to take the time getting to know the technology before putting it into production instead of jumping head first into it. That would be bar raising advice. The current advice doesn’t advise caution; instead it perpetuates the status quo and gives “feel good” advice.

I couldn't disagree more with this comment. No amount of reading documentation teaches you how to build production systems. You progress much faster by getting your hands dirty, making mistakes, and learning from them.

The challenge of writing good software is not about knowing and focusing on the perfection every gory detail, it's about developing the judgement to focus on the details that actually matter. Junior engineers who follow your advice will be scared to make mistakes and their development will languish compared to those who dive in and learn from their mistakes. As one gains experience it's easy to become arrogant and dismissive of mistakes that junior engineers make, but this can be extremely poisonous to their development.

Respectfully, I believe you are disagreeing with an argument that I am not making. It's not an either-or scenario. Both access to and reliance on reference materials and the ability to safely experiment are part of the professional engineer's toolbox. I can say this with confidence because I was a junior engineer once - it's not like I'm speaking without experience.

Everyone - new and experienced developers alike - should have a healthy fear of causing pain to customers. Serving customers - indirectly or otherwise - is what makes businesses run. As a principal engineer today my number one focus is on the customer experience, and I try to instill this same focus in every mentee I have.

Does that mean that junior developers should live in constant fear and decision paralysis? Of course not.

That's where the mentor - the more experienced and seasoned engineer - comes in. The mentor is roughly analogous to the teacher. And the teaching process uses a combination of a mentor, reference materials, and the lab process. The reference materials provide the details; the lab process provides a safe environment in which to learn; and the mentor provides the boundaries to prevent the experiments from causing damage, feedback to the mentee, and wisdom to fill in the gaps between them all. If any of these are absent, the new learner's development will suffer.

Outstanding educational systems were built over the last 3 centuries in the Western tradition using this technique, and other societies not steeped in this tradition have sent their children to us (especially in the last half-century) to do better than they ever could at home. It is a testament to the quality of the approach.

So no, I'm not advocating that junior developers should do nothing until they have read all the books front to back. They'd never gain any experience at all or make essential mistakes they could learn from if they did that. But junior developers should not work in a vacuum - more senior developers who both provide guidance and refer them to reference materials to study and try again, in my experience, lead to stronger, more capable engineers in the long term. "Learning to learn" and "respect what came before" are skills as important as the engineering process itself.

> That's where the mentor [...] comes in.

The thing is, specifically, that the OP did NOT have a mentor. They had a serious problem to solve, pronto, and knew enough to take a crack at it. OK, the initial implementation was suboptimal. So what? It's totally normal, learn and try again. Repeat.

It would be nice if every workplace had a orderly hierarchy of talent where everyone takes care to nurture and support everyone else (especially new folks). And where it's OK to ask questions and receive guidance even across organizational silos, where there are guardrails to mitigate accidents and terrible mistakes. If you work in such an environment, you are lucky.

It is far more common to have sharp-elbow/blamestorm workplaces which pretend to value "accountability" but then don't lift a finger to support anyone who takes initiative. I suspect that at the time Rachelbythebay worked in exactly this kind of environment, where it simply wasn't possible for an OPS person to go see the resident database expert and ask for advice.

Right. Hence my concern about the lack of helpful advice other than “accept yourself”[1]: Neophyte engineers lacking guardrails and local mentors should fall back to what they learned in school: experiment, use your books, and ask questions from the community. Mentors can be found outside the immediate workplace, after all.

And when it comes to taking production risks, measure twice and cut once, just like a good carpenter.

[1] Of course you should accept yourself. But that alone won’t advance the profession or one’s career.

> measure twice and cut once

Being able to do that is a luxury that many do not enjoy in nose-to-the-grindstone workplaces. You make an estimate (or someone makes it for you), then you gotta deliver mentor or no mentor, and whether you know the finer points of database design/care-and-feeding or not.

There's something good to be said for taking action. Rachelbythebay did just fine. No one died, the company didn't suffer, it was just a problem, it got corrected later. Big whoop.

But it's also about constantly asking yourself "how can this be done better? What do I not know that could improve it?"

I learned probably 80% of what I know about DB optimization from reading the Percona blog and S.O. The other 20% came from splashing around and making mistakes, and testing tons of different things with EXPLAIN. That's basically learning in a vacuum.

I couldn't disagree more with this comment. No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes, nor changes the impact of the mistakes. You also progress much faster by learning from other peoples' mistakes, this is why written language is so powerful.

Honestly, I think your comment is okay minus the fact that you're trying to highlight such hard disagreement with a sentiment that people should read the fucking manual. Really, you couldn't disagree MORE?

There is definitely value in RTFM. There are cases where making mistakes in production is not acceptable, and progressing by "making mistakes" is a mistake on its own. I don't think the case in this article sounds like one of those, but they do exist (e.g. financial systems (think about people losing money on crypto exchanges), healthcare, or, say, sending rockets into space). In many cases, making mistakes is fine in test systems, but completely, absolutely, catastrophically unacceptable in production. Although, I refuse to blame the junior engineer for such mistakes, I blame management that "sent a boy to do a man's job" (apologies for a dated idiom here).

(As an aside, overall, I disagree with many of the comments here nitpicking the way the author solved the problem, calling her "clueless", etc. I really don't care about that level of detail, and while I agree the solution does not seem ideal, it worked for them better than the previous solution.)

> No amount of getting your hands dirty and making mistakes teaches you how to learn from your mistakes...

Categorically wrong.

Mistakes, failure, and trial and error are very much a part of developing skills. If you're not making mistakes, you're also not taking enough risks and thus missing out on opportunities for growth.

Yes, mistakes are required to learn from them, but making mistakes does not imply that a person will learn from their mistakes.

You can't really search a problem if you don't suspect its existence. At this point you might feel you have everything you need to start the implementation, how do you guess?

… I appreciate her sentiment at the end there that we should try to change the status quo (which I think does suck), but I'm not sure how much power the average employee has over it. Most employers I've worked at seem loathe to retain anyone past about 2 years. (E.g., currently I'm in the 95'th percentile, after just over 2 years.) IME it takes about 6 months to really learn how a company's systems work to where to proficiency (yes, six months.) which means we're spending ~25% of the time "training", where "training" is usually someone trying something crazy and failing, and getting corrected as opposed to some structured form of learning. Oftentimes the systems that these engineers are stumbling I think are the internal ones; they need more features, or refactoring to account for organic growth, but since they're not shiny new customer facing features they'll get exactly 0 priority from PMs. The engineers who build experience working with these systems and actually know what needs to change and how without breaking the existing use cases … are about to leave the company, since nobody is retained beyond ~2 years.

I also find my own team is usually strapped for resources, normally, people. (Usually politely phrased as "time".) Yes, one has to be wary of mythical man-month'ing it, but like my at my last employ we had essentially 2 of us on a project that could have easily used at least one, if not two more people. Repeat across every project and that employer was understaffed by 50-100%, essentially.

Some company just went for S-1, and they were bleeding cash. But they weren't bleeding it into new ventures: they were bleeding it into marketing. Sure, that might win you one or two customers, but I think you'd make much stronger gains with new products, or less buggy products that don't drive the existing customers away.

Also there's an obsession with "NIT" — not invented there — that really ties my hands as an engineer. Like, everything has to be out sourced to some cloud vendor provider whose product only fits some of the needs and barely, and whose "support" department appears to be unaware of what a computer is. I'm a SWE, let me do my thing, once in a while? (Yes, where there's a good fit for an external product, yes, by all means. But these days my job is 100% support tickets, and like 3% actual engineering.)

What's funny is there seems to be a lot of debate among grizzled veterans here about whether her old solution is actually better than the new. We're getting into index types, engine implementations, normal forms, etc. And really, this is a relatively simple development problem that I can easily imagine on an interview test.

Now, imagine a new engineer, just getting started trying to make sense of it all, yet being met with harsh criticism and impatience.

Maybe that was exactly her point--to get everyone debating over such a relatively trivial problem to communicate the messages: Good engineering is hard. Everyone is learning. Opinions can vary. Show some humility and grace.

Two engineers I interact with on a regular basis:

1. Me from six months ago. Clueless about a lot of stuff, made some odd technical choices for obscure reasons, left a lot of messes for me to deal with.

2. Me six months from now. Anything I don't have time for now, I can leave for him. If I was being considerate I'd write better docs for him to work with, but I don't have the time for that either.

> tend to chase off anyone who makes it to the age of 35

That's definitely not true anymore, if it ever was. Senior/staff level engs with 20+ years of experience are sought after and paid a ton of money.

I was reading her article like "yeah, well… wow, she's tough / hard with this programmer, it's not that unreasonable anyway!", until this passage.

I would not expect her to be like this from what I've read on her blog so I was surprised. Well written!

The way she kept referring to the programmer made me suspicious that that would be the ending. I personally really like that as a way of conveying the message of us all having growing pains and learning the hard way from some inefficient code.

I know I've also had these encounters (specifically in DBs), and I'm sure there are plenty more to come.

Seems like the definition of hacker to me.

> My point is: EVERYONE goes through this

A lot of the basic issues beginners go through can be mitigated by paying attention in class (and/or having formal education in the first place).

I’ll grant that there are other, similar, things that everyone will go through though.

My own was manually writing an ‘implode’ function in Javascript because I didn’t know ‘join’ existed.

While the normalized version is more compact and doesn't store redundant data, it _also_ needs an index on the four columns or it'll have to check every row in the table. A similar index added to the original denormalized table would have given comparable query performance.

The table schema isn't terrible, it's just not great. A good first-pass to be optimized when it's discovered to be overly large.

> The table schema isn't terrible, it's just not great.

It depends - if this was the full use-case then maybe a single table is actually a pretty good solution.

that is not true. the sub tables guarantee that value is unique in the entire column.

while main table did not have any such guarantees, so, there will be a lot of indexing and duplication in atleast 3 (of 4 sub tables)

Moreover, indexing would be much faster and efficient as there are only unique values in sub tables

Why wouldn't it have such guarantees, if the indices in question are defined as UNIQUE?

Because they are not unique in main table.

For eg: if there are multiple emails from same domain name, then domain name column will have same entry multiple times.

When domain is moved to sub table, main table still has duplicate entries, in form of foreign keys, while domain table has unique entry for each domain name

Foreign key columns (i.e. the ones in the main table) are often indexed themselves, and those indices can also be UNIQUE.

Feels like you could just concatenate and hash the 4 values with MD5 and store the hash and time.

Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.

Yup, we do this at work for similar purposes and it works a-ok. We also have some use cases that follow the original “problem” schema and they work fine with the correct indices involved.

My guess is that in 2002 there were some issues making those options unappealing to the Engineering team.

When we do this in the realm of huge traffic then we run the data through a log stream and it ends up in either a KV store, Parquet with SQL/Query layer on top of it, or hashed and rolled into a database (and all of the above of there are a lot of disparate consumers. Weee Data Lakes).

This is also the sort of thing I’d imagine Elastic would love you to use their search engine for.

Isn't that exactly what adding an index would do internally?

2002 MySQL had some limitations on indexing variable length string (varchar) columns. That's the gist of the linked story.

Depending on the RDBMS system, in most cases it does not.

For the 2021 version, you'd just generate a bloom filter/cuckoo hash from all the data gathered by your spamhaus database periodically. Make a separate one for each value in your tuple and your score would be the number of the sub-hashes that matched.

“Periodically” here would have to be quite frequent, as you have to rebuild the database before the next retry.

A couple times a day at most?

From the post, you'd have to be rebuilding it every 15 minutes:

> A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.

Still better than locking the tables every time you receive a piece of mail.

I'm not super familiar with this stuff, but I believe you could then use a key-value store with automatic expiry like Redis for automatic pruning and faster lookups.

Redis and memcached didn't exist in that timeframe.

I'd say that pretty much everyone is a better DB programmer than I am, so I don't really go out of my way to design anything especially robust. I try to design stuff I'm not good at, in a way that makes it easy for someone that knows more than I do, to come in and toss out the work I did, to be replaced with good work.

There's lots of stuff that I'm bad at. I'm also good at a fair bit of stuff. I got that way, by being bad at it, making mistakes, asking "dumb" questions, and seeing how others did it.

> I try to design stuff I'm not good at, in a way that makes it easy for someone that knows more than I do, to come in and toss out the work I did, to be replaced with good work.

Sounds familiar. And sometimes that other someone is just a later version of yourself :)

I am very grateful when my past self thought to add comments for my future self.

People in the comments are getting (rightfully) outraged about the poor understanding of indexing, but I'm a little surprised that everyone here doesn't seem to understand normalization either. The original schema is perfectly normalized and is already in 3NF: none of the columns shown has a dependence on any of the other columns outside of the primary key (in other words, if you knew eg the values of the ip, helo, and from columns, you'd still have no information about the "to" column).

Normalization does not mean "the same string can only appear once". Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table. Now instead of "address1@foo.bar" in 20 different locations, you have "id1" in 20 different locations. There's been no actual "deduplication", but that's again not the point. Creating the extra tables has no impact on the normalization of the data.

Thank you. Was thinking that I was the clueless one... (probably a little) because that is not my understanding of what normalization is.

I thought perhaps swapping the strings to integers might make it easier to index, or perhaps it did indeed help with dedupilcation in that the implementation didn't "compress" identical strings in a column—saving space and perhaps help performance. But both issues appeared to be implementation issues with an unsophisticated Mysql circa 2000, rather than a fundamentally wrong schema.

I agreed with her comment at the end about not valuing experience, but proper databasing should be taught to every developer at the undergraduate level, and somehow to the self-taught. Looking at comptia... they don't seem to have a db/sql test, only an "IT-Fundamentals" which touches on it.

Came here to say this. Whatever may have been true of InnoDB 20 years ago, don't follow this article's advice for any modern relational database.

To add: Not sure about MySQL, but `varchar`/`text` in PostgreSQL for short strings like those in the article is very efficient. It basically just takes up space equaling the length of the string on disk, plus one byte [1].

[1] https://www.postgresql.org/docs/current/datatype-character.h...

>>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.

>> Mapping the string "address1@foo.bar" to a new value "id1" has no effect on the relations in your table.

>How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.

Ok, but that's not what normalization means.

If you have a table as described in the article that looks like:

foo | bar | baz


foo1 | bar1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | bar1 | baz2

foo3 | bar1 | baz4

and then you say "ok, bar1 is now actually called id1", you now have a table

foo | bar | baz


foo1 | id1 | baz1

foo2 | bar2 | baz2

foo3 | bar3 | baz3

foo4 | id1 | baz2

foo3 | id1 | baz4

you haven't actually changed anything about the relationships in this data. You've just renamed one of your values. This is really a form of compression, not normalization.

Normalization is fundamentally about the constraints on data and how they are codified. If you took the author's schema and added a new column for the country where the ip address is located in (so the columns are now ip, helo, from, to, and country), then the table is no longer normalized because there is an implicit relationship between ip and country--if ip is located in the USA, every row with as the ip must have USA as the country. If you know the IP for a row, you know its country. This is what 3NF is about. Here you'd be able to represent invalid data by inserting a row with and a non-US country, and you normalize this schema by adding a new table mapping IP to country.

But none of that is what's going on in the article. The author described several fields that have no relationship at all between them--IP is assumed to be completely independent of helo, from address, and to address. And the second schema they propose is in no way "normalizing" anything. The four new tables don't establish any relationship between any of the data.

>This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.

It's not "very narrowly" 3NF. It's pretty clear-cut! A lot of commenters here are referring to the second schema as the "normalized" one and to me that betrays a fundamental misunderstanding of what the term even means. And sure, if you had a different schema that wasn't normalized, then it wouldn't be normalized, but that's not what's in the article.

Your point is well taken. In the bare description provided for the first schema in the article, the data is already normalized, and all that can be achieved is compression.

> This is really a form of compression, not normalization.

First of all, the index hashes of the emails depend on the email strings, hence the indexed original schema is not normalised.

Secondly, it would not effectively be compression unless there were in fact dependencies in the data. But we can make many fair assumptions about the statistical dependencies. For example, certain emails/ips occur together more often than others, and so on. In so far as our assumptions of these dependencies are correct, normalisation gives us all the usual benefits.

I think I’ve determined very long ago that the only normal form worth keeping in mind is the Boyce-Codd normal form.

While you may technically be correct, I think there’s few people that think of anything else when talking database normalization.

That said, I cannot quickly figure out if it’s actually true here.

Okay, this will be my last comment in this thread because I don't have any new way of presenting this and checking this is not a good use of my time.

This schema is also in Boyce-Codd normal form. It's normalized in every usual sense of the word. Trivially so, even. It's not a question of being "technically" correct. If you think the second schema is more normalized than the first one, you need to re-evaluate your mental model of what normalization means. That's all there is to it.

Plus depending on the amount of emails they get, that optimisation could be unnecessary. That database schema was perfectly fine in some cases.

This is very slightly not quite true, because the HELO string and the remote address should go hand-in-hand.

You might be right (I don't actually know what a HELO string is, I don't know anything about SMTP :). I was just going off how the author presented the data, as a tuple of four completely independent things.

Of course the main point still stands, that the two schemas are exactly as normalized as each other.

Edit: rereading the original post, the author mentions that "they forged...the HELO"--so perhaps there was indeed no relationship between HELO and IP here. But again, I don't know anything about SMTP, so this could be wrong.

I do know about smtp and you were right regardless, because the author was talking about 4 database fields, not smtp. The details of smtp are irrelevant.

Normalisation depends on the semantics of the data, and so the details of SMTP are very much relevant.


It's (somewhat) because the HELO is forged that there's no relationship between HELO and IP. The very first message of SMTP is "HELO <hostname>", hostname can either be a unique identifier (server1.company.com, etc.) or a system-level identifier (mta.company.com for all of your company's outbound mail agents, or in the case of bulk mailers they might use bulk.client1.com when sending as client1, bulk.client2.com, etc). But there is/was no authentication on what you send as HELO (Now you can verify it against the TLS certificate, though many implementations don't do that at all or well), so correlating based on the hostname in HELO was questionable at best. Thus, the combination of (HELO, IP) was the a single value as a tuple.

> But there is/was no authentication on what you send as HELO

Yep, and that explains the "foobar" rows - those should have resolved to the same IP, except because there's no authentication that blocks it you could put gibberish here and the SMTP server would accept it.

> so correlating based on the hostname in HELO was questionable at best

Eh, spambots from two different IPs could have both hardcoded "foobar" because of the lack of authentication, so I could see this working to filter legitimate/illegitimate emails from a compromised IP.

Right, useful as a signal in a bayesian filter most certainly, but there's no strong general rule.

Only if the SMTP client is following the RFC but being spammers they probably sent anything but the actual domain they sent from.

> Normalization does not mean "the same string can only appear once".

It can, if you want to easily update an email address, or easily remove all references to an email address because it is PII.

That's not what normalization normally means, no. See eg wikipedia https://en.wikipedia.org/wiki/Database_normalization

My piano teacher once told me that its common to commit a mistake but its disastrous if you keep practicing without realizing that you have made a mistake. What I look for in a developer is their ability to realize a mistake and find ways to fix it.

The problem, I think is that most people, me included, don't really know what databases really do. There is a whole lot about optimizing procedural code, with a variety of tools, the dangers of premature optimization and the tradeoff with readability. Anyone with a computer science degree has heard about algorithmic complexity, caches, etc...

But databases are just magic. You try things out, usually involving a CREATE INDEX at some point, and sometime it gets faster, so you keep it.

Rachel, in he blog post is a good example of that thought process. She used a "best practice", added an index (there is always an index) and it made her queries faster, cool. I don't blame her, it works, and it is a good reminder of the 3NF principle. But work like that on procedural code and I'm sure we will get plenty of reactions like "why no profiler?".

Many, many programmers write SQL, but very few seem to know about query plans and the way the underlying data structures work. It almost looks like secret knowledge of the DBA caste or something. I know it is all public knowledge of course, but it is rarely taught, and the little I know about is is all personal curiosity.

> But databases are just magic.


I hate how developers talk about a "database" as a monolithic concept. It's an abstract concept with countless implementations built off of competing philosophies of that abstract concept. SQL is only slightly more concrete, but there's as many variants and special quirks of SQL dialects out there as databases.

Database indexes are definitely one of the most approachable concepts of databases for computer science majors. If you have studied data structures and Big O notation you have all the concepts needed.

While I agree that how a query planner works is one of the most ‘magic’ aspects, I think the output from the query planner in most databases is very approachable as well to regular common programmers and will get you quite far in solving performance issues. We know what full scans are (searching through every element of an array), etc.

The challenge is usually discovering that the jargon used in your database really maps to something you do already have a concept about and then reading the database documentation…

This was great. Clever twist at the end.

I cringed a little because some of those mistakes looks like stuff I would do even now. I have a ton on of front/back end experience in a huge variety of languages and platforms, but I am NOT a database engineer. That’s a specific skillset I never picked up nor, admittedly, am I passionate about.

Sorry to go off on a tangent, but it also brings to mind that even so-called experts make mistakes. I watch a lot of live concert footage from famous bands from the 60s to the 90s, and as a musician myself, I spot a lot of mistakes even among the most renowned artist…with the sole exception of Rush. As far as I can tell, that band was flawless and miraculously sounded better live than on record!

Assuming this needs to be optimized for massive scale, just hash the values to a single indexed field.

And use something other than an RDBMS. Put the hash in Redis and expire the key; your code simply does an existence check for the hash. You could probably handle gmail with a big enough cluster.

That super-normalized schema looks terrible.

I too recommend using Redis in 2002.

Fair enough, but I think the issue here is recognizing that the underlying business operation is a hit test on an expiring hash key. You could have used MySQL as a key/value store or looked for something more specialized.

“Hash it and store in BerkeleyDB” would have been a ridiculously easy solution at the time.

++ correct answer here

I think the 'terrible schema' thing is a secondary issue. The important take away for me was this:

> Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?

Maybe I’ll change my mind in 5 years, but I have a hard time believing engineers over the age of 35 get chased off. What actually seems to be the case is that the field skews young because it’s rapidly growing, and older engineers choose to retire early because they can afford to.

Which results in the same problems but the root cause is different.

Unfortunately it is the penultimate sentence in a sizable post.

<pushes glasses up nose> Actually, the Correct Answer is a bloom filter.

(And, yes, we had math in the early 2000s.)

Snark aside, I'm frustrated for the author. Her completely-reasonable schema wasn't "terrible" (even in archaic MySQL)—it just needed an index.

There's always more than one way to do something. It's a folly of the less experienced to think that there's only One Correct Way, and it discourages teammates when there's a threat of labeling a solution as "terrible."

Not to say there aren't infinite terrible approaches to any give problem: but the way you guide someone to detect why a give solution may not be optimal, and how you iterate to something better, is how you grow your team.

To be fair, the “correct way” to do databases at that time was to use third normal form. Putting indices on string columns would have been considered a hack, much like using MySQL was.

3NF is largely orthogonal to indices on string columns, though. If you need to look something up by string fast, there's no avoiding an index of strings somewhere.

Plenty of databases at the time had pretty lame string indexing because it simply wasn't something people relied on if they wanted performance, and memory sayeth mysql's were both not particularly efficient and had some nasty technical limitations.

On the mysql she was using, breaking things out so it only needed to index ints was almost certainly a much better idea. On anything I'd be deploying to today, I'd start by throwing a compound index at the thing with the expectation that'd probably get me to Good Enough.

When has ever been putting indices on string columns "a hack"? If you're looking for that string, which seems to have been the case here in particular, you need that index.

We had a test database that contained 1 record. Nobody paid much attention since the focus was on the problem, not the database.

The database included several newly developed "stored procedures".

Time elapsed... and it was nearing the time to ship the code. So we tried to populate the database. But we could not. It turned out that the stored procedures would only allow a single record in the database.

Since a portion of the "business logic" depended on the stored procedures... well, things got "delayed" for quite a while... and we ended up having a major re-design of the back end.

Fun times.

Why do so many of us “forget” to populate the database with representative quantities of data until after we’ve baked in all of our poor design decisions?

Oops - missed that one. I guess there was more appetite for discussion.

> The rub is that instead of just being slow, it also cost a fair amount of money because this crazy vendor system charged by the row or somesuch. So, by scanning the whole table, they touched all of those rows, and oh hey, massive amounts of money just set ablaze!

Why _the hell_ is nobody mentioning that using a database that charges per row touched is absolute insanity? When has it become so normal that nobody mentions it?

You'll probably enjoy reading this article that was on the front page yesterday: https://briananglin.me/posts/spending-5k-to-learn-how-databa...

Thank you. I cannot say I enjoyed it, I am distressed now. Using a “planet scale” “serverless” database to store just 750k download records per day, and paying per DB row touched, and “thank you for letting us off the hook this time”. This reads like satire.

Nice. I was thinking of an earlier post involving cloud storage buckets with no index.

If the amount charged isn't proportional to the work done, the cloud provider would quickly go out of business.

My VPS provider gives me unlimited traffic, unlimited disk reads, and unlimited DB rows touched for fixed 30 dollars a month, regardless of how much CPU and RAM I keep loaded.

This makes me think that these things are at least 100x cheaper than AWS might want to make me believe.

On a single instance. The more instances you use, the more you will be charged. This service handles automatic scalability and resilience with multiple instances for you, allowing it to charge less than $30.

Sorry, no. The original schema was correct, and the new one is a mistake.

The reason is that the new schema adds a great deal of needless complexity, requires the overhead of foreign keys, and makes it a hassle to change things later.

It's better to stick the the original design and add a unique index with key prefix compression, which all major databases do these days. This means that the leading values gets compressed out and the resulting index will be no larger and no slower than the one with foreign keys.

If you include all of the keys in the index, then it will be a covering index and all queries will hit the index only, and not the heap table.

One thing that worth taking into consideration is that this happened in 2002. When the databases were not in cloud, the ops was done by dba’s and key prefix compression thats omnipresent today was likely not that common or potentially not even implemented/available.

But i don’t think the point of the post is whats right/wrong way of doing it. The point as mentioned by few here is that programmers makes mistakes. They are costly and will be costly if in tech industry, we continue to boot experienced engineers… the tacit knowledge those engineers have gained wi ll not be passed on and this means more people have to figure things out by themselves

>One thing that worth taking into consideration is that this happened in 2002. When the databases were not in cloud, the ops was done by dba’s

Are you implying that isn't the case today? Thousands of (big) companies are still like that and will continue to be like that.

I write my own SQL, design tables and stuff, submit it for a review by someone 10x more qualified than myself, and at the end of the day I'll get a message back from a DBA saying "do this, it's better".

If you have one, more power to you. I'm currently making a good living as a freelance DBA/SRE for startups. With a team of 3-5 devs, some of which frontend and/or mobile devs, proper database knowledge is definitely thin on the ground in some places.

The question of it being in the cloud or not is highly orthogonal to schema design.

In 2002, you started seeing major increases in query run time with as little as 5 joins. Once you hit six you had to start thinking about rearchitecting your data or living with slow response times.

There was a lot of pressure to relax 3NF as being too academic and not practical.

Around then, I had a friend who was using a pattern of varchar primary keys so that queries that just needed the (unique) name and not the metadata could skip the join. We all acted like he was engaging in the Dark Arts.

> When the databases were not in cloud, the ops was done by dba’s and key prefix compression thats omnipresent today was likely not that common or potentially not even implemented/available.

To my understanding, for example Firebird/Interbase had automatic key prefix compression as far back as early 2000s at the very least. I don't believe you could even turn it off.

You don't really need compression. Rows only need to persist for about an hour. The table can't be more than a few MiB.

We can debate the Correct Implementation all day long. The fact of the matter is that adding any index to the original table, even the wrong index, would lead to a massive speedup. We can debate 2x or 5x speedups from compression or from choosing a different schema or a different index, but we get 10,000x from adding any index at all.

Just to make this fun.

Adding an index now increases the insert operation cost/time and adds additional storage.

If insert speed/volume is more important than reads keep the indexes away. Replicate and create an index on that copy.

> Adding an index now increases the insert operation cost/time

that insert is happening after you've checked the table to see if the record is present. so two operations whose times we care about are "select and accept email" and "select, tell the sender to come back in 20 minutes, and then insert".

the insert time effectively doesn't matter, unless you've decided to abandon discussion of the original table entirely without mentioning it.

I agree. In the "new" FK-based approach, you'll still need to scan indexes to match the IP and email addresses (now in their own tables) to find the FKs, then do one more scan to match the FK values. I would think this would be significantly slower than a single compound index scan, assuming index scans are O(log(n))

Key thing is to use EXPLAIN and benchmark whatever you do. Then the right path will reveal itself...

And even if this is somehow not an ideal use of the database, it's certainly far from "terrible", and to conclude that the programmer who designed it was "clueless" is insulting to that programmer. Even if that programmer was you, years ago (as in the blog post).

Moreover, unless you can prove with experimental data that the 3rd-normal-form version of the database performs significantly better or solves some other business problem, then I would argue that refactoring it is strictly worse.

There are good reasons not to use email addresses as primary or foreign keys, but those reasons are conceptual ("business logic") and not technical.

She doesn’t mention the write characteristics of the system but she implies that it was pretty write heavy.

In that case it’s not obvious to me that putting a key prefix index on every column is the correct thing to do, because that will get toilsome very quick in high write loads.

Given that she herself wrote the before and after systems 20 years ago and that the story was more about everyone having dumb mistakes when they are inexperienced perhaps we should assume the best about her second design?

It's not an index on every column. It's a single index across all columns.

Which will be updated every time there is a new combination.

Which is OK. It's not like the whole index has to be rebuilt. It's cheaper than O(log N).

I think it depends a lot on the data. If those values like IP, From, To, etc keep repeating, you save a lot of space by normalizing it as she did.

But strictly from a performance aspect, I agree it's a wash if both were done correctly.

Space is cheap now tho. Better to duplicate some data and avoid a bunch of joins than to worry about saving a few gb of space.

It’s not that easy: you need to consider the total size and cardinality of the fields potentially being denormalized, too. If, say, the JOINed values fit in memory and, especially, if the raw value is much larger than the key it might be the case that you’re incurring a table scan to avoid something which stays in memory or allows the query to be satisfied from a modest sized index. I/O isn’t as cheap if you’re using a SAN or if you have many concurrent queries.

This system was written in 2002.

Using space to avoid joins will not necessarily improve performance in an RDBMS -- it might even make it worse.

The inverse of your statement is also true. Denormalizing the database to avoid duplicating data will not necessarily improve performance in an RDBMS - it might even make it worse.

> all major databases do these days

Did everyone on HN miss that the database in question was whichever version of MySQL existed in 2002?

Which I was using in production at the time, and, yeah, what she ended up doing was a much better plan for that thing.

It actually doesn't matter. Even simple indexes would have made that schema work just fine, and MySQL could certainly index strings in 2002.

> Sorry, no. The original schema was correct, and the new one is a mistake.

Well, you also save a space by doing this (though presumably you only need to index the emails as IPs are already 128 bits).

But other than that, I'm also not sure why the original schema was bad.

If you were to build individual indexes on all four rows, you would essentially build the four id tables implicitly.

You can calculate the intersection of hits on all four indexs that match your query to get your result. This is linear in the number of hits across all four indexes in the worst case but if you are careful about which index you look at first, you will probably be a lot more efficient, e.g. (from, ip, to, helo).

Even with a multi index on the new schema, how do you search faster than this?


> compute is cheap

There was a whole thread yesterday about how a dude found out that it isn't: https://briananglin.me/posts/spending-5k-to-learn-how-databa... (also mentioned in the RbtB post)

Depends on the provider. I've checked the pricing of 4 top relational database cloud vendors (Google, Amazon, Azure, IBM) and they all charge for the number of CPU cores and RAM you buy, not by the number of rows scanned like PlanetScale did in the post you refer to. They'll be more expensive than buying your own server but not nearly what PlanetScale charges for full table scan queries.

When do the rest of the folks figure it out?

To a degree, this is true though. An engineer's salary is a huge expense for a startup, it's straight up cheaper to spend more on cloud and have the engineer work on the product itself. Once you're bigger, you can optimize, of course.


If this is a case from real life that you have seen, you should definitely elaborate more on it. Otherwise, you are creating an extreme example that has no significance in the discussion, as I could create similarly ridiculous examples for the other side.

There are always costs and benefits to decisions. It seems that are you only looking at the costs and none of the benefits?


If you're suggesting that we should double the size of the hardware rather than add a single index, then I respectfully disagree.

But your point is well-taken. Hardware is cheap.

> The observation was that we could probably store the IP address, HELO string, FROM address and TO address in a table, and send back a 4xx "temporary failure" error the first time we saw that particular tuple (or "quad"). A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.

I've run into this form of greylisting, it's quite annoying. My service sends one-time login links and authorization codes that expire in 15 minutes. If the email gets delayed, the user can just try again, right? Except I'm using AWS SES, so the next email may very well come from a different address and will get delayed again.

You could have your codes expire after an hour or day instead?

That's a solution, but with tradeoffs:

* security, an attacker has more time to intercept and use the links and codes.

* UX, making the user wait 15+ minutes to do certain actions is quite terrible.

I've had a couple support requests about this. The common theme seems to be the customer is using Mimecast, and the fix is to add my sender address in a whitelist somewhere in their Mimecast configuration.

btw. ses now offers unique sending ips

I'd really love to be snarky here but I'll try to be polite: all those comments about the example situation are missing the whole point of the post. And it really worries me that there is a good chunk of the tech workers that just ignores the real meaning of something and just nitpick about stupid implementation details. The post is about managing rookie errors, being empathetic and also warn the ageism that pervades the sector. TBH about this last point IDK the situation nowadays in Silicon Valley, but in Europe my limited experience is that ageism is not that bad; kt's actually difficult to find seasoned developers.

Edit: typos

I don't buy it. If the only point was managing rookie errors, etc., the blog post shouldn't have been 15~ paragraphs of technical discussion and 2 paragraphs at the bottom of "my point is ..." You can't advocate that people just ignore 80% of the article because they're "not the point of the post."

I'm sure a good chunk of tech workers are worried that when an influential blogger writes something like this a couple hundred developers will Google 3NF and start opening PRs.

It's a narrative. A story. She told it in that way to lure people who get hooked on technical discussion, and then make the point that they probably would have missed or ignored. Without a narrative example, she could have just said "I made mistakes too, live and learn" but she chose to provide an example before presenting the thesis.

And hey, what's so bad about people learning about 3NF? Are you not supposed to know what that is until you're some mythical ninth level DBA?

Sure, I don't mind that the hook, and controversial posts often create the most educational discussion. I only take exception to the idea that we shouldn't debate it because it's "not the point." (specifically for technical advice; being over-pedantic is a thing).

I certainly don't mean to criticize the author or say that they shouldn't have posted the article. If articles needed to be 100% correct then no-one has any business writing them. And it's still open to debate whether the article is even wrong or misleading.

> Are you not supposed to know what that is until you're some mythical ninth level DBA?

I think the world might be a better place if people _did_ wait until they were ninth level DBAs :p

Jokes aside, I have no problem with people learning 3NF. I'm more concerned that people can be too quick to take this sort of rule-of-thumb advice to heart (myself included). And in my opinion, database optimization is too complex to fit well into simple rules like "normalize when your query looks like this." My only 'rules' for high-frequency/critical tables and queries is "test, profile, load test, and take nothing for granted." But I know just enough to know that I don't know enough about SQL databases to intuit performance.

If she told it to lure people who get hooked on technical discussion, then it’s not fair for you to complain about said people analyzing said technical discussion.

But if she told it to show people who get hooked on technical discussion that this often makes them miss a more important point, then they certainly proved her right.

The problem with her point is that it doesn't go anywhere. Mentorship is obviously good and ageism is obviously bad, and she doesn't have specific suggestions on how to solve any of the problems, so the hook is the most substantial thing to talk about.

The reason "the details are important" here are not because of the nitty gritty around what mistakes a "novice" programmer made.

They are important because the present incarnation of the author is making all the wrong diagnoses about the problems with the original implementation, despite doing it with an air of "Yes, younger me was so naive and inexperienced, and present me is savvy and wise".

She explained the problem, the first not working solution and the second working solution as they really happened in 2002 as an example. The real point is the last part of the post. And it was not "how to properly implement an sql based filter for open proxies in your MTA".

I get all that. But 2020 version of this person still does not understand the problem, and she is arguing that she does while oddly self-deprecating the inexperienced version of herself, who arguably had a better solution to begin with.

Yes, I think an alternate design was found that didn't hit as many MySQL limitations as the previous one. This improved performance and was a win. But the post-mortem diagnosis was lacking.

I think she inadvertently made a different point, which is that even experienced developers sometimes misunderstand the problem and make mistakes.

Or an even better argument: you don't need to actually understand the problem to fix it, often you accidentally fix the problem just by using a different approach.

I would argue instead that this comment thread is making the point that people forget that things that work now wouldn't've worked then and design decisions have to be made based on the database engine you're running in production.

MySQL could index strings in 2002. It would have worked fine.

"mysql could index strings" and "using a compound index over four varchar columns would've worked out well" are significantly different propositions.

Yes, we won't know because she didn't try it. And we know she didn't try it because the problem she described is table scans, not "indexed strings are somehow slow in MySQL circa 2002".

Or she did try it and it didn't work - or researched the question and figured, quite reasonably based on my experience of mysql in that era, it probably wouldn't work - and kept the post to only the original problem and the final approach to avoid having even more technical details in a post that wasn't really about those.

I agree that we don't know, but it seems a little unfair to her to treat every unknown as definitely being the most negative of the possibilities.

I really don't understand your general thrust here. MySql certainly had lots of issues in 2003, but being able to support multi-column indexes was not one of them. Her analysis is simply wrong - it is wrong now and was wrong then. Here is the doc from MySql v3.23:

7.4.2 Multiple-Column Indexes MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist of up to 16 columns. For certain data types, you can index a prefix of the column (see Section 7.4.1, “Column Indexes”). A multiple-column index can be considered a sorted array containing values that are created by concatenating the values of the indexed columns. MySQL uses multiple-column indexes in such a way that queries are fast when you specify a known quantity for the first column of the index in a WHERE clause, even if you do not specify values for the other columns

As I said already: "mysql could index strings" and "using a compound index over four varchar columns would've worked out well" are significantly different propositions.

To be more verbose about it - there is an important difference between "can be created" and "will perform sufficiently well on whatever (likely scavenged) hardware was assigned to the internal IT system in question."

I wouldn't be surprised if the "server" for this system was something like a repurposed Pentium 233 desktop with a cheap spinning rust IDE drive in it, and depending on just how badly the spammers were kicking the shit out of the mail system in question that's going to be a fun time.

A composite index is literally just concatenating the fields and then indexing that value. This is not technology that was out of reach of MySQL in 2002 and there is no reason to presume it was so when TFA clearly described the problem as a complete lack of indexes.

They are not significantly different. It just means sorting using more than one column.

you know for performance the new table would've needed a string index aswell?

A single-column string index is a different beast to implement compared to a multi-column varchar index.

I’m happy the hn comments are nitpicking, I’m not particularly well versed in database schemas but while reading it I was going “???????” and it’s good to know I’m not going crazy

Yeah I was confused because the new design looked like an obvious anti-pattern to me, in contrast to adding an index to the original table. And then I was wondering whether I had a completely false understanding of what database normalization means (especially since this is the first time I’d heard of 3NF).

Just because the author was intending to get across a certain point, doesn't mean the implementation details aren't worth discussing too. I don't think many people here would disagree with the central point, so what's there to discuss about it? I think it's uncharitable to assume that everyone talking about the implementation details is "missing the point".

There's probably also some sense of people trying to paint the lesson of "sometimes rookie errors aren't actually errors, and your seniors will tell you you're doing something wrong, when in actuality their solution is worse than what you came up with."

Which I guess is saying the article's setup story makes the exact opposite point that it's trying to?

As a 48 year old who lives in San Diego (so not Silicon Valley but geographically much closer than Europe) and still works as a programmer and who still enjoys coding and continually fights any attempt to put me into any sort of management role...

I think ageism is a valid concern but its also not a black and white issue because while I know quite a few older folks like me that are still constantly keeping up to date I've also known some people my age or just a little bit older who basically stagnated into irrelevance as experts in now discarded technology who never really moved on, so the line between what is ageism and what is people just not bringing value to the table anymore can be a bit blurry.

I'm by no means trying to suggest ageism isn't an issue in tech, I do see echoes of it when I look around and if I suddenly lost the network of people I work with, have worked with in the past, etc, who understand I'm still very technically 'flexible' for an older person I'm pretty sure it would make it difficult to find a job in the practical sense of getting pre-filtered out often just based on the age thing. So my heart really goes out to people who do find themselves in situations where they are capable but can't find work due to age primarily.

But, uh, the issue is somewhat complicated and very case by case, I think.

The real meaning being that people learn and gain experience over time? Is this really something we need to read two pages of text to find out? I think people are justifiably miffed at having read two pages for such a trivial message. Not only that, but these "stupid implementation details" could seriously mislead someone who really is a clueless newbie.

I think you are leaping to conclusions here. It is possible that most people want to keep HN focused on technical conversations and use this as an opportunity to learn something (I certainly do), which is why you are seeing many more comments on the technical aspects.

In this blog post, senior engineer Rachel talks nonsense about normalization and promotes a bafflingly complicated solution as superior to a simple one, without identifying the actual fix, which appears to have happened accidentally.

In other words, with enough empathy and patience, a clueless rookie can grow into a clueless senior engineer!

Rachel usually makes more sense than that. That's why people are nitpicking implementation details.

Hah, pretty accurate and kinda funny, but could be nicer. I still make mistakes.

Meanwhile, Rachel posted an update. Apparently, the whole section about how "the system got reworked" doesn't describe the proper solution, but her own (Rookie Rachel's) attempt to fix it.

And in that context, everything makes sense.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact