This was the first time regular people could go buy tickets for events & they had been lining up overnight at Bank of China locations through the country. We were down for over a day before we called it off. Apparently this led to minor upheaval at several locations in Beijing & riot police were called in.
We were pretty puzzled as we had an index and had load tested extensively. We had Oracle support working directly with us & couldn't figure out why queries had started to become table scans.
The culprit? A point upgrade to DBD::Oracle (something like X.Y.3 to X.Y.4) introduced subtle but in character sets. So the index required using a particular Unicode character set, and we were specifying it, but when it was translated into the actually query, it wasn't exactly the right one, so the DB assumed it couldn't use the index. Then, when all the banks opened & a large portion of very populous country tried to buy tickets at the same time, things just melted.
Not a fun day.
"More experienced programmers know very well that the bug is generally in their code: occasionally in third-party libraries; very rarely in system libraries; exceedingly rarely in the compiler; and never in the processor."
If it causes an issue with our own systems it also gives me a very good justification for doing it.
And then many other times I honestly don't have the knowledge to be able to contribute into large third party systems and it seems very rare to get an issue resolved without submitting a pr.
And lastly sometimes it's hard to judge if unexpected behavior is really a 'bug'. Similar to the last point, it can honestly be hard to tell the difference sometimes.
Given that you mention China in your story, did GB 18030 have anything to do with your problems?
Writing the pre-ANSI-JOIN-support SQL dialect code was pretty good fun though.
Near as I can tell, we assume there is some bit of magic built into the foreign key concept that handles this for us, but that is not the case.
But it seems like I am missing your point: care to expand on it so I can learn about the gotcha as well? What are you attempting to do, and what's making the performance slow?
Or in the case of Silicon Valley, leave everything in tact with a disabled flag and then spam you for the next decade or so.
Which is more or less what the GP says as well, deletes are not implemented because doing it properly requires proper design.
Oracle was called in, "en masse". Keep also in mind that Oracle Portal was basically running on top of lots of dedicated tables and stored procedures and you could not access the data directly, you interacted to the contents only through vistas, aliases etc.
It was indeed due to a smallish version bump.
But we had a pretty good Team Leader who had made Oracle declare that the two different version were absolutely compatible and allowing us to write code on the older version.
He made them state this in written form, before starting the development so it was later impossible to blame the developers for the problem.
load tests usually only happens on staging env, not production, that is the policy.
It may be that the second solution wasn’t the best, or that it was better for write performance or memory usage—this was 2002 after all. It may also be the case that the cardinal it’s of some columns was low in such a way that the four table solution was better, but maybe getting the columns in the right order for a multi column index could do that too.
Either way, I don’t think it really matters that much and doesn’t affect the point of the article.
Also remember indexes take space, and if you index all your text columns you'll balloon your DB size; and this was 2002, when that mattered a lot more even for text. Indexes also add write and compute burden for inserts/updates as now the DB engine has to compute and insert new index entries in addition to the row itself.
Finally, normalizing your data while using the file-per-table setting (also not the default back then) can additionally provide better write & mixed throughout due to the writes being spread across multiple file descriptors and not fighting over as many shared locks. (The locking semantics for InnoDB have also improved massively in the last 19 years.)
* Assuming the author used InnoDB and not MyISAM. The latter is a garbage engine that doesn't even provide basic crash/reboot durability, and was the default for years; using MyISAM was considered the #1 newbie MySQL mistake back then, and it happened all the time.
Indexes didn’t go away with these extra tables, they live in the ‘id’ field of each one. They also probably had UNIQUE constraint on the ‘value’ field, spending time on what you describe in the second half of my citation.
I mean, that should have saved some space for non-unique strings, but all other machinery is still there. And there are 4 extra unique constraints (also an index) in addition to 4 primary keys. Unless these strings are long, the space savings may turn out to be pretty marginal.
In general, Indexes alone can't save you from a normalization problems. From certain points of view an Index is itself another form of denormalization and while it is very handy form of denormalization, heavily relying on big composite indexes (and the "natural keys" they represent) makes normalization worse and you have to keep such trade-offs in mind just as you would any other normalization planning.
It is theoretically possible a database could do something much smarter and semi-normalized with for instance Trie-based "search" indexes, but specifically in 2002 so far as I'm aware of most of the databases at the time such indexes were never the default and didn't have great multi-column support and would have been expensive to compute if you did turn them own. Even such "smart" Indexes would likely still have suggested you normalize first.
Indexes take space, except for the clustered index.
An important distinction.
If the point of this table is always selecting based on IP, m_from, m_to, then clustering on those columns in that order would make sense.
Of course, if it's expected that a lot of other query patterns will exist then it might make sense to only cluster on one of those columns and build indexes on the others.
The author mentions that these events took place in 2002. At that time the site was likely still running MySQL 3 and InnoDB was very new and considered experimental. Sure MySQL 4.0 had shipped in 2002 and InnoDB was a first class citizen, but back then upgrades from one version of MySQL to another weren't trivial tasks. You also tended to wait until the .1 release before making that leap.
So in fairness for the folks who originally set up that database MyISAM was likely their only realistic option.
I looked after and managed a fleet of MySQL servers from back then and for many years afterwards and even then it wasn't until MySQL 5 that we fully put our trust in InnoDB.
However, normalizing the fields that contain repetitive data into separate tables could create significant space savings since the full text for each column would not need to stored for each row. Instead of 20-30 bytes per email address, a 4 byte (assuming 32-bit era) OID is stored in its place.
It's pretty easy to imagine how quickly the savings would add up, and that it would be very helpful in the era before SSDs or even 1TB HDDs existed.
Also is helo field even needed?
Even better, use a database with a proper inet datatype. That way you get correctness, space efficiency, and ability to intelligently index.
In 2021 I’d recommend just turning off the IPv6 allocation or deleting it from DNS like this site does.
This is the funniest IPv6 excuse I've heard on HN.
> Now, what do you suppose happened to that clueless programmer who didn't know anything about foreign key relationships?
> Well, that's easy. She just wrote this post for you. That's right, I was that clueless newbie who came up with a completely ridiculous abuse of a SQL database that was slow, bloated, and obviously wrong at a glance to anyone who had a clue.
> My point is: EVERYONE goes through this, particularly if operating in a vacuum with no mentorship, guidance, or reference points. Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?
I don't know the actual numbers, but it's been pointed out that at any given time something like half of all programmers have been doing it less than five years, for decades now.
That, plus the strident ignorance of past art and practice, seem to me to bring on a lot of issues.
They then went on to list the errors that this brand new entry level employee had made when writing ... an authentication system ...
I was more than a little shocked when I realized they were serious and hadn't realized the issue was sending the new entry level guy to do that job alone.
I've always seen authn. Where'd you pick this usage up?
For my part, I have not encountered "authn", "authz", or "authk" before. But prior discussion did not even mention the word "authorization", and that seemed worth bringing out in the open.
So, not nonresponsive, just responsive to things you weren't personally interested in.
They way I see it is that it was a case of something that wasn't working, then replaced with something that was working.
I don't know what your metrics about "good" or "bad" are but eventually they got a solution that covered their use cases and the solution was good enough for stakeholders and that is "good".
that is an excellent term.
Maybe a bit of both.
I could have jumped on it and played it but I figured just another toy craze, it'll be over before long.
With yes the added confounding factors of how often our industry seems to prefer to hire youth over experience, and subsequently how often experience "drains" to other fields/management roles/"dark matter".
The narrative should lead to the conclusion. If I told you the story of the tortoise and the hare in which the hare gets shot by a hunter, then said the moral is "perseverance wins", you'd be rightfully confused.
The thing is, the first implementation was a perfectly fine "straight line" approach to solve the problem at hand. One table, a few columns, computers are pretty fast at searching for stuff... why not? In many scenarios, one would never see a problem with that schema.
Unfortunately, "operating in a vacuum with no mentorship, guidance, or reference points", is normal for many folks. She's talking about the trenches of SV, it's even worse outside of that where the only help you might get is smug smackdowns on stackoverflow (or worse, the DBA stackexchange) for daring to ask about such a "basic problem".
I've experienced this just by switching languages. C# had many articles dedicated to how much better StringBuilder was compared to String.Concat and yet, other languages would do the right thing by default. I would give advice in a totally different language about a problem that the target language did not have.
As the song goes:
"Be careful whose advice you buy but be patient with those who supply it
Advice is a form of nostalgia, dispensing it is a way of fishing the past
From the disposal, wiping it off, painting over the ugly parts
And recycling it for more than it's worth"
Do you have a specific example that would help in the case described in Rachel's blog post?
I am still making (and finding) occasional indexing and 3NF mistakes. In my experience, it is always humans finding and fixing these issues.
Some languages have mutable strings, but you would still need to allocate a sufficiently larger buffer if you want to add strings in a loop.
Rust's ownership system ensures that there cannot exist mutable and immutable references to the same object at the same time. So if you have a mutable reference to a string it is ok to modify in place.
I have not confirmed either the Python example or the Rust example too deeply, since I'm replying to 5 day old comment. But the general principal holds. GC languages like C# and Java don't have easy ways to check ownership so they always copy strings when mutating them. Maybe I'll write a blogpost on this in the future.
Using a StringBuilder, the strings are only copied once when copied into the buffer. If the buffer need to grow, the whole buffer is copied into a larger buffer, but if the buffer grows by say doubling the size, then this cost is amortized, so each string on average is still only copied twice at worst.
Faster yet is concatenating an array of strings in a single concat operation. This avoids the buffer resize issue, since the necessary buffer size is known up front. But this leads to the subtle issue where a single concat is faster than using a StringBuilder, while the cost for multiple appends grows exponentially for concat but only lineary for StringBuilder.
There's a lot of interesting conflations of "right" versus "wrong" in this this thread. There's the C# "best practices" "right" versus "wrong" of knowing when to choose things like StringBuilder over string.Concat or string's + operator overloading (which does just call string.Concat under the hood). There's the "does that right/wrong" apply to other languages question? (The answer is far more complex than just "right" and "wrong".) There's the VM guarantees of .NET CLR's hard line "no string is mutable" setting a somewhat hard "right" versus "wrong" and other language's VMs/interpreters not needing to make such guarantees to themselves or to others. None of these are really ever "right" versus "wrong", almost all of them are trade-offs described as "best practices" and we take the word "best" too strongly especially when we don't know/examine/explore the trade-off context it originated from (and sometimes wrongly assuming that "best" means "universally the best").
This also nicely shows one of the fun things with table scans. They look decent at first then perf goes crappy. Then you get to learn something. In this case it looks like she used normalization to scrunch out the main table size (win on the scan rate, as it would not be hitting the disk as much with faster row reads). It probably also helped just on the 'to' lookups. Think about how many people are in an org, even a large company has a finite number of people but they get thousands of emails a week. That table is going to be massively smaller than keeping it in every row copied over and over. Just doing the 'to' clause alone would dump out huge amounts of that scan it was doing. That is even before considering an index.
The trick with most SQL instances is do less work. Grab less rows when you can. Use smaller tables if you can. Throw out unnecessary data from your rows if you can. Reduce round trips to the disk if you can (which usually conflicts with the previous rules). Pick your data ordering well. See if you can get your lookups to be ints instead of other datatypes. Indexes usually are a good first tool to grab when speeding up a query. But they do have a cost, on insert/update/delete and disk. Usually that cost is less than your lookup time, but not always. But you stick all of that together and you can have a DB that is really performant.
For me the fun one was one DB I worked in they used a GUID as the primary key, and therefore FK into other tables. Which also was the default ordering key on the disk. Took me awhile to make them understand why the perf on look up was so bad. I know why they did it and it made sense at the time to do it that way. But long term it was holding things back and ballooning the data size and crushing the lookup times.
How did I come by all of this? Lots of reading and stubbing my toe on things and watching other do the same. Most of the time you do not get a 'mentor'. :( But I sure try to teach others.
It’s not clear from the article whether the author spent any time studying the problem before implementing it. If she failed to do so, it is both problematic and way more common than it ought to be.
“Ready, fire, aim”
But you know what they say: good judgment comes from experiences and experience comes from bad judgment.
Compounding the problem here is that the author now has much more experience and in a reflective blog post, still got the wrong answer.
IMO the better lesson to take away here would have been to take the time getting to know the technology before putting it into production instead of jumping head first into it. That would be bar raising advice. The current advice doesn’t advise caution; instead it perpetuates the status quo and gives “feel good” advice.
The challenge of writing good software is not about knowing and focusing on the perfection every gory detail, it's about developing the judgement to focus on the details that actually matter. Junior engineers who follow your advice will be scared to make mistakes and their development will languish compared to those who dive in and learn from their mistakes. As one gains experience it's easy to become arrogant and dismissive of mistakes that junior engineers make, but this can be extremely poisonous to their development.
Everyone - new and experienced developers alike - should have a healthy fear of causing pain to customers. Serving customers - indirectly or otherwise - is what makes businesses run. As a principal engineer today my number one focus is on the customer experience, and I try to instill this same focus in every mentee I have.
Does that mean that junior developers should live in constant fear and decision paralysis? Of course not.
That's where the mentor - the more experienced and seasoned engineer - comes in. The mentor is roughly analogous to the teacher. And the teaching process uses a combination of a mentor, reference materials, and the lab process. The reference materials provide the details; the lab process provides a safe environment in which to learn; and the mentor provides the boundaries to prevent the experiments from causing damage, feedback to the mentee, and wisdom to fill in the gaps between them all. If any of these are absent, the new learner's development will suffer.
Outstanding educational systems were built over the last 3 centuries in the Western tradition using this technique, and other societies not steeped in this tradition have sent their children to us (especially in the last half-century) to do better than they ever could at home. It is a testament to the quality of the approach.
So no, I'm not advocating that junior developers should do nothing until they have read all the books front to back. They'd never gain any experience at all or make essential mistakes they could learn from if they did that. But junior developers should not work in a vacuum - more senior developers who both provide guidance and refer them to reference materials to study and try again, in my experience, lead to stronger, more capable engineers in the long term. "Learning to learn" and "respect what came before" are skills as important as the engineering process itself.
The thing is, specifically, that the OP did NOT have a mentor. They had a serious problem to solve, pronto, and knew enough to take a crack at it. OK, the initial implementation was suboptimal. So what? It's totally normal, learn and try again. Repeat.
It would be nice if every workplace had a orderly hierarchy of talent where everyone takes care to nurture and support everyone else (especially new folks). And where it's OK to ask questions and receive guidance even across organizational silos, where there are guardrails to mitigate accidents and terrible mistakes. If you work in such an environment, you are lucky.
It is far more common to have sharp-elbow/blamestorm workplaces which pretend to value "accountability" but then don't lift a finger to support anyone who takes initiative. I suspect that at the time Rachelbythebay worked in exactly this kind of environment, where it simply wasn't possible for an OPS person to go see the resident database expert and ask for advice.
And when it comes to taking production risks, measure twice and cut once, just like a good carpenter.
 Of course you should accept yourself. But that alone won’t advance the profession or one’s career.
Being able to do that is a luxury that many do not enjoy in nose-to-the-grindstone workplaces. You make an estimate (or someone makes it for you), then you gotta deliver mentor or no mentor, and whether you know the finer points of database design/care-and-feeding or not.
There's something good to be said for taking action. Rachelbythebay did just fine. No one died, the company didn't suffer, it was just a problem, it got corrected later. Big whoop.
I learned probably 80% of what I know about DB optimization from reading the Percona blog and S.O. The other 20% came from splashing around and making mistakes, and testing tons of different things with EXPLAIN. That's basically learning in a vacuum.
Honestly, I think your comment is okay minus the fact that you're trying to highlight such hard disagreement with a sentiment that people should read the fucking manual. Really, you couldn't disagree MORE?
There is definitely value in RTFM. There are cases where making mistakes in production is not acceptable, and progressing by "making mistakes" is a mistake on its own. I don't think the case in this article sounds like one of those, but they do exist (e.g. financial systems (think about people losing money on crypto exchanges), healthcare, or, say, sending rockets into space). In many cases, making mistakes is fine in test systems, but completely, absolutely, catastrophically unacceptable in production. Although, I refuse to blame the junior engineer for such mistakes, I blame management that "sent a boy to do a man's job" (apologies for a dated idiom here).
(As an aside, overall, I disagree with many of the comments here nitpicking the way the author solved the problem, calling her "clueless", etc. I really don't care about that level of detail, and while I agree the solution does not seem ideal, it worked for them better than the previous solution.)
Mistakes, failure, and trial and error are very much a part of developing skills. If you're not making mistakes, you're also not taking enough risks and thus missing out on opportunities for growth.
I also find my own team is usually strapped for resources, normally, people. (Usually politely phrased as "time".) Yes, one has to be wary of mythical man-month'ing it, but like my at my last employ we had essentially 2 of us on a project that could have easily used at least one, if not two more people. Repeat across every project and that employer was understaffed by 50-100%, essentially.
Some company just went for S-1, and they were bleeding cash. But they weren't bleeding it into new ventures: they were bleeding it into marketing. Sure, that might win you one or two customers, but I think you'd make much stronger gains with new products, or less buggy products that don't drive the existing customers away.
Also there's an obsession with "NIT" — not invented there — that really ties my hands as an engineer. Like, everything has to be out sourced to some cloud vendor provider whose product only fits some of the needs and barely, and whose "support" department appears to be unaware of what a computer is. I'm a SWE, let me do my thing, once in a while? (Yes, where there's a good fit for an external product, yes, by all means. But these days my job is 100% support tickets, and like 3% actual engineering.)
Now, imagine a new engineer, just getting started trying to make sense of it all, yet being met with harsh criticism and impatience.
Maybe that was exactly her point--to get everyone debating over such a relatively trivial problem to communicate the messages: Good engineering is hard. Everyone is learning. Opinions can vary. Show some humility and grace.
1. Me from six months ago. Clueless about a lot of stuff, made some odd technical choices for obscure reasons, left a lot of messes for me to deal with.
2. Me six months from now. Anything I don't have time for now, I can leave for him. If I was being considerate I'd write better docs for him to work with, but I don't have the time for that either.
That's definitely not true anymore, if it ever was. Senior/staff level engs with 20+ years of experience are sought after and paid a ton of money.
I would not expect her to be like this from what I've read on her blog so I was surprised. Well written!
I know I've also had these encounters (specifically in DBs), and I'm sure there are plenty more to come.
A lot of the basic issues beginners go through can be mitigated by paying attention in class (and/or having formal education in the first place).
I’ll grant that there are other, similar, things that everyone will go through though.
The table schema isn't terrible, it's just not great. A good first-pass to be optimized when it's discovered to be overly large.
It depends - if this was the full use-case then maybe a single table is actually a pretty good solution.
while main table did not have any such guarantees, so, there will be a lot of indexing and duplication in atleast 3 (of 4 sub tables)
Moreover, indexing would be much faster and efficient as there are only unique values in sub tables
For eg: if there are multiple emails from same domain name, then domain name column will have same entry multiple times.
When domain is moved to sub table, main table still has duplicate entries, in form of foreign keys, while domain table has unique entry for each domain name
Edit: I guess concatenate with a delimiter if you're worried about false positives with the concat. But it does read like a cache of "I've seen this before". Doing it this way would be compact and indexed well. MD5 was fast in 2002, and you could just use a CRC instead if it weren't. I suppose you lose some operational visibility to what's going on.
My guess is that in 2002 there were some issues making those options unappealing to the Engineering team.
When we do this in the realm of huge traffic then we run the data through a log stream and it ends up in either a KV store, Parquet with SQL/Query layer on top of it, or hashed and rolled into a database (and all of the above of there are a lot of disparate consumers. Weee Data Lakes).
This is also the sort of thing I’d imagine Elastic would love you to use their search engine for.
> A real mail server which did SMTP properly would retry at some point, typically 15 minutes to an hour later. If it did retry and enough time had elapsed, we would allow it through.
There's lots of stuff that I'm bad at. I'm also good at a fair bit of stuff. I got that way, by being bad at it, making mistakes, asking "dumb" questions, and seeing how others did it.
Sounds familiar. And sometimes that other someone is just a later version of yourself :)
I am very grateful when my past self thought to add comments for my future self.
Normalization does not mean "the same string can only appear once". Mapping the string "email@example.com" to a new value "id1" has no effect on the relations in your table. Now instead of "firstname.lastname@example.org" in 20 different locations, you have "id1" in 20 different locations. There's been no actual "deduplication", but that's again not the point. Creating the extra tables has no impact on the normalization of the data.
I thought perhaps swapping the strings to integers might make it easier to index, or perhaps it did indeed help with dedupilcation in that the implementation didn't "compress" identical strings in a column—saving space and perhaps help performance.
But both issues appeared to be implementation issues with an unsophisticated Mysql circa 2000, rather than a fundamentally wrong schema.
I agreed with her comment at the end about not valuing experience, but proper databasing should be taught to every developer at the undergraduate level, and somehow to the self-taught. Looking at comptia... they don't seem to have a db/sql test, only an "IT-Fundamentals" which touches on it.
To add: Not sure about MySQL, but `varchar`/`text` in PostgreSQL for short strings like those in the article is very efficient. It basically just takes up space equaling the length of the string on disk, plus one byte .
How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.
This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.
>How do you mean? If id1 is unique on table A, and table B has a foreign key dependency on A.id, then yeah you still have id1 in twenty locations but it's normalized in that altering the referenced table once will alter the joined value in all twenty cases.
Ok, but that's not what normalization means.
If you have a table as described in the article that looks like:
foo | bar | baz
foo1 | bar1 | baz1
foo2 | bar2 | baz2
foo3 | bar3 | baz3
foo4 | bar1 | baz2
foo3 | bar1 | baz4
and then you say "ok, bar1 is now actually called id1", you now have a table
foo1 | id1 | baz1
foo4 | id1 | baz2
foo3 | id1 | baz4
you haven't actually changed anything about the relationships in this data. You've just renamed one of your values. This is really a form of compression, not normalization.
Normalization is fundamentally about the constraints on data and how they are codified. If you took the author's schema and added a new column for the country where the ip address is located in (so the columns are now ip, helo, from, to, and country), then the table is no longer normalized because there is an implicit relationship between ip and country--if ip 18.104.22.168 is located in the USA, every row with 22.214.171.124 as the ip must have USA as the country. If you know the IP for a row, you know its country. This is what 3NF is about. Here you'd be able to represent invalid data by inserting a row with 126.96.36.199 and a non-US country, and you normalize this schema by adding a new table mapping IP to country.
But none of that is what's going on in the article. The author described several fields that have no relationship at all between them--IP is assumed to be completely independent of helo, from address, and to address. And the second schema they propose is in no way "normalizing" anything. The four new tables don't establish any relationship between any of the data.
>This might not be important in the spam-graylisting use case, and very narrowly it might be 3NF as originally written, but it certainly wouldn't be if there were any other data attached to each email address, such as a weighting value.
It's not "very narrowly" 3NF. It's pretty clear-cut! A lot of commenters here are referring to the second schema as the "normalized" one and to me that betrays a fundamental misunderstanding of what the term even means. And sure, if you had a different schema that wasn't normalized, then it wouldn't be normalized, but that's not what's in the article.
First of all, the index hashes of the emails depend on the email strings, hence the indexed original schema is not normalised.
Secondly, it would not effectively be compression unless there were in fact dependencies in the data. But we can make many fair assumptions about the statistical dependencies. For example, certain emails/ips occur together more often than others, and so on. In so far as our assumptions of these dependencies are correct, normalisation gives us all the usual benefits.
While you may technically be correct, I think there’s few people that think of anything else when talking database normalization.
That said, I cannot quickly figure out if it’s actually true here.
This schema is also in Boyce-Codd normal form. It's normalized in every usual sense of the word. Trivially so, even. It's not a question of being "technically" correct. If you think the second schema is more normalized than the first one, you need to re-evaluate your mental model of what normalization means. That's all there is to it.
Of course the main point still stands, that the two schemas are exactly as normalized as each other.
Edit: rereading the original post, the author mentions that "they forged...the HELO"--so perhaps there was indeed no relationship between HELO and IP here. But again, I don't know anything about SMTP, so this could be wrong.
Yep, and that explains the "foobar" rows - those should have resolved to the same IP, except because there's no authentication that blocks it you could put gibberish here and the SMTP server would accept it.
> so correlating based on the hostname in HELO was questionable at best
Eh, spambots from two different IPs could have both hardcoded "foobar" because of the lack of authentication, so I could see this working to filter legitimate/illegitimate emails from a compromised IP.
It can, if you want to easily update an email address, or easily remove all references to an email address because it is PII.
But databases are just magic. You try things out, usually involving a CREATE INDEX at some point, and sometime it gets faster, so you keep it.
Rachel, in he blog post is a good example of that thought process. She used a "best practice", added an index (there is always an index) and it made her queries faster, cool. I don't blame her, it works, and it is a good reminder of the 3NF principle. But work like that on procedural code and I'm sure we will get plenty of reactions like "why no profiler?".
Many, many programmers write SQL, but very few seem to know about query plans and the way the underlying data structures work. It almost looks like secret knowledge of the DBA caste or something. I know it is all public knowledge of course, but it is rarely taught, and the little I know about is is all personal curiosity.
I hate how developers talk about a "database" as a monolithic concept. It's an abstract concept with countless implementations built off of competing philosophies of that abstract concept. SQL is only slightly more concrete, but there's as many variants and special quirks of SQL dialects out there as databases.
While I agree that how a query planner works is one of the most ‘magic’ aspects, I think the output from the query planner in most databases is very approachable as well to regular common programmers and will get you quite far in solving performance issues. We know what full scans are (searching through every element of an array), etc.
The challenge is usually discovering that the jargon used in your database really maps to something you do already have a concept about and then reading the database documentation…
I cringed a little because some of those mistakes looks like stuff I would do even now. I have a ton on of front/back end experience in a huge variety of languages and platforms, but I am NOT a database engineer. That’s a specific skillset I never picked up nor, admittedly, am I passionate about.
Sorry to go off on a tangent, but it also brings to mind that even so-called experts make mistakes. I watch a lot of live concert
footage from famous bands from the 60s to the 90s, and as a musician myself, I spot a lot of mistakes even among the most
renowned artist…with the sole exception of Rush. As far as I can tell, that band was flawless and miraculously sounded better live than on record!
And use something other than an RDBMS. Put the hash in Redis and expire the key; your code simply does an existence check for the hash. You could probably handle gmail with a big enough cluster.
That super-normalized schema looks terrible.
> Considering that we as an industry tend to chase off anyone who makes it to the age of 35, is it any surprise that we have a giant flock of people roaming around trying anything that'll work?
Which results in the same problems but the root cause is different.
(And, yes, we had math in the early 2000s.)
Snark aside, I'm frustrated for the author. Her completely-reasonable schema wasn't "terrible" (even in archaic MySQL)—it just needed an index.
There's always more than one way to do something. It's a folly of the less experienced to think that there's only One Correct Way, and it discourages teammates when there's a threat of labeling a solution as "terrible."
Not to say there aren't infinite terrible approaches to any give problem: but the way you guide someone to detect why a give solution may not be optimal, and how you iterate to something better, is how you grow your team.
On the mysql she was using, breaking things out so it only needed to index ints was almost certainly a much better idea. On anything I'd be deploying to today, I'd start by throwing a compound index at the thing with the expectation that'd probably get me to Good Enough.
The database included several newly developed "stored procedures".
Time elapsed... and it was nearing the time to ship the code. So we tried to populate the database. But we could not. It turned out that the stored procedures would only allow a single record in the database.
Since a portion of the "business logic" depended on the stored procedures... well, things got "delayed" for quite a while... and we ended up having a major re-design of the back end.
Why _the hell_ is nobody mentioning that using a database that charges per row touched is absolute insanity? When has it become so normal that nobody mentions it?
This makes me think that these things are at least 100x cheaper than AWS might want to make me believe.
The reason is that the new schema adds a great deal of needless complexity, requires the overhead of foreign keys, and makes it a hassle to change things later.
It's better to stick the the original design and add a unique index with key prefix compression, which all major databases do these days. This means that the leading values gets compressed out and the resulting index will be no larger and no slower than the one with foreign keys.
If you include all of the keys in the index, then it will be a covering index and all queries will hit the index only, and not the heap table.
But i don’t think the point of the post is whats right/wrong way of doing it. The point as mentioned by few here is that programmers makes mistakes. They are costly and will be costly if in tech industry, we continue to boot experienced engineers… the tacit knowledge those engineers have gained wi ll not be passed on and this means more people have to figure things out by themselves
Are you implying that isn't the case today? Thousands of (big) companies are still like that and will continue to be like that.
I write my own SQL, design tables and stuff, submit it for a review by someone 10x more qualified than myself, and at the end of the day I'll get a message back from a DBA saying "do this, it's better".
There was a lot of pressure to relax 3NF as being too academic and not practical.
Around then, I had a friend who was using a pattern of varchar primary keys so that queries that just needed the (unique) name and not the metadata could skip the join. We all acted like he was engaging in the Dark Arts.
To my understanding, for example Firebird/Interbase had automatic key prefix compression as far back as early 2000s at the very least. I don't believe you could even turn it off.
We can debate the Correct Implementation all day long. The fact of the matter is that adding any index to the original table, even the wrong index, would lead to a massive speedup. We can debate 2x or 5x speedups from compression or from choosing a different schema or a different index, but we get 10,000x from adding any index at all.
Adding an index now increases the insert operation cost/time and adds additional storage.
If insert speed/volume is more important than reads keep the indexes away. Replicate and create an index on that copy.
that insert is happening after you've checked the table to see if the record is present. so two operations whose times we care about are "select and accept email" and "select, tell the sender to come back in 20 minutes, and then insert".
the insert time effectively doesn't matter, unless you've decided to abandon discussion of the original table entirely without mentioning it.
Key thing is to use EXPLAIN and benchmark whatever you do. Then the right path will reveal itself...
Moreover, unless you can prove with experimental data that the 3rd-normal-form version of the database performs significantly better or solves some other business problem, then I would argue that refactoring it is strictly worse.
There are good reasons not to use email addresses as primary or foreign keys, but those reasons are conceptual ("business logic") and not technical.
In that case it’s not obvious to me that putting a key prefix index on every column is the correct thing to do, because that will get toilsome very quick in high write loads.
Given that she herself wrote the before and after systems 20 years ago and that the story was more about everyone having dumb mistakes when they are inexperienced perhaps we should assume the best about her second design?
But strictly from a performance aspect, I agree it's a wash if both were done correctly.
Did everyone on HN miss that the database in question was whichever version of MySQL existed in 2002?
Well, you also save a space by doing this (though presumably you only need to index the emails as IPs are already 128 bits).
But other than that, I'm also not sure why the original schema was bad.
If you were to build individual indexes on all four rows, you would essentially build the four id tables implicitly.
You can calculate the intersection of hits on all four indexs that match your query to get your result. This is linear in the number of hits across all four indexes in the worst case but if you are careful about which index you look at first, you will probably be a lot more efficient, e.g. (from, ip, to, helo).
Even with a multi index on the new schema, how do you search faster than this?
There was a whole thread yesterday about how a dude found out that it isn't: https://briananglin.me/posts/spending-5k-to-learn-how-databa... (also mentioned in the RbtB post)
There are always costs and benefits to decisions. It seems that are you only looking at the costs and none of the benefits?
But your point is well-taken. Hardware is cheap.
I've run into this form of greylisting, it's quite annoying. My service sends one-time login links and authorization codes that expire in 15 minutes. If the email gets delayed, the user can just try again, right? Except I'm using AWS SES, so the next email may very well come from a different address and will get delayed again.
* security, an attacker has more time to intercept and use the links and codes.
* UX, making the user wait 15+ minutes to do certain actions is quite terrible.
I've had a couple support requests about this. The common theme seems to be the customer is using Mimecast, and the fix is to add my sender address in a whitelist somewhere in their Mimecast configuration.
I'm sure a good chunk of tech workers are worried that when an influential blogger writes something like this a couple hundred developers will Google 3NF and start opening PRs.
And hey, what's so bad about people learning about 3NF? Are you not supposed to know what that is until you're some mythical ninth level DBA?
I certainly don't mean to criticize the author or say that they shouldn't have posted the article. If articles needed to be 100% correct then no-one has any business writing them. And it's still open to debate whether the article is even wrong or misleading.
> Are you not supposed to know what that is until you're some mythical ninth level DBA?
I think the world might be a better place if people _did_ wait until they were ninth level DBAs :p
Jokes aside, I have no problem with people learning 3NF. I'm more concerned that people can be too quick to take this sort of rule-of-thumb advice to heart (myself included). And in my opinion, database optimization is too complex to fit well into simple rules like "normalize when your query looks like this." My only 'rules' for high-frequency/critical tables and queries is "test, profile, load test, and take nothing for granted." But I know just enough to know that I don't know enough about SQL databases to intuit performance.
They are important because the present incarnation of the author is making all the wrong diagnoses about the problems with the original implementation, despite doing it with an air of "Yes, younger me was so naive and inexperienced, and present me is savvy and wise".
Or an even better argument: you don't need to actually understand the problem to fix it, often you accidentally fix the problem just by using a different approach.
I agree that we don't know, but it seems a little unfair to her to treat every unknown as definitely being the most negative of the possibilities.
7.4.2 Multiple-Column Indexes
MySQL can create composite indexes (that is, indexes on multiple columns). An index may consist
of up to 16 columns. For certain data types, you can index a prefix of the column (see Section 7.4.1,
A multiple-column index can be considered a sorted array containing values that are created by
concatenating the values of the indexed columns.
MySQL uses multiple-column indexes in such a way that queries are fast when you specify a known
quantity for the first column of the index in a WHERE clause, even if you do not specify values for the
To be more verbose about it - there is an important difference between "can be created" and "will perform sufficiently well on whatever (likely scavenged) hardware was assigned to the internal IT system in question."
I wouldn't be surprised if the "server" for this system was something like a repurposed Pentium 233 desktop with a cheap spinning rust IDE drive in it, and depending on just how badly the spammers were kicking the shit out of the mail system in question that's going to be a fun time.
Which I guess is saying the article's setup story makes the exact opposite point that it's trying to?
I think ageism is a valid concern but its also not a black and white issue because while I know quite a few older folks like me that are still constantly keeping up to date I've also known some people my age or just a little bit older who basically stagnated into irrelevance as experts in now discarded technology who never really moved on, so the line between what is ageism and what is people just not bringing value to the table anymore can be a bit blurry.
I'm by no means trying to suggest ageism isn't an issue in tech, I do see echoes of it when I look around and if I suddenly lost the network of people I work with, have worked with in the past, etc, who understand I'm still very technically 'flexible' for an older person I'm pretty sure it would make it difficult to find a job in the practical sense of getting pre-filtered out often just based on the age thing. So my heart really goes out to people who do find themselves in situations where they are capable but can't find work due to age primarily.
But, uh, the issue is somewhat complicated and very case by case, I think.
In other words, with enough empathy and patience, a clueless rookie can grow into a clueless senior engineer!
Rachel usually makes more sense than that. That's why people are nitpicking implementation details.
And in that context, everything makes sense.