An email falsehood surprised me recently: I thought a case-insensitive email address can be compared by using pseudocode `lower(x)`. But that's false.
An email system that guarantees case-insensitive email addresses can still fail during comparisons of lowercase-to-lowercase, due to international encodings, locales, I18N, L10N, etc.
Pseudocode:
string-compare-case-insensitive(x, y) => true // Right way
string-compare(lower(x), lower(y)) => false // Wrong way
It turns out this issue is called a "case folding non-deterministic" error, and is a broader issue with strings in general.
I thought a case-insensitive email address can be compared by using pseudocode lower(x)
You shouldn't be comparing the mailbox part of email addresses at all other than as literal bytestrings: you cannot know what equivalence rules the mailserver for that domain uses.
The domain part can be equivalence-tested using the normal rules for domains though, including case insensitivity, IDN translation and punycode resolution.
> you cannot know what equivalence rules the mailserver for that domain uses.
You're right in general. In my specific post, I do know the equivalence rules, because I'm administering the mail system and working with its source code, and the documentation guarantees/requires that internally its email addresses are treated entirely as case-insensitive.
What I saw was source code comments about not using `lower(x)` nor Postgres module `citext`, and instead using Unicode case folding ICU and Postgres non-deterministic collations. In the end, what surprised me wasn't about email servers in general, it was about human languages with case folding.
This also depends a lot on why you are comparing those addresses. If you want for example to make sure that you don't easily allow the same person to register multiple accounts (say, to take advantage of a free trial period), then they are both wrong, since X@gmail.com, X.@gmail.com, X..@gmail.com etc. are all the same account and cost nothing to make.
However, if you just want to make sure this is the same user that signed in earlier, you get to chose what rules you want - it's their problem to some extent to remember what account name they gave you.
Found a mailing list ("marketing") service that allowed you to sign up with a +, but any 'unsubscribe' was a link to a URL with your email in it, with the + sign in it, and... the unsubscribe page could magically never find your address to unsubscribe you. I submitted this as a bug report to the service, and was brushed aside. At that time, I also had a couple friends who worked there, and tried to run it up the flagpole there, and I was told about half the folks there didn't understand the problem, and the other half didn't really care, because the + trick was seen as very niche. So I started reporting all the newsletters I got as 'spam' - well after I'd tried to unsubscribe.
Did you try to replace the + in the unsubscribe URL with %2B (the urlencoding of a +)? Because interpreting the URL on the server will almost certainly mean urldecoding the arguments which will turn + into a space.
Nope, cause + used to mean space. So plus symbol has sometimes to be left alone, some times replaced - due to still supporting legacy stuff and the newer RFC that makes + go away.
FWIR, yes. And it didn't work. They had some weird parsing going on to deal with some weird legacy stuff from years earlier, and... it made life more complicated than it needed to be.
Our website blocked it inadvertently, we were using ASP.NET Identity which has internal email validation which by default does not accept it. Just flipping a flag in the configuration is enough to allow it, but until someone reported the problem I wasn’t even aware it was blocked.
IME it's the latter, but not because they are lazy, but because email validation is hard. The lazy regex was in the late 90's early 00's when any email that didn't end in .com, .edu or .net failed!
I just wrote a post[1] about using the user+alias@example.com instead of user@example.com. Basically treat your "regular" email address as a spam target, and use user+alias for all wanted communications. The +alias can be a random string, making it harder for spammers to guess.
This works in gmail for both sending and receiving, although sending as you+alias@gmail.com requires messing around in the settings[2] and having to remember to select a non-default From: address in your outgoing mail.
Yes - other providers have their own strange ideas of what it means for two emails to be equal.
My point is that there is no standard that you can use to decide if two email addresses refer to the same inbox.
Each provider has their own rules for what emails they consider identical, and you need to either invent your own rules, or learn what theirs are, depending on why you are doing this.
Using Unicode aware string comparisons doesn't solve the issue any more than byte-wise comparison or ascii case-insensitive comparison.
You can't reliably say that a.b@ and ab@ are the same user except in a very small population (gmail.com) because you don't know how other domains will handle it.
Email is not a unique identifier which is how they're wanting to use it.
Which makes the problem even worse. Are you supposed to have comparison rules per provider then? Only for those that are large enough to bother in your particular use case?
Yes, if you want to know if two addresses refer to the same account, you need to keep up with the actual rules for each provider that you decide to care about.
But, if you're using the email as a username just for convenience, you get to decide what rules you have for username comparison. You can absolutely decide that A@gmail.com and a@gmail.com are different accounts for your service, even though emails you send to these accounts will likely reach the same inbox. You can also decide that they are the same account, but that haşim@gmail.com, HAŞIM@gmail.com and HAŞİM@gmail.com are all different accounts. It's entirely up to you to know what is important for your users.
It just means you cannot use email validation as a way to limit one account per person. A person could just have another email on another provider anyway.
It's part of the reason why you see many places ask for your phone number, not that I agree with that, and even this method has many flaws.
My favorite issue with phone number-based identity verification is that it doesn't account for multiple different countries well at all. In some regions, getting more than one 'legit' phone number (i.e. not on any of the VoIP blocklists) per person is nearly impossible, while in other regions, even in the Western world, it's very much possible to buy a lot of pay-as-you-go plans, keep the numbers from expiring, and use them to retain perhaps dozens of accounts at no to minimal cost, making it yet another measure that affects legitimate users disproportionately compared to it affecting illegitimate users.
I really wish there was something that was easy to get as a company but hard to get as an individual. As a B2B SaaS, I want to do the organization-level equivalent of KYC — "Know Your Corporate Buyer"? — but there's no such thing.
Instead, all anyone offers in this vein are identifiers that are really annoying to get and that nobody already has as a matter-of-course of registering a company; such that many real companies can't (or won't bother to) pass them. DUNS numbers are a common choice. Our company is five years old and doesn't have a DUNS number. It takes two weeks to get one. So how could I expect our customers to bother getting one just to try our product?
"can you set this value in a TXT field on your DNS" or another domain-level challenge-authentication mechanism would be one. Sure, anyone can buy a domain but in theory that is what Extended Verification is supposed to represent, so if you get a valid handshake from an EV domain then in theory you're talking with someone representing the organization that the EV certificate was issued for.
Of course in principle you can get an EV cert for any company as well, but now we're talking about determining whether a company is sufficiently well-known to accept, which is not something that can fully be solved deterministically since that's a human judgement, there will always be some grey areas. But, there are certainly a lot of companies that could probably be "automatically verified" in some fashion (ford motor company? reasonably well-known, and this is their domain...) given some sort of authoritative domain -> stock mapping, and that's not impossible to do if you trust the EV scheme.
Anyway, not perfect, but challenges based raises the bar a lot from "register domain with godaddy and sign up for let's encrypt" to "first you have to get an EV certificate..."
Of course, since such a mechanism is not widely used... not exactly going to find tons of official support for using it like that. But the mechanisms are there!
Yeah, but not usually easy enough that you can make hundreds/thousands of them quickly for a single purpose and then throw them away.
Also, as a slight tweak, I'd hope that this authentication scheme would only admit legal companies that are at least a week old. People who do these sorts of bulk-registration attacks tend not to be patient people, willing to wait around for their credentials to gain reputation before using them. They create them and then try to use them right away. Whereas no real company would be signing up for most B2B services in its first week of legal existence. (Google Workspace? Sure. Accounting software? Probably not.)
If it's targeted to X company for a specific reason, they'll wait however long they need. If it's generic and X company is just caught in the crossfire, then maybe.
>Whereas no real company would be signing up for most B2B services in its first week of legal existence. (Google Workspace? Sure. Accounting software? Probably not.)
When I've registered companies in the past, it's when I have sales. Before then, it's just an idea to see if we get users. But the company with the service can determine that.
You can, however, use email parsing + heuristics as a way to detect people who are trying to register accounts that look like they ever could be part of a set of "related" addresses — and just reject even the first member in that set. (I.e. just reject registrations with a + in the username part in the first place.)
(You don't have to be petty about it; no need to send them to a "you're an attacker" page or anything. Just redline the form-field, explain the problem clearly, and let them modify the address until it's less duplicable.)
Yes, this doesn't stop people from manually registering multiple times, since, as you say, people can have multiple addresses; or even email service from multiple providers. But it does stop low-effort automated bulk-registration attacks. And some services — those with any sort of free-tier especially — get a lot of those!
That + sign as a label to the same inbox is a Gmail thing.
It's a 100% valid character to use. Doesn't have to mean it's a label in another hostname. Goes back to what others have said. You either end up with a ton of rules, or go with the fact that email isn't the best solution to identify unique users.
> The local-part of the email address may be unquoted or may be enclosed in quotation marks.
>If unquoted, it may use any of these ASCII characters:
> - uppercase and lowercase Latin letters A to Z and a to z
> - digits 0 to 9
> - printable characters !#$%&'*+-/=?^_`{|}~
> - dot ., provided that it is not the first or last character and provided also that it does not appear consecutively (e.g., John..Doe@example.com is not allowed).[5]
> If quoted, it may contain Space, Horizontal Tab (HT), any ASCII graphic except Backslash and Quote and a quoted-pair consisting of a Backslash followed by HT, Space or any ASCII graphic; it may also be split between lines anywhere that HT or Space appears. In contrast to unquoted local-parts, the addresses ".John.Doe"@example.com, "John.Doe."@example.com and "John..Doe"@example.com are allowed.
I know that it's allowed; but — given that we're a B2B company serving highly-technically-literate customers — I don't think I want the business of anyone who thinks it'd be a good idea to use it even when it is allowed, given how it'd affect their own deliverability to imperfectly-implemented MTAs.
(It's a similar "you're really relying on perfect competence from the whole rest of the ecosystem to get you out of this" feeling as e.g. putting a space in your user name — and thus your home directory path — in Linux. Sure, the core GNU/BSD/etc tooling has been tested for that use-case — but are you really going to trust random tools and shell-scripts to handle argument tokenization perfectly? Or are you just going to ditch the space to be safe?)
The more technical literate a user is, the more they'll understand and figure out how things work, the nuances, and use those things.
You know that core tooling that works? It's because technical literate users, tried it, found the issue, and fixed it.
Same way with me using é on my name. If something breaks on a website, email, etc, I create an issue and starts emailing. My name doesn't have an e, it has an é.
you are supposed to not use email as a token of user uniqueness, because it's not.
the fundamental problem here is assuming that "emailA == emailB" could ever conceivably be used as a proxy of the equality "userA == userB". Unicode tarpits aside (that's a legitimate problem where even if you did everything exactly right, libraries or api calls etc might not) the idea of "do these mailboxes belong to the same user" (in the sense of stripping '+' suffixes from gmail boxes etc) cannot be answered period. You cannot affirmatively confirm this nor can you disprove the assertion either, without some additional information about the behavior of the receiving mail transport agent.
even if you understand the behavior of the receiving MTA, you cannot prove that they don't have an email account somewhere else, so that is not sufficient either.
email is not a key for user uniqueness, period. It's a channel for contacting a user. You don't mess with it, and you simply send an email and see if the session user can authenticate it. If you want identity verification that happens at another layer, not email.
"transforms" like lowercasing or mailbox-suffix-stripping are only window dressing patching around this fundamental issue. There is no way to affirmatively go from "email address" to "user identity" without more information from another layer. It's a communication channel, not an identity provider. You can (reasonably) use it as a medium to talk to the user while you verify them against something else (although like SMS it can of course be hijacked unless your IDP protocol accounts for this possibility) but "email verification" is not and can never be an IDP in itself. Gmail addresses are free.
> (in the sense of stripping '+' suffixes from gmail boxes etc)
I have a few accounts. Those + suffixes drive a bunch of rules on them to get me to take action. They forward to my primary email address if it's important.
Some of them get scraped and mailed in a weekly summary. Some of them send text messages alerting me.
When they mess with them, I'll probably just not see it unless I have to login for some other reason to that account.
The whole thing is a mess. I remember trying to make a Postgres email address column and wanted to do make sure it can do comparisons either way, then found this stackoverflow post that shattered my expectations of a clean well understood problem: https://dba.stackexchange.com/questions/68266/what-is-the-be...
I think the part before the @ is actually case sensitive per RFC, but most mail server will treat it case insensitive. But I am not sure I am reading the RFC correctly, citation:
Verbs and argument values (e.g., "TO:" or "to:" in the RCPT command
and extension name keywords) are not case sensitive, with the sole
exception in this specification of a mailbox local-part.
Which is to say, nobody of note is still allowing it; you can allow it if you like, for anyone who is still doing it; but if you don't, and someone sends you a message, and it bounces, then, well, they already know why.
Yes, there are very few rules on the standard that could create two binary-different local parts that resolve to the same value (mostly involving parenthesis). But the mail server can add as many rules it wants, including the very common setting of making completely different names resolve to the same mailbox.
Case insensitivity was a huge mistake in computing really. Most languages don't have cases and its very non trivial to convert between cases. Should have treated every char as completely unique.
Sure. At the user signup side, block emails that are too close in ways like case, but as a sender you should always treat them as unique emails.
So if my email is HeLLo@example.com because I want to be cute people will have to try 6 times before they finally get the right email address? Imagine telling that to someone in person. This kind "weird" casing isn't that rare and doesn't require cute usernames: "DonaldDuck@example.com", "FreeBSD@example.com", "DrMcCoy@example.com", etc.
Languages that don't have case is not an issue; the situations where a lowercase <-> uppercase mapping is not simple are actually not that many. It's not trivial, but not all that complex either. The most annoying part is Turkish, Azeri, and Lithuanian where the rules differ a bit but the used language is often unknown. For the purpose of matching things ("is this email address known in our system?") it's actually not that hard, since you can just treat several characters as identical (displaying text correctly to users is harder, but that's not important here).
I see this attitude in various situations, often under "falsehoods programmers believe" articles, which goes something like "it's hard in a few rare cases, therefore we should not do it at all for the >99% cases where it's simple and unproblematic".
It is fine to have multiple email addresses connected to a single inbox. Email providers already do normalization like this that is not baked into the spec. Gmail for example treats johndoe@gmail.com and john.doe@gmail.com the same.
Well, maybe Google should check their own implementation, because interesting things happen when accounts for both version exist.
I have an account with the dot, that was made in age, when Gmail was invite-only. Few years ago, someone created an account without the dot. Yes, I'm receiving their mail, and have no way to contact them, because everything I send out comes back to me.
Not only do people do this, it's actually extremely common, ask anyone with commonname@gmail.com. Someone here said their original email address has become unusable because it gets thousands of messages a day that are intended for other people. "That guy" might actually be multiple people all making the same mistake with your email address.
My coworker told me there's a complete stranger who, every time she emails her son, accidentally sends the email to him first. This has been going on for years, she makes the same mistake every time, she doesn't learn her son's actual email address, and she doesn't learn to press "reply."
I'm 100% sure the "non-dot-version" doesn't exist as a separate account.
AOL used to allow users to email other users without using the @aol.com extension. Back in those days I (prior to any capacity to negotiate a sensible ISP, for the record) had an email that matched a common subject line that was inundated by people who typed their subject in the To line and then wrote an email.
And that’s why I regularly receive other people’s mail in my gmail inbox, and why i have stopped using gmail for anything important (it’s right to assume that gmail is also sending my emails to other people).
Google’s gmail people aren’t really as smart as they think they are.
That's something I don't understand. I've always given my email as john.doe@gmail.com, and I sometimes receive emails - addressed to Another John Doe - sent to johndoe@gmail.com.
That Another John Doe never, ever had access to johndoe@gmail.com, they just gave a wrong address. That's not gmail's fault.
That Another John Doe never, ever had access to johndoe@gmail.com, they just gave a wrong address. That's not gmail's fault.
This. My wife and I have two flavors of this.
Her address is firstmlast@gmail.com. There people frequently forget the m initial and somebody else owns firstlast@gmail.com She's since started using dots first.m.last to mitigate the error.
My address is firstlast@gmail, where first and last are not globally common, but are fairly common in Scotland. Once a year or so, I receive email for somebody else that shares my name. I don' know his real email, but I've been "invited" on his family vacations 3-4 times now. Infrequent enough that I just respond "thanks for the invite, but I think you'll be disappointed when I arrive and not the Alistair you were expecting."
Or someone assumed (or just tried) that other email address.
I signed up for my university's email forwarding for alumni early on and got my first name as my email. For quite a while, I would get emails, including fairly sensitive ones, sent to me by not yet very email savvy people just assuming you could send an email to someone's first name and it would get to them.
Nah, it happens with mangled names that no bot would ever try to stuff, too. E.g. I own derefr@gmail; but I sometimes receive email from people trying to reach a man named "Derek" — who almost certainly owns the address derek.fr@gmail, but probably typoed it once as dere.fr@, and now his browser autocompletes that into registration forms for him.
This doesn't really make any sense. It's not just gmail that does this, dots are almost always ignored before the @.
Nobody else can register an email that is the same as yours but without a dot. So the only way you receive someone else's email is if they give the wrong address.
> It's not just gmail that does this, dots are almost always ignored before the @.
That's not my experience. Which non-gmail email software ignores dots before the @?
Thinking about this, I guess the sending MTA doesn't care about dots; it goes RCPT TO: <address.with.dots@example.com>. The receiving MTA then has to validate that address; it does that using some account database that isn't typically part of the MTA - it could be a unix account (no dots!), a database table, or an LDAP user. Finally it passes the mail off to a delivery agent, which hopefully relies on the same account database.
So the elision of dots appears to be a feature of certain account databases. So which account databases elide dots?
MTAs can be configured to additional transforms before looking up the account. For example, postfix's virtual table [0] can be used for this and on my server it does elide dots in the local part (along with everything else).
> So if my email is HeLLo@example.com because I want to be cute people will have to try 6 times before they finally get the right email address? Imagine telling that to someone in person.
Yeah. That's fine.
If my email is LLLLLLLLLLLL@example.com because I want to be cute I have to tell people to type exactly the right number of L's. Do you think they should just be able to type a lot of L's and as long as it's somewhere near it counts as the same email address?
In a world where email addresses are always case sensitive, everyone will use lowercase (like they pretty much always already do anyway), and it'll be fine.
The only reason "LLL" and "lll" mean the same thing are because currently email addresses are (sometimes) case insensitive.
In a world where email addresses were "obviously" case sensitive, "LLL" mapping to "lll" would be just as crazy as "LLLLLLLLLL" mapping to "LLLLLLLLLLL".
They just seem similar, to humans. But they're different strings.
By maps to, they mean it counts as the same email address. 111@ and lll@ do not do that. The font has no impact on the email spec. However it can add extra confusion.
I doubt anyone is crazy enough to implement the email spec for comparing emails, to be fair. I would honestly be surprised if any publicly available mail agent or server supports that craziness.
> Case insensitivity was a huge mistake in computing really
Ah, youth. There was little choice! Sometimes you had only six bits for a character; sometimes your bytes could be from 1-36 bits wide, depending on what you wanted for your program, so you might have systems that mixed six-bit (only upper case) and early ascii (two cases) and so for matching you had to be case insensitive.
It’s easy to look back and say “those people were so stupid” but they weren’t.
That's a surefire way to increase customer support load, as users have mix case for their emails all the time. They might sign up on a phone, or login on a phone. They might just be hamfisted. Sure, it might be their problem, but they'll make it yours.
Some places even support the same password with inverted caps. So say, if your password was "passWORD", then "passWORD" and "PASSword" would work. If the first hash fails, they'll invert the case, re-hash, then check again.
Computer have to adapt to people and not the other way around.
E-Mail wouldn't have been widely accepted with case sensitive addresses.
People expect that MyName@mail.com reaches the same person like myname@mail.com or Myname@Mail.COM just like letters reach their recipient irrespective of the upper and lower case of the recipient's name.
Computers are there to make the lives of the users easier, not the programmers.
I'm not sure this is true. In my experience a lot of people actually do think that it's case sensitive. Many times I've heard someone describe the capitals while verbally telling someone their addresss.
However it being insensitive has probably helped a lot of times where people make mistakes in explaining or copying those capitals.
As a sender, you should always treat email as case sensitive. As an email host/receiver, you can and probably should chose to be insensitive. But never assume any other host works like that.
Similar to how gmail ignores . in emails but other hosts do not.
> In my experience a lot of people actually do think that it's case sensitive.
I don't think they really do. They may think case matters somehow, and so may be careful to reproduce the exact case that they used before, but I don't think many people would expect JohnDoe@gmail.com and johnDoe@gmail.com to be too different email accounts.
In the general population, how many people do you think understand that username (including an email address) probably isn't case sensitive but that password almost certainly is?
> Case insensitivity was a huge mistake in computing really
original ASCII only had uppercase. When lowercase got added, gradually with newer systems and software, without case insensitivity you would have had massive incompatibilities which probably would have hampered or even arrested the introduction of lowercase in new systems
and all over again, when microcomputers first came out, they came out with uppercase only to cut complexity and cost on simple systems.
I don’t see how adding lowercase to ASCII could have resulted in massive backwards compatibility issues considering that uppercase-only ASCII existed for only a few weeks in 1963. Surely there was not widespread adoption of ASCII in the spring of 1963.
I'm not old enough to be an expert, just old enough to have used the leftovers (new computers were too expensive!): many many devices were uppercase only, card punches, ASR-33 teletype, the Telex system, lineprinters, "glass" teletypes, FORTRAN, COBOL, and now that I think of it, Morse code/telegraph had always been. There was a ton of infrastructure that was uppercase only. You may be right that it wasn't ASCII's fault. Perhaps the first version of ASCII made sure to encompass what had been, and then saner heads said "let's allow for future progress".
i'm not going to explore the entire history, but just looked this up. TL;DR example, the addition of lowercase characters represented a jump from 6 bits to 7 bits at the hardware level:
"A six-bit character code is a character encoding designed for use on computers with word lengths a multiple of 6. Six bits can only encode 64 distinct characters, so these codes generally include only the upper-case letters, the numerals, some punctuation characters, and sometimes control characters. The 7-track magnetic tape format was developed to store data in such codes, along with an additional parity bit."
Capital letters aren't a matter of font. There's a difference between the river phoenix, a magical bird which lives by the river River, and River Phoenix, the actor. It isn't a presentation-layer difference, it's an encoding-layer difference.
Then again, if we could start from scratch we'd probably just have a single global phonetic language without case and with a limited number of total chars.
Honestly non-phonetic glyphs are probably an easier lift.
Fun fact: the reason we pronounce "ph" like "f" is because the Greek letter was originally pronounced like p-h, at the time Romans began stealing words, but then the Greeks started pronouncing it like "f" and the Romans followed suit, but kept the old Latinization of "ph", because they'd already carved it into stone.
English spelling is largely phonetic... but it captures the phonetic spelling across dozens or hundreds of shifts in the spoken language. Unless you can stop people from changing how they speak, any phonetic spelling reboot is either going to suffer from the same problem, or words will constantly change how they are spelled to keep up with the spoken word.
if you view meme culture as a trend towards increased symbol density in linguistic communication due to their ability to convey emotions, overtones, implications, and other nuance ("shaka, when the walls fell") then the increased symbol space of chinese/japanese/korean characters looks interesting.
conversely it's certainly been an obvious disadvantage (posed a lot of problems and imposed a lot of awkward workarounds) for mechanical/electronic communication - now you have to enter the characters too, and you have to express that larger number of characters efficiently. In practice, a lot of electronic communication is just simplified to ASCII because that's the set that works universally. Someone used the example of ess-tset being transliterated as "ss" in german, dunno if chinese uses anything similar, but it wouldn't surprise me, obviously Japanese has romaji too.
but at a human-interface level, fundamentally there is a limit to how many symbols people can absorb. Even with latin characters, people at best will sight-read whole words to increase symbol rate, but, the natural evolution is to use 1 character to represent 1 symbol/word, that's the highest possible rate at which humans can absorb symbols for a written system. And in turn you could in principle absorb a "word of symbols", which is a sentence, similar to how western readers can sight-read a word of our 1-character glyphs.
by "increasing the dimensionality" of the symbol, you increase the effective symbol rate, similar to how memes use subtext/etc to convey more nuance than a pure text can by itself.
I think anyone who deals with end users would disagree. It seems impossible to get users to abide to a specific casing. Things would break all the time.
> as a sender you should always treat them as unique emails
This is already how it works. Senders are not supposed to assume that local-parts are case-insensitive. (Some buggy implementations ignore this requirement and upper-case everything, but the serious implementations don’t).
The correct answer is "who cares" though. Languages which use cased alphabets.. use cased alphabets, you don't get to argue with it.
You also don't get to argue with the fusional position changes in Arabic, or the ligatures in Devanagari, or the places within a square the featural particles of Hangul must be printed in.
You are correct that its not negotiable when supporting that language, but it is negotiable what languages and writing sets a given application support.
I think you have correctly identified an implausible claim!
Of course, most languages aren't written at all ... or at least don't have a traditional written form that is sufficiently well established for someone to say that the "language" has case rather than a particular (proposed) way of writing it.
However, I rather suspect that the majority of languages in which books are published use some variant of the Latin alphabet and do, therefore, have case. (The only language I've heard of that uses the Latin alphabet without case is Lojban!)
On the other hand, if you weight languages by the number of (native) speakers, since about three quarters of the world's population lives in China, India, Pakistan, Bangladesh, Japan or Korea, probably it's true that most people don't use case in their main language.
It's just blown my mind that case might be a thing non-English speakers would need to learn to be able to read and write English. (Same for non-English, but that doesn't blow my mind in the same way.)
Yeah, we always say that the English alphabet has 26 letters, but there are actually 52 unique symbols you have to learn to read, or 104 if you also have to read/write cursive. Some of these symbols are very similar (if you learn 'o' you will definitely recognize 'O', and likely the cursive variants as well), while others are quite different ('g', 'G', and the cursive upper case G might as well be different letters altogether; the lower-case cursive does resemble 'g').
With joined-up letters (“cursive” in the USA I guess) different languages have different letterforms, and sometimes multiple systems.
For that matter typesetting rules vary by language as well — not just the obvious hyphenation rules busnspacimg as well. Just pick up a book in, say, French or Russian and you can tell at a glance (without even looking at the letters) that it’s not in English.
Right, 104 symbols would be the minimum if you do need to read/write cursive.
However, I don't agree with your point about typeset text. You're right that the styles differ, but if you have learned one style, and know the language of the text, you will not need any significant amount of time to read a different style of typesetting.
Russian of course normally uses the Cyrillic alphabet, not the Latin one, so obviously you do have to learn a whole new set of symbols to understand it even if you can read Latin symbols. And of course French uses slightly more letters/letter forms than English, with the sedile and four accents (egu, grave, circonflex, and very rare treme).
Lots of accents when using Cyrillic to write non-Russian text.
I didn’t mean the typesetting differences made reading a different language in any way hard, merely pointing out that there are lots of different aspects to text in different languages even when the alphabets are basically the same.
And there are a fair number of (inconsistent) rules for casing. Proper nouns vs. common nouns. Camel case (or other non-standard capitalizations). Title case. "Standard" body copy.
Are they a majority of languages if counted? I guess it also matters if you count the number of languages or if you count the number of people writing them.
Let’s just use Greek and its descendants (Latin, Greek and cyrillic alphabets) and Brahmic-derived writing (we said “alphabet” when I was a kid but now ppl say “ Abugida“. There are about 200 languages spoken in Europe, all of which use these alphabets. India has over a hundred “major” languages and about 1600 others, most of which use Bramic writing alphabets (the major exception, Urdu, uses a form of Arabic writing). So a big imbalance!
Oh, you want speakers? merely counting people who read Hanzi + Arabic-Alphabet readers + the Indian subcontinent gets you more half the world‘s population. And there are hundreds, maybe over a thousand writing systems.
When we sum up realistically, then world-wide the amount of users of writing systems with case/"cameral" are about equally balanced with those without.
I think that's GP's point: tolower() looks like it works well to English speakers but it's subtly wrong and will fail unexpectedly for people with other locales.
Internalization is hard. I think its too much to expect software written for a specific market to handle all languages in the world.
Fx in danish we have 3 letters (æøå) that is not common in the latin alphabet. I cant go to germany or turkey and expect people to be able to write out those letters when doing input in a local system.
Fun thing I've run into in a Germany-based but increasingly international company: German always spells out umlauts and eszetts when going to lower ASCII for email addresses ("Schäßler" -> "Schaessler"), but Hungarian does not. Not sure how Turkish ö and ü get fully lower-ASCII-ized there, but in Germany, they get spelled out "oe" and "ue" as if they were German ö and ü. This isn't as much of a corner case as one might think - there are a lot of people with Turkish names in Germany.
I’m always amused by these kinds of nonsensical usage for Turkish in Germany (ü->ue) but the thing that really trips people up is that Turkish has two letters that look like i, one with the tittle and one without — in both cases.
Germany relatively adopted an uppercase ß (and got it into Unicode) to try to help with case roundtripping but I’ve never seen it in the wild. And let’s not get into obsolete German Fraktur ligatures like tz or ch which also had no upper case equivalents.
> Germany relatively adopted an uppercase ß (and got it into Unicode)
The parenthetical part is true. It was an uphill battle, but not because of the consortium, but because of what the tropes wiki would describe as executive meddling.
The adoption is not recent, but about 110 years old. You have the wrong idea because of sloppy journalism.
> I’ve never seen it in the wild
I see it all the time. Maybe you are undercounting. Pay attention to non-standard letterforms on hand-written signs, and you also have to include print media where someone substituted lower-case ß in absence of a glyph in a font. This is a typographic mistake, but the intent is clear.
As Unicode standard describes (e.g. the same 5.18 section mentioned above) case mapping depends on locale, so lowercasing the same string may have different results on different hosts, and so also the truthfulness of lowercase(x)==lowercase(y) is not universal and depends on the host locale.
Indeed, a fundamental to the problem is that most unicode text doesn't actually carry the relevant locale information... (Of course, one probably wouldn't want to rely on sender-specified locales for email adresses when deciding address equality -- that would open one up to all sorts of potential weird scenarios, i.e. a nightmare for security).
Presumably the case-insensitive version is also doing unicode normalization as well, which is what a byte-level comparison of tolower versions would miss
Judging by the number of times I've had this fight with developers, and seen other people argue it online, I'd suggest the biggest falsehood programmers believe is: Your random VPS can just send email from any address you like and expect it to be delivered.
Or any SAAS for that matter. It takes some effort to get email sending right. It is also a field like SEO where there is a large chance that you get lucky, and things works even when the setup is incorrect.
Just because you cannot setup a mail-server correctly, i have installed 100's of email-server (2022) from AWS to Hetzner to Vultr andandand. Yes your random static ip can deliver reliable email IF you have:
And your IP was never on a blocklist before it was allocated to you. And your domain name doesn’t get blocked for being too algorithmic looking according to the spam ML model. And your domain is old enough not to get greylisted. And not too many of your users get reported as spam. And your domain isn’t in .ru or .cn. And your netblock isn’t accidentally put into those regions in maxmind.
>domain name doesn’t get blocked for being too algorithmic looking according to the spam ML model
Check your domain on your own SpamD/Spamassasin but i never had that problem.
>And your domain is old enough not to get greylisted
Everything is greylisted at first contact, email servers don't know how "old" a domain is.
>And not too many of your users get reported as spam.
Yes you can be a spam-bot too, that's why you check your outgoing emails for spam too and limit the recipients and mail frequency outside your domain.
>And your domain isn’t in .ru or .cn
That's another problem, i talk about technical stuff and not political ones, if your recipient don't want Chinese mails than this is your (or his) (non-)problem, that's a human problem.
>And your netblock isn’t accidentally put into those regions in maxmind.
ah yeah, because I just want to do trial and error until I maybe find an IP that isn't flagged. fwiw, several VPS providers will keep the same IP assigned to you if you just create a new VPS after deleting the old one.
> Everything is greylisted at first contact, email servers don't know how "old" a domain is.
just because your mail server doesn't know that it doesn't mean no mail server does. see for example https://spameatingmonkey.com/services, which provides reputation lists for recently registered domains.
right, because it's totally up to them to ensure some third party that provides PAID services has their data set correct? this is the responsibility of maxmind, not the VPS provider's.
from the earlier comment:
> Site-verification for gmail/microsoft
yeah, if you want to send emails, just go to the 5 biggest providers and submit to their "voluntary" programs to reduce the likelihood of getting marked as spam by them.
People have pointed out numerous factors to you that are beyond one's control - including your control - and your response continues to be to deflect and to rudely deride their abilities. What exactly are you trying to establish? Everyone is just pretending that email is difficult?
I don't think it's difficult, in the sense in which network security, or compiler design, or composing orchestral music are difficult. It can be complicated, because there are a lot of moving parts. And because your service is going to deal with many other services that are not under your control, and which you probably never heard of, there are submerged rocks that you can stub your toe on.
If you get your ducks lined up first (good domain, clean hosting etc.), then setting up a mailserver can be pretty much a cookie-cutter exercise. There are practical challenges, but no part of setting up a mailserver is technically hard.
How? It means I have a reasonable idea of the complexity involved. Sure, "too much" is a subjective term, but .. look around, there really aren't all that many people doing it any more, just like everyone bemoans the movement of blogs to social media.
I still have the domain, because it's also too much hassle to change an email address which you've used in a lot of places, it's just delegated to a small ISP to run the actual MTA.
> That's another problem, i talk about technical stuff and not political ones, if your recipient don't want Chinese mails than this is your (or his) (non-)problem, that's a human problem
The problem is that a recipient my want Chinese emails, but an operator of an anti-spam system may not know this and just block the whole country using GeoIP and TLD block as a poor-man anti-spam measure. Geo filters IMHO are overused but end users often have no easy way to communicate their problems to whoever sets such filters. In a largish company a user who suspects that emails addressed to him are blocked for no good reason will have to raise a ticket with company's IT which then in turn will have to raise a ticket with a vendor and in a month if a user is lucky the problem will be resolved by which time a sender will either find another way to communicate or will give up. And a user needs to know that someone have unsuccessfully tried to send him an email he wants to receive in a first place.
Yeah look if your spam operator just blocks a whole ip bkock because he dosent know better...change that system. If he thinks blocking a tld is a good idea fire him/it.
My point was that the fact anyone needs to do any of those things means a developer can't randomly put any address they want into PHP's mail() function (without telling anyone) and expect it to work.
There's no point accusing me of not being able to do things here - I've run my share of mail servers.
In terms of source vs. destination, they already said "from" so that's taken care of.
When it comes to "from any sender you like" vs. "from any address you like", I'd be more likely to interpret the former as talking about the username. Which would be the wrong interpretation. So I think your suggested wording would be unhelpful.
"Sure it can, if you configure it correctly." implies that doing it correctly is basically always possible.
If you are trying to imply that, it's reasonable to counter-argue that no, it's not always possible. It's not some easy "sure it can".
If you're not trying to imply that, then you're burying the lede and hiding the important part of the equation. The ability to do things correctly is not going to fix the problem.
> Your random VPS can just send email from any address you like and expect it to be delivered.
You're response backs up the parent's point, that a lot of developers don't realise that you have to do all those things to stand any chance of reliable email delivery.
Too many people think you can just slap a mail server on a VPS and expect it to be able to reliably deliver mail.
Oh boy. I just set up my company a few months back, unfortunately I never had to go into email too much, so I didn't set up DKIM/SPF/DMARC. Then I started receiving emails from my own domain, I panicked because spoooofingg (read with spooky voice) and set up all three. I was bashed for an entire week with reports from all major email services with DMARC reports.
Fortunately I still only have a couple clients so no one was the wiser.
UPDATE: It ended up in a simple scare, nothing was affected and I'm not in any list that I know of, in case you're wondering
All true until it comes to outlook.com and your sent mail just vanishes without any error. It just accepts the mail but doesn't deliver. Not in Inbox, not in Junk.
Users always have immediate access to their mailboxes
I imagine lots of people do not have their email account attached to their phone, people maybe on a shared computer (library perhaps) and do not readily have access to the password (if it's randomly generated and stored at home) or their mail provider is blocked where they are (things like Hotmail etc. were blocked whilst I was in education)
I'd say there's lots of services that require you to validate your email address immediately after signing up - even where an email address is not required by the service itself - having a grace period to verify you email in such circumstances is great, but see it very infrequently.
> An email address like ^_^@example.com or +&#@example.com is invalid
My current employer autogenerated a company email address for me including the apostrophe in my last name. I couldn't believe that was a legal character, but I looked it up, and sure enough it is. Of course, plenty of other internal systems reacted the same way I did, and I frequently generate errors whenever I try to register myself with random services :p
I’ve seen a lot of systems, including corporate systems for internal use, reject apostrophes in email addresses (and sometimes even in other fields). Apparently the developers are too lazy to deal with strings properly and fear SQL injection attacks, and perhaps they don’t trust all the other systems they may interface with. So their escape hatch is to prevent these from being allowed.
(“Little Bobby Tables” from xkcd comes to my mind whenever I see these restrictions)
I wonder if this is a good indicator of a bad product/company to be a user of. If they're so uncertain about their tech stack that they have to prevent certain characters from being used in passwords/emails/etc, maybe it's not something you should trust?
Honestly, I would actually consider this best practice. There is absolutely no reason to go the trouble of allowing special characters in emails and fight every system you encounter.
Note that allowing Unicode Letter characters is a whole different topic, and in fact much less risky than allowing random punctuation. At least for the vast majority of people, this is much more important to personal identity than having your name spelled without a quote mark that will anyway confuse numerous systems where you may want access.
I wrote our onboarding system and had it strip apostrophes from names. Some people object, but they object more when random websites refuse to let them sign up.
A few years back I was asked to set up a mail server on an AWS server for some small non-profit organization. I am a software developer of 25 years with a lifelong habit of tinkering with OS installations and the like, so I thought "sure, how hard can it be?". Here is my warning for you all: do not enter this highway to hell unless you actually are a sysop who is specialized in setting up email servers.
Honestly, I've being building and running ISPs of every scale since the late '80s, and I'm a source code contributor to some widely used mail servers, so I am that very model of someone who others might suppose knows what they are doing, and still the prospect of setting up a reliable email service from scratch today would give me pause to say "are you sure an existing service can't be used"?
I run my own email using mail-in-a-box running on a 5 dollar Linode, works like a charm with almost no maintenance (the little maintenance I do is always requested automatically by the system itself, and I'm notified by email)
* Blocking sending to domains listed in [0] or similar is a useful way to prevent spam or sybil attacks with minimal impact on authentic users
I hate this. Motivated attackers can trivially circumvent it at minimum effort and cost while it further normalizes centralization and strengthens surveillance capitalism as the barrier to use unlinkable e-mail for different service providers for a normal person becomes untenable (curiously equally disposable domains from major providers are absent from most of these lists, supposedly precisely because it is disruptive). I'm ambivalent on even sharing the link for the risk of a dev reading this going "oh, neat!"...
RFC 822 and some email-related systems accept commas as valid and to mean multiple receivers. This can be dangerous if user-inputted strings aren't properly filtered. I recall a website that would accept "bob@example.com,admin@company.com" as a valid email, send the verification to both emails, and grant administrative privileges to the site once verified, since the email clearly ends in @company.com and belongs to the company!
I'm curious, I've always validated an email as containin only one '@' and kept it that simple. This validation would cause that input to be rejected. I would love to know if my assumption was right, can an email address only have one `@`?
What GP is saying is that yes, multiple @ could be a valid address. Your solution (rejecting them) sounds reasonable for sign-up checks I guess but you want to keep receiving incoming mail from such addresses for normal communication.
I once scanned many millions of emails to see what kind of "strange" addresses people are using. Everyone uses "@one-or-more-label.tld"; the only ones that didn't were bad spam scripts.
For all practical purposes for public services, it's a "truth", with the biggest real-world exception being "user@localhost" and such for local email delivery (or email delivery inside your network).
It depends which email addresses you are talking about. Email works with local addresses and mail relays will cleanup any incoming destinations without a host or domain name to be routed to whoever handles local mailboxes.
I tend to use something like this rule, too. Except that I don't like having an IP in the domain part (it's valid, but screams spam), and a FQDN instead of a (possibly local/internal) hostname. So /.+@.+\..+/ it is for me (plus checking it's not an IP). Unless of course the application interacts with intranet hostnames, but imho those should be avoided these days anyway.
Imho it's more important to allow a user to fix a mistake: "We sent you a confirmation e-mail to $email_address. Spotted a mistake? Click here to change your email address."
No. For IPv4 the dotted syntax must be used and for IPv6, the colon syntax preceded by the string “IPv6:”. In either case the whole thing must be enclosed in square brackets.
I am certain. There must be something before the @.
The @ itself is optional (I believe the RFC disagrees, but on practice it is), but if it's there, there must be something after it too. So the only really required "field" is the one before the @.
ok, i admit the rfc may prescribes something like .+(@.+)?
i've just drawn the conclusion from that the local part is interpredet solely by the receiving party. that's why case-conversion should be avoided, since the receiving party may be case sensitive. so further thinking, local part can be empty perhaps, depending on the implementation.
Microsoft (at least used to) require account passwords to not include the part before @ in email addresses. My email address was a@(domain).net, and therefore I was prevented from using any password including the letter "a".
It's not about falsehoods I believe about emails, it is about knowing that emails will be used in million different ways in wide range of often legacy software and I'd rather force user to use normal email like user.name@domain.tld than to debug some early '90s cow milker at 2am on Saturday standing knee deep in cow piss in the middle of Nebraska just because some smart as have backslash or emoji in email address.
I don’t disagree on principle but I’m genuinely curious how you arrived at your counter-example. My probably not entirely correct mental stereotype would have your Nebraskan farmer using an email address like benjomcfamilyname13464@megacorp.whateverispbranding.weirdlyunfamiliardomain.definitely.com
I'm missing "if a person confirms they are in control of an email address it will always be theirs" recently got bitten by it as it dept recycled email addresses so a new hire got email address of somebody who left some time ago. They got some of their privileges. Oops
> Any one email address refers to only one single person
This one hit me. My grandparents share a computer and one email address (just as they share one physical address and phone number), you wouldn't believe how many services, including Google, fails this rather simple test.
And in case you think this is a weird one: until not that long ago, every way to contact people where to the house they stayed in. Letters typically had a name, but if you were married and had shared accounts, either person could need to read those letters.
Mildly interesting problem: law says email with financial information needs to be encrypted. Email goes from the Accounts Payable clerk to Verizon Accounts Receivable, but triggers the automatic encryption process. One needs to create a free login and read the email in the (secure) web portal. Verizon complains. Talk with the Verizon AR manager and he tells me "I have 40 people who access that mailbox; I am NOT going to create a username and password in your system and then share that with those 40 people. What happens next week, when one of them leaves?"
I wish that such articles explained why those are falsehoods. In this case, it's clear to me, but in many others, I could not understand why half of them were false.
That's unfortunate because those articles are very valuable to anyone building software.
Or they might have an email address but it not be .edu - in the UK they're .ac.uk for example. Used to annoy me when US sites would use email to validate my studenthood and then miss it anyway (and the service was supposed to be available here).
Oh that makes me think of another: Anyone with a university email address is still a student/faculty member!
> Or they might have an email address but it not be .edu - in the UK they're .ac.uk for example.
At least you have a common suffix. Around here, a student at the computer science department of the federal university of the Rio de Janeiro state (UFRJ) could have an email at the domain dcc.ufrj.br; yes, universities which got on the Internet early enough have their domains directly on the ccTLD.
No common suffix for universities either in Germany. And at my university, comp sci students didn't get just one emeail but a second one administered by the comp sci department.
Actually, it does seem that they have a .edu now. I guess those aren't restricted to US-based institutions anymore?
Oh that's impossible to deal with then. Mine at least changed to @alumni.imperial.ac.uk (vs. @imperial.ac.uk) - so you could exclude me (now) if you really wanted to go all out.
You should interpret "programmers believe" not as literally thinking that this is true, but rather as "some systems are designed as if their programmers believed that" - and the latter is very definitely true; "Everyone has exactly one email address" is a very relevant falsehood, because there are systems with this baked in as a fundamental assumption.
I guess what I was getting at is, the earlier articles that this one riffs on were more along the lines of "don't forget to test these edge cases. If you don't, your system will break when people put in their unusual (but valid!) data".
To me, someone baking in "one email address per user" or "no numeric/symbols" feels different — they're not being caught out by tricky real-world data that they forgot to consider, they're just deliberately cutting a corner (e.g. our company naming policy is '[firstname][lastname]' so we're not going to bother supporting numeric input in the intranet input field).
Maybe not if you actually asked them, but you’d be baffled from how many systems are designed to require people to have, and use, exactly one e-mail address, ever. The programmers/designers of those systems did believe, implicitly, that everyone has exactly one e-mail address.
But that's just limiting the scope of development effort on a project to reduce time/cost. It doesn't mean that you believe anything you don't support is impossible to happen.
If an automotive engineer put a battery in an electric car that gave a capacity of 200mi trip, no-one would say "Engineers believe every road has a charging station at least every 200mi".
> You'd be hard pressed to find anyone who's at all component with the internet who thinks that this is true, nevermind a programmer.
On a related note, there are people without an email address who still use the web applications and smartphone apps that require accounts and/or notifications. In some (or most?) developing countries, people use phone numbers as the login identifier and may not have an email address (or not know that they have one and what to do with it).
I mean, it is. Either you email will be delivered successfully, or you get a message that it couldn't be delivered. If disappears without trace, then most likely system administrator has manually deleted it
Not even remotely true. Not even the first hop from your local MTA can be trusted in that regard, it may accept the message and just immediately bin it, it might accept it but queue it for further verification and not bother sending you a message back telling you this had happened, etc. Between your MTA and the receiving mailbox there could be several hops, any of which might silently send your message to /dev/null.
And that is without considering the same issues with the MUA, assuming your message is for human consumption and your aren't using SMTP to communicate between automated agents) at the other end having it's own spam/junk/other filtering (though I suppose you could consider that later part to not be email transport begin unreliable, i.e. if you consider successful transport to be "the user saw it" or "their mail server time their MUA it existed").
The only reliable bit of email transfer is the little bit you have full control over, the local MTA. Even then, if you are in a shared hosting environment where you don't control that yourself this could still silently reject your messages (a consideration you might need to make if publishing software others may self-host).
A good behaving spam filter should either returns 5xx/4xx during an SMTP transactions or send a bounce. Returning 250 and then dropping a message without a bounce is a very bad idea but luckily it is not very common - no known to me opensource MTA does this by default and some make it hard to configure this (mis)behavior even if an administrator wants it.
Unfortunately email is very centralized nowadays and companies known for bad behavior in general (and deleting emails without telling anyone in particular) now control too many mailboxes.
E-mail is a reliable transport. Claiming it is not is like claiming that TCP is not a reliable transport, just because extreme packet loss exists, and firewalls can play any games they like. Yes they can, but that does not make TCP an unreliable transport.
I suppose that SMTP can be considered a reliable transport if you trust all MTAs and MUAs to properly implement the specs. But email more generally is certainly not IMO.
Nothing is a reliable transport if you can't trust the operators to follow the specs.
In practice, E-Mail is quite reliable, even in situations where other systems drop messages. For example, you don't need to worry about dropped messages from any downtime of reasonable length (e.g. for system upgrades) because well-behaved MTAs will retry later - this is the entire basis for greylisting.
..or you are sending to a Microsoft-hosted email like @outlook.com, @hotmail.com or @live.com and they have decided that your sending server is spammy. In that case they will silently drop your mail.
My experience with this as been more that they will silently drop email if you haven't been sending enough legitimate email for them to classify you as non-spammy. How do you bootstrap that? I have no idea.
IME problems with Microsoft have been entirely with the using netblock-based blacklists that you can't do anything to prevent from ending up on unless you buy a whole /24. Thankfully, I don't really need to care about deliverability to MS so I can consider this their problem.
"An email address is the global standard to sign up for applications/services"
False in China, where the norm is to use their phone number. Doesn't mean they don't have an email address somewhere, but it's not how they sign up or sign in, typically.
Pretty much. But like almost everyone, you may spend 80% of your screen time in 1-3 apps whilst simultaneously having an additional 100 logins for lesser visited or even one-time usage interactions.
It seems like it mixes up things people believe and things that people do for ease of use/ease of life, like:
>Anyone with a .edu address is a student
>Anyone with a .edu address is a student or faculty
I dont think most people believe that, but its a easy filter if you want to give rebates to students and they dont cost you too much, like dropbox giving increased free quota to people who sign up with a .edu
Which I've had failed as a student in Australia, as we use .edu.au (not for Dropbox, but other services).
As you said though, its a simple test, and if you don't think about it too much, its too easy to just test the email ends in .edu and move onto the next task.
Sure, if you go for international markets you have a lot more work to do. But in most countries you cant use extension to verify anything. I had a .edu as a university student in denmark, but I think my dapartment was the only danish education instituion who had that, the rest just used normal .dk domains.
My university has lifetime email forwarding; so I use my .edu address as my main personal email address. (I tell people, "In 30 years, it you email that address, it should still get to me.") I once signed up for a SaaS team workflow thing with my personal email address, thinking about trying to use it w/ my family to try to work together on a project; and within a day or two got a call from someone from that company obviously hoping I was actually a decision-maker at that university. Sorry...
Lists like these would be better with some more explanations for the less obvious bullet points. For instance, when/why would an email have multiple From addresses?
Yeah, that one took me by surprise. I had no idea that multiple from addressees was possible. I wonder how many (popular today) email clients support that (for sending and/or for receiving)?
No provider has to allow its own users to use the full range of legal email addresses. But can you receive an email from someone with such an address and reply to them? That's the real test of whether it's valid (and not too buggy) from the perspective of Exchange Online.
• Bit of a personal axe but I wish I'd been aware: If somebody who you've been in correspondence with for years replies to one of your emails, then Gmail will not chuck that reply in the spam folder without notice (the followup reply from the same person didn't suffer that fate for some reason).
That it sometimes without discernible pattern does the same thing to a mailing list I repeatedly told it to mark as not spam is comparatively sane behavior I've come to accept over time. Keeps me on my toes I guess.
The amount of wrong emails I get to [myname]@[major email domain] is quite surprising. I don't know if someone with my name just doesn't know their email address, or the people who input it are careless.
Same! I have an email that contains a common female name (even though the actual email doesn't have anything to do with that name) and I constantly get emails of someone trying to setup accounts for various services /shrug
I didn't see "Every HTML enabled client is configured to show images and other remote content."
That one bites me constantly because I have remote content disabled and financial institutions use web beacons to verify that I'm reading their emails. If they think I'm not reading them, they start sending me paper snail mail again.
I had made an email graphing tool early in my career. The idea was to find instances where an account had sent to and received from an email from any address and put that in some funnel.
The tech worked, I created lovely graphs of conversations but it was far more rare to find hits than expected.
I came to find that almost nobody was sending and receiving emails from external domains. (I also discovered I did not want to know what people were doing with their work email.)
I eventually did figure out the issue. People use aliases for external communications to protect their inbox and create rules around it. Two or more email addresses per account is quite common.
This was a lot harder to solve. Not because I couldn’t create that mapping technically, but because I had to go out and collect everyone’s aliases through user input.
So I’ll add that email addresses correspond to unique accounts.
Usually limits like this come from someone defining a DB column for the value without specifying a length, and the DBMS default being taken. Someone (maybe the same someone) then comes along and adds validation to the input form which forces values to fit in this limited space.
This can vary by tool too. With SQL Server the default for an [N]VARCHAR value if not length is specified is 30 characters (this means CAST/CONVERT can unexpectedly truncate without error which sometimes causes interesting problems to debug), though if you are creating a table in some of the standard tools many of them default to 50.
Though scanning the RFCs to verify other comments in this thread I note that local-part has a 64-byte limit which I was not aware of (or once knew but have forgotten). And it is explicitly stated as 64 octets not 64 characters, so beware of the possibility of non-ASCII characters when validating (I suspect many regexs attempting to validate addresses will get this wrong or not enforce it at all).
Falsehoods programmers believe about phone numbers, especially American programmers, is probably its own separate article.
(Edited to add) This list is good, but is perhaps overly generous - it leaves off the most common irritant: programmers who believe that all phone numbers are exactly 10 digits, American-style, even if there is a country code dropdown.
One of the worst combinations of these falsehoods:
- Everyone has a usable phone number, a residential address that conforms to your validation criteria and can receive SMS - when transiting through an international airport and attempting to use the Wi-Fi.
Sometimes the phone-number input field won't let me type the last digit, unless I first remove the space. Sometimes it declares the number invalid if it doesn't have a space. The form never has hints as to what they consider a valid phone-number to look like.
Same here. If they ask for a full US address, I put 1 Sunset Blvd, Beverly Hills CA 90210 (not sure if that's actually a valid address, but in my experience plenty of web sites think it is).
Many businesses require to "write FROM the email address you signed up with". At first I tried to argue, but now I'm just spoofing the From header and everyone is happy.
Encryption doesn't mean secure. ROT13 (or any Caesar Cipher) could technically be called an encryption method, but no sane person would consider it secure.
That’s silly. If your boss tells you, “make sure you use encrypted e-mail”, would you be able to get away with rot13? Used casually, the phrase “encrypted e-mail” means securely encrypted e-mail.
Since when has my boss ever been the arbiter of what is secure or not? My bosses happily turned on "secure links" in exchange so that now you can't see where the link in the email goes without clicking on it and following it.
Meanwhile we continue to have users who click on phishing emails, real and test.
Isn't there some sort of federated white list of mail domains?
Something where you can just declare a domain, pay a fee, gain reputation. Aren't those free or in the public domain?
Something with some sort of certificates?
Maybe it would also be a good solution to let users "add" an email contact or sub domain and only receive email from those, and treat unknown emails as untrustworthy.
I’m curious about this one. I’m presuming it doesn’t just mean email clients that support a processed version of MIME messages (e.g. a client that talks JMAP), but I’m not sure what else would be intended.
i am more and more convinced that there should be standards and implementations where emailservices publish what they accept.
like, black/whitelist of regexes of emails-strings they just drop or domains they accept from, or headers they drop, or that they only accept mails which are signed by key x,y,z.
with that at least we could formalize that problem and services know what to expect.
Emails - extremely cheap if compare with any other ads method.
Extremely effective, good to scale and gives much more results, that all known and tried ads networks with big budgets.
What about the topic -> all of that info related to 90% of users.
Most of normal casual users has one email, and all stuff mentioned in the topic.
Very rare cases when something else.
The list more about exceptions, instead of real life
Yet another article that can't tell the difference between a "falsehood" and a heuristic.
The end goal of most software is to weed out bogus email addresses, not to weed out email addresses that don't match the standard. "a@a.com" is a valid email address according to RFC 5322. But when a user provides such an address, you can be 99.9999% certain that it is neither the user's address, nor anyone else's.
Many programmers are very much aware that "mymail@[123.123.123.123]" is technically a valid email, but allowing such addresses invariably leads to spam and service abuse, for virtually no benefit. Restricting accepted addresses to "normal" ones is common sense, not a falsehood. The same is true for many of the other supposed mistakes pointed out in the article.
Restricting accepted addresses to "normal" ones is common sense, not a falsehood.
In the case of a@a.com, that's just someone not wanting to give an email address. You can block it with validation rules, but they'll just use some random but not real address like noemailforyou@gmail.com instead. The validation achieves nothing except annoying the user, really annoying anyone whose legitimate address is blocked as a false positive, and makes your email address database harder to clean up if you ever want to send an email to everyone.
Blockimg emails because they're "not normal" is validation theatre. It doesn't stop anyone nefarious or who wants privacy, it doesn't stop spammers, and it does stop rare cases of people with weird email accounts.
The assumption that you can get away with a heuristic would be a falsehood.
Attempting to block any email will always be erroneous and pointless - heck, in this day and age, use of gmail.com or outlook.com is a bigger cause for suspicion than a "weird" address, as the big providers are usually the ones used for malicious activities in order to blend in. By trying to be "smart" with a heuristic, all you're doing is exclude a lot of real users.
If you want to guard against "a@a.com", the only sensible and bullet-proof solution is a simple email validation flow as indeed the only thing you can assume about a valid address is that the user should be able to read email sent to that address.
If worried about generating spam from such flow from malicious users, implement rate limits per target address and source IP.
Users also expect email validation at this point and will most likely provide a valid, routable email if they didn't make a typo. a@a.com rarely flies nowadays.
> The assumption that you can get away with a heuristic would be a falsehood.
That depends on what you mean by "get away". Plenty of large-scale software systems implement some of the "falsehoods" mentioned in the article. Those systems still operate, so they seem to "get away" with it just fine.
The worldview underlying articles like this one, which I believe is itself a mistake, is that software must be able to accommodate 100% of cases encountered in the real world. But no system, past or present, does that. In the end, it's always users that end up adapting to the system instead. There are countless examples for this, such as people who have no last name filling in their name as both first and last names so they can apply for a passport which requires those fields to be filled.
When someone doesn't fill in the "last name" field on a passport application form, it's much more likely that they overlooked the field than that they actually don't have a last name. When someone provides "mymail@[123.123.123.123]" in an online signup form, it's much more likely that they are trying to do something fishy than this actually being their email address. And that's reason enough to reject such emails outright, without even bothering with the usual validation flow.
> The worldview underlying articles like this one, which I believe is itself a mistake, is that software must be able to accommodate 100% of cases encountered in the real world.
The mistake I believe you are making is removing users from the set you accommodate with absolutely no valid reason or gain.
Sure, there can be valid reasons to discriminate against users, but you better have a valid reason with no non-descriminating option available.
When it comes to email, not only is the discriminating approach completely broken, the non-discriminating approach is free of any problems, and will likely be implemented anyway!
> When someone provides "mymail@[123.123.123.123]" in an online signup form, it's much more likely that they are trying to do something fishy than this actually being their email address.
So people with addresses you do not consider normal must be doing fishy things?
In 2022, fishy things are done with gmail.com addresses as they go unnoticed, are easy to issue and are universally accepted and deliverable. People don't use addresses that raise eyebrows when they're trying to go unnoticed.
> When someone doesn't fill in the "last name" field on a passport application form, it's much more likely that they overlooked the field than that they actually don't have a last name.
Not a valid argument as we can test emails trivially but not test names without very specific registry accesses.
... But this is also a perfect example of an incorrect heuristic. Hopefully you do not consider it fair to exclude all members of societies that do not use last names as fair? That's far more grotesque than your email example after all, and fairly illegal under most anti-discrimination laws out there.
Instead, add a checkbox for "I do not have a last name" if you think that failing to enter your last name is a common enough error to bother those without.
> The mistake I believe you are making is removing users from the set you accommodate with absolutely no valid reason or gain.
That's a very strange objection. Virtually every online service excludes all customers that don't have a credit card. Many exclude all users that don't have a mobile phone or are unwilling to provide their phone number, even where the service has nothing to do with phones. Simplifying/excluding assumptions about users are ubiquitous on the web (and in society in general) today.
> Hopefully you do not consider it fair to exclude all members of societies that do not use last names as fair?
They're not being excluded, they just have to go through some extra steps, which is already true for many, many people for a near-infinite number of reasons. Those extra steps might involve workarounds like the one I mentioned.
> Instead, add a checkbox for "I do not have a last name"
This isn't feasible because there are hundreds of special cases like that. Forms (and software systems) would balloon in complexity and become utterly unmanageable in the "99% case" if every single possibility was catered to.
Just some examples: There are people who don't have a name at all (yes!). There are people who don't know their date of birth (fairly common actually). Should standard forms have checkboxes for all of these cases?
The map is not the territory. Expecting databases to perfectly model reality is an exercise in futility. It's far better in most cases to make the data fit the model (say, by filling in default or approximate values where the true value isn't known or available) than to relax constraints to the point where they become meaningless just because there is the odd entry that doesn't fit in, while the overwhelming majority of entries do.
> That's a very strange objection. Virtually every online service excludes all customers that don't have a credit card.
But they do not discriminate against who issued it. Valid? All good. Just like it with email.
> They're not being excluded, they just have to go through some extra steps
Disallowing an empty last name field is exclusion, not extra steps. Asking if it's correct is extra steps, which might be fair.
Disallowing what in your opinion is a "weird" email is exclusion, not extra steps. Allowing me to use it after sending me an email would be extra steps.
However, the name situation is just a validity or database issue, the email is unwarranted discrimination as all emails have the same format.
And that is exactly why the argument is not applicable: It is one thing to have a system not fit due to too an ill defined legal form or too many possible options (which is unacceptable on its own, but is hard to resolve), it is another entirely to decide to discriminate actively without cause, especially as it is extra work.
Your first argument was that an invalid email was a typo, which the validation flow sorts out which is needed anyway as a gmail address is no less likely to habe a typo. The latter was that weird emails are likely fraudulent, which is just flat out false.
So with those out of the way, what reasons remain for going out of your way and writing additional code for deciding what email address is right and what is wrong?
> But they do not discriminate against who issued it. Valid? All good. Just like it with email.
Ahaha of course they do! Try paying online with a credit card issued by a bank from Uganda, or by a bank that allows customers to prepay their credit card bills, or with a "privacy.com" type virtual card, and you will very quickly see that payment processors most certainly do not treat all cards the same, even if they are valid.
And yes, that's "just like with email". Because the majority of online services will outright reject addresses from anonymous/"temp mail" providers, even though they are perfectly valid and can receive mail. Many will additionally disallow +suffix addresses, because they can be used to create multiple accounts from the same mailbox, and also make it easier to automatically get rid of spam originating from the service provider. Service providers want the user's primary, personal email address. Exclusion rules exist to increase the likelihood that this address is in fact what the user has entered. The standard validation flow is clearly not sufficient, else service providers wouldn't bother with all the additional complexity.
Wow, can you elaborate on this, or point me to a reference to learn more? People may not have a formal name recorded officially, but I find it *really* hard to believe they wouldn't have a name at all.
"Most Machiguenga do not have personal names. Members of the same band are identified by kin terminology, while members of a different band or tribe are referred to by their Spanish names."
There are also plenty of cultures where a person has multiple distinct names whose use depends on the context, cultures where one's "true name" is considered a secret that must be kept from others, and cultures where people routinely change their own name multiple times during their life just because they have decided they would now like to be called something else.
A counter argument on this is also to hold a standard across your software suite. The amount of times i have found websites that allow `email+tag@example.com` on sign up but then promptly break at the backend has become more than i can count on one hand.
While the argument that supporting all valid email addressing may not be the best idea, holding a uniform standard across your own systems is.
Agreed. And I don't really care if your email client doesn't understand MIME, as I don't care if you get it via UUCP, because you are not that special* and not worth the extra time. You can go yell at me or whatever, I can't justify the expense.
At this point, I also don't really care if your client can't read HTML MIME, again upgrade it or don't get the message.
* assuming I am not sending it to you specifically, in which case you are a friend and you are that important.
In respect of handling email addresses provided by customers (note I do not use what I consider has become almost a derogatory term "users") the U.K. Government Digital Service (GDS) guidelines on interface/experience design and implementation patterns is widely seen as the gold standard[3]. Here's what they have to say about accepting email addresses[0] including code examples:
When asking users for their email address, you must:
make it clear why you’re asking
make sure the field works for all of your users
help users to enter a valid email address
You may also need to check that users have access to the email account they give you.
and here [1] is the github issue tracker for discussing email address patterns accepted by government services.
In respect of people without both first and last names, I resemble those remarks! I've dealt with broken computer systems since the 1990s that assume first+last (or even first+middle+last !), or set an arbitrary minimum length, and break when meeting me ("Tj") !
Worst cases for this are web services that set arbitrary rules for the name on a credit card (which can be almost anything by the rules and guidelines and should be free-form) - I had this only yesterday with onlyfans.com not accepting my name as it appears on the card because their rules impose first " " last format.
The double-abuse of the customer then comes when the first-line support, when told precisely what the problem is, and "please let your web-devs know", gets the response "Use a different card". Turned out that onlyfans.com (or their provider) don't even ensure the name matches when actually doing the authorisation since I put something random in for first-name and it was authorised.
I've discussed this in detail with my various bank's technical teams over the years and they've confirmed it usually isn't their side doing a DECLINE; it's the requesting service applying overly-strict arbitrary rules before deciding to make the authorisation request.
Again, GDS has recommendations (and consider this is for government services) for accepting names in free-form, not split into fields[2]. and says this:
Use single or multiple fields depending on your user’s needs. Not everyone’s name fits the first-name, last-name format. Using multiple name fields mean there’s more risk that a person’s name will not fit the format you’ve chosen and that it is entered incorrectly.
In my case in the U.K. the passport office telephoned me once, for my first digital passport in the 1990s, asking rather apologetically if I'd mind them putting X's in the first name since their computer system couldn't cope with it being empty. Whereas the U.K. Driver and Vehicle Licencing Agency (DVLA) has no problem at all and shows my legal name correctly on my driving licence.
Two annoying exceptions (not strictly government created/operated) are the internal (local) NHS registration system and some local authority (local council) electoral role (voter registration) systems that do it badly - usually due to having bought in an external 'enterprise' application to handle it, or trying to interface many disparate systems.
Generally, over the last 25+ years, I've found government organisations are great at handling these corner cases but random private sector / out-sourced development is worst.
Getting traction to get things fixed is the hardest part - being treated as dumb (usually by first-line support and their 'managers') when I set out a clear case and rationale for the bug and how to fix it has to be amongst my least favourite voluntary community-spirited endeavours. The short-cut I apply there, now, is a an email CC-ed to the organisation head (chair, CEO) and senior legal person.
There is an up-side to it though - I rarely if ever suffer any kind of spam or phishing and anyone trying identity theft will have to have much more determination than me to overcome all the obstacles :P
It can’t be reliable by design. All it takes is dead server without backup in the middle between you and recipient. Reliable protocol requires acknowledges and retransmissions at every step. There’s no such thing in email.
It is reliable enough in real life with everything set up properly, that’s true. If mail going to spam can be counted as delivered.
Well that is wildly inaccurate. Who calls themselves an engineer and doesn't know how MX works? Sorry, but I have to disagree. I'm sure you can cite examples for each, but that isn't a reason to indict all programmers; the original post this calls back to was a much more universally misunderstood concept, as far as I can tell.
Looks like we need to add another falsehood to the list:
* all (or even most) engineers and other technical persons know and understand the details of mail exchange
I do, but I wouldn't expect all to have much understanding and I wouldn't expect most to know a lot of the finer detail. Heck, I think I've just learned that local-part has a 64 octet limit (or if I already knew it I'd forgotten).
Apart from myself and my boss who have had to learn about SMTP I would suggest none of the 50 or so devs I have worked with in 20 years know much at all about email/SMTP.
Could you please tell which RFC says this? According to RFC5321/5322 + can be used inside a local part but has no special meaning. Because a receiving MTA can interpret a local-part as it wants [1] some decided to treat a part after + as an extension, but it is not universal.
[1] RFC5321 "the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address"
"Consequently, and due to a long history of problems when intermediate hosts have attempted to optimize transport by modifying them, the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address."
Similar erroneous assumptions strip the dot from my gmail address in a mistaken belief that I will get it. All my non-dotted email goes direct to spam via a filter I wrote, because multiple-nines of spam strip the dot.
I've only had a single (government) organization refuse to leave the dot in my address, so I had to special case them.
An email system that guarantees case-insensitive email addresses can still fail during comparisons of lowercase-to-lowercase, due to international encodings, locales, I18N, L10N, etc.
Pseudocode:
It turns out this issue is called a "case folding non-deterministic" error, and is a broader issue with strings in general.For more about "case folding" with Unicode: http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
For more about "non-deterministic" comparisons: https://www.postgresql.org/docs/current/collation.html#COLLA...