Hacker News new | comments | show | ask | jobs | submit login
The Correct Way to Validate Email Addresses (hackernoon.com)
556 points by amk_ 377 days ago | hide | past | web | 386 comments | favorite



The number of websites that try reject my email address with a + in it, ugh!

Surprisingly, the validation is often done 100% client-side anyway, and simply modifying the incorrect regex lets my email address through... If I wrecked havoc on your back-end, then it's your fault for sucking ;)


Even worse is rejecting my password because it has a + in it! Why do you as a business care what my random password generator spit out??

Scarier still is when it's a server-side response that rejects my password for its contents...


Caring what characters are in the password heavily implies that the site is not hashing the plaintext password in any way, and scarier still, may just be storing the plaintext password as plain text.

Why: Because if they were (at least) hashing it the output from the hash would be a binary string in which case they would have to be 8-bit clean through to the DB column where the hash output resided, and then there would be no reason to care what character was in the input.

Caring specifically about a + in a password also implies that their authentication might be setup internally as a http endpoint with URL encoding of a "password=" form variable (because + is used as the escape for hex encoded characters in URL encoding).

Both, of course, imply a lack of proper secure design in their password handling.


> Caring what characters are in the password heavily implies that the site is not hashing the plaintext password in any way, and scarier still, may just be storing the plaintext password as plain text.

I don't think that is true at all.

I may very well want to put a few simple rules I validate serverside, such as

1) No username in password 2) No email in password 3) No list of 100 most common passwords in password

All of which require me to look at the text for your password, none of which mean I am storing it in plaintext.


> I don't think that is true at all.

I've run into a couple of sites that reject passwords containing ', %, and other special characters that suggest there may be some truth to it.

If you're scrubbing input as if it's about to be insert into a database via SQL, then there's really only two possibilities. Either a) you're running legacy code that still does the check and does blind escaping (which has its own set of implications) or b) you really truly are storing passwords as plain text.


Or they're cargo-culting on decades of experience where special characters are verboten.


Hah, good point. Although I do have to wonder: When does it turn into filtering by force of habit? After all the legacy cruft has long been forgotten and is no longer maintained?


Because the other monkeys will chew you out if you start doing it differently all of a sudden. Nobody knows exactly why we're doing it the way we are doing it, but it is complex, and changing it might break something.

It's the five monkey experiment:

http://johnstepper.com/2013/10/26/the-five-monkeys-experimen...


It's not always that.

Let's use the "filtering chars from password" example above. You can't put some special chars in password field, and you want to change that so it's doing normal hashing where special chars don't matter.

In a larger org, even changing a practice like that so that it "makes more sense" can have a big ripple effect.

You have to

* explain to someone else on the team who came up with the original process that it's flawed (and why)

* explain to other dept that they need to update their testing process (and why)

* get support dept to change their language/process

* change outbound messaging in all affected places (perhaps with code you can't touch, involving other teams)

* possibly have a flag that deals with 2 versions of data

Even if your change brings you in to line with normal/safe practices, you may have to fight multiple inane battles, spend loads of time and political capital, and at the end of the day, you'll be able to also accept a !"@+'$ in a password field? Most people will not grasp the bigger issue at play.


At a small business convincing colleagues isn't that hard. At a larger corporation, getting the rest of the team on-board is the job of whoever is in charge of defining security policies and such. Not allowing characters present on all keyboards and input devices (!"@+'$) statistically increases the risk of people picking weaker passwords then possible, and he/she will guide that change through the proper processes, for example to make sure that any client software interacting with the backend is aware of the upcoming change. Same as with any security issue (e.g., the deprecation of SSLv3 ciphers in favour of newer TLS versions).

If a business can't handle a change like this, something is really broken in the development pipeline. Granted, that describes a lot of medium sized companies…


Take that one 'request' - possibly initiated by a jr-mid level developer - and stack it up against the 500 other todo items in the pipeline. You can make those arguments about "statistically increases the risk of people picking weaker passwords " - unless this increases a bottom line or comes out of someone else's budget, this sort of 'bug' is going to be really low down on the totem pole for all the reasons I mentioned, and a few others.

You can says "the process is broken" but it's also that same process that got people where they are, puts food on their table, pays for the lifestyle, and precious few people are willing to ever rock the boat at any company for anything.


Which gets back to some of the cargo cult thing too. It's not uncommon in a large enterprise to smack up against things like "Years ago we paid a highly trained Security Consultant a large amount of money to develop our Security Guidelines, who are you and where are your security credentials to tell us to do things differently?"

Even worse when that "Security Consultant" is still a retained coworker with a fiefdom to maintain by war at all costs, a wizened old greybeard whose seniority will always trump yours, and/or your boss.


Or they scrub ; from ALL arguments before considering what kind of argument it is instead of scrubbing only where that might be needed.


There's a related issue that I've seen in some banks, which prevents you from using passwords LONGER than a certain number of characters (usually 8 or 10). The only reason I can think this is happening is because they don't hash them and they need to fit in their database column. I took my money out of that bank the next day.


Could just as likely be some process that still needs to pass the password around in a fixed-length file format to some ancient backend process written in COBOL living in the cthulian abyss deep in the bowels of the bank that no one dares rewrite from fear that it would destroy the bank from inside.


All of those checks can easily be done client-side, though. Of course this means you can't guarantee that none of these rules are violated, but I suspect that the users capable of bypassing this aren't the ones you're concerned about anyways.


The server does have to have access to the plain text password in memory, you know. I don't see why you think it's worth sacrificing guarantees of password strength to uphold some kind of taboo.


No it doesn't. You can perform the first pass of salted hashing on the client-side.

This should not harm security, but it can improve it if someone on the datapath is logging requests but does not alter them.


so you just replaced the password with another secret, the hashed password. An attacker would now not need to gain access to the original password, but needs the hashed password - which will be logged just like the PW.

The only security benefit is that it offers a bit of support for those that are reusing passwords since it doesn't expose the plain text.


Most people reuse passwords, so that isn't a trivial benefit.


Isn't that what SSL does? And if the SSL between you and the destination is compromised, you don't even know if the hashing algorithm you asked the client to use is actually the one they used.


Yep. And importantly, SSL is securing more than just the password - if you just salt + hash client side, then anyone watching gets to do a replay on that value instead of the original.


> You can perform the first pass of salted hashing on the client-side.

Not really. JavaScript crypto is fundamentally broken: an attacker, malicious server or disgruntled employee can replace server-side JavaScript and remove the client-side hashing at any time. This is, notably, why Firefox Accounts are completely and totally insecure (and hence why Sync is unsuitable for storing any private data at all).


Can you reassure me about the worst horror I've seen?

That is: a listed-and-enforced ban on obscenities in passwords. Is there any world in which that company isn't butchering their security in some serious way? (It's Time Warner, so not a small-time deal here).


Or, they submit the password over HTTPS, validate server side, then hash and store in the DB?

There is no reason to assume rejecting of a + means they don't hash, and browsers escape the password for you in POST/GET etc. The fact that + is used for space really isn't relevant.

(That said, I think it's pretty stupid to "validate" passwords beyond checking for "12345", "password", a min length, and other public info. e.g. your password should not be your name or date of birth. Anything any password generator spits out, so long as it meets the min length criteria, should be accepted).


> Anything any password generator spits out, so long as it meets the min length criteria, should be accepted

It annoys me to no end when I have to hand-craft a password to stay within the silly rules of some service.

Be vocal about this! Keep complaining to the services you use that don't accept such valid passwords. Banks for example are notorious in insisting on short passwords and arcane limits on which characters to use (ING in the Netherlands limits you to 20 characters, and nobody there seems to be able to explain why).


With banks there is other problem, because they often use partial passwords. They are password manager unfriendly and often limited to 16 characters.


Société Générale uses the client ID and a 6 digit passnumber (sent via physical mail) with a digital numpad (you have to click) where each number is randomly placed.

So secure.

And they force a change every 3 months. As if a new 6 digit pass was more secure than the last 6 digit pass.

[https://en.wikipedia.org/wiki/Société_Générale]


Sadly at some places it's intentional so your password matches existing PIN systems or is "easy to remember" so you're not as easily locked out of your account. I wish in those cases there was a check box that says, "I know what I'm doing, leave me alone."

Speaking of annoying validations, my name has a hyphen in it but you'd be surprised how often that's rejected with the familiar, "Please enter a valid last name." Sigh.


My partner has a hyphenated first and last name. So many systems refuse to accept that, and even regular humans struggle with understanding it!


Had this bug in a system I was working on. Pushed a patch in april.. and there still is an ongoing discussion whether it should be merged or not.


Why is there a discussion? What do the people who don't want to take the patch say? "We don't care if we lose customers with unusual names"?


Because of code style (I do not mean spaces or tabs, it is beyond that). Different developers have different ideas of how a function should behave, where it literally has no effect on how the rest of the system works.


My last name starts with oo, which many humans struggle to understand. It almost always becomes do, or co.


As much as I agree that most artificial limits on the passwords are silly, there's a reason for at least limiting to latin1 charset - or maybe even a bit stricter. How sure are you that strings are always unified the same way between the browser and hashing? Have you tested that you can login from a browser which prefers CJK encodings and each of the UTFs? From all browsers? Maybe it works, maybe your framework deals with this without issues. But if not, do you really want to chase those mysteries in the future?

Btw, almost nobody is actually storing hashes as binary blobs. Pretty much every framework I know stores either base64, or direct hex encoding of the value.


We reject Unicode control characters, surrogates, private use code points etc. Passwords are sent in a UTF-8 encoded JSON over TLS. They are normalized using NFC.


For more fun, try putting a "--" (two consecutive hyphens) in many fields in the AWS console. Not allowed, I guess because they're passing it to shell commands somewhere along the line?!


Not likely. That's the comment start marker in many dialects of SQL.

For commands, shell metacharacters like `, |, &, and $() are much more important.


> Scarier still is when it's a server-side response that rejects my password for its contents...

Why? Assuming that you have established a secure connection with that server (i.e., HTTPS by means of TLS), then it is perfectly fine for the server to check, at the time you are setting the password, if your password confirms to the rules established. And when the password turns out to be suitable, it is okay for the password to be sent to the server as-is.

Now it goes without saying that as soon as you have picked a suitable new password, that the server will store only a (proper) hash — by using BCrypt for example. At no time is the plain text password stored anywhere, and any proper HTTP API will send the password via POST to prevent it from being logged in the server's access logs (which is where it could end up if sent as parameter via a GET request).

There are plenty of services that screw up and try to apply nonsensical rules (such as limiting the length of the password to anything less than, say, 256 characters), but in general this is done to prevent weak passwords. You can't exclusively do this client-side, because in security terms, the client cannot be trusted to actually apply the validation and to generate a proper hash; the client can be bypassed. Of course the client can and should validate any input before the server gets to it, so ideally the server need not even come into it during validation, but the server has the final word.

As user though, you have the responsibility of not reusing passwords anywhere (and you don't, because you sensibly use a password generator as you mention). Don't assume that any service handles security well.


> At no time is the plain text password stored anywhere

Hopefully. When I see rules limiting passwords to 16 characters and disallowing SQL special characters, I'm having doubts.


And you should; you have no way of knowing if any service is practising proper security or not beyond what you can see (e.g., exclusive use of HTTPS with valid certificates), so assume the worst. It is best to see a password as a shared secret; something you share with a specific service. That means no reuse of passwords if you care about the consequences of that password getting out.


My bank limits passwords at 15 characters. The best thing? There is no verification, it just cuts off. Have fun figuring out why you can not login anymore.


Here is a fun one. My credit union limits passwords to 32 characters and since I use lastpass for my passwords I had it generate a 32 character password when registering at the website.

Next day I get around to downloading and setting up their mobile app. Login and get a prompt that since it is my first time using the mobile app. They have sent me an email with a 4 digit code I need to enter before proceeding in the app. The email with the code arrives quickly and I enter the code. Proceed to get an error message "Error 400: This service is not available at this time." Ok, they must be down I'll try again later. The code is good for 24 hours. Next morning find time to try again and get the same error message except this time it also informs me that my account has been locked. Call up customer service and with scarily little information get them to unlock my account. Explain what is happening with the app and the CS rep puts me on hold. Comes back and tells me the IT folks thought it must be a problem because of my email address. "Do you have a normal email address like from Gmail, Yahoo or hotmail?" I release a great sigh and give in to the stupidity I'm about to have to navigate. I provide a gmail address. They change the email associated with my account on their end and tell me I should try again in 24 hours. I do with the same results. Call back again and the CS person puts me on hold again while she reaches out to IT. This time whoever she talks to knows the issue right away.

Turns out that when you login to the mobile app for the first time and submit the code it is actually appended to the end of your password and submits it as your password. Which if you have a password with more than 28 characters means you are exceeding the 32 character password max which causes them to return the informative, "Error 400: This service it not available at this time.", message.


Fun fact PayPal (used to?) silently cut off passwords when signing up, but not when logging in. And the password rules are of course not shown on the login screen. So good luck remember what length they cut your password down to when your password manager has stored the longer version.


And this from the same industry that pretends showing me a picture of a squirrel as a "secret image" meaningfully enhances my security.

It's not even proof against MITM attacks, which are the only thing it's supposed to prevent!


A previous place I worked did almost the same thing. There was an internal website they had built that everyone used for time entry. It authenticated via LDAP, so you didn't need a separate login for it. However, the password box on the page only permitted passwords of up to 10 characters, but it wouldn't notify you it would just truncate whatever you typed in. So if you had a Windows password longer than 10 characters, you couldn't enter your time.


When I was doing my taxes 2 years ago, I got bitten by the H&R Block website doing that. It did not inspire the trust.


Comcast has the same issue. I changed the password for a user on my account using a password manager, it accepted a 20 character password with no errors, but then I was unable to login with that account. Changing to a 12 character password finally worked.


My bank limits passwords to ten chars. Ten!

Have you ever heard any reasoning behind why they do this? The "best" excuse I've heard is so that customers don't forget. As if they don't have a "Forgot password?" link right there.


When I forget my banks password, I have to go to the closest client centre. Ironically, the only time I've had to do that is when my password was truncated.


Because somewhere they have a decades old legacy system that can't handle more than 10 characters.


I remember Starbucks was doing it for their mobile app about two years ago, sacrificing security for a "better"/faster UX. It's one of those things that you think "No one would be dumb enough to do this", only to be surprised by the fact that a big player has been doing it for a while.


> Scarier still is when it's a server-side response that rejects my password for its contents...

A friend's project decided to disallow umlauts, combined characters like ´e (can't type the correct e with accent mark), the pipe symbol and a couple more in new passwords.

Not due to plaintext storage or so, but because of customer service issues - people were bugging support all the time because they were e.g. abroad and couldn't enter umlauts or on a Mac and couldn't find the pipe symbol (it isn't written on the keyboard!)... of course, a quick direction to the Zeichentabelle in Windows helped, but what for Mac users? At least it did take a huge load off the customer service guys.

I'm looking forward to the date when some clueless user will input a UTF8 emoji as a password, given that Android and iOS keyboards now include these on special keyboards...


I'm looking forward to the date when some clueless user will input a UTF8 emoji as a password, given that Android and iOS keyboards now include these on special keyboards...

Looking forward? That happened a year ago and caused problems because the OS X Yosemite login screen had no way to input emoji:

http://apple.stackexchange.com/questions/202143/i-included-e...


> combined characters like ´e (can't type the correct e with accent mark)

You mean é ? :-)


Yes. OS X keyboard drives me nuts sometimes, Karabiner can only fix some bits of the weirdness.


Hold the "e" key down for a few seconds: it will have a little dialog above where you can choose it, depending on your keyboard settings of course.


It's worth playing around with the option key when pressing letters on OS X. You're able to enter a ton of common (western) characters with that alone, without ever opening the "special characters" (ugh) palette.

Also the default configuration for a couple versions now (which I hate, but I know it benefits others) is if you hold a commonly (western) accented character, it'll suggest accents for it.


option-e and then a vowel will create {á é í ó ú}

option-u and then a vowel will create {ä ë ï ö ü ÿ}

option-n and then an 'n' will create ñ

Found this out while learning Spanish since holding down keys to select their alternate was way too slow.


For the record:

option-n and then 'a' will create ã

option-n and then 'o' will create õ


Thank you, this was driving me nuts.


On Windows[1], á é í ó ú are achieved by holding down Ctrl+Alt and then pressing the appropriate letter.

--

[1] Yes, I know you are OSX, but someone else reading this might not be.


There's a pipe symbol on my keyboard.

By pipe, you mean "|", right?


There is a key on my keyboard that produces a vertical bar (|) but the glyph on the key is a broken bar(¦). I think every keyboard I own is like that.

That could be confusing to someone who doesn't know better.


Even Linux tools like GNOME Archive Manager in Linux Mint 18 rejected my RAR password containing a $ as the incorrect password, even though it was the correct password for the RAR file I was trying to extract. I then used the command-line unrar utility with the exact same password, and it extracted successfully.

Now, why would you preemptively (and explicitly) throw out a candidate password string based on its characters?

At least make the validation consistent.

I also had trouble with connecting to a wifi network with a WPA password containing an ampersand in Arch Linux using the Arch Wiki's recommended command line network manager utility (netctl, if my memory serves me correctly), and I tried all sorts of ways of escaping it and quoting it in the configuration file.

The Arch devs on IRC just told me to change my AP settings so that the wifi network doesn't use a password containing an ampersand character. Well, that would have been a good idea, but it wasn't my network, so I didn't have the ability to change its WPA password.


I'm guessing both are because the are calling command line tools. One of my pet peeves with linux is that many of these tools are only callable via text and don't expose an API for other programs.


"callable via text"? What does that even mean? The command line is an API, isn't it? If people don't manage to pass an ampersand to another program via the command line, that's really no different than people failing to pass an ampersand as a URI parameter to an HTTP resource: Failure to encode properly. There is absolutely nothing that prevents you from passing an ampersand (or any other characters) to a program via its command line.


> "callable via text"? What does that even mean?

It means you have to execute commands through a shell. A real API would be something you could include in your program, execute a method against and get a list of objects back. Instead all these basic command are replicated in every framework.

As far as encoding properly, you're preaching to the choir, but out in the real world there is still injection attacks everywhere.


> It means you have to execute commands through a shell.

Except you don't. There is no need to involve a shell.

> A real API would be something you could include in your program, execute a method against and get a list of objects back.

So, Web APIs are not APIs?

Also, you can execute methods against command line programs, method names usually start with a dash.


Word-splitting, escape character interpretation and so forth is done by the shell. The shell is only invoked if you use the system() call to run said tool.

Which you shouldn't. execve (and friends) will let you pass an explicit array of flags, and you should always use one of those functions if you're calling another program. No interpretation.


On a Unix system, a console application is effectively just a vararg function taking a bunch of const char* arguments. There's no limitation on what characters you can pass as those arguments, so that doesn't sound like a valid excuse.

(There are some characters that are treated specially by shells, and require escaping - but you don't normally spawn child processes via a shell.)


But even PHP has a way to escape shell arguments safely


Having a way and being used everywhere are different things. That's why we still have sql injection attacks.


Or when a password that is generated by my password manager is rejected with a message "Password should be 12 characters maximum". Why???


This very heavily implies that the database column which stores "passwords" is typed as "char(12)" and that the site is storing unhashed plaintext passwords in that column.

Why: Because if they were (at least) hashing the password, then the output of the hash would be a fixed size token unrelated to the length of the input plaintext password, and no such arbitrary short limit would be necessary on the plaintext password itself.


Ha! If only! Probably more than 50% of the sites I visit that have a maximum length that my password manager exceeds... give me an unrelated error message. Some times they tell me I haven't met the minimum length (100 chars, really?), sometimes they tell me that I've not met complexity requirements (I use upper/lower/numbers/special chars), etc.

It's as though the developer only ever thought of how people wouldn't meet their minimum requirements, wrapped it all in a case or switch and let the default be whatever they hadn't checked in previous conditions.... not thinking that anyone would exceed their requirements.

Very annoying and sloppy.


You default to 100 character passwords? Doesn't that make it extremely inconvenient on the rare occasions when you need to type a password out? I figure 14 characters is going to be effectively unbreakable, but still possible to manually copy in under a minute.


Yeah, basically this. In theory every time I need a password I can copy and paste.

In reality, sometimes I'm on someone else's computer or something else comes up that I need to open my database up on my phone and type it in by hand.

For instance, I can't imagine trying to enter a 100 character password with a PS3 controller to log myself back into Netflix...

If I'm already at "it's going to take 100 quintillion years to break this hash, even if they're only using MD5", then I really don't see any security benefit to using a longer password... But there's a definite loss of usability.


It would take less time than you think with MD5 because of hash collisions. A 100 char pass is longer than the hash it outputs and is truncated.


It depends on your needs. Clearly in my case: nope, not a problem. Otherwise I would have changed. And I think your use of the word "default" is appropriate; what I do most of the time is not what I do 100% of the time... there are exceptions.

In those cases where I might need to manually enter a passphrase and I can't rely on the password manager functionality, I use a pass-sentence that is both long and includes some random characters thrown in. That can still get me to 100 characters pretty easily. But those cases a fairly rare for me. I'm also only very, very rarely using a device that I don't own (ok, I do access systems I don't own frequently, but using my device... ssh, etc).

Most of the time my 100 character default doesn't give me any greater security than your 14 characters (assuming they're well constructed or random passwords) and things like multi-factor auth are often times much more important anyway. There are some times it can be useful; sometimes I can remember a 14 character password just by looking at it, even if it's random and not mine. 100 characters I can't. Since password managers can rely on things like cut and paste, rarely, but occasionally, I might accidentally paste a password somewhere I don't mean to: at 14 characters I might not notice, but at 100 I more likely will (or overrun the field and lose a piece of it anyway). At the end of the day, I have a password manager and I'm going to turn it to 11 and let it run. My personal practices are not necessarily my recommendations in this area, but they suit my needs.

Either way, my point still stands: if you build software that takes passwords and you limit the length (or can't take longer passwords), provide meaningful feedback if your requirements are exceeded... makes me wonder what else I might be able to do if you didn't expect me to exceed your password length and respond properly.


Depends on where you have to type it, but usually, if people have long "passwords", it's because they use passphrases, which tend to actually be easier to type on real keyboards, but which are long because of the low entropy per character.


That always makes me think there's probably a Password char(12) in the Users table somewhere.


Or they want to make passwords "easy to remember" and read somewhere in the early 2000s that form validation matters. If you can do it, why not? :/


Even 15, 16, 31, or 32 (all of which I've seen before) would make more sense. 12 is "odd".


I haven't tried to see if they've updated their requirements to be more secure, but Wells Fargo is 14 characters.


Unless they are trying to pack multiples of 2 at a row level. It's 12 so the can squeeze a char(2) somewhere else.


Meet Time Warner Cable. Time Warner, which has an obscenity filter on their passwords. Enforced server-side!

Bets on how long before those folks make the news for losing a hundred million unhashed passwords?


That is ridiculous. Why would you even filter passwords? It's not like it is public, unless you are planning on making it so.


My guess is that when you call them on the phone, they ask for your password to validate your identity. Which means it's stored in plain text in their database so that customer service can verify what you said is correct. Maybe they don't want their employees to have to be cursed at by customers.

I can't think of a good way for a business that has an online interface and frequently handles phone calls from customers to validate that they're talking to the correct person. Asking for other personal information can be used by an attacker to compromise multiple accounts via social engineering: http://www.wired.com/2012/08/apple-amazon-mat-honan-hacking/


Someone else in this thread mentioned a company that has customer service type in your password to open your account. So that would be a non-plaintext reason to insist on non-obscene passwords. But it's still terrible, because why the hell is customer service typing in your password.

Pretty much all organizations that allow phone authentication seem to be at risk of engineering attacks. The only ones that manage it send you something verifiable they can ask about like a credit card, and people who really care like the government just send an actual human to your house.


I'd prefer to have an obscenity in my password if a customer service representative is seeing it. That would help communicate my frustration with their system. Saves me from having to voice that same obscenity, most likely.


It doesn't have to be stored in plain text to validate it is correct. The phone operator could enter it into an authentication form to verify it is correct.


Customer care clicks a button to create a random temporary token associated with that account and user has to log on and read that token.


Devil's Advocate: They could filter for obscenities before hashing.


The concern isn't that there's no way to do it, it's that there's no reason to do it unless the plaintext is going to matter again in the future.

The best guess I've seen is that they might be hashing, but also having people read passwords to customer service reps. That would justify caring about rude plaintext, but it's also a terrible system.


Heh, to be fair, we are talking about a company with millions of customers that is consistently rated as "most hated" by consumers. I'd be surprised if a decent percentage of attempted passwords weren't "fuck time warner"...


Hah. I can see some poor support rep getting a dozen calls a day with insulting passwords, actually - maybe it's about reducing call center turnover...


An extremely large company I was involved in building a system for had a requirement that you be able to read your password over the phone to a call center agent, resulting in requirements like case insensitivity and character limits.

Terrible security, sure, but at least it came from the desire for usability, not just basic encryption idiocy.


No, the system isn't secure.

The text of the password should not persist.


Who said the text persists, you could take it off the phone and type it in to compare against any existing salted hash. Still not good security, but not necessarily stored in plaintext.


(For the record, yes this is how it worked.)


A former employer, that I will decline to mention by name, stored a hash of the password AND the plaintext in the database so it could be sent to people via email when they forgot it.

I tried to explain to my boss why this was such a terrible idea and he was not hearing any of it.


If I ever use the "forgot my password" functionality at a site and they mail out something that is probably my original password, I make a point of cancelling my account and sending them an e-mail explaining why I don't trust them any more.


And submitting the site to http://plaintextoffenders.com


Omg. That sounds like a public list of "Top poorly secured business sites to breach. --Reward guaranteed"


They have two classes of site that they don’t distinguish to be fair. One is the awful “You’ve forgotten your password. It’s: xxxx” where they’re just storing the password in plaintext in the database.

The other is still not great but it’s less bad: “Thankyou for registering. Your registered password is: xxxx”. In this case they’ve just passed the user details to an emailer before (potentially) encrypting and storing them.

Neither is good, but the first is awful.


Yeah. As a former employee, I will not recommend them to anyone who is looking for services in that particular arena.


scarier still is when they let you set it but fail to let you log in (generally happens more with length). i don't know what you're doing, but i know it's not right and it scares the hell out of me


I had that happen because of length at an online service we needed to use in high school. I believe my password had 9 characters. After requesting a password reset and experimenting I found that the registration form would allow you to enter a password of any length, but the login form would only accept up to 8 characters. What really shocked me was that I was the only one of my classmates that used a password longer than 8 characters.


Some places have been known to just truncate the password...


I see this all the time. They accept it and then when you go to log in, it fails. I usually try chopping it down to 32, 30, 24, 20, ... until I hit a match.


I've had that happen with Comcast. I think they fixed it but it was pretty annoying. The signup password input had no max length but the login form did.


At least you got a notification. One bank just stripped all of mine off and submitted like that. At least they were consistent about it, I didn't realize until I accidentally mistyped and still got in.

In disbelief I logged out, tried again but this time I intentionally didn't use any special characters. It worked....


Worse than that: I've encountered a few web sites which accept email addresses with '+' characters... and then tell me that my email address has a ' ' character in it. Every time I see this I think "there's got to be a multiple-form-decoding vulnerability here"...


I've got an account where they just plain stripped the + character. Since I happen to have used only alphanumeric charters after, I am now registered with an email address I can't actually receive mail on - it goes to somebody else's inbox.

Account synchronization was involved - IIRC the initial address confirmation message got thorough.


I thought that though the + is valid, nothing after the + is used to differentiate the email address? I use local+Organization when I sign up for an email list so that I can easily filter, plus I can see if that email address gets shared around. So on places that reject the + I just use everything before it as the local part. Maybe I'm missing something. Edit: I guess not all email providers do it this way but here's a link to more info: https://www.cs.rutgers.edu/~watrous/plus-signs-in-email-addr...


As far as email is concerned, "+" is just a character as any other. You might as well configure your mail server to ignore everything from the first "g" in the localpart when figuring out which mailbox to deliver to. Some mailservers happen to do this with a "+", if it happens to occur in a localpart, but that is just what they happen to be doing. There is nothing technically wrong with an email address like "++foo@example.net" or "x+y+z@example.net" or "+-@example.net", and "a+b@example.net" is a different address than "a+c@example.net", unless the operator of example.net explicitly specifies otherwise.


So this is true on the email provider side, sometimes. Gmail (and a lot of others) rout foo+bar@gmail to foo@gmail. It's not required, but it's common.

The problem here, though, is that the email sender just stripped the + and sent the content to foobar@gmail.com. That's just a totally different account, chosen without warning.


Yeah, of course it will work, but then it's harder to filter their mail to an appropriate folder.


Had this problem with Virgin Atlantic. Booked a ticket, never got it.


Southwest (Airlines) does this... or did a few years ago.


...and they probably send the confirmation mail to two addresses, both wrong.


I recently registered [mylastname].email, thinking to switch over to firstname@lastname.email from my gmail address. Turns out a large percentage of forms don't accept new TLDs.


I co-founded the company that set up the ".name" TLD as part of the first batch of new TLDs back in 2001, and we had no ends of problems because of that.

A lot of people either had hardcoded lists, or they checked the length and refused TLDs longer than 3 characters. We kept e-mailing people about it, and kept getting messages back from people who had "fixed" it by adding just us (we'd generally have pointed people to articles listing other new TLDs too, to make it clear to them that this wasn't just one TLD), or increasing the limit to 4 characters... We quickly gave up on trying to get these people to stop having useless checks and settled for just getting them to accept ours.

It's quite shocking people still haven't learnt even after the number of expansions since.


I recall doing something similar for .coop


It's because a lot of regex validations (including earlier versions of Angular) have a regex that ends in something like

\.[A-Za-z]{2,4}$ (this is the end of the regex in Angular 1.15, it also doesn't support brackets or quotes in the local part, poor showing from Google)

At the time when they were written, gTLDs didn't exist, so the longest a tld could be was 4 characters (.info, .mobi). It's pretty piss poor futureproofing to be honest.


Angular didn't even exist when the first expansion happened, so either they were flat out ignorant, or they copied an earlier regex without verifying it (.museum was part of the initial batch of new gTLDs in 2001)


Yes, the + is incredibly useful for tagging emails. When I register new web accounts, I always specify a new unique tag so that I can track down the source in case I receive spam. Furthermore, they help my mail server when filtering out junk mail.


These days I do this differently; I created a subdomain that forwards all email to my main account. hackernews@foo.example.com would forward to main@example.com, and I can just filter the prefixes. That way I can use the subdomain for my own unique addresses, without interfering or using up addresses on the parent domain.


I do as well. Keeps a nice track of which have been leaked (or guessed) by spammers. And easy mail filtering to relevant labels on my side.

But it does get awkward quite a few times when having to interact with a human (customer services, hotel bookings, etc) via email or phone when they get a bit confused why their company name is my email alias....

Spelling a long alias over the phone letter by letter is especially tedious...

Also responding to emails either means I have to configure yet another sender alias, or mostly just send from my normal alias, which sometimes gets rejected or confuse whomever I interact with.

Also hate unsubscribe links that insist on sending unsubscribe email from that alias(mailman etc).


Doesn't solve all of it, but that's why I use random localparts instead. Sometimes people wonder whether the address is correct, but if I confirm that it's ok that it looks weird, people aren't overly confused and just accept it. I then have it all integrated with Mutt so it tags emails with a human-readable label and automatically selects the correct source address when replying.


>But it does get awkward quite a few times when having to interact with a human (customer services, hotel bookings, etc) via email or phone when they get a bit confused why their company name is my email alias....

I had that problem so many times that I whipped up a quick Rails app that generates new email addresses. Type in the company name, hit submit. It uses a random project name generator gem to create something like SteelyFishSauce@<mydomain>.com, displays it on the screen, and emails "<Company name> has been associated with SteelyFishSauce@<mydomain>.com" to the spamcatcher address.


Same here, same (minor) problems. The human interaction part is pretty much covered when I used yourorg@mail.my-name.tld for registration, but the more provocative variation yourorg@spam.my-name.tld occasionally raises some eyebrows.

The nasty part is replying with the main address as the sender. I'd really love to have email clients with reasonable support for this usage pattern for the platforms I use, but maybe then the pattern might become popular enough to lose some of its advantages.


I do this too. Unfortunately some spammers decided that they'd use <random>@<mydomain>.com as their reply address and I have to filter out all of their bounces.


This is probably a better approach if your domain provider supports wildcard DNS records. My old provider did not and I am very glad I switched.


I don't need wildcard DNS, I have a catchall on a single subdomain.


A lot of spammers will rip out the + tagging on a Gmail account. Probably works better if you have a domain in front of it via Google Apps.


I do not use Gmail but a self-hosted Postfix instance. I have configured an alias for tagged use only and configured Postfix to reject all emails to this alias without a tag. This means that currently any tag will be delivered but luckily, I am not receiving any spam to this address at all.

Originally, I wanted to create a Postfix filter based on an HMAC together with a browser extension which would simply let me generate new valid email addresses in the form of prefix.HMAC(secret, prefix)@example.com but I have never implemented it.


I agree that it's annoying.

But I have an acquaintance who is constantly bickering and venting about this issue. If he strongly suspects that a signup form will reject his plus address, he will enter it even more fervently to prove they are idiots.

I'm firmly on the side of "Too bad. Maybe you could simply move on with your life? Just use a dash if you need sub-mailboxes"


Is a dash the same as a "+"? According to some quick googling it depends on your provider. If his provider relies on the "+" for sub-addressing what good will a dash do him?

Why are you so firm on this position when it seems to be a legitimate problem for him?


Yes, it depends on the mail provider.

What is easier and better for your nerves? Moving to another mail provider or getting worked up for years?

Or if moving is not an option: how about stop wasting time and nerves on it? Just accept that your mail address doesn't work everywhere. There is no possible uture where you will force everyone on earth to stop mis-validating email addresses!

And I've never disputed that the problem is legitimate. I actually said in my comment!


I still see sites rejecting ".info" domains. Here's looking at you, Salesforce guest registration in SF and dozens of other sites.


I think best user experience in this case is to just prompt the user asking whether they have typed in correct email address, if they say yes just go ahead and submit the form.


I had trouble signing up for JetBlue because I had a . in my email address!!!


I've got # in my address and some systems fail with it too. It's totally valid. As example outlook.com won't allow sending email to addresses with # . Also some services reject short addresses for no reason. As example m@em.mm or so, that could be totally valid. You don't know it before checking for MX info and asking from the SMTP server if the address is valid in rcpt to: .


Can you name some popular websites that do this?

Speaking as someone who uses + addresses to filter stuff from mostly well-known websites, I have never seen this. I have seen this a few times on old, crusty, finance websites etc. but I hardly ever need to use a + address with them anyway. (It does make me wonder about how good their internal security is, though.)


BestBuy allowed me to sign up with a trailing "+bestbuy@gmail.com", but their unsubscribe interface rejects it as invalid, so I can't unsubscribe from their promos. I just filter/mark as spam and move on.


I've seen this several times as well. If it's a referral link to unsubscribe and is meant to include your email, double check it's okay. If your plus is missing insert "%2B" where it should be.

It's less common, but some websites ask you to type your email address once they realise they can't find you on record. If the "+" doesn't work in that field (server side rejected), then you can try %2B there as well.


If they refuse it, I think it's generally because they _know_ you're filtering them, and don't want you to sign up with a single-use address. In which case I don't bother signing up because I know they'll be spamming.

Also, there's always https://mailhero.io/


As of a year or two ago, Garmin wouldn't accept email addresses with "+" signs in them when registering for some (but not all) of their services. I think they would let you register for the main Garmin Connect service, but not for their support forum or something. Leading to a nice situation where you couldn't get technical support via the forum for the same account you had created on Connect.

Interestingly, though, the support email address they had hanging around at the bottom of the login page actually routed to an engineer (somewhere), which was nice, although they did claim repeatedly that the "+" character wasn't valid in an email address, which is of course not true. I was floored just to get a response from my complaint, though, even though it didn't seem likely to lead to them fixing the problem.


You probably hang around tech sites that mostly have it together.

I find common culprits are service companies; gas, electricity, real estate websites, insurance companies etc.

EDIT: Oh, and warranties! Man warranty websites tend to suck. Samsung for instance won't let you register your product to an email with a +.


when registering my S3 years ago I realised that samsung wouldn't let you have the word "samsung" or "s3" as part of your email address.


IIRC, Chipotle's Chiptopia promotion this summer allows users to sign up with a + in their email address, but not log in or do a password reset.


Airlines, like klm.com. I could book using a + suffix, but not check in online.


I just silently accept your + and throw the insignificant bit away.


The one thing I systematically do in term of email validation is catch the common typos of the main providers. So things like gmail.con, hotmai.com, gmall.com and so on.

In 99% of those cases, it prevents someone from entering a wrong email.

We do not do email activation by forcing people to click a link in their email to validate that they received it since that causes a drop in the funnel and reduces the amount of revenues (non technical users tend to not come back when you ask them to go to their emails to verify it). So, in this case, correcting the typical typoes is very important. In our case though the information is not extremely private so it's less of a problem to do this.

Having people type the email twice doesn't really prevent typoes, people copy and paste. And if you disable paste, then it becomes annoying to users and you don't want to annoy users during the signup process (plus I hate websites that mess up with paste so I won't be hypocritical and do it).

Lastly, I know that having an email with a local domain name with no TLD is valid but it'll never be valid in the context we are sending so supporting them just doesn't make sense.


> The one thing I systematically do in term of email validation is catch the common typos of the main providers. So things like gmail.con, hotmai.com, gmall.com and so on.

That can be helpful, but I assume you suggest the user that they may have made a typo and not flat out reject the input?

> […] having an email with a local domain name with no TLD is valid but it'll never be valid in the context […]

Well, depending on how the fairly recent brand generic top-level domains work out, I would not rule out someone actually using ceo@cocacola at some point (until the CEO of Coca-Cola realises that legacy email validation regexes mean a suspiciously quieter then usual inbox).


Our email provider flat out refuses to send to those domains actually so we actually reject the input for the domains that the email provider blocks.

> I would not rule out someone actually using ceo@cocacola at some point

Of course, this is subject to change if the practice changes. The aim is to make sure that we balance the percentage of emails that are correct versus the number of potential rejections. If 0.00001% of users use a top level domain and 99.9999% of emails without a tld are invalid, then it makes sense to reject it. We're not trying to find the most perfect solution just the most convenient solution for users.


How about rejecting all email addresses that contain the character sequence "jex"? 99.9999% of email addresses with "jex" in it are invalid, so it makes sense to reject it, doesn't it?


If a few hundred people messed up and typed jex instead of sex a day, then yes it would make sense to reject it.


1. That wasn't the argument.

2. No, it still wouldn't make sense to reject existing email addresses if there is a method to figure out whether the email address you are being presented with actually exists instead of divining validity using some unreliable proxy. There is just absolutely no reason to ever reject an email address that you can successfully send emails to.


I mean, in this case there are two reasons. They're both bad reasons, but still.

1. "Our email provider won't send to them". That excuses OP's part in the thing, although now we need to ask why the email provider is being stupid.

2. "We don't do validation links, they cause too many lost users". I have serious problems with this, but from a pure-business standpoint they decided that rejecting valid emails loses fewer users than using account confirmation.

Number two is vaguely horrifying to me, but in terms of "new users gained" it probably works out.


Are they being stupid though? For example, hotnail.com is a parked domain. There's virtually no chance that a user actually has an email address there and sending an email to a wrong email address is bad for the email provider reputation...

I mean it's not like they block a huge amount of domain names but with their volumes, it makes sense to avoid sending emails that will never be received by their intended recipients anyway...

Could it inconvenience legitimate users? Yeah, there's a probably of that but it's negligible and so far we've never had a complaint about it... On the other hand, we've had users telling us that it was good that our system caught their typo.


This is a fair point. I guess it's the sort of thing that I'd prefer to see fixed with a double-check prompt instead of a rejection, but I doubt it's causing many problems.

I was primarily thinking of short domains like "gmail" and "aol". For those, I can see a company called, say, "Gail" getting blocked from legitimately using "Gail.com". "Hotnail" seems a lot less risky.

Not a big problem, I just have an aesthetic objection to very high-friction things like blocking possibly-legal domains. As a double-check option I wouldn't object, and in fairness it's possible no one has ever been inconvenienced.


> 1. "Our email provider won't send to them". That excuses OP's part in the thing, although now we need to ask why the email provider is being stupid.

Nope, actually, it doesn't. If your reaction to noticing that some service that you are using is incompetent is to adopt the same incompetence, that doesn't excuse anything.

> 2. "We don't do validation links, they cause too many lost users". I have serious problems with this, but from a pure-business standpoint they decided that rejecting valid emails loses fewer users than using account confirmation.

> Number two is vaguely horrifying to me, but in terms of "new users gained" it probably works out.

Well, sure, it's as much a reason as "I don't like your nose!"

If they don't do verification emails, they might as well just not ask for an email address in the first place (or make it optional). Misguided "validation" doesn't help with most mistyped addresses anyhow.


I oversold 'excuses'. Let's say "means the original error lies elsewhere". I also wonder who the hell they're using - who ever heard of an email service that bans domains for being likely misspellings?

The "it's as much a reason" I disagree with. Validating common typos will catch more errors than false positives, so you do get more users through your funnel than if you abandon it. "I don't like your nose" is a strict loss, this causes corrections to get real emails. So they'll still miss most typos, but it's a net gain compared to not doing it.

Of course, again, I don't endorse any of this. Decide if you're ok with bad emails, follow through on that decision, use verification, and get a not-incompetent email service.


I like activation emails, because it shows the website cares about being able to email me. Then again, I'm a technical user.


I like them as well, but for a different reason - I have a way to find what email I used for particular website, if I used it, etc. Doesn't have to be the clicky linky mail, just a confirmation mail will do.

But if you include plain text password in it... ugh.


Oh, we do send a welcome email and actually include a link in the email to subscribe to the newsletter but we don't make it necessary to use the website.


In that case, I'd rather not have the site gather my email address at all. Just create an account with an arbitrary name (or no account at all) and let the user use the web site. As soon as an email address is connected to the account (or the user name itself is an email address), I'd rather have it verified.

But I can see how there may be other concerns (commercial or not) that interfere with this.


Users remember their emails, they don't always remember their nicknames especially if they had to select an alternative nickname because the one they usually chose was already used...

Honestly, it's ruthlessly pragmatic business reasons. Is it ideal? No but in the end, it's the most painless for most users...

And, also most users on that particular site have little computer experience, so it does affect how they react and how we work. If it were something targeting the HN crowd, I'd probably have enforced email validation and would see a much lower drop in conversion due to it because people on HN are used to that and do not have a problem with it.

So, validations and UI workflow have to be adapted to your audience and your business and that's the main point really.

In this particular case, we did experiments with enforcing email validation for a subset of new signups or even allowing a small number of signups to signup with a username instead of an email. So we do have data...

Most of the work I do for other customers doesn't touch those areas of their app, so I only have relevant on hand experience on that particular site but I think it's important to mitigate the recommendation of "Always validation emails" or "Always use regexp" and try to think of the best experience for the kind of users you're targeting.

Sometimes in HN, people talk in absolutes when things are instead very context-dependent.


> We do not do email activation by forcing people to click a link in their email to validate that they received it

Have you considered doing email validation without forcing users to comply? It's fairly common to require valid emails after X days, or in order to unlock N features. Gets the advantages of having valid emails without the disadvantages of a drop signups.


TL;DR the odds that the user entered an incorrect-but-valid address are way higher than that they entered one which will not actually be able to receive mail. Send a validation email.


Also relevant was, "If you have a well laid-out form with a label that says “email”, and the user enters an ‘@’ symbol somewhere, then it’s safe to say they understood that they were supposed to be entering an email address."

In other words, it does make sense to check that they entered an '@' symbol somewhere, since it shows that they understood it was an email field. Any 'validation' beyond that is useless.


How about more than one @ sign? Does the email address spec exclude the possibility of more than one @ symbols?


It doesn't matter. If they have typed an @, they probably understand that it's an email field. Trying to validate to the spec beyond that is pointless, for reasons thoroughly covered in the article.


Nope, that's valid if one of them is quoted: "very.unusual.@.unusual.com"@example.com


It actually used to (maybe still does) signal routing information[1]. An example from the RFC is "@ONE,@TWO:JOE@THREE" [1]: https://tools.ietf.org/html/rfc821


The local-part may contain a quoted @: "@"@example.invalid is a valid email address.

@ signs are also permitted in comments: (ted@home)ted@example.invalid is also a valid email address.


For flip's sake, yes please! Close the loop, fer crying out loud.

I got a popular givenname.familyname@gmail.com address and I frequently get mail that's meant for other people who share my name. The vast majority of the time it's the individual themselves who sign up for a service or offering but there's rarely a validation upfront.

The best emails are the ones who extend full trust to the email recipient over some account during that first email. Facebook, shame on you.


> The best emails are the ones who extend full trust to the email recipient over some account during that first email. [Random company,] shame on you.

Well what else would you have them do? Have people enter their address and send letters there to do a password reset?

A confirmation email before full trust is going to do little: a malicious person would just click that link, right?


Someone could have just created an account and accidentally put the wrong email address in. Therefore, putting a link that extends full trust in the welcome/confirmation email is a design error. The fix for this error is to require the newly created account holder to put in their password on this first login. While the account's in this state it shouldn't be possible to reset the password.


Address validation by sending an email should only be used if it is required for some reason to verify the user owns the email account. Otherwise, it's not a great UX.


I honestly can't think of a reason you'd ask a user for their email but not need to validate it.

For being able to do password resets later, permission to add to mailing list, avoiding sending private info to the wrong user, avoiding allowing someone to masquerading or impersonate someone they're not.. All should be validated.

If you're looking for a username as login identity and nothing more (and you don't have password reset functionality), then ask for a 'username', not email addresses.


"For being able to do password resets later" - don't need to validate.

Mailing list - reluctant OK

"avoiding sending private info to the wrong user" - how could this happen?

"avoiding allowing someone to masquerading or impersonate someone they're not" - how could this happen? Like if I signed up as tim@apple.com? What could realistically happen?


email addresses for usernames have the advantage of already being unique. None of this "gregmac already taken, try gregmac23, gregmc_595, or verbingnounXX instead?" nonsense


I agree, but in that case, you should be validating the e-mail address (which was my original point).

Failure to do so means:

- The user who actually owns that e-mail can't login

- The user who signs up can potentially impersonate the real user (depending on what your app does)

- The user who signs up can't reset their password

- The user who actually owns the e-mail can take ownership of the account (by resetting password)


Fair points.

One argument against emails-as-names is that some people change email addresses frequently - unless you have some way of accounting for that (and you definitely wouldn't use the username as the contact field), the username will eventually go stale for some users.


You should allow changing the e-mail address. Of course, changing it requires the same validation, but I'm not really sure why there'd be an argument against it (though I do note many sites/applications don't have a mechanism for this).

At the same time, as a user, you need to consider your own e-mail setup. You should not use your ISP-provided e-mail address whatsoever, there is absolutely no reason to tie your contact to an account you may not keep.

Likewise, you should not be using your school or work-provided addresses for something that you expect to use beyond going to school or having that job.

If you're changing your e-mail address frequently for personal reasons (eg, someone harassing you?), then you should actually consider creating an account you use only for authentication, and a separate one that is your public-facing persona.

Having separate addresses for your contact and sign-ups is actually not a bad idea anyway. It actually makes it a bit harder for someone to break into your account somewhere (because they don't necessarily know what address you used), and you get the freedom to change/abandon your public e-mail address if necessary.


Another argument against: some couples share email accounts but may want separate accounts on your service. Yes, this happens.


> email addresses for usernames have the advantage of already being unique

Not true. There are people who share an email address.


I mean unique as far as your end is concerned. Short of very invasive technology, there's no way to stop people from sharing a user token if they choose to do so.


Except, it's not a "user token", it's an address. It's as much a user token as a postal address or a telephone number is a "user token": not at all. It's a way to contact a person, not a way to identify a person. Just because you can enforce that only one person with a given postal address can create an account with your service, doesn't make it an inherent "user token".


Yep.

I designed a system once around the assumption of a 1:1 mapping between people and email addresses. I will never design another system that way.


Sure, but what if two humans share an email address, but both of them want to sign up for your game/service/whatever independently? This is a fairly common situation if you are (say) designing a game that might appeal to kids. Or old people.

If you block that, you're going to lose users. Either because they just can't sign up at all, or because there's too much unnecessary friction in requiring someone to go out and sign up for an email address just to register for your service.


I have family who share one email between two spouses. They might maintain separate personal email addresses; probably not, though.


Care to expand? As in having a first.last@example versus firstlast@example


Probably the most common example these days is spouses who don't email regularly.

Once upon a time, you only got one or two email addresses from your email provider (I'm looking at you, CompuServe and AOL), so at that time it was common for everyone in a family to have a single address. Eventually they added explicit multi-user features -- IIRC, AOL went as high as 5 emails per account before my family moved on.


As in bloggsfamily@localisp.net.au that both parents and all the younger sprogs share. Still common as anything.


One of the things I loved about the original Mad Max film was that they nicknamed their own toddler 'Sprog'...


I always assumed it was more a sanitization issue for security's sake. By allowing only a simple subset ("common") email address type, you can be ambivalent about what email server is running and how it reacts to the wide variety of specially crafted email addresses.

With no validation other than sending the email, you have to know, for example, what the server would do with an email address that claims to be @localhost. Now it becomes a problem- or at least a question and concern- for the backend system. Whether the backend interprets root@localhost as valid and does exactly what it's told or rejects it due to some configuration- it has become a backend complication and a DOS attack vector.

A simple policy of only handling a subset- the common class of email addresses- is one of the things that allows us to have a simple mental model of what the MTA is supposed to do. The fact that it sometimes caught a type-o, or not, is incidental. "Invalid email" wasn't meant to imply the email address doesn't fit the spec- it was meant to imply that a particular site or service has chosen not to accept email addresses like that.

Or at least that's what I assumed :-)


> I always assumed it was more a sanitization issue for security's sake.

Sanitization is at best idiotic, at worst creates security problems. There is no such thing as "bad characters", there only is broken code that incorrectly encodes stuff. If you ever find yourself modifying user input "for security reasons" (or really, for any reason at all), you are doing it wrong. The only sane thing to do is to make sure that the semantics of every single character of your user's input is preserved in whatever data format you need to represent it in.


An email address isn't a document though, it's a routing command. I don't mean sanitization in the sense of inserting backslashes. I mean sanitization in the sense of "we don't allow people to set their email address to a mailbox on localhost at our mail server."


1. Sanitization generally means changing information. As in, "removing bad characters", that kind of stuff. That's different from validation, which should result in rejection of bad input, and which can be perfectly fine. However, more often than not, validation is implemented badly and rejects perfectly fine input, which is why validation shouldn't be employed more than necessary either.

2. Rejecting @localhost addresses doesn't really make a whole lot of sense. People could just enter the public IP address or hostname of the server, or add a DNS A record under their own domain that points to 127.0.0.1, or an MX record that points to localhost, or any number of other weird stuff that you could not possibly validate anyway (if only because it could be changed at any point lateron). Just configure your mail server properly and then send the damn email, and if it does get sent to root@localhost, and possibly forwarded to the admin--so what? People obviously could just sign up using your admin's email address anyway, and that not only at your site, but at millions of sites out there, you won't be able to stop them. There is nothing particularly dangerous about receiving unsolicited signup emails or about sending emails to yourself.


> There is nothing particularly dangerous about receiving unsolicited signup emails or about sending emails to yourself.

Depends on what you do with them, in the latter case. There could be an amplification attack there.

Validating domain parts to a certain extent isn't a bad idea, at least as far as non-routable domain names and RFC1918 ranges go. I've seen this done (actually implemented some of it, in fact) at a past employer, who were basically looking to cover the 90% case in terms of not getting hosed by a trivial attack. It doesn't take much effort and it makes 4chan's life harder. What's not to like?


> Depends on what you do with them, in the latter case. There could be an amplification attack there.

Hu? How would that work?

> Validating domain parts to a certain extent isn't a bad idea, at least as far as non-routable domain names and RFC1918 ranges go.

What do you mean by "non-routable domain names" and what do you gain by checking for RFC1918 ranges?

> I've seen this done (actually implemented some of it, in fact) at a past employer, who were basically looking to cover the 90% case in terms of not getting hosed by a trivial attack.

Why did you prefer that approach over a robust solution?

My idea of a robust solution: Have one central outbound relay that's firewalled off from connecting to anywhere but the outside world, make all servers that need to send email use that relay as a smarthost (so they never connect to anything but that relay, regardless what the destination address is), use TLS and SMTP AUTH with credentials per client server to prevent abuse of the relay by third parties.

> What's not to like?

(a) that it's a lot easier to build a solution that's more robust, (b) it's extremely likely that your implementation is buggy, thus rejecting valid addresses, and (c) it's causing a maintenance burden (what happens when the first people drop IPv4 for their MXes? I'd pretty much bet that you don't check for AAAA records, so you'd probably suddenly start rejecting perfectly fine email addresses, thus making the transition to IPv6 unnecessarily harder, am I right?).


'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved. Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.

You would lose the bet. The product supported IPv6 from day one.

That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.


> 'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved.

What exactly do you mean by "cannot be resolved"?

> Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.

But why check for it? Is that actually a common mistake people make? An attacker could change the address after you checked it, so it's not going to help against attackers, is it?

> You would lose the bet. The product supported IPv6 from day one.

Good for you! :-)

> That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.

Complex? Certainly less so than implementing validation yourself.

As for scalability: Well, yeah, as described that's more the setup for a company that's operating various different services, none of which has a high volume of outbound email (which would be most, even the best startups don't have ten million signups per day and don't send much email otherwise, and even that should actually still be managable with a single server).

But that's trivial to adapt without changing the general approach. First of all, obviously, you could just add more relay servers and have client servers select one randomly, that scales linearly. But if you really need to move massive amounts of email for one service, so that adding an additional relay hop for each email you send actually adds up to noticable costs, you can still use the same approach: Just put the MTA onto the same machine(s) that the service is running on, into its own network namespace (assuming Linux, analogous technology exists on other platforms), and firewall it off there so it cannot connect to your internal network. Potentially you can even just add blackhole routes for your internal networks/RFC1918 ranges, so you would not even need a stateful packet filter (though currently you might still need it due to IPv4 address shortage).


"Cannot be resolved" means NXDOMAIN.

Why assume email addresses only get checked in one place, and not all?

Ten million a day was a milestone. I left that company over a year ago; it would astonish me to find that figure now exceeded by less than a factor of twenty. Granted these are mostly not signups. They are outgoing emails nonetheless, which makes the case germane despite that superficial distinction.

Your proposed solution sounds pretty expensive in ops resource, to no obviously greater benefit than the rather simple (well under one dev-day) option we chose. You seem to feel yours is strongly preferable, but I still don't understand why.


NXDOMAIN can be a temporary error. The SMTP queuing protocol is designed to be resilient against DNS failures, internet outages, routing problems, and temporary mail delivery issues.


> NXDOMAIN can be a temporary error.

Unless some DNS server is broken, it actually cannot. NXDOMAIN is an authoritative answer that tells you that the domain positively does not exist. Not to be confused with SERVFAIL, which you should get if the DNS resolver ran into a timeout or got an unintelligible response or whatever, NXDOMAIN should only occur if the authoritative nameserver of a parent zone explicitly says "I don't know this zone either locally nor do I have a delegation for it".


> "Cannot be resolved" means NXDOMAIN.

OK, that at least shouldn't reject any valid addresses, so maybe ...

> Why assume email addresses only get checked in one place, and not all?

It's not an assumption, it's just a matter of simplicity and reliability.

> Ten million a day was a milestone. I left that company over a year ago; it would astonish me to find that figure now exceeded by less than a factor of twenty. Granted these are mostly not signups. They are outgoing emails nonetheless, which makes the case germane despite that superficial distinction.

Well, no clue how well common MTAs would cope with that, but 2500 mails per second could still be within the power of a single machine if it's designed for high performance. But regardless, I don't think that really matters: If you have a relatively low volume of emails, it's probably most efficient and secure to handle it all with one central outbound relay, if you need to send lots of emails, it obviously makes sense to distribute the load, but that shouldn't really otherwise change the strategy.

> Your proposed solution sounds pretty expensive in ops resource, to no obviously greater benefit than the rather simple (well under one dev-day) option we chose. You seem to feel yours is strongly preferable, but I still don't understand why.

What sounds expensive about it?

Whether there is any benefit to it: Well, depends on your goals and what exactly your solution actually does. I still don't understand why (or even how exactly) you do those RFC1918 checks, for example?! It seems like it's mostly a security measure? But then, it's not actually secure, it's essentially a race condition/TOCTTOU. Plus, it might even break valid email addresses.

Essentially, it's four factors why I think just delegating the validation of email addresses to the mail server is the best strategy:

1. Implementing your own checks risks introducing additional mistakes (which might lead to the rejection of valid addresses).

2. Implementing your own checks is additional work when you could just use the MTA which already knows how to do this (and which you have to install/configure/use anyway as soon as you want to actually use the email address), both for the initial implementation, and possibly for subsequent maintenance (if you just let your MTA do the work, only the MTA needs to be adapted to any changes in how emails get delivered, abstracting away the problem for any software that's supposed to be sending emails and isolating it from the lower layers).

3. My approach actually gives you the perfect result, in that it does not reject any valid addresses (assuming your MTA implements the RFCs correctly), and at the same time is perfectly secure against all possible abuses with weird addresses, which is impossible to achieve when you separate the check from the actual abuse scenario, and it's even kindof trivial to see that that is the case.

4. You have to have the infrastucture to deal with bounces anyhow, both because you ultimately cannot be sure an address actually exists unless you have successfully delivered an email to it, and because addresses that once existed might not exist anymore at a later time, so it's not like you can avoid that if only you validate your addresses better.

As for "ops resources": Assuming that any service that needs to send > 10 million mails per day will be deploying machines automatically anyway, what's so much more expensive about deploying the configuration of an additional network namespace? Writing that script certainly shouldn't be more than a day of work either, should it?


I may have erred in giving the impression that the application-level checks are the only line of defense here. They're not. The (bespoke) MTA underlying this product performs most if not all of these checks as well. I didn't really spend any time on that side of the business, so I might be wrong about that, but it would be something of a surprise. I do know our analytics needed to be able to cope usefully with an astonishing panoply of bogosity warnings that came back from the MTA, but I no longer recall exactly what they covered. And, in any case, it's nice when you can to tell the user "hey, this isn't deliverable" before it gets to the point of a bounce.

Checking whether an email address's domain-part is an RFC1918 IP is actually pretty easy. Split the address by '@'. The last piece is the domain part. Split it by '.'. If there are four pieces, all of which meaningfully cast to integers, treat it as an IP address. (Otherwise it's a domain name, which is fine as long as it has more than one part and isn't NXDOMAIN when the backend tries to resolve it.) If any part is negative or greater than 255, it's invalid. If it starts with [10] or [192, 168], or if it starts with [172] and the second part is between 16 and 31 inclusive, it's an RFC1918 address. Otherwise, it's fine.

Even with unit tests, that takes almost no time to write, and when your frontend and backend share a language as ours did, you can use the same logic both places. Node gives you a name resolver binding for free. I really can't imagine it being as quick to write, test, validate, and roll out a change to the MTA node manifest.


> Checking whether an email address's domain-part is an RFC1918 IP is actually pretty easy.

What I still don't understand: Why do that at all? What's the point of this? Security? Preventing user errors? What else?


Little of both.

We were pretty sure it would already be impossible, or nearly so, for a malicious user to probe our infrastructure this way, but when it's so simple to be even more sure, why not?

Similarly, we'd already observed a low but nonzero rate of users inadvertently providing such addresses - not during signup or onboarding so much, but in recipient lists they submitted. Since we used the same recipient checking code everywhere, why not cut that back to zero, too?


Then reject email addressed to localhost. It shouldn't matter how the email got there. I'd suggest that especially given DNS trickery involving setting up a low TTL then redirecting to 127.0.0.1, you're probably not preventing this from happening or you'd have to invalidate any unrecognised domain. Better to solve that problem at a different layer -- validate the email by sending a validation link if you must...


True, but that's my point- it's a backend issue, not a front-end "help the user" issue.


And their point is, it's a backend issue, the backend being the mail client/server that already completely handling the sending/receiving of emails. Either the activation link gets clicked or it doesn't. The click is the only correct validation, and yes, the whole process happens on the "backend".


I mostly agree but there are cases where some definition of sanitization is the only appropriate thing. For example, if you allow users to create content with a lightweight subset of HTML for the sake of formatting control and want to render that html in your page. And in such cases, the correct way to sanitize it is not via regexps but via a DOM parser that takes user input and builds a DOM and then emits rendered html according to a whitelist of available tags/attributes. So you might argue DOM parsing isn't sanitization and so still matches your assertion, however, in general it's common and not really inaccurate to call this sanitization.


Well, it depends ... ;-)

The important thing is to not change information. "Sanitization" as it is commonly used means doing something that (potentially) changes information. Which is in contrast to decoding/encoding/parsing/unparsing/translation/..., which, if done correctly, change representation, but not information.

So, to make it a useful distinction, I would call anything that potentially changes the semantics of the processed data "sanitization", and avoid using the term for anything else.

So, simply parsing a string with an HTML parser, possibly checking for acceptable elements, and then serializing back into some sort of canonical form that is semantically equivalent to the input, that's perfectly fine, and I wouldn't call that sanitization, but rather validation and canonicalization.

If you simply start dropping elements, though, that's probably a bad idea, just as simply dropping "<" characters is a bad idea, because those elements presumably bear some semantic meaning, just as a "<" in a message presumably bears some semantic meaning.

Now, it is not always obvious which level of abstraction to evaluate the semantics (and thus the preservation of semantics) at. So, it might be prefectly fine, for example, to remove or replace some elements where the semantics are known and you can show that, say, removing emphasis still generally preserves the meaning of a text.

But a whitelist approach where you simply remove everything that isn't on the whitelist usually is a bad idea. If you want to have a whitelist, use it for validation, and reject anything that's not acceptable, so the user can transform their input in such a way as to avoid any constructs you don't want, while still retaining the meaning of what they are trying to say.


I hear what you're saying and it represents an ideal. But there are circumstances where information really has to be removed. Perhaps because the user is no longer present and it was collected under circumstances that had more liberal validation. Or because you're handing information across a boundary of implementation ownership and can't trust the receiver to handle potentially dangerous information correctly. I agree that sanitization (in the sense of stripping bits out of a user data payload according to some security rules) shouldn't be the first tool in the toolbox, but I would really hesitate to say it's always the wrong thing to do.

Edit: here's a good example. I don't know if they still do this, but when I worked at Yahoo!, they used a modified version of PHP that applied a comprehensive sanitization process to all user inputs. As a frontend coder at Y!, all the information you pulled from request parameters, headers, etc, ran through this validation at the PHP level before your app code got to it. You can then literally splat this information into an html page raw, without any further treatment, and not expose the Y! property you work for to an XSS or other injection vectors. There were ways to obtain the raw input using explicit accessors when needed, and these workarounds were detectable by code monitoring tools and had to be reviewed and approved by security team(s). Overall this worked really quite well, in my opinion. Y! could hire junior frontend devs without deep knowledge of data encoding, security issues, etc etc and rest easy. I think the principle of safe-by-default, even if it means destruction of user input in some cases via aggressive sanitization, is a good principle to apply to a frontend framework.


re edit:

No, that's just a terrible idea. It might work quite well in the sense that it prevents server security holes. But it makes for terrible usability, and potentially even security problems for the user. The user expects that their input is reproduced correctly, and if it isn't, that can potentially have catastrophic consequences because it might result in silent corruption.

See also: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

If you think that it should be possible to have developers who don't understand this problem and its solution, you might as well argue that it should be possible to have developers who don't know any arithmetic or who are illiterate.

If you want to have some help from the computer in avoiding injection attacks, the solution is a type system that ensures that you cannot accidentally use user input as HTML or SQL, for example, or possibly automates coercion when you need to insert pieces of one language into another.


And that's a great philosophy until your email gets rejected by some service that picked a different "common class of email addresses" than you did. This is precisely why we have written standards.


Back when SMTP servers still had remnants of UUCP etc.- where the address could actually contain characters that specify intermediate servers to route to, I would have argued that front-end sanitization was important as, for example, html sanitization from end-users from a security perspective.

However, IETF made lots of progress simplifying things- to the point where, at the very least, the standard tells us specifically that we should leave it up to the destination host to interpret the local part of the email address-- that is, the thing to the right of the @ should be given the thing on the left unmolested ideally- even being ignored by intermediate relay servers. Since that's what most people complain about, any validation to the left of the @ should become extinct.

But off the top of my head that still leaves the thing on the right of the @ (such as localhost), buffer overflows by allowing longer strings than the standard allows (those limits do exist), and the problem with multiple @ which the MTA may or may not handle well... Since I'm not a security expert I'm going to go out on a limb and assume that I'm missing a bunch of other things.

My point is not that the article is wrong, though- my point is that if he wanted to convince me to only validate by sending the email on any string, he should convince me that those security concerns are not an issue- not that it's not good at catching type-os.


I think there is a reasonable middle-ground for validating the domain side of an email address. There are RFCs on all this stuff; it's not just a total free-for-all. The RFCs just aren't nearly as strict as a lot of badly-designed validation regexps are, presumably because most people are unaware of the diversity of acceptable email addresses.

The currently operative RFC is 5322, specifically section 3.4.1: https://tools.ietf.org/html/rfc5322#section-3.4.1

There are some basic rules that you could safely apply to an address, which would prevent some attacks (buffer overflows, etc.) while also not blocking any legitimate addresses. E.g. limiting the overall length to 255 characters, for instance, could be defensible practice. There are also well-defined rules for validating the domain portion, since it has to be a routable address by definition.

What nobody ought to be doing is looking too hard at the string to the left of the @ symbol, because it's designed purely as instructions to the recipient server. Nobody else needs to care about it; only the receiving mailserver needs to actually parse that part of the address, in order to put the message into the right mailbox. From what I've seen, the vast majority of false-positive validation failures occur because people are looking at the mailbox portion of an email address when they have no business doing so.


Your security model is garbage if you depend on controlling all apps that might send email to your mail server.


Hmm, sorry but I don't buy that the "correct way to validate" is not to validate the input.

Email addresses aren't a special enough case to be handled differently than any other user input, which we always validate to both sanitize and show client-side errors, if nothing else.

Sure, the complete regex is complex, but it is defined and is hardly unconquerable. Look at Django's `EmailValidator` implementation for example [0] that is mature and well tested [1].

The author has not convinced me that ignoring validation is the right choice when options with a scope so thorough exist.

[0]: https://github.com/django/django/blob/master/django/core/val...

[1]: https://github.com/django/django/blob/a9215b7c36bff232bcc941...


> Email addresses aren't a special enough case to be handled differently than any other user input, which we always validate to both sanitize and show client-side errors, if nothing else.

Yes, they aren't special, and you shouldn't validate any of the other stuff either, unless you actually need to understand the semantics. So much breakage happens because people think that they need to validate all kinds of stuff, it's not funny. People have their legal real names rejected because they're "not a valid name", people have their correct postal address rejected because it's "not a valid postal address" ... just forget it, if someone tells you their name, just believe them it's their name, if someone tells you their postal address, just believe them it's their postal address, if someone tells you their email address ...

Oh, and don't ever think about "sanitizing" stuff. Just don't. If you think you need it, you are doing something wrong. The solution to SQL injection is not to disallow people using "'" characters in messages, the solution to cross site scripting is not to prevent people from using "<" characters in comments, ... that's what encoding/escaping is for.


> Sure, the complete regex is complex, but it is defined and is hardly unconquerable.

If it is a regular expression, then it is not able to match all valid email addresses, because the grammar of email addresses is context-free, and regular expressions can only match regular grammars. It doesn't matter if it is defined or not: if it's a true regular expression, then it simply cannot validate email addresses.

(it may, of course, be a context-free expression masquerading as a regular expression)

I wonder if the django validator will choke on perfectly valid email addresses such as (this)"()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"(is)@(valid)example.org(honest)

I suspect that it will, but of course I could be wrong.


> the grammar of email addresses is context-free

I don't think you're really correct about "email addresses" being context-free, or at least, citation, please?

When I look at a generic "email address" entry field on a random form on the Internet, say on the sign-up page for some hot new startup's service, I expect it to take what RFC 5322 §3.4.1[1] calls an `addr-spec`; specifically, I don't ever expect such fields to take the grammar of what that RFC calls an `address`. I don't think most people are going to think they can enter that, nor would most programmers even think to implement it. And I certainly wouldn't want to try explaining it to a PM…

If you accept that assumption, what about `addr-spec` isn't regular?

Also, using that assumption, your "perfectly valud email addresses such as …" would appear to not be valid, as it has unbalanced quotes. (In fact, even under the grammar of `address`, I'm not sure it's valid; it feels like it should be invalid for the same reason, but I've not rigorously checked this.)

[1]: https://tools.ietf.org/html/rfc5322#section-3.4.1


> I don't think you're really correct about "email addresses" being context-free, or at least, citation, please?

> When I look at a generic "email address" entry field on a random form on the Internet, say on the sign-up page for some hot new startup's service, I expect it to take what RFC 5322 §3.4.1[1] calls an `addr-spec`; specifically, I don't ever expect such fields to take the grammar of what that RFC calls an `address`.

Well, sure. Let's look at what RFC 5322 defines as an addr-spec[1]:

    addr-spec       =   local-part "@" domain
And how does it define a local-part?

    local-part      =   dot-atom / quoted-string / obs-local-part
Let's ignore quoted-string and obs-local-part for the moment. What is a dot-atom?

    dot-atom        =   [CFWS] dot-atom-text [CFWS]
And what is CFWS?

    CFWS            =   (1*([FWS] comment) [FWS]) / FWS
What's a comment?

    comment         =   "(" *([FWS] ccontent) [FWS] ")"
So far, all of this has been matchable with a regular expression. But what's a ccontent?

    ccontent        =   ctext / quoted-pair / comment
See that there? A comment is composed of a balanced pair of parentheses around, perhaps, another comment! Thus (this (is (a (heavily (commented (email \(address))))))foo@bar.example(some more (to prove (the point))) is a perfectly viable RFC5322 address!

Pair-balancing, of course, is impossible with regular expressions, since matching pairs requires push-down automata (which match CFGs) and cannot be done with finite-state machines, (which match regular expressions).

QED.

> Also, using that assumption, your "perfectly valud[sic] email addresses such as …" would appear to not be valid, as it has unbalanced quotes.

Nope, there are no unbalanced quotes in (this)"()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"(is)@(valid)example.org(honest): the first quote balances with the third, while the second quote is one of a quoted pair \" (which is allowed within a quoted-string, which is allowed within a local-part). It's all allowed per the spec.

I'll admit that it's a bit surprising, but it's true. One simply cannot match a valid RFC5322 addr-spec with a regular expression. One can, of course, match it with something which pretends to be regular but isn't really (as I noted).


Oh, shoot, I missed that. I saw "FWS" (meaning "folding white space"), and assumed that didn't include comments since they have nothing to do with folding whitespace.

Well done.


No worries. The RFC is … complex.


I'm not sure if they are context-free, but talks about Parseable Expression Grammars, specifically Lua's LPeg, implies they might be. Conceptually, PEGs are to Context Free Grammars as RegEx is to Regular Expressions.

You can find multiple short email validation snippets using Lua LPeg pretty easily, but this is from the Lua creator's talk about LPeg which includes a part about proper RFC822 validation and how complicated it is for regex, but can be concisely done with PEGs.

http://program-transformation.org/pub/WGLD/Austin2012/Robert...

Here is the entire PEG slide. (Looks CFG-ish to me.)

    address <- mailbox / group
    group <- phrase ":" mailboxes? ";"
    phrase <- word ("," word?)*
    mailboxes <- mailbox ("," mailbox?)*
    mailbox <- addr_spec / phrase route_addr
    route_addr <- "<" route? addr_spec ">"
    route <- ("@" domain) ("," ("@" domain)?)* ":"
    addr_spec <- local_part "@" domain
    local_part <- word ("." word)*
    domain <- sub_domain ("." sub_domain)*
    sub_domain <- domain_ref / domain_literal
    domain_ref <- atom
    domain_literal <- "[" ([^][] / "\" .)* "]"
    word <- atom / quoted_string
    atom <- [^] %c()<>@,;:\".[]+
    quoted_string <- '"' ([^"\%nl] / "\" .)* '"'


> ...@(valid)example.org(honest)

Are those parens really permitted in the host name? I'm looking at RFC 3696.

> 3. Restrictions on email addresses

> .... The syntax of the domain part corresponds to that in the previous section.

and

> 2. Restrictions on domain (DNS) names

> Any characters, or combination of bits (as octets), are permitted in DNS names. However, there is a preferred form that is required by most applications. This preferred form has been the only one permitted in the names of top-level domains, or TLDs. .... The LDH rule, as updated, provides that the labels (words or strings separated by periods) that make up a domain name must consist of only the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen. No other symbols or punctuation characters are permitted, nor is blank space.

https://tools.ietf.org/html/rfc3696#section-2


> > ...@(valid)example.org(honest)

> Are those parens really permitted in the host name? I'm looking at RFC 3696.

RFC5322 states that an addr-spec consists of a local-part, followed by @, followed by a domain. It states that a domain may be a domain-literal; a domain-literal may begin and end with commented folding whitespace (CFWS). It's all in https://tools.ietf.org/html/rfc5322#section-3.4.1

It's only the dtext portion of the domain-literal which must consist of printable ASCII characters, not including [, ] or \.

Yes, foo@bar!baz!quux is a valid RFC5322 email address (decimal 33, i.e. !) is allowed, per the dtext production of the spec) — and so is foo@(comment (nested comment (escaped \(comment)))bar!baz!quux(another (comment (to prove) the point))

Whether one should ever actually use such an address is, of course, another matter entirely.


Thank you for that; this has been eyeopening.

The possibility of having a valid email address where the second-level domain component can't actually be registered as a domain, to the best of my understanding, eg "bar!.com", is interesting.


> The possibility of having a valid email address where the second-level domain component can't actually be registered as a domain, to the best of my understanding, eg "bar!.com", is interesting.

It is! Conceivably, it could be used to implement a new kind of domain-less mail system, e.g. foo@$megamail or something.


I wonder if the django validator will choke on perfectly valid email addresses such as (this)"()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"(is)@(valid)example.org(honest)

If it doesn't, remind me to patch it to not accept that.

I have strong views on email validation, and they include telling people who use that kind of address to go register on someone else's site.


> If it doesn't, remind me to patch it to not accept that.

You'd be wrong to do so. The whole point of RFCs and Standards is to take things out of the realm of personal preference.

Also, I suspect that a validator which allowed reasonable addresses like jim(somesite)@foo.invalid (which is both a good use of comments and what the + hack emulates) or "Ted Smith"@bar.invalid or "work@home"@jobs.invalid or "William \"Bill\" Jones"@baz.invalid but disallowed unreasonable ones would also be able to solve the halting problem. How many perfectly reasonable allowed features must one use before it's considered abuse?

> I have strong views on email validation, and they include telling people who use that kind of address to go register on someone else's site.

An email address is, simply, an address which can be given to an Internet mail server in order for it to properly route an email. All of the addresses I gave are perfectly valid ways to do so. An addresses is a property of the addressee, not of the addressor.

In other words: it may be your site, but it's my address.


There are already things that are technically allowed by RFCs for various types of input but should be disallowed because they cause problems. So something being in an RFC doesn't create any kind of binding requirement, and very often the things that get dropped are things that, on contact with the real world, turned out to be bad ideas.

Many of the more arcane things that can technically be done in an email address seem to me to be in the "turned out to be bad ideas" bin, and I have no problem making them be de facto deprecated even if no RFC has yet caught up to that.

Also, no halting-problem issues at all; my preferred approach would be to disallow certain classes of characters.


> There are already things that are technically allowed by RFCs for various types of input but should be disallowed because they cause problems.

That may be true, but it's something else entirely for someone to drop them because the problems they cause … were enabled by him. An email address using characters you don't like is perfectly valid, and perfectly deliverable, right up until you refuse to deliver it. It doesn't cause any problems until you decide to make it cause a problem.

> Many of the more arcane things that can technically be done in an email address seem to me to be in the "turned out to be bad ideas" bin, and I have no problem making them be de facto deprecated even if no RFC has yet caught up to that.

> Also, no halting-problem issues at all; my preferred approach would be to disallow certain class

I really don't see why you think it's a bad idea to disallow more than alphanumeric (and plus? and ampersand?) characters in the local-part. Cui malo?


I would prefer to allow plus, dot, hyphen, underscore, and alphanumeric characters in the local part and probably not much of anything else.

I prefer this because I work in a world where email addresses are provided to me, and possibly stored and then retrieved and used and even displayed, as text, and the RFCs allow some nightmarish things when you consider the interaction of the syntax the RFCs permit and the set of characters which are sensitive to one or more of the non-MTA components of that chain.

Your original example, for instance, contains characters that require escaping or at least careful handling for multiple situations, and even goes so far as to contain things that will be interpreted as escape sequences in some contexts. I'm sure that Robert(';DROP TABLE users;)(<script type="text/javascript">alert("Bobby Tables")</script>)@not.a.hacker(honest) will be terribly disappointed to know that he needs to use a different address to register on my site. I'll send him over to you instead.


It is true that regular expressions in the CS sense can't parse context-free grammars. However, PCRE, which is what most programmers are talking about when they say "regex", can do so. So you're both kinda right, I guess. But you're being a bit pedantic.


> But you're being a bit pedantic.

Technically-correct is the best kind of correct:-)

But I do think it's important to note, since there really are differences between regular and 'regular' expressions.


> PCRE, which is what most programmers are talking about

I wish that were the case. http://www.regular-expressions.info/refunicode.html

(They used to have a much more useful and concise comparison table but I can't for the life of me find it.)


Interesting; I've never had to think about how regex interact with Unicode.

I guess what I really meant is that programmers are talking about PCRE in terms of power, not in terms of exact syntax. In particular, they have recursive patterns, which are sufficient to pull them up to context-free grammars.


It definitely can get pedantic in a topic like this, but when all of the interesting bits are in the nuances, I don't mind so much.


You are correct:

  >>> from django.core.validators import EmailValidator
  >>> EmailValidator()(""""()<>[]:,;@\\\"!#$%&'-/=?^_`{}| ~.a"(is)@(valid)example.org(honest)""")
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python2.7/dist-packages/django/core/validators.py", line 203, in __call__
      raise ValidationError(self.message, code=self.code)
  django.core.exceptions.ValidationError


Well shoot, that's my email address! /s

Seriously, though, this is why regexp-based validation is Not Even Wrong. It's like showing up to a gunfight with a keen eye for detail.


1. Strip non alphanumerics except +

2. Remove everything between + and @

3. Do validation


By all means, validate. But err on the permissive side. The odds are pretty great that you're wrong, and you're preventing real email addresses (and almost certainly real users) by doing so.

I'd never looked closely at Django's email validation, but after looking at it I'm inclined to stop using it. Example reason:

> # max length for domain name labels is 63 characters per RFC 1034

There is a high likelihood that most of your users have shorter domain names, and also a high likelihood that most if not all registrars currently enforce this. But it also used to be highly likely that domains didn't begin with a numeral, and even though it was disallowed, 3m.com existed.

Another: neither the tests nor the patterns account for unicode, which is valid for domain names irrespective of the local part, and if not allowed for the local part by existing providers almost certainly will be eventually.

Also, a thing I noticed in the Django tests that is just upsetting:

- name: `TEST_DATA`

- values: `(validator, value, expected)` (a validator of the data is not the data)

- unstructured groups of such that should almost certainly be structured


The 63 character limit on domain name labels in RFC 1034/1035 isn't arbitrary, it's a limitation in the way DNS packets are encoded. The upper two bits of the label length are used for compression of labels in DNS packets.

Without protocol changes, DNS cannot support labels longer than 63 bytes.

However, I agree that validating it is unnecessary.


You can use a regex as a simple pre-check but you absolutely have to do more than that if you expect high-quality results.

Back in the 90s, we ran the customer rewards program mailing list for a mainstream business you've heard of. A [gnarly] regex took care of the gross failures but we still had double-digit percentage of invalid addresses and many spam reports because people mistyped their username, used their old ISP's email address which had been disconnected, etc.

The only approach which produced satisfactory results back then was to have a bit of (horrible) PHP code which did a full SMTP connection to deliver the welcome email. Even that wasn't enough to ensure you'd delivered it to the right person, however, so we had to track clicks on an activation link.

I would be surprised if the average internet user has gotten significantly more reliable over the last couple of decades and that's born out by the wide variety of misdirected legitimate email my shortname@gmail.com account receives.


Definitely, I think email validation links are important too. However, it's pretty senseless to let an obviously invalid email address pass all the way through to that layer (and potentially get billed for sending messages to invalid email addresses).


Obviously invalid can be a tough measure, though - many people who used domains other than .com/.net/.org have reported sites incorrectly denying their address. If I was still doing this, I'd start by doing some loose validation on the address and checking whether the domain resolved in DNS before returning a form validation error.

The other key step is having some contact form which doesn't require the same test so people can report bugs if you get the logic wrong.


The marketing or product team might prefer the "cost"(?) of sending a bad email here and there to the cost of losing a customer because the email is incorrectly rejected.


Great point. It's easy for me to get caught in the engineer's perspective.


1. There are no "obviously invalid email address[es]".

2. Getting billed for sending emails? WTF?


You haven't worked in email, I presume. Yes, it does cost money to send emails at a large scale.


I am not sure what you are trying to say. Yes, of course, it costs money. Just like serving websites costs money. And just as checking the syntax of an email address costs money (CPUs cost money and need power to run!). Now, how is that an argument for checking the syntax of an email address to avoid sending a single email and seeing whether it bounces?


Sending a lot of bounced emails is going to get your mailservers reputation tanked, which is not an acceptable tradeoff for removing email validation.

Email validation is actually really important if you send even a medium amount of email. Testing by bouncing is not a very good idea, especially since there are a lot of ways to test the deliverability of the email before you try and send it.

That's not to say use a regex instead, you can do some simple checks on the domain etc or you can use one of the many services to check for you (for of a fee of course).

It's understandable not to know this unless you've run a medium - large site, so here's some best practices to follow: https://documentation.mailgun.com/best_practices.html#email-...


> Sending a lot of bounced emails is going to get your mailservers reputation tanked, which is not an acceptable tradeoff for removing email validation.

?!?

OK, let's break this down: What could be reasons for an email to be bounced due to an invalid address?

1. Because the address is syntactically invalid: Your own email server will notice and reject/bounce the email. Noone outside your own server will notice.

2. Because the domain doesn't exist or has no inbound mailserver: Your own email server will do the DNS lookup(s) and bounce the email. Noone outside your own server will make any note of that (and you can't find out without doing DNS lookups anyway).

3. Because the localpart doesn't exist: You can't find that out without trying to actually send an email (and consequently risking a bounce).

So, what exactly was the point of email validation "without bounces" again?

The only thing you can actually do to avoid unnecessary bounces is to make sure your emails have a valid return path and to make sure that you somehow process all the bounces that arrive there to make sure that you don't continue sending emails to addresses that start sending bounces (i.e., that don't exist (any more)). If you send lots of mails, it's probably best to employ VERP for that.


> There are no "obviously invalid email address[es]".

An email address which doesn't validate against the applicable RFC is ipso facto obviously invalid.

Validating against the RFC is somewhat complex though.


> Validating against the RFC is somewhat complex though.

Hence, not obvious :-)

If there is one thing we can learn from all the misguided attempts at validating email addresses that we have seen over the years, it is that it is obviously not obvious which email addresses are syntactically invalid.


With respect to (1), the cases I had in mind as "obviously invalid" is something that doesn't contain all of the required components of an email address. For example, if the user submits their email as "foo" or "foo@bar" or "foo@bar." I should have provided examples as the way I wrote it was definitely ambiguous.


Both "foo@bar" and "foot@bar." are valid email addresses. "bar" can be a host name or a top-level domain name. "bar." is a top-level domain name.

That said, "foot@bar." may be valid, but ICANN says it should not be functional. They prohibit dotless domain names (that is, top-level domain names may not have A, AAA, or MX records).


How about foo@[1.2.3.4] ?

"foo", ok, I agree with the blogpost, if there isn't even an @ in there somewhere, maybe it's sensible to catch that and reject it outright. But that's probably about it.


Right. Wholeheartedly agree. Sanitize all input. If you _KNOW_ input you're receiving is invalid, throw it away.

Also whether you use RegExp on email or not totally depends on what you're doing with the address. Are you throwing that email into Salesforce? Great! Salesforce does RegExp validation on all email addresses and the field is required. You _HAVE_ to do it. Solution? Use the same validation that Salesforce does. That's what I do.

Also the OPs math about how common invalid email addresses are is wrong. Do you know what the most common way of screwing up an email address is? Hitting tilde when you're tabbing away from the field. And that error is in the tenths or hundreds of a percent range (which is still significant). I have a dataset of 5MM email addresses typed in by jaded call center agents to prove it.


Regarding that testdata, what is the reason for rejecting bare IP literals on the right hand side? It seems pedantic to require the square braces.


That part is confusing to me.

I ran a fresh install of Django to confirm and it behaves as you suggest.

Edit: Perhaps only for low digits. Hmm.

    >>> from django.core.validators import EmailValidator
    >>> e = EmailValidator()
    >>> e.__call__('foo@[1.2.3.4]')
    >>> e.__call__('foo@1.2.3.4')
    ValidationError: [u'Enter a valid email address.']
    >>> e.__call__('foo@0.0.0.0')
    ValidationError: [u'Enter a valid email address.']
    >>> e.__call__('foo@[0.0.0.0]')
    >>> e.__call__('foo@10.10.10.10')
    >>> e.__call__('foo@[10.10.10.10]')
    >>> e.__call__('foo@255.255.255.255')
    >>> e.__call__('foo@[255.255.255.255]')


Huh. "Well tested" indeed.


You shouldn't be accepting raw ips for e-mail anyway.

More

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: