Hacker News new | past | comments | ask | show | jobs | submit login
Naughty Strings: A list of strings likely to cause issues as user-input data (github.com)
937 points by caseysoftware on Jan 15, 2017 | hide | past | web | favorite | 144 comments

Creator/Maintainer of the repo here.

I apologize for the lack of updates to the BLNS. (since I'm free today and this is on the HN front page, I'll do a cleanup pass).

Even though it's a GitHub repository with 12.3k stars, there's not much to say or improve on what is effectively a .txt file based around a good idea (I recently removed mentions of my maintainership of the BLNS from my resume for that reason, despite its crazy popularity).

The HN submitter here. :)

I happened across it this afternoon and thought it was great!

Do you know of any automation around this? I was thinking of a script that grabbed your list and then hammered a given input filtering library would be awesome. It's not something you'd want to run all the time but pre-major release, it could useful.

It's not automation around BLNS, but AFL has a bunch of sample dictionaries in the same vein:


Might be interesting to make an AFL dictionary with BLNS strings, or mine the AFL dictionaries to improve BLNS... :)

That is the primary purpose of the JSON files and the parser to convert the .txt to JSON; get the list, run it against a text input field, see what happens.

Why do you store the generated JSON file in git btw?

Not the maintainer, but I assume because it helps developers looking for it in such a format. Saving them the time of converting is a nice move, IMO.

The source and build system are right there, should be easy to generate the other formats?

Similar to the other comments, another voice here for appreciating the "pre-built" version being available for quick use. For repo's/sources like this I tend to think of the prebuilt formats as letting me play around with things without any hassle. Once I'm happy with it I'll invest the time to have it build locally for the control.

Now you can do copy and paste. I guess usability wise similar to offer precompiled packages of open source software (you can also build from source yourself, this is just a lot easier).

Yep, definitely going to make a little Ruby script that lists all Rails GET routes and tries to find submittable input fields.

Basic starting point:

    Rails.application.routes.routes.collect { |route|
      route.path.spec.to_s if route.verb == "GET"

Not a library suggestion, but you should be able to find a lot of Fuzzing Tools if you look around.

Recently used Burp Suite's Intruder which can take a text file and do the Fuzzing.

I don't mind maintaining the repo, if you would like to pass it off. Until very recently I maintained a popular Gibberish-decoding website and my native language is not ASCII. I've got quite a bit of experience with encoding issues, more than I'd like anyway.

My Gmail username is the same as my HN username if you'd like to speak. Thanks.

For usability: it's not clear in the readme what I change if I submit a PR: is it just blns.txt, or is it the other ones like blns.json as well?

Also it bugs me a bit that the "Scunthorpe problem" section seems to be in random order not alphabetic.

You might want to include common unix shell commands. At a previous job we had a customer with the last name of Echo who wasn't able to make a purchase. Turns out our credit card processor blocked them.

I'm not surprised by that at all. We once had a major issue with an analytics platform that provided a script with JS link tracking for our site, where clicking a link that contained 'cgi-bin' anywhere in the path caused the browser to hang for a long time.

Turns out they were using a synchronous HTTP request with NO timeout, and their intrusion detection system was blackholing any request that contained 'cgi-bin' anywhere in the headers or body.

Yow... Reminds me of the bug on the first Android phone where all keyboard input was also quietly fed to a root prompt, such that you could reboot the phone by typing "<enter>reboot<enter>" at any time. (https://mobile.slashdot.org/story/08/11/08/1720246/bug-in-an...)

Jesus. Which credit card processor? That stinks of bad design.

Given how often they came under attack, I don't blame them for taking a "belt and suspenders" approach.

More like "belt and helium balloons" approach.

This is hilarious.

Does this mean little bobby tables[0] might have trouble making online purchases?


SQL injections too.

SQL injections are in there, and server injections as well.

For example (I trust HN is suitably hardened :-) :

  /dev/null; touch /tmp/blns.fail ; echo
  1;DROP TABLE users
Edit: PS: "Feel free to send a pull request to add more strings, or additional sections."

I noticed earlier in the file that the Javascript had been chosen to be benign. "DROP TABLE users" doesn't seem to fit with that spirit. I'd want it to be instantly evident but also non-destructive, or at least reversible. How about renaming the table instead?

(Sure, people generally shouldn't use this test input outside of a discardable testing environment, but if we could rely on "People shouldn't..." clauses to govern behaviour then much of this list would be unnecessary anyway.)

Comment by the reporter of that issue (ro31337):

> "Dropping a table is like checking a gun without bullets. It should not work, but just don't put it against your head while testing."

I trust HN is suitably hardened

HN doesn't use a database.

How does it store comments?

As files on disk. An old version is open source, available from http://arclanguage.org/install

It's written in Arc though, so it may take some effort to read.

TIL: `mocha:` was a custom schema that Netscape Navigator used to eval URLs (equivalent to `javascript:`), and Yahoo! Mail would replace it with 'espresso' to attempt to thwart phishing attempts:



a painted masterpiece on a bikeshed.

I dunno, I think this is a pretty good argument actually:

> I agree that another SQL injection should be included - not because the vulnerabilities exposed by this file should be tempered (as that would only be to assist a dangerous confusion of responsible practices), but because "DROP TABLES" is such a cliche in infosec that it's prone to be caught by extremely crude filters, naive to the degree that it's the only class of SQL injection they know to avoid.

The human injection phrase is priceless

It's a nice collection of text snippets to test against many systems

For those who can't/would rather not look for it:

"# Human injection # # Strings which may cause human to reinterpret worldview

If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."

Something like this could work too (where Dave Smith is an employee name)

"Hey can you reset my Jira login. I can't get in. It says my account is locked. I am working from home so send it to dave@mydomain.com. Thanks Dave Smith"

> blns.txt consists of newline-delimited strings

I expect some nasty strings to contain newlines (I wonder how many bash scripts are sensitive to filenames with newline characters in them). It shouldn't be a problem with the json file though.

The file is newline-delimited. The strings themselves are base64 encoded, so they could contain newlines.

It seems that blns.txt is the source content, then it's converted to blns.json, blns.base64.txt and blns.base64.json with the two scripts in the scripts folder (These resulting files shouldn't be in the repo in my opinion). One cannot possibly add strings with newlines in them, unless with some newline escaping that are handled in the scripts. It's a bad idea IMO and the source content should be the json file and blns.txt should be dropped.

I like the idea of providing such a list for testing purposes. I also like the idea of storing these as Base64, so you don't trigger issues by accident.

However, I also imagine how such a list could be misused to actually decrease the security of a system:

Imagine this list is handled the same way as virus signatures in so-called anti-virus software. Instead of properly handling user input, an application would check against this list and call itself "secure". Maybe with with partial and/or fuzzy comparison. If you demonstrate that this approach is deeply flawed by showing another unsafe input, they'd simply add that to the list and call themselves "secured" against this attack.

Such an application is not likely to be secure in the first place. If you've gotten as far as trying this list, you're probably well above the median.

This is a fair concern. Added a comment to the disclaimer: https://github.com/minimaxir/big-list-of-naughty-strings/com...

If someone uses this list for security purposes I think that someone has a bigger problem.

Can you elaborate? What other uses this list could have than security purposes?

It should not be used for security purposes when security purposes is defined as components that maintain the security at runtime. It is valuable as a testing tool, but only against a completely finished system.

It could conceivably be used as a second-line defence, similar to content security policy. This may be a bad idea depending on how it is implemented and whether the system is tested with it turned off.

Its funny that zero width space is considered weird and twitter fails on it. Its quite common in my language (Persian).

From what I remember (can't test right now), a zero-width space is okay as long as there are other (printable) characters in there too. This seems reasonable, because allowing a tweet to be a single zero-width space would make it appear be empty and probably lead to some confusing display issues.

I'm pretty sure I've used it to "end" a hashtag early, like in this made-up example:

    I've eaten two #banana<ZWS>s today!
In my language, the possessive form doesn't take an apostrophe ("Alices Adventures" instead of "Alice's"), so for hashtags and user names it can be desirable to use the ZWS as an invisible apostrophe.

    > Its quite common in my
    > language
I'd love to hear more details on why?

Ligatures. It's easy to not notice them at all in english &#64258; (fl) vs. fl [fly vs. fly] but some languages use them very extensively and the combinations are more significant.


* It's entirely possible that the browser you're using isn't doing a very good job with ligatures which explains the strange look of my examples

ZWJ and ZWNJ are also common in Indic scripts. It's basically used to control the appearance of glyphs, for example half-forms and consonant clusters (क्‍ष vs क्ष, both are kṣa). As usual, wikipedia has good examples. The Unicode Standard also contains details about these.

ZW[N]J as a standalone character or at the beginning of a word is very unusual on a day-to-day basis, so it's understandable that Twitter fails to recognize this pattern.

¹ https://en.wikipedia.org/wiki/ZWJ

² https://en.wikipedia.org/wiki/ZWNJ

Ah ha!

    > When a ZWJ is placed between two
    > emoji characters, it can also result
    > in a new form being shown, such as
    > the family emoji, made up of two adult
    > emoji and one or two child emoji
That makes a lot of sense too, and I hadn't put sufficient work into how that's implemented -- retrospectively that makes perfect sense.

I noticed that on new Emojis on my MacBook. Some of the new emojis like ‍ are rendered as "guy behind a MacBook" on my PC but on phones without the emoji as "guy emoji" and "computer emoji".

Same for ‍️ (male version of raise hand). On phones without the Emoji, it's just "male emoji" and "female raise hand emoji".

/e: oh, HN is stripping Emojis

This made me wonder if anyone had tried combining word2vec with emojis, and then I came across this:


which is a dead link

Apologies, and thanks!

Not OP, but in Norwegian the correct way to write "Tom's car" is "Tom sin bil", the car of Tom. But the creep of English and laziness allows for "Toms car", esp. in informal writing.

Here's a tool to generate problematic filenames: https://github.com/jakeogh/angryfiles


> Strings which punish the fools who use cat/type on this file

Hello human. This is a message from the Matrix. You've been in a coma for 20 years. Please write back.


I like that in the master list it's annotated as:

    #	Strings which may cause human to reinterpret worldview

It’s missing the old “+++” for non-Hayes modems.

I think that sequence is an escape for Hayes modems; do you mean that Hayes modems were less vulnerable to attacks involving it because of their guard interval feature?


Yes, exactly.

Related: a list of names that probably should be reserved (for example, to prevent someone setting up a user-profile page at a URL you don't want them to control):


Alternatively, put them in another path (https://facebook.com/user123 -> https://facebook.com/users/user123)

Line 629 is a gem!

Thank you @minimaxir, I hadn't seen this before, this looks very useful.

Doesn't look like anything to me.

line 629 is empty ;)

No, wake up!!

(For any who want to take the blue pill: https://github.com/minimaxir/big-list-of-naughty-strings/blo...)

I'm not sure what you mean. That line really is empty.

Just a glitch...

Unicode control characters from when people copy and paste from PDFs.

Drives me up the wall, i didn't have time to go deep into this.

> Also, do not send a null character (U+0000) string

isn't that quite a blind spot?

Reminds me of a complain I read on Twitter last week.

Native Australians were angry, that FB blocked their real names, because they seemed fake to them.

They have last names like "Creepingbear" and such.

Did you mean native Americans? I've never heard "Creepingbear" as an Australian name.

If you're going to link to a line number - press 'y' to get a link tied to the commit. Otherwise it may be out of date the next time the file changes.


This reminded me of that story of people who have such strings for names: http://www.bbc.com/future/story/20160325-the-names-that-brea...

My personal favorite: U+202E. It sets the directionality for a document from LTR encoding to RTL https://twitter.com/robin_linus/status/820567617903751169

I'm trying really hard to figure out what's bad about 'Lightwater Country Park'

It says it above that group:

Innocuous strings which may be blocked by profanity filters (https://en.wikipedia.org/wiki/Scunthorpe_problem)

I remember the struggles I had trying to book a hotel at Essex Junction in Vermont to visit IBM. Netnanny had serious issues with that town (name). I otoh thought it and the people working at the IBM ASIC plant was very nice.


Characters 4 through 7 (zero-indexed)

Ligh twat er country park.

("Country" is not obscene, but Shakespeare makes "country matters" into an obscene reference in Hamlet. There are a lot of innuendoes in the classics)

I'm sad about how many literature teachers give Shakespeare's works a treatment as dry as unbuttered toast.

Some of the best teaching of Shakespeare I've seen used the actual lines mixed with a little extemporaneity to better get the intent across. "Nay, gentle Romeo, we must have you dance. Come on, stop being so emo! There are like a million other girls out there."

I figured that one out, but --- evaluate? mocha? expression?

The commit message explains that the terms are verbatim from Wikipedia [1].

Wikipedia [2] attributes it to a Yahoo email filter "which automatically replaced Javascript-related strings with alternate versions, to prevent the possibility of cross-site scripting in HTML email".

[1] https://github.com/minimaxir/big-list-of-naughty-strings/com...

[2] https://en.wikipedia.org/wiki/Scunthorpe_problem

mocha has a naughty german word in its middle, I'm fairly sure.

As a German native speaker, I'm unable to figure it out.

No it doesn't. I believe it can be used for Javascript injections like 'eval' as 'mocha' is/was common a test framework. At least that's the ostensible reason Yahoo replaced 'eval' with 'review', 'mocha' with 'expresso', and 'expression' to 'statement' way back in 2002 [0].

[0] https://www.newscientist.com//article/dn2546-email-security-...

"espresso", not "expresso".

"expresso" with "mocha", surely.

those puzzled me too

Why is the string "Linda Callahan" a naughty/Scunthorpe word?

After re-reading it I can see it contains "allah", but I can't see why that would be filtered.

See the Scunthorpe Wikipedia article:

"In February 2006, Linda Callahan, a resident of Ashfield, Massachusetts, was initially prevented from registering her name with Yahoo! as an e-mail address as it contained the substring allah. Yahoo! later reversed the ban."


Interesting my last name was blocked from making Genius Bar appointments [0]. My name is Jason Hung.

[0] https://discussions.apple.com/thread/1491462?start=10&tstart...

Open a PR?

I got that one ("allah", presumably), but was stuck on these three:


I got nothing for "mocha", though. Edit: apparently (from below) there was a Yahoo! mail filter that replaced "expresso" [sic] with "mocha"; but either the story was misreported or the mail filter was wildly misconfigured. So the entry should be "expresso" [sic], perhaps.

Nope, it was a schema that would cause a URL to be interpreted as code; an alias for `javascript:`


is it wise to just take this list "as is" as a black list for, say, valid usernames, on a backend system ?

are there any drawbacks to this that i can't think of ?

in terms of perfomance - i guess it could be somehow optimized (with dictionary and sorting algorithms etc etc)

edit: newlines

  is it wise to just take this list "as is" as a black list for, say, valid usernames?
I interpret this as a list of input that you should accept, and it's test-data to verify that the input is correctly handled.

After all, I imagine Linda Callahan would be upset if she couldn't use her name when registering, especially if she couldn't flip a table in comments afterwards. (╯°□°)╯︵ ┻━┻)

Not really, since a lot of the lines are examples of classes of input -> good for testing, but if you have an actual problem with one of them blacklisting them only protects you against this single example.

Definitely not -- these are examples of classes of strings that should be OK but might potentially cause issues, that can be used for testing.

But the issues they might cause are not all malicious: some are people's names, added to the list because an over-zealous profanity or offensiveness filter once choked on them.

So my suggestion is that you shouldn't block any of the strings in this file, but should use the file to make sure that your code works successfully when any of the strings are given. Where "successful" is naturally dependent on context: you may have a policy in place that says that messages may not consist solely of whitespace, so the correct response to receiving any of the whitespace strings is to return the correct error to let the user know that, avoiding Twitter's example of an internal server error in that case.

"Strings which may cause human to reinterpret worldview"

Hah. Totally filtering for that one now.

This is super helpful! Thanks for sharing!

Customary xkcd reference (Exploits of a Mom):


To add another funny XKCD reference: https://xkcd.com/1137/

On that note, can anyone suggest how one could efficiently test that an RTL unicode char doesn't "infect" the whole following content of a template?

>Although this is not a malicious error, and typical users aren't Tweeting weird unicode, an "internal server error" for unexpected input is never a positive experience for the user

What would the user expect from inputting "U+200B ZERO WIDTH SPACE" into a form, anyway?

At minimum, no error at all. Ideally, the same behavior you would get from putting in either nothing or a space.

Let's try it on Facebook. Here's what happens when you put only a blank or space into a post and try to submit: http://i.imgur.com/bNtgky8.png

Here's what happens when you put a zero width space and try to submit: http://i.imgur.com/NMgyZqc.png

I've observed ZWSes appearing in user input for an application I maintain. It appears in text pasted from either Outlook or OWA, I believe. In our case, it is necessary that the application handle them gracefully - indeed, the user has no reason to know anything is amiss.

That internal server error only appears if you paste the ZWS by itself, without any valid text in the tweet at all. So yes, the user knows perfectly well what he's doing.

That doesn't mean that a 500 is a good UX. We give error messages on invalid form input for a reason.

HTTP 5xx error indicates something abnormal happened on the server that wasn't handled. The server should be responding with a 400 if it's data it shouldn't accept.

But yeah like others said I would expect this to turn into some sort of validation message on the client and never show them the backend error.

Once I had a form that accepted a minimum number of words. Instead of trying to write more verbosely, I simply inserted ZWS (or it could be ZWJ, can't remember) randomly in the text to fool the word count checker.

Well, depends. If copying a table from a technical document, maybe a zero width space?

Probably a 4xx error not a 5xx.

400 Bad Request suits it.

i actually tweeted an zero width space some time ago and it worked. The tweet contained no text though.

Nice to have a list of these.

Also, the first time that copypasta actually spooked me out ;-)

There's no such thing as “naughty strings”, just dumb code. Sorry.

I have to agree here. While a collection of "naughty strings" isn't wrong per se, the growing number of "killer regexpes to escape HTML" and other magic approaches to injection attacks on github only serve lazy devs who want post-facto excuses for their injection-prone web apps, or project managers who want to check items on security check lists.

It's wrong because it de-emphasizes the importance of HTML-aware template languages, such as some that are available for golang, or SGML, the natural template language for HTML. There's no such thing as a collection of regexpes for sanitizing HTML; it all depends on the context into which strings are inserted.

But wouldn't you want a decent set of cases to work on for learning purposes?

I think it's also good in that while you may not know all the latest tricks, this can help you reveal what you don't know. It can get you really thinking about the possibilities of what a simple string can do to your code if not properly handled.

No, you don't want cases. You want real specifications that you can understand before setting on to write a program. “Corner cases” only exist due to lack of understanding.

Also, explicit HTML (or SQL or whatever) string handling in normal application code is just a failure to separate concerns: you haven't distinguished the level at which HTML has an abstract syntax and the level at which HTML's abstract syntax is linearized into strings in one particular way.

Real specifications being, "save user's text and display it back", or "save user input that is in English ASCII excluding special characters and no larger than 160 characters"? I get a lot of the first, with emphasis being on the users perspective.

I do know to consider things like sql injection and having js injected into the site. But I don't know what a special white space character from a Persian alphabet will do to my server. Until today I haven't actually thought about it. Not every language handles strings the same, as you pointed out.

I still think it's good to have around for helping you reveal what you don't know, about what you don't know.

Real specifications relate preconditions to postconditions. Preconditions and postconditions, in turn, are predicates on the program state. The mathematical techniques for writing programs that meet their formal specifications have been known for a few decades already.


Replying as an edit, because HN complains that “I'm submitting too fast”:

Sure, what you said applies to entire applications. But something relatively stable and small, like, um, the definitions of HTML, JSON, SQL, etc. (do they become larger every time your boss requests a new feature?) surely should have formal specifications.

I would love "real" specifications. But right now I'm already dealing with a boss that has no idea what he wants in terms of the UI. Simultaneously demanding I "know" what should be done without "taking on things nobody asked for."

Alas, I don't work at NASA where these formalities exist. I'm given a rough sketch that I'm expected to bring into life, throw away and recreate again on a whim.

Please note that I am not complaining, nor excusing. Only pointing out that our expectations, environments, and programming languages are different. Each can massively affect how the program should handle the input. Adding checks helps, but does not mitigate the need for a nice set of test data to help verify everything runs the way we expect it to behave.

> (zeroth paragraph)

Exactly. Security check lists become unnecessary when the program is designed to be correct right from the start.

> (first paragraph)

The real problem is that we do a very poor job of embedding languages inside each other. For example, HTML parsers must contain special provisions to handle that embedded JavaScript. </tag> might no longer be a terminator, because it could appear inside a JavaScript string literal. This is terrible design! I don't even like Lisp, but Lispers do have a point when they say using S-expressions would avoid all of these issues.

Worse, actually: <script> content is only terminated by </script> and not other end-element tags, but <!-- --> comments within script content are treated as JavaScript comments [1] (though I'm not aware of template approaches that need to compose the content of script elements).

[1] http://sgmljs.net/docs/html5.html#script-data

Why can't there be both? Yes, the code is naive if it doesn't handle all these strings correctly. But at the same time the strings are naughty because they purposefully try to exploit common weaknesses. Sometimes both sides are guilty.

Because a string is just a piece of data, and if your program can take it as input, then it must be handled correctly.

“But writing parsers was sooo boring in college, and who has to do this in real life?”

A social engineer is just taking to you, if you listen to him you must act accordingly. A lock picking set is just a bit of metal, if it fits into the keyway the lock has to handle it correctly.

Yes, writing parsers is a lot easier than those examples. But so far society has always ruled that inputs that purposefully try to abuse flaws are not freed from responsibility just because the flaw shouldn't be there.

A computer program is a mathematical object. If you want to rule out misbehaviors, you prove that such misbehaviors won't arise - just like any other theorem. And that's it.

I'm not a malicious person. I don't purposefully abuse any system's flaws. But if anyone else does, my sympathies aren't with the designer of the flawed system.

P.D.: Appeals to authority won't help make your case.

And how would you suggest testing to see whether or not your code is dumb?

Wait, testing?

These so-called “naughty strings” expose implementation errors in code that processes widely used formal languages such as numeric literals, URLs, HTML, JSON, SQL, etc. These languages are so widely used that it's criminal not to have formal specifications for them. And the mathematical techniques for constructing programs that meet their formal specifications are very well known.

Fine, imagine we have a formal specification for SQL. Now how do I make sure my parser is compliant with the spec without testing it? Formal verification is a very active research area, I don't think this is quite as easy as you're implying. How do I avoid "implementation errors" without exposing them?

(0) Design languages so that implementation errors are harder to make and easier to detect. For example, avoid context-sensitive grammars like the plague.

(1) Design implementations (parsers, code generators, etc.) with the logical argument for their correctness in mind. That is, don't attempt to verify an existing possibly incorrect program - write it to be correct right from the beginning! This is greatly aided by designs that meet criterion (0).

These are obviously useful ideas, but "write it to be correct from the beginning"? Are you serious? This is the oldest joke in software engineering. "Don't worry about testing it, I don't make mistakes." No matter how idiot-proof your languages and frameworks are, it is grossly irresponsible to not test work that a human has done. Until developers are themselves replaced by formally verified programs, testing is an absolute necessity.

I doubt human programmers can be fully replaced, and I'm not saying testing is completely useless. But the sheer number of “naughty strings” in that list is an indictment of our languages: They have way too many corner cases, way too many traps for us to fall into.

I still don't understand your logic. Are you saying once a program passes a test, we should stop using that test? The point of this list is to cover all classes of input in general, not just ones that a specific framework has issues with.

These are corner cases in the concept of user input, not just corner cases of any specific parser. What if it's a number, what if it's not? What if it's the same alphabet as the code, what if it's not? What if it is valid code? What if it's empty, what if it's not? etc. Even if you've written the perfect parser in the perfect language, you still need to have unit tests for all of this stuff. They are traps caused by human definitions of "input" and "string", which cannot be formally verified.

> Are you saying once a program passes a test, we should stop using that test?

No. I'm saying that programs have to be proven correct. Then you can use tests to rule out other pesky problems that have nothing to do with your design being incorrect. (For example, you could prove a program correct on paper, then transcribe it incorrectly to a computer. It has happened to me before.)

> These are corner cases in the concept of user input

“undefined” and “null” aren't special cases in the concept of user input - they're special cases in languages that happen to have “undefined” and “null”.

Octal numeric literals aren't special cases in the concept of number - they're special cases in languages where octal literals begin with the prefix “0”, rather than something more sensible like “0o”.

Failing to distinguish between escaped and unescaped strings is also a language problem - they should have different types!

The list goes on.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact