
Naughty Strings: A list of strings likely to cause issues as user-input data - caseysoftware
https://github.com/minimaxir/big-list-of-naughty-strings
======
minimaxir
Creator/Maintainer of the repo here.

I apologize for the lack of updates to the BLNS. (since I'm free today and
this is on the HN front page, I'll do a cleanup pass).

Even though it's a GitHub repository with 12.3k stars, there's not much to say
or improve on what is effectively a .txt file based around a good idea (I
recently removed mentions of my maintainership of the BLNS from my resume for
that reason, despite its crazy popularity).

~~~
caseysoftware
The HN submitter here. :)

I happened across it this afternoon and thought it was great!

Do you know of any automation around this? I was thinking of a script that
grabbed your list and then hammered a given input filtering library would be
awesome. It's not something you'd want to run all the time but pre-major
release, it could useful.

~~~
minimaxir
That is the primary purpose of the JSON files and the parser to convert the
.txt to JSON; get the list, run it against a text input field, see what
happens.

~~~
BuuQu9hu
Why do you store the generated JSON file in git btw?

~~~
lytedev
Not the maintainer, but I assume because it helps developers looking for it in
such a format. Saving them the time of converting is a nice move, IMO.

~~~
BuuQu9hu
The source and build system are right there, should be easy to generate the
other formats?

~~~
jaymzcampbell
Similar to the other comments, another voice here for appreciating the "pre-
built" version being available for quick use. For repo's/sources like this I
tend to think of the prebuilt formats as letting me play around with things
without any hassle. Once I'm happy with it I'll invest the time to have it
build locally for the control.

------
chiph
You might want to include common unix shell commands. At a previous job we had
a customer with the last name of Echo who wasn't able to make a purchase.
Turns out our credit card processor blocked them.

~~~
Normal_gaussian
Jesus. Which credit card processor? That stinks of bad design.

~~~
chiph
Given how often they came under attack, I don't blame them for taking a "belt
and suspenders" approach.

~~~
paulddraper
More like "belt and helium balloons" approach.

------
bsimpson
TIL: `mocha:` was a custom schema that Netscape Navigator used to eval URLs
(equivalent to `javascript:`), and Yahoo! Mail would replace it with
'espresso' to attempt to thwart phishing attempts:

[https://www.obscure.org/javascript/archives/msg01369.html](https://www.obscure.org/javascript/archives/msg01369.html)

[https://www.cnet.com/news/yahoo-mail-puts-words-in-your-
mout...](https://www.cnet.com/news/yahoo-mail-puts-words-in-your-mouth/)

------
thomasahle
This is a fun issue [https://github.com/minimaxir/big-list-of-naughty-
strings/iss...](https://github.com/minimaxir/big-list-of-naughty-
strings/issues/16)

~~~
remolueoend
a painted masterpiece on a bikeshed.

~~~
vanderZwan
I dunno, I think this is a pretty good argument actually:

> I agree that another SQL injection should be included - not because the
> vulnerabilities exposed by this file should be tempered (as that would only
> be to assist a dangerous confusion of responsible practices), but because
> "DROP TABLES" is such a cliche in infosec that it's prone to be caught by
> extremely crude filters, naive to the degree that it's the only class of SQL
> injection they know to avoid.

------
raverbashing
The human injection phrase is priceless

It's a nice collection of text snippets to test against many systems

~~~
chronolitus
For those who can't/would rather not look for it:

"# Human injection # # Strings which may cause human to reinterpret worldview

If you're reading this, you've been in a coma for almost 20 years now. We're
trying a new technique. We don't know where this message will end up in your
dream, but we hope it works. Please wake up, we miss you."

------
leni536
> blns.txt consists of newline-delimited strings

I expect some nasty strings to contain newlines (I wonder how many bash
scripts are sensitive to filenames with newline characters in them). It
shouldn't be a problem with the json file though.

~~~
mtnygard
The file is newline-delimited. The strings themselves are base64 encoded, so
they could contain newlines.

~~~
leni536
It seems that blns.txt is the source content, then it's converted to
blns.json, blns.base64.txt and blns.base64.json with the two scripts in the
scripts folder (These resulting files shouldn't be in the repo in my opinion).
One cannot possibly add strings with newlines in them, unless with some
newline escaping that are handled in the scripts. It's a bad idea IMO and the
source content should be the json file and blns.txt should be dropped.

------
vog
I like the idea of providing such a list for testing purposes. I also like the
idea of storing these as Base64, so you don't trigger issues by accident.

However, I also imagine how such a list could be misused to actually decrease
the security of a system:

Imagine this list is handled the same way as virus signatures in so-called
anti-virus software. Instead of properly handling user input, an application
would check against this list and call itself "secure". Maybe with with
partial and/or fuzzy comparison. If you demonstrate that this approach is
deeply flawed by showing another unsafe input, they'd simply add that to the
list and call themselves "secured" against this attack.

~~~
tomascot
If someone uses this list for security purposes I think that someone has a
bigger problem.

~~~
beefield
Can you elaborate? What other uses this list could have than security
purposes?

~~~
kedean
It should not be used for security purposes when security purposes is defined
as components that maintain the security at runtime. It is valuable as a
testing tool, but only against a completely finished system.

------
emilsedgh
Its funny that zero width space is considered weird and twitter fails on it.
Its quite common in my language (Persian).

~~~
peteretep

        > Its quite common in my
        > language
    

I'd love to hear more details on why?

~~~
satbyy
ZWJ and ZWNJ are also common in Indic scripts. It's basically used to control
the appearance of glyphs, for example half-forms and consonant clusters (क्‍ष
vs क्ष, both are kṣa). As usual, wikipedia has good examples. The Unicode
Standard also contains details about these.

ZW[N]J as a standalone character or at the beginning of a word is very unusual
on a day-to-day basis, so it's understandable that Twitter fails to recognize
this pattern.

¹ [https://en.wikipedia.org/wiki/ZWJ](https://en.wikipedia.org/wiki/ZWJ)

² [https://en.wikipedia.org/wiki/ZWNJ](https://en.wikipedia.org/wiki/ZWNJ)

~~~
peteretep
Ah ha!

    
    
        > When a ZWJ is placed between two
        > emoji characters, it can also result
        > in a new form being shown, such as
        > the family emoji, made up of two adult
        > emoji and one or two child emoji
    

That makes a lot of sense too, and I hadn't put sufficient work into how
that's implemented -- retrospectively that makes perfect sense.

~~~
vanderZwan
This made me wonder if anyone had tried combining word2vec with emojis, and
then I came across this:

[https://github.com/uclmr/emoji2ve](https://github.com/uclmr/emoji2ve)

~~~
peteretep
which is a dead link

~~~
satbyy
Correct link:
[https://github.com/uclmr/emoji2vec](https://github.com/uclmr/emoji2vec)

~~~
vanderZwan
Apologies, and thanks!

------
jakeogh
Here's a tool to generate problematic filenames:
[https://github.com/jakeogh/angryfiles](https://github.com/jakeogh/angryfiles)

------
solidsnack9000
[https://github.com/minimaxir/big-list-of-naughty-
strings/blo...](https://github.com/minimaxir/big-list-of-naughty-
strings/blob/master/blns.txt#L633)

> Strings which punish the fools who use cat/type on this file

------
Confiks
Hello human. This is a message from the Matrix. You've been in a coma for 20
years. Please write back.

[https://github.com/minimaxir/big-list-of-naughty-
strings/blo...](https://github.com/minimaxir/big-list-of-naughty-
strings/blob/8536c7903316763d7a6123e878c150fb97e6ea07/blns.txt#L629)

~~~
jjcm
I like that in the master list it's annotated as:

    
    
        #	Strings which may cause human to reinterpret worldview
    

[https://github.com/minimaxir/big-list-of-naughty-
strings/blo...](https://github.com/minimaxir/big-list-of-naughty-
strings/blob/master/blns.txt#L627)

------
teddyh
It’s missing the old “+++” for non-Hayes modems.

~~~
schoen
I think that sequence is an escape for Hayes modems; do you mean that Hayes
modems were less vulnerable to attacks involving it because of their guard
interval feature?

[https://en.wikipedia.org/wiki/Hayes_command_set#.2B.2B.2B](https://en.wikipedia.org/wiki/Hayes_command_set#.2B.2B.2B)

~~~
teddyh
Yes, exactly.

------
ubernostrum
Related: a list of names that probably should be reserved (for example, to
prevent someone setting up a user-profile page at a URL you don't want them to
control):

[https://ldpreload.com/blog/names-to-
reserve](https://ldpreload.com/blog/names-to-reserve)

~~~
chipperyman573
Alternatively, put them in another path
([https://facebook.com/user123](https://facebook.com/user123) ->
[https://facebook.com/users/user123](https://facebook.com/users/user123))

------
ljoshua
Line 629 is a gem!

Thank you @minimaxir, I hadn't seen this before, this looks very useful.

~~~
bluesign
line 629 is empty ;)

~~~
ljoshua
No, wake up!!

(For any who want to take the blue pill: [https://github.com/minimaxir/big-
list-of-naughty-strings/blo...](https://github.com/minimaxir/big-list-of-
naughty-strings/blob/master/blns.txt#L629))

~~~
pluma
I'm not sure what you mean. That line really is empty.

~~~
jononor
Just a glitch...

------
zeristor
Unicode control characters from when people copy and paste from PDFs.

Drives me up the wall, i didn't have time to go deep into this.

------
the8472
> Also, do not send a null character (U+0000) string

isn't that quite a blind spot?

------
k__
Reminds me of a complain I read on Twitter last week.

Native Australians were angry, that FB blocked their real names, because they
seemed fake to them.

They have last names like "Creepingbear" and such.

~~~
harto
Did you mean native Americans? I've never heard "Creepingbear" as an
Australian name.

------
r-w
The list itself can be found here: [https://github.com/minimaxir/big-list-of-
naughty-strings/blo...](https://github.com/minimaxir/big-list-of-naughty-
strings/blob/master/naughtystrings/internal/resource.go#L530)

~~~
derimagia
If you're going to link to a line number - press 'y' to get a link tied to the
commit. Otherwise it may be out of date the next time the file changes.

([https://github.com/minimaxir/big-list-of-naughty-
strings/blo...](https://github.com/minimaxir/big-list-of-naughty-
strings/blob/8536c7903316763d7a6123e878c150fb97e6ea07/naughtystrings/internal/resource.go#L530))

------
air7
This reminded me of that story of people who have such strings for names:
[http://www.bbc.com/future/story/20160325-the-names-that-
brea...](http://www.bbc.com/future/story/20160325-the-names-that-break-
computer-systems)

------
Capira
My personal favorite: U+202E. It sets the directionality for a document from
LTR encoding to RTL
[https://twitter.com/robin_linus/status/820567617903751169](https://twitter.com/robin_linus/status/820567617903751169)

------
cwmma
I'm trying really hard to figure out what's bad about 'Lightwater Country
Park'

~~~
david-given
I figured that one out, but --- evaluate? mocha? expression?

~~~
cscheid
mocha has a naughty german word in its middle, I'm fairly sure.

~~~
guitarbill
No it doesn't. I believe it can be used for Javascript injections like 'eval'
as 'mocha' is/was common a test framework. At least that's the ostensible
reason Yahoo replaced 'eval' with 'review', 'mocha' with 'expresso', and
'expression' to 'statement' way back in 2002 [0].

[0] [https://www.newscientist.com//article/dn2546-email-
security-...](https://www.newscientist.com//article/dn2546-email-security-
filter-spawns-new-words)

~~~
jwilk
"espresso", not "expresso".

------
aroman
Why is the string "Linda Callahan" a naughty/Scunthorpe word?

~~~
ue_
After re-reading it I can see it contains "allah", but I can't see why that
would be filtered.

~~~
jasonjei
Interesting my last name was blocked from making Genius Bar appointments [0].
My name is Jason Hung.

[0]
[https://discussions.apple.com/thread/1491462?start=10&tstart...](https://discussions.apple.com/thread/1491462?start=10&tstart=0)

~~~
pavel_lishin
Open a PR?

------
frankmoodie
is it wise to just take this list "as is" as a black list for, say, valid
usernames, on a backend system ?

are there any drawbacks to this that i can't think of ?

in terms of perfomance - i guess it could be somehow optimized (with
dictionary and sorting algorithms etc etc)

edit: newlines

~~~
manarth

      is it wise to just take this list "as is" as a black list for, say, valid usernames?
    

I interpret this as a list of input that you _should_ accept, and it's test-
data to verify that the input is correctly handled.

After all, I imagine _Linda Callahan_ would be upset if she couldn't use her
name when registering, especially if she couldn't flip a table in comments
afterwards. (╯°□°）╯︵ ┻━┻)

------
yellowapple
"Strings which may cause human to reinterpret worldview"

Hah. Totally filtering for that one now.

------
Tokkemon
This is super helpful! Thanks for sharing!

------
piyush_soni
Customary xkcd reference (Exploits of a Mom):

[https://xkcd.com/327/](https://xkcd.com/327/)

~~~
btschaegg
To add another funny XKCD reference:
[https://xkcd.com/1137/](https://xkcd.com/1137/)

On that note, can anyone suggest how one could efficiently test that an RTL
unicode char doesn't "infect" the whole following content of a template?

------
akjainaj
>Although this is not a malicious error, and typical users aren't Tweeting
weird unicode, an "internal server error" for unexpected input is never a
positive experience for the user

What would the user expect from inputting "U+200B ZERO WIDTH SPACE" into a
form, anyway?

~~~
ttrmw
I've observed ZWSes appearing in user input for an application I maintain. It
appears in text pasted from either Outlook or OWA, I believe. In our case, it
is necessary that the application handle them gracefully - indeed, the user
has no reason to know anything is amiss.

~~~
akjainaj
That internal server error only appears if you paste the ZWS by itself,
without any valid text in the tweet at all. So yes, the user knows perfectly
well what he's doing.

~~~
rabidferret
That doesn't mean that a 500 is a good UX. We give error messages on invalid
form input for a reason.

------
kahrkunne
Nice to have a list of these.

Also, the first time that copypasta actually spooked me out ;-)

------
catnaroek
There's no such thing as “naughty strings”, just dumb code. Sorry.

~~~
tannhaeuser
I have to agree here. While a collection of "naughty strings" isn't wrong per
se, the growing number of "killer regexpes to escape HTML" and other magic
approaches to injection attacks on github only serve lazy devs who want post-
facto excuses for their injection-prone web apps, or project managers who want
to check items on security check lists.

It's wrong because it de-emphasizes the importance of HTML-aware template
languages, such as some that are available for golang, or SGML, the natural
template language for HTML. There's no such thing as a collection of regexpes
for sanitizing HTML; it all depends on the context into which strings are
inserted.

~~~
6DM
But wouldn't you want a decent set of cases to work on for learning purposes?

I think it's also good in that while you may not know all the latest tricks,
this can help you reveal what you don't know. It can get you really thinking
about the possibilities of what a simple string can do to your code if not
properly handled.

~~~
catnaroek
No, you don't want cases. You want _real specifications_ that you can
understand before setting on to write a program. “Corner cases” only exist due
to lack of understanding.

Also, explicit HTML (or SQL or whatever) string handling in normal application
code is just a failure to separate concerns: you haven't distinguished the
level at which HTML has an _abstract syntax_ and the level at which HTML's
abstract syntax is linearized into strings in one particular way.

~~~
6DM
Real specifications being, "save user's text and display it back", or "save
user input that is in English ASCII excluding special characters and no larger
than 160 characters"? I get a lot of the first, with emphasis being on the
users perspective.

I do know to consider things like sql injection and having js injected into
the site. But I don't know what a special white space character from a Persian
alphabet will do to my server. Until today I haven't actually thought about
it. Not every language handles strings the same, as you pointed out.

I still think it's good to have around for helping you reveal what you don't
know, about what you don't know.

~~~
catnaroek
Real specifications relate preconditions to postconditions. Preconditions and
postconditions, in turn, are predicates on the program state. The mathematical
techniques for writing programs that meet their formal specifications have
been known for a few decades already.

\---

Replying as an edit, because HN complains that “I'm submitting too fast”:

Sure, what you said applies to entire applications. But something relatively
stable and small, like, um, the definitions of HTML, JSON, SQL, etc. (do they
become larger every time your boss requests a new feature?) surely should have
formal specifications.

~~~
6DM
I would love "real" specifications. But right now I'm already dealing with a
boss that has no idea what he wants in terms of the UI. Simultaneously
demanding I "know" what should be done without "taking on things nobody asked
for."

Alas, I don't work at NASA where these formalities exist. I'm given a rough
sketch that I'm expected to bring into life, throw away and recreate again on
a whim.

Please note that I am not complaining, nor excusing. Only pointing out that
our expectations, environments, and programming languages are different. Each
can massively affect how the program should handle the input. Adding checks
helps, but does not mitigate the need for a nice set of test data to help
verify everything runs the way we expect it to behave.

