
Show HN: Big List of Naughty Strings for testing user-input data - minimaxir
https://github.com/minimaxir/big-list-of-naughty-strings
======
rspeer
Most of what I do involves the messy world of text, and I think this is a
great resource. I wish the software I depended on tested against it.

I can think of a few more cases that I've seen cause havoc:

\- U+FEFF in the middle of a string (people are used to seeing it at the
beginning of a string, because Microsoft, but elsewhere it may be more
surprising)

\- U+0 (it's encoded as the null byte!)

\- U+1B (the codepoint for "escape")

\- U+85 (Python's "codecs" module thinks this is a newline, while the "io"
module and the Python 3 standard library don't)

\- U+2028 and U+2029 (even weirder linebreaks that cause disagreement when
used in JSON literals)

\- A glyph with a million combining marks on it, but not in NFC order (do your
Unicode algorithms use insertion sort?)

\- The sequence U+100000 U+010000 (triggers a weird bug in Python 3.2 only)

\- "Forbidden" strings that are still encodable, such as U+FFFF, U+1FFFF, and
for some reason U+FDD0

People should also test what happens with isolated surrogate codepoints, such
as U+D800. But these can't properly be encoded in UTF-8, so I guess don't put
them in the BLNS. (If you put the fake UTF-8 for them in a file, the best
thing for a program to do would be to give up on reading the file.)

~~~
gsnedders
Bi-directional text is probably another one. All the bidi control characters,
especially. Probably really all Unicode control characters in general.

~~~
rspeer
Sure, but there's already a lot of bidi text in the file.

~~~
gsnedders
Bah, I only saw mono-directional text. Looking closely I only see one line of
with bi-directional text, "הָיְתָהtestالصفحات التّحول"?

~~~
minimaxir
There are Bidi controller characters in the Trick Unicode. (Doesn't appear in
Github rendering, oddly)

------
jsat
" # Server Code Injection # # Strings which can cause user to run code on
server as a privileged user (c.f.
[https://news.ycombinator.com/item?id=7665153](https://news.ycombinator.com/item?id=7665153))

/dev/null; rm -rf /*; echo " That's a little aggressive for testing no?

~~~
jleader
Some would argue that if you're testing on a system you can't recreate
easily/quickly, you're doing devops wrong.

~~~
pavel_lishin
And I'd agree, but this would be a pretty disproportionate punishment for the
crime of doing devops wrong :P

~~~
DonHopkins
It's two crimes: doing devops wrong, and having a huge security hole.

------
afandian
One fun (and very interesting) string is EICAR[0]. I worked for an antivirus
company once and we had the EICAR string for testing but couldn't check it
into source control because it triggered the AV software which we dogfooded...

Is it naughty to include it here?

    
    
        X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
    
    

[0]
[https://en.wikipedia.org/wiki/EICAR_test_file](https://en.wikipedia.org/wiki/EICAR_test_file)

~~~
girvo
Aw, Sophos on OS X doesn't think it's a threat.

~~~
afandian
Without giving too much away, I was sufficiently surprised by that that I
downloaded the Sophos for Mac Home Edition. It does recognise it.

Here's what I get:
[http://i.imgur.com/JQzVsQf.png](http://i.imgur.com/JQzVsQf.png)

This was picked up by the on-access scanner and a manual scan. The Web
Protection doesn't complain about the text in a page (rightly or wrongly).

Are you using a centrally managed version (i.e. not Home Edition)?

~~~
girvo
Interestingly, I found what caused the false-negative. If I used Vim to create
the file, it was picked up. If I "echo ...EICAR > text.txt" it doesn't get
picked up, at least not immediately!

~~~
afandian
The on-access scanner intercepts requests to open files, and scans them. Echo
just writes to the file and closes it. It doesn't try to open it again once
the EICAR string is in there. I'm speculating here, but Vim probably writes
the file/buffer, flushes, and then tries to obtain a file handle to it. At
that point an on-access scan will occur, and it will find the EICAR string.

A scheduled scan would pick this up eventually.

------
efriese
Yeah, I would make the SQL injection and command injections test a little less
kinetic =). Using a simple SELECT test, like SELECT @@VERSION, would be a
little safer... Edit: Forget to say thanks! This is a pretty cool list.

~~~
bryanlarsen
You want something that modifies so that you can detect that the SQL executed.
But an INSERT would be a much friendlier than a DROP TABLE. :)

~~~
efriese
Not necessarily. If you do a test with good SQL and a second test with SQL
Injection and compare the responses that can show SQL Injection exists without
having to change the database. This won't work for all SQL injection tests,
but I would rather take this approach first.

------
tptacek
This is good. There are lots of lists like this; you might find additional
strings to add to it here:

[https://code.google.com/p/fuzzdb/](https://code.google.com/p/fuzzdb/)

Fuzz lists are to web pentesters what drain snakes are to plumbers.

~~~
janfry
Another good list that incorporates FuzzDB:
[https://github.com/danielmiessler/SecLists](https://github.com/danielmiessler/SecLists)

As other commenters noted, strings like DROP TABLES should be used with
caution!

------
simonw
It's not completely clear to me which encoding the blns.txt file uses. Since
this project is all about weird/evil bytestrings, the encoding of the file
itself is very important.

Using a newline as a delimiter in that file excludes newlines from being part
of the strings you are testing - but newlines are an important "naughty"
character to consider. Unfortunately the same is true of basically any other
common delimiter character.

Maybe base64-encoding the strings would be one way to solve for this? You
could use base64-encoded values in JSON, for example.

~~~
minimaxir
Fair question. Encoding is UTF-8. This is fine for time being since UTF-8 is
ubiquitous.

I had it set as UTF-16 for the two-byte characters when first writing it, but
that had caused issues. If there is a demand, a second list can be added.

------
adzicg
for anyone testing web sites, I built a chrome extension that makes things
like this available in the right-click menu [1] the code is on github, so it
can be easily extended [2]

[1] - [https://chrome.google.com/webstore/detail/bug-
magnet/efhedld...](https://chrome.google.com/webstore/detail/bug-
magnet/efhedldbjahpgjcneebmbolkalbhckfi?hl=en)

[2] - [https://github.com/gojko/bugmagnet](https://github.com/gojko/bugmagnet)

------
acehyzer
If I put this into my company's tests, we'd end up with no users... I have a
lot of work ahead of me. :/

~~~
sanderjd
Yeah, the other exploit strings do innocuous stuff like putting up javascript
alerts or touching files, but the SQL injection ones aren't innocuous at all.
I wonder if there's something better to replace those with. Something like
`1'; CREATE TABLE blns ...--` would be more akin to what the shell exploits
do.

------
reitanqild
Anyone knows if anything similar exists for telephone numbers?

Edit: Found this two minutes later:
[https://github.com/googlei18n/libphonenumber](https://github.com/googlei18n/libphonenumber),
seems to be an official Google product and Apache licensed.

------
thomasfoster96
Unintentionally, this also shows that GitHub is going pretty well when it
comes it sanitising user inputs.

~~~
duncans
Thankfully they're not sanitising inputs but correctly encoding outputs.

------
orf
Looks interesting, but the Script Injection, SQL Injection and Server Code
Injection sections need a _lot_ more samples to be remotely useful.

~~~
minimaxir
I definitely agree; hence the open-sourceness. :)

I only added what was off the top of my head for those sections; this list
will consistently be updated.

~~~
vog
Wouldn't it make more sense to define building blocks and automatically
generate all sensible combinations? Otherwise I don't think this list can be
managed by hand, especially not in a volunteer project.

------
siculars
Nice "in the beginning..." hebrew string:

בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ

~~~
gizmo686
Full context, this is the beggining of the bible.

~~~
minimaxir
Yes, I'm lazy. :p

~~~
sam_goody
You'd do better with "הבה נרדה ונבלה שם שפתם אשר לא ישמעו איש שפת ראהו"
(Genesis 11:7) That's God saying he will make multiple languages to confuse
everyone...

------
itaibn
The list seems to be missing the simplest naughty string of all: The empty
string!

(Well, the text file has empty lines separating the comments and example
strings so it _technically_ includes the empty string, but it's not in the
JSON file.)

~~~
minimaxir
There is a pull request pending that fixes this.

~~~
DonHopkins
I also submitted a pull request with an infinitely long string, but it's still
pending...

------
jl6
Is the scope just well-formed strings or would you consider adding binary
nasties like null bytes, mal-encoded characters, or even just newlines on
their own?

What about XML billion laughs strings, or parser-busting very long runs of
parentheses?

~~~
eli
I've definitely seen NUL bytes in what's supposed to be a text string break
many tools.

------
hoprocker
Nice; sort of a programming complement to Shutterstock's _List of Dirty,
Naughty, Obscene, and Otherwise Bad Words_[0]. So helpful to have a bunch of
minds working on useful lists like this. Good to see that GitHub passes this
test!

[0] [https://github.com/shutterstock/List-of-Dirty-Naughty-
Obscen...](https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-
Otherwise-Bad-Words)

~~~
brohee
I doubt the value of this repository. The first naughty French word "allumé"
can't be considered naughty, dirty, or bad, like, at all. And many others are
not naughty under too many circumstances...

Except very few swear words, word filtering is pretty much useless.

~~~
joosters
Alternative source: list of reddit usernames...

------
joelcollinsdc
Great list. A few questions:

* How could this be used to test 'corrupt' characters? Isn't the process of savign the file itself as UTF-8 un-corrupt...the file?

* Is there some recommended way to group these into "strings that should pass validation" versus "strings that should fail"... or is that too application-specific?

------
pbnjay
If you really intend this for use in testing, I'd suggest making the
injections less nasty. I could easily see a junior dev slapping this in and
deleting some important stuff.

I'd also add more invalid UTF encodings and embedded null bytes, etc. The JSON
format would be preferable to plain text for that though.

~~~
ph0rque
Thankfully, there are no strings invoking Cthulhu :)

~~~
tsemple
lol! You must be referring to the ICFP contest 2015.
[http://icfpcontest.org/](http://icfpcontest.org/)

~~~
ph0rque
I was actually inspired by the concept that lovecraftian horrors can be
accessed and interacted with programmatically, prominently featured in
Stross's Atrocity Archives:
[https://en.wikipedia.org/wiki/The_Atrocity_Archives](https://en.wikipedia.org/wiki/The_Atrocity_Archives)

------
userbinator
/dev/urandom can also be used as a source of random and unusual input data, as
it contains by definition all 256 byte values and 65536 2-byte values, 16M
3-byte values, etc., and should eventually output every possible string.

~~~
RandomBK
> and should eventually output every possible string.

"Eventually" being the key word here. Fuzzing with purely random inputs will
take eons to actually reveal non-trivial bugs...

------
x0
I absolutely love strange unicode strings. It's handy if you ever want to find
out what a server's running. One time, I put a bunch of emoji's in a GET param
of a Google site, then got a big Java error page. I had no idea Google ran
Java.

Edit: Another one that tends to be fun is [] in the param, like
[http://example.com/?get[]=[]](http://example.com/?get\[\]=\[\]).

And you can things inside, like
[http://example.com/?get['"%05<!]=[%FE%FF]](http://example.com/?get\['"%05<!\]=\[%FE%FF\])

------
nradov
For more great examples of "naughty" strings see the Twitter @glitchr_
account. [https://twitter.com/glitchr_](https://twitter.com/glitchr_)

------
ivanca
Complete AI is no the hardest problem in CS, parsing text is. Joking aside
this reminded me of that CSS vulnerability that allowed attackers to read
peoples mails: [http://scarybeastsecurity.blogspot.com/2009/12/generic-
cross...](http://scarybeastsecurity.blogspot.com/2009/12/generic-cross-
browser-cross-domain.html)

------
webo
I don't deal with user input validation, but any resources for reading about
handling various inputs like the ones in blns?

------
TallGuyShort
I don't recall exactly where this was, but I know I've worked with an API
before that sometimes dropped requests, and it was because some randomly
generated data included 'naughty text' like 'xxx', or profanity. I was
expecting a dataset intended to catch this problem...

------
ck2
OT but is there a way to see projects with the most stars on github?

This one seems to be skyrocketing.

Oh here we go, and lookie who is at the top:
[https://github.com/trending](https://github.com/trending)

~~~
tlrobinson
[https://github.com/stars?direction=desc&sort=stars](https://github.com/stars?direction=desc&sort=stars)

~~~
ck2
Well that is for your account, I meant overall for the system and the trending
report does that...

------
homakov
Should be 1 long string, then if something fails use bsection

------
rectangletangle
This should be really handy for fuzz testing, nice work!

------
iopuy
Bookmark

