I can think of a few more cases that I've seen cause havoc:
- U+FEFF in the middle of a string (people are used to seeing it at the beginning of a string, because Microsoft, but elsewhere it may be more surprising)
- U+0 (it's encoded as the null byte!)
- U+1B (the codepoint for "escape")
- U+85 (Python's "codecs" module thinks this is a newline, while the "io" module and the Python 3 standard library don't)
- U+2028 and U+2029 (even weirder linebreaks that cause disagreement when used in JSON literals)
- A glyph with a million combining marks on it, but not in NFC order (do your Unicode algorithms use insertion sort?)
- The sequence U+100000 U+010000 (triggers a weird bug in Python 3.2 only)
- "Forbidden" strings that are still encodable, such as U+FFFF, U+1FFFF, and for some reason U+FDD0
People should also test what happens with isolated surrogate codepoints, such as U+D800. But these can't properly be encoded in UTF-8, so I guess don't put them in the BLNS. (If you put the fake UTF-8 for them in a file, the best thing for a program to do would be to give up on reading the file.)
/dev/null; rm -rf /*; echo
That's a little aggressive for testing no?
1;DROP TABLE users
1'; DROP TABLE users--
Seems a bit hairy to have that in there in case someone tries to run these tests on their prod environment
So someone may test against prod if they didn't really know what these things could do.
Is it naughty to include it here?
Here's what I get: http://i.imgur.com/JQzVsQf.png
This was picked up by the on-access scanner and a manual scan. The Web Protection doesn't complain about the text in a page (rightly or wrongly).
Are you using a centrally managed version (i.e. not Home Edition)?
A scheduled scan would pick this up eventually.
Edit: Seriously, Microsoft?
Description: This program is dangerous and replicates by infecting other files.
Recommended action: Remove this software immediately.
>Anti-virus programmers set the EICAR string as a verified virus, similar to other identified signatures. A compliant virus scanner, when detecting the file, will respond in exactly the same manner as if it found a harmful virus. Not all virus scanners are compliant, and may not detect the file even when they are correctly configured.
Fuzz lists are to web pentesters what drain snakes are to plumbers.
As other commenters noted, strings like DROP TABLES should be used with caution!
Using a newline as a delimiter in that file excludes newlines from being part of the strings you are testing - but newlines are an important "naughty" character to consider. Unfortunately the same is true of basically any other common delimiter character.
Maybe base64-encoding the strings would be one way to solve for this? You could use base64-encoded values in JSON, for example.
I had it set as UTF-16 for the two-byte characters when first writing it, but that had caused issues. If there is a demand, a second list can be added.
 - https://chrome.google.com/webstore/detail/bug-magnet/efhedld...
 - https://github.com/gojko/bugmagnet
Edit: Found this two minutes later: https://github.com/googlei18n/libphonenumber, seems to be an official Google product and Apache licensed.
I only added what was off the top of my head for those sections; this list will consistently be updated.
בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ
(Well, the text file has empty lines separating the comments and example strings so it technically includes the empty string, but it's not in the JSON file.)
What about XML billion laughs strings, or parser-busting very long runs of parentheses?
There are so many creative ways to get around swearing. Replace letters with numbers, drop consonants and vowels. And you almost always need to check for word boundaries otherwise somebody from Scunthorpe might be upset you banned them. And then there are cases where word boundaries aren't enough. Good luck ;-)
Except very few swear words, word filtering is pretty much useless.
* How could this be used to test 'corrupt' characters? Isn't the process of savign the file itself as UTF-8 un-corrupt...the file?
* Is there some recommended way to group these into "strings that should pass validation" versus "strings that should fail"... or is that too application-specific?
I'd also add more invalid UTF encodings and embedded null bytes, etc. The JSON format would be preferable to plain text for that though.
"Eventually" being the key word here. Fuzzing with purely random inputs will take eons to actually reveal non-trivial bugs...
Edit: Another one that tends to be fun is  in the param, like http://example.com/?get=.
And you can things inside, like http://example.com/?get['"%05<!]=[%FE%FF]
This one seems to be skyrocketing.
Oh here we go, and lookie who is at the top: https://github.com/trending