
Big List of Naughty Strings - pmoriarty
https://github.com/minimaxir/big-list-of-naughty-strings
======
dang
Big lists of previous comments:

[https://news.ycombinator.com/item?id=13406119](https://news.ycombinator.com/item?id=13406119)

[https://news.ycombinator.com/item?id=10035008](https://news.ycombinator.com/item?id=10035008)

------
minimaxir
Repo maintainer here.

...can someone explain how the repo keeps resurfacing? I haven’t promoted it
in a _long_ time. (Looking at the repo traffic, it recently spiked on the 6th,
but nothing since then.)

~~~
sbr464
Tangentially related to the original project intent;

Is there a place where common things in the dev world like this are
accumulated? For example, a list of all countries or list of the US states,
for use with an HTML dropdown. I know there are various repos on Github that
maintain these types of lists, such as English stop words, profanity word
lists etc, but is there a service that accumulates these in a familiar,
structured api?

~~~
majewsky
Look at Wikipedia's lists of things. For your particular examples:

[https://en.wikipedia.org/wiki/List_of_sovereign_states](https://en.wikipedia.org/wiki/List_of_sovereign_states)

[https://en.wikipedia.org/wiki/U.S._state](https://en.wikipedia.org/wiki/U.S._state)

Some of them are quite meta, such as
[https://en.wikipedia.org/wiki/List_of_lists_of_lists](https://en.wikipedia.org/wiki/List_of_lists_of_lists)

For a more structured source, Wikidata aims to be that, but I cannot comment
on its completeness.

~~~
sbr464
Structured, maintained API though, not general knowledge. I personally see an
issue that someone has to accumulate their own stash of structured data for
common knowledge (random examples) like: countries, zip codes, valid HTML5
element names, css properties, hex colors, common naming
prefix/suffixes/professional titles, etc. A growing list of work repeated by
each dev team/company for really no reason. No complaint about this repo, at
all, just seeking if a solution exists.

~~~
thinkalone
There is Corpora:
[https://github.com/dariusk/corpora/tree/master/data](https://github.com/dariusk/corpora/tree/master/data)

~~~
sedatk
> Corpora is a collection of small files. It is not meant to be an exhaustive
> source of anything: a list of resources should contain somewhere in the
> vicinity of 1000 items.

------
stcredzero
There's this recurring problem with certain strings which crash the Messages
app in iOS, leaving a big hole in functionality and making iOS look pretty
bad. I've pointed out that this is inexcusable for any language where you have
exception handling. The standard reply to that on HN is to point out that you
want the process to die once it's gone into undefined behavior, then downvote
me.

This puts me in mind of interviews, where I point out to the candidate that
their update routine would go into an infinite loop if there was a 2 node
cycle in their data. So then they give me an if statement that detects only
the 2 node loop. I've even then asked what would happen if there was a 3 node
loop, and gotten a 2nd if statement for that as well.

Apps which might crash due to processing untrusted data should be reading that
data from a queue. Then a 2nd process can monitor the 1st process, taking
problematic data off the queue if necessary. This way the 1st process can die,
but be restarted, and your smartphone OS doesn't have to look completely
broken due to a primary function just dying, requiring the user to reboot.

I hope someone tells me Apple has already done this. It's been something
approaching a decade, at this point.

~~~
charleslmunger
I have no knowledge of apple's decisions here, and no experience with iOS, and
you're generally correct that any sort of batch processing should be able to
recover from bad input, but I'll try to make an argument as to why this isn't
as simple as it sounds.

1\. Creating a graceful degradation path is a complicated and expensive, from
a UX, product, and testing perspective - if you silently drop the message,
that's much worse than crashing; if you insert some sort of tombstone "this
message could not be parsed" into the UI, you have to figure out if that's
something users will actually understand. These code paths are hard to
exercise with normal integration or manual testing, and it's likely that over
time they may stop working correctly.

2\. Monitoring graceful degradation is more complicated than tracking crashes,
especially for unforeseen issues like what you're describing. There's a real
risk that if this periodically showed up with "failed to parse" messages, the
actual issue would have remained undiscovered by Apple a lot longer.

3\. Again, no familiarity with iOS, but on other mobile operating systems I've
worked with there's a significant memory overhead to using multiple processes
and IPCs this way. If your device is under memory pressure, the extra cost of
a pipeline like this introduces new failure modes. First, your other process
may be killed by the LMK, which the sender process could interpret as being a
failed message. Second, you may increase the amount of memory required to
receive a text message in the background - this can directly affect critical
high-memory use cases like taking a live video with the camera. There may just
not be enough room to do both at the same time.

4\. There's significant input processing that can't be done in another
process, or meaningfully isolated from that of other messages - the best
example being UI rendering of the text. If there's a magic string that causes
view measurement to fail, that's extremely difficult to attribute to any
specific piece of input - so adding this extra process to do validation won't
really help you, since the "validated" string will fail later on.

~~~
kayamon
Are you seriously making the argument that a program reading untrusted data
off a network should crash if it can’t parse the data?

Programs should never, EVER crash. If you can’t decide/handle a Unicode
codepoint, replace it with a question mark (or box etc) and carry on. And yes,
a big-name program like iMessage absolutely should have unit tests for this.

I can’t believe I’m having to explain this.

~~~
zoul
_Programs should never, EVER crash._

Oh yes they should. If, for example, the alternative is to continue in an
undefined state, potentially corrupting user data. Not all error conditions
are recoverable.

~~~
feanaro
The right question to ask yourself is _why_ this is the only alternative. It
almost never is, if the program is properly designed, especially for a user-
facing application.

~~~
zoul
As an example, the default reaction to accessing an invalid index in a Swift
array is trapping. How would you solve this “properly”? Returning an optional
(ie. Maybe) would be too cumbersome and wouldn’t lead to safer code in
practice.

------
robertkrossa
This string made me chuckle

[https://github.com/minimaxir/big-list-of-naughty-
strings/blo...](https://github.com/minimaxir/big-list-of-naughty-
strings/blob/4115c9deee71a7d732d4e50b814df33c4207789b/blns.txt#L672)

~~~
sideshowb
I enjoy this sort of humour as much as the next nerd, but just so y'all know,
someone close to me who has been through psychosis linked to that particular
delusion could have any one of a number of responses to that sentence - from
getting a little freaked out through to anxiety attack or in the worst case
full-on relapse into psychosis.

I wouldn't remove it from the list, as this sort of thing adds a bit of
character, fun and a sense of shared values to programming; what's life for if
you can't have a laugh? But if you're ever making jokes of that ilk and
someone in the room goes a bit ... quiet ... do gently check up on them.

~~~
AnIdiotOnTheNet
It isn't delusion, because it could be true, it's just irrelevant. For all you
know you're just a brain in a jar hooked up to simulated input, there's no way
to prove otherwise. We have no choice but to accept that our senses provide us
some approximation of reality.

~~~
rocqua
If the brain in a vat gets perfectly simulated input, then indeed, there is no
way to tell. Same thing if we live in a computer simulation.

However, if such a surrounding simulation 'errors' that can be noticed.
Moreover, there is the rather pressing issue of 'being turned off'. Its a
difficult (semantic?) discussion of whether you can notice your world being
deleted, but the thought of unpredictable all-encompassing extinction isn't
quite comforting.

A better reason not to care, is the idea that there is nothing you can do
about it. However, if we are being simulated, there is probably some observer
who might react to our actions. Moreover, if we are in a computer simulation
we might be able to trigger some 'bug'. So even that line of reasoning isn't
clear cut.

In the end, life does seem to be easier when you accept reality. Even more so
because any way we have to influence the 'outside' is so infeasible as to be
useless.

~~~
nradov
In theory some extremely high energy particle physics experiments ought to be
impossible to properly simulate, and so if those experiments produce anomalous
data then that could expose a "glitch in the matrix".

~~~
AnIdiotOnTheNet
How would you determine the difference between anomalous data as a result of
the simulation and anomalous data as an indication that your model of physics
is simply incomplete?

After all, you don't have data from a real reality to compare to.

~~~
Filligree
Indeed, there's no guarantee that the real world follows the same laws of
physics whatsoever. For all we know, it might allow hypercomputation.

------
lmcarreiro
I liked this one:

Ｔｈｅ ｑｕｉｃｋ ｂｒｏｗｎ ｆｏｘ ｊｕｍｐｓ ｏｖｅｒ ｔｈｅ ｌａｚｙ ｄｏｇ

𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠

𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌

𝑻𝒉𝒆 𝒒𝒖𝒊𝒄𝒌 𝒃𝒓𝒐𝒘𝒏 𝒇𝒐𝒙 𝒋𝒖𝒎𝒑𝒔 𝒐𝒗𝒆𝒓 𝒕𝒉𝒆 𝒍𝒂𝒛𝒚 𝒅𝒐𝒈

𝓣𝓱𝓮 𝓺𝓾𝓲𝓬𝓴 𝓫𝓻𝓸𝔀𝓷 𝓯𝓸𝔁 𝓳𝓾𝓶𝓹𝓼 𝓸𝓿𝓮𝓻 𝓽𝓱𝓮 𝓵𝓪𝔃𝔂 𝓭𝓸𝓰

𝕋𝕙𝕖 𝕢𝕦𝕚𝕔𝕜 𝕓𝕣𝕠𝕨𝕟 𝕗𝕠𝕩 𝕛𝕦𝕞𝕡𝕤 𝕠𝕧𝕖𝕣 𝕥𝕙𝕖 𝕝𝕒𝕫𝕪 𝕕𝕠𝕘

𝚃𝚑𝚎 𝚚𝚞𝚒𝚌𝚔 𝚋𝚛𝚘𝚠𝚗 𝚏𝚘𝚡 𝚓𝚞𝚖𝚙𝚜 𝚘𝚟𝚎𝚛 𝚝𝚑𝚎 𝚕𝚊𝚣𝚢 𝚍𝚘𝚐

⒯⒣⒠ ⒬⒰⒤⒞⒦ ⒝⒭⒪⒲⒩ ⒡⒪⒳ ⒥⒰⒨⒫⒮ ⒪⒱⒠⒭ ⒯⒣⒠ ⒧⒜⒵⒴ ⒟⒪⒢

~~~
zubi
So did I. But I'd appreciate if someone could explain how it works.

~~~
tialaramex
From the outset Unicode's goal (more so than ISO 10646 though now they're one
and the same) was to unify all existing character sets, so you'd only need
one.

Necessarily then, there should not be other sets that encode things you can't
in Unicode, since then you can't displace those with Unicode.

So, particularly in the early life of Unicode the goal was collect stuff that
already exists and add it to Unicode. (These days we're finished with that and
most new work is on adding things that weren't previously in any character
set)

Two controversial things were done, at opposite ends of the spectrum, during
this period of consolidation:

What you're seeing here is adding copies of the entire Latin alphabet, but
with some particular property that Latin users would not really consider part
of the character, such as "bold" or "italic" but which _was_ preserved in some
character set being used somewhere. Without this choice, if we converted a
text file encoded in a way that distinguished bold and italic characters, we'd
lose that bold/ italic and it might be significant. This would be like when
you get a black & white photocopy of a sheet that says

"Ignore any text below shown in red"

Um, but none of this text is red? Oh. Probably some of it was before it was
photocopied. Oops.

At the far end of the spectrum, a process called CJK unification took place in
which scholars of the languages using characters from the Han ("Chinese")
writing system decided that although say, a Japanese character set and a
Chinese character set both had a particular character, and the Chinese and
Japanese would not draw this character the same way, actually in some
linguistic sense it's the same character (and in many cases the visual
differences are quite small) and so Unicode should not encode both separately.

There's a coherent technical argument for why both these types of decisions
made sense, but they were nonetheless controversial.

You should not use weird characters like italic Latin letters in new
documents, but you also should not transform these characters without warning
when processing an existing document as you may lose important meaning.

~~~
folbec
One of the reason for these sets is mathematics ℜ <> ℝ in a math text (and BTW
the math symbols ℂℍℕℙℚℝ in the double strike set are "out of sequence" which
can be a nasty surprise if you do naive incrementation.

~~~
duskwuff
And ℤ. The reason these double-struck symbols are in a weird place
(U+2100-214f, separate from the rest in U+1d400-1d7ff) is because they all
have commonly used special meanings in mathematics -- they're used to
represent the sets of all numbers of various types. ℂ = complex numbers, ℍ =
quaternions, ℕ = natural numbers, ℚ = rational numbers, ℝ = real numbers, ℤ =
integers.

------
Uberphallus
+++ATH0 brings memories. In my early days of IRC, it used to wreck havoc to
write that in crowded channels. It blows my mind that modems processed that
string no matter the context.

~~~
eythian
Modems have no concept of context, and out of band signalling was optional.

It was _supposed_ to be +++[wait 0.5s]AT... but that pause was patented by
Hayes, so to avoid patent issues, many cloners didn't require it. At least,
that's the story I heard.

~~~
xenadu02
It seems strange they didn't use DTR since RTS/CTS are usually used for
hardware flow control. I can see the point of DSR, which tells the computer
the modem exists... but there is no point in the modem looking for the
computer. In theory the PC could have asserted DTR whenever it wanted to
activate the control channel.

------
Wiretrip
In a similar vein, I always wondered how much trouble people born at midnight
on 01/01/1970 have with websites and software in general.

~~~
scbrg
I always enter that date into systems that ask for my birth date (unless
there's actually good reason for them to know it). I have never experienced
any problems.

------
Theodores
Implementing a solution where bad strings are not allowed is always fun. Had
to create a gig long list of warranty registration codes a while back where it
wasn't just the swear words that were naughty. There were also a few product
specific words as you can't have 'FAIL' or 'SNAP' as part of a warranty
registration code.

Other words that had to go included 'Jew'. Nothing 'naughty' about the word it
just didn't look good in the codes so that had to be added to the big, long
list of mostly 'naughty' words.

We also used a reduced dictionary so that codes could be given out over the
phone without people getting it wrong, so no '1, l, L' problem or '0, o, O'
problem.

I didn't find a handy library for our exact requirements and, as a
consequence, I would advise people to roll their own code for such
applications.

The fun part was making my non-technical manager in charge of the big, long
list of rude words. I think that his additions were the only contribution to
'code' that he ever made.

~~~
deaps
That's actually really interesting. Something you'd _never_ think about unless
you've actually worked on a similar application for a rather large system
where a lot of long codes had to be generated.

------
anonu
Found this in the list:

[https://en.wikipedia.org/wiki/Scunthorpe_problem](https://en.wikipedia.org/wiki/Scunthorpe_problem)

~~~
fredley
A clbuttic example.

~~~
ThePadawan
I'll never look at basements the same.

------
inspector14
"If you're reading this, you've been in a coma for almost 20 years now. We're
trying a new technique. We don't know where this message will end up in your
dream, but we hope it works. Please wake up, we miss you."

------
trash_panda
This is really useful for security testing, where unexpected input could have
security implications.

There is a similar project, which I think is better organized and has more
lists to play with:

[https://github.com/danielmiessler/SecLists](https://github.com/danielmiessler/SecLists)

------
leni536
It looks like that the master file is blns.txt while the other files are
generated from this text file (correct me if I'm wrong). So the strings are
separated by newline characters and there is no way to escape in the strings.
So a nasty string that trips up blns.txt and the tools around it would be
"\n".

I think this is unfortunate since strings with newlines in them could
certainly trip up some bash scripts. Filenames on Linux can contain newlines
for example.

------
SimpJee
I just had an issue today with an internal program, the error was "unclosed
parenthesis". A string with an Unclosed parenthesis would given you this
error.

In our case the string that had the issue should have been machine generated
but ended up being assigned user input... Is that kind of string also helpful
in a list like this?

------
utrack
There's also a great tool to scrub bad symbols from your data:
[https://www.polydesmida.info/cookbook/gremlins.html](https://www.polydesmida.info/cookbook/gremlins.html)

------
nradov
The Twitter @glitchr_ account is also a great source of bizarre Unicode
strings for testing, especially with UIs.

[https://twitter.com/glitchr_](https://twitter.com/glitchr_)

------
shangxiao
I'm curious to know what it is about each of the strings that makes it
naughty. A comment next to each one describing what it's supposed to be
testing would be great.

~~~
Arnt
[https://github.com/minimaxir/big-list-of-naughty-
strings/com...](https://github.com/minimaxir/big-list-of-naughty-
strings/commit/85bc805f4f7375445839a7509a6d3df51a6ddb01) is a fine commit
message, don't you agree?

~~~
thomasfedb
And this is the language we're meant to reinvent apps with. Oh my.

~~~
beatgammit
There's a reason we're moving toward webassembly. This isn't it, but there's a
reason :)

------
tlrobinson
Am I missing something or do none of these strings have newlines in them
(because they're defined in a newline delimited file)?

EDIT: Yup, there's an issue for that [https://github.com/minimaxir/big-list-
of-naughty-strings/iss...](https://github.com/minimaxir/big-list-of-naughty-
strings/issues/138) It's a little ironic BLNS can't handle certain strings :)

------
lifeformed
Shouldn't the SQL injection test be something other than `DROP TABLES`? I
mean, this is meant for testing for weakness, not exploiting it.

------
jawns
> Strings which may cause human to reinterpret worldview

I laugh every time I see this one. Which is just about every year on HN.

------
toomanybeersies
I actually did use this for unit testing once at an old job.

I can't for the life of me remember why, or even what I was testing. I
honestly think it was more to amuse myself than for an actual practical
reason.

------
jcelerier
> medieval erection of parapets

lordy

------
k__
Once, I copied a back-space and I don't know how.

~~~
uryga
there's an ASCII code that means "backspace"

[https://en.m.wikipedia.org/wiki/Backspace#Computers](https://en.m.wikipedia.org/wiki/Backspace#Computers)

------
agentofoblivion
Why do strings like “dW5kZWZpbmVk” cause problems?

~~~
guessmyname

        $ echo -n 'dW5kZWZpbmVk' | base64 --decode
        undefined
    

Many JavaScript-based applications break when the user-input is the string
“undefined”.

All of the entries in this repository are encoded using Based64, for obvious
reasons.

If the reasons are not obvious, read the header of the repository here:

> _Also, do not send a null character (U+0000) string, as it changes the file_

> _format on GitHub to binary and renders it unreadable in pull requests._

To prevent unintentional breaks of GitHub’s UI, the author decided to encode
the strings.

