Hacker News new | past | comments | ask | show | jobs | submit login
Big List of Naughty Strings (github.com/minimaxir)
780 points by pmoriarty on Nov 16, 2018 | hide | past | favorite | 146 comments




Repo maintainer here.

...can someone explain how the repo keeps resurfacing? I haven’t promoted it in a long time. (Looking at the repo traffic, it recently spiked on the 6th, but nothing since then.)


Since you are here. Thanks for making this. I recently used it to prove to a client that the api I delivered could take any content they cared to throw at it. They were especially impressed considering they were coming from a 35 year old system that only allowed ASCII.

The BLNS allowed me to prove it and I hooked it into our integration and fuzz tests which managed to shake out a few bugs.


You're very welcome! :D


It was brought up during VMware's internal security conference called "MooseCon" earlier today during a talk on Unicode.

No idea if it's just coincidental resonance though.


Yes, a few people from VMWare made a PR (which I just merged).


Tangentially related to the original project intent;

Is there a place where common things in the dev world like this are accumulated? For example, a list of all countries or list of the US states, for use with an HTML dropdown. I know there are various repos on Github that maintain these types of lists, such as English stop words, profanity word lists etc, but is there a service that accumulates these in a familiar, structured api?


Look at Wikipedia's lists of things. For your particular examples:

https://en.wikipedia.org/wiki/List_of_sovereign_states

https://en.wikipedia.org/wiki/U.S._state

Some of them are quite meta, such as https://en.wikipedia.org/wiki/List_of_lists_of_lists

For a more structured source, Wikidata aims to be that, but I cannot comment on its completeness.


Often times instead of the list of US Sates, you actually want to list of US States and Territories:

https://en.wikipedia.org/wiki/List_of_states_and_territories...

For example when the intent is "list of place where the USPS ships" or "list of state-level political jurisdictions where US residents live"


Let’s move this tidbit to a structered api of common knowledge! Dewey decimal for data, not just a generic search engine for datasets in different formats (like the recent google datasets site), but a familiar, goto resource.


Um...why not https://www.wikidata.org/ ?


Have you ever used wikidata? It's kind of a shitshow.


Yes! Surprisingly often I am unable to complete a form because there is no option in the State field for Washington, DC.


Structured, maintained API though, not general knowledge. I personally see an issue that someone has to accumulate their own stash of structured data for common knowledge (random examples) like: countries, zip codes, valid HTML5 element names, css properties, hex colors, common naming prefix/suffixes/professional titles, etc. A growing list of work repeated by each dev team/company for really no reason. No complaint about this repo, at all, just seeking if a solution exists.



> Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.


Thanks, will check it out.


Here's a platform specific example. https://github.com/SmileyChris/django-countries/


Wikidata has a SPARQL API though


I've used faker [0] for stuff like this. I think originally a perl package, has similar packages in other languages as well. I've used the python implementation and enjoy it, along with it's localization feature.

It looks like 1.0 was just released as well :D

[0] https://github.com/joke2k/faker/releases


SecLists (https://github.com/danielmiessler/SecLists) contains a wealth of security-related lists of this sort, including a useful section containing the most common passwords.


That's a cool idea. I've seen individual packages for things like US states and HTTP status codes but I don't think I've ever seen them all packaged together.


Would it essentially be like a graph with multiple nested nodes with different strands of info?


Somebody needed to find strings to test his/her app with, saw your repo, found it interesting and posted it here.

About the repo: nice job, I've used it a lot when testing sites/apps I did, good job on providing different formats too so it's easy to automate testing!


I imagine because it's useful and has a fun name. So when someone stumbles across it, they post it. I've seen it on here a number of times and I still upvote it...because its useful and has a fun name.


It's a pretty useful list. I do wonder how many people actually end up having to rebuild their databases after running a test!


It sounds like the VMWare comment is most likely, but I thought I would share how I learned of the project just yesterday. There was a HN post yesterday about https://sr.ht/ and in looking at that I noticed the project used a blacklist of usernames that I thought was cool, so when I took a look at that project it had a link to this repo.


The current spike might be due to a recent post [1] on programming subreddit

[1] https://www.reddit.com/r/programming/comments/9xla2j/naughty...


That Reddit post was made after this HN submission hit the top.


Thanks for this list! And I appreciate that the RTL naughty string contains Hebrew for the first line of Genesis 1. :-)


I would guess because it's a useful tool that gets shared whenever people think about these issues. It's a nice reminder that not everything needs to be regularly updated or promoted to be useful. :)


Its the kind of thing that sticks in your mind when you think of weird things going wrong with string input or unicode rendering.


Because you have created something great and novelty.

I stumbled on it for the first time and already saved it for future testing. Thanks :)


It's a great testing resource. Way more concrete than 'well just test all the different strings'


Have you gotten any interesting or offensive pull requests?


There's this recurring problem with certain strings which crash the Messages app in iOS, leaving a big hole in functionality and making iOS look pretty bad. I've pointed out that this is inexcusable for any language where you have exception handling. The standard reply to that on HN is to point out that you want the process to die once it's gone into undefined behavior, then downvote me.

This puts me in mind of interviews, where I point out to the candidate that their update routine would go into an infinite loop if there was a 2 node cycle in their data. So then they give me an if statement that detects only the 2 node loop. I've even then asked what would happen if there was a 3 node loop, and gotten a 2nd if statement for that as well.

Apps which might crash due to processing untrusted data should be reading that data from a queue. Then a 2nd process can monitor the 1st process, taking problematic data off the queue if necessary. This way the 1st process can die, but be restarted, and your smartphone OS doesn't have to look completely broken due to a primary function just dying, requiring the user to reboot.

I hope someone tells me Apple has already done this. It's been something approaching a decade, at this point.


I have no knowledge of apple's decisions here, and no experience with iOS, and you're generally correct that any sort of batch processing should be able to recover from bad input, but I'll try to make an argument as to why this isn't as simple as it sounds.

1. Creating a graceful degradation path is a complicated and expensive, from a UX, product, and testing perspective - if you silently drop the message, that's much worse than crashing; if you insert some sort of tombstone "this message could not be parsed" into the UI, you have to figure out if that's something users will actually understand. These code paths are hard to exercise with normal integration or manual testing, and it's likely that over time they may stop working correctly.

2. Monitoring graceful degradation is more complicated than tracking crashes, especially for unforeseen issues like what you're describing. There's a real risk that if this periodically showed up with "failed to parse" messages, the actual issue would have remained undiscovered by Apple a lot longer.

3. Again, no familiarity with iOS, but on other mobile operating systems I've worked with there's a significant memory overhead to using multiple processes and IPCs this way. If your device is under memory pressure, the extra cost of a pipeline like this introduces new failure modes. First, your other process may be killed by the LMK, which the sender process could interpret as being a failed message. Second, you may increase the amount of memory required to receive a text message in the background - this can directly affect critical high-memory use cases like taking a live video with the camera. There may just not be enough room to do both at the same time.

4. There's significant input processing that can't be done in another process, or meaningfully isolated from that of other messages - the best example being UI rendering of the text. If there's a magic string that causes view measurement to fail, that's extremely difficult to attribute to any specific piece of input - so adding this extra process to do validation won't really help you, since the "validated" string will fail later on.


1. Is probably a lot more desirable than the smartphone OS looking like trash, requiring a reboot. iOS has suffered epidemics of people trolling by sending out messages that crash processes on other's phones.

2. Since this has happened several times at this point, the point is moot. Since this is familiar territory, why not log it and have the system tell HQ something weird happened?

3. May have been true in the past. Very doubtful at this point.

4. Irrelevant. The process can have the UI, crash, then be restarted by the other process. There is no validation done by the other process. The 2nd process only detects that the 1st process has crashed, then removes the offending datum from the queue.


> 2. Since this has happened several times at this point, the point is moot. Since this is familiar territory, why not log it and have the system tell HQ something weird happened?

This has real privacy considerations. I would rather Apple not simply get a message sent to me forwarded to them. Is the trade off worth it in this case? That's not so easy to pick for all ~1/2 billion or so users.


I would rather Apple not simply get a message sent to me forwarded to them. Is the trade off worth it in this case? That's not so easy to pick for all ~1/2 billion or so users.

All Apple would need to get is the notification that there was a crash. It could be left to the user if the crashed message gets forwarded.


What sort of useful information would that even provide? It would just show that the system crashed because of a text. You'd still have to send some sort of contextual information about the text, which itself would carry non-metadata information about the text (ie, components of the text's contents).


What sort of useful information would that even provide?

If it's a widespread attack, then the traffic analysis would still be useful. A hex dump of the offending data could be sent.


> (ie, components of the text's contents)

Which may then crash the prompt asking if it can send the crash...


Just send the metadata and the hex dump.


You could let the user decide with buttons on the error dialogue. Lot's of crash reporters use this pattern.


> The 2nd process only detects that the 1st process has crashed, then removes the offending datum from the queue.

Thereby hiding away the problem, diminishing the urgency of a fix while strongly favouring it's exploitation. Something that's just a DOS in an ASLR environment escalates to a remotely exploitable hole where the atacker can iterate untill they get the correct offset.

Unless a process is expected to crash or terminate by design, automatic respawning is not a good security practice.


Are you seriously making the argument that a program reading untrusted data off a network should crash if it can’t parse the data?

Programs should never, EVER crash. If you can’t decide/handle a Unicode codepoint, replace it with a question mark (or box etc) and carry on. And yes, a big-name program like iMessage absolutely should have unit tests for this.

I can’t believe I’m having to explain this.


I suggest you research font formats and the code required to render arbitrary unicode code points with case folding, reflow, ligatures, stacking, bidi, combining marks, etc.

TrueType and OpenType are turing-complete all by themselves.

There are nearly 140,000 Unicode code points; exhaustively testing the combinations is computationally infeasible.

It is the definition of a decidedly non-trivial problem. Furthermore layout passes often involve hundreds or even many thousands of string measurements.

Every platform has had numerous bugs, including exploitable ones, in font and string handling.


Programs should never, EVER crash.

Oh yes they should. If, for example, the alternative is to continue in an undefined state, potentially corrupting user data. Not all error conditions are recoverable.


The right question to ask yourself is why this is the only alternative. It almost never is, if the program is properly designed, especially for a user-facing application.


As an example, the default reaction to accessing an invalid index in a Swift array is trapping. How would you solve this “properly”? Returning an optional (ie. Maybe) would be too cumbersome and wouldn’t lead to safer code in practice.


The problem isn't with parsing the data.

The problem is with triggering pathological cases in the complicated beast that is font rendering. And rendering text is something that's so extremely common in apps, and so crucial, that it needs to be done in-process in order to avoid unacceptable overhead.


I see where you're going - and I like the queue idea - but wouldn't it be better for that second function to just monitor the queue - and pull data off the queue that is problematic?

Of course 'monitoring' is not exactly the same as 'processing' the actual data - so you'd have to know exactly what to pull off. Which would be just as easy to add that to the original process and just delete the unwanted data as you pull it out of the queue anyway.

In either event, you seem fairly bright - I probably don't follow. The main thing is that I believe system processes do monitor those other processes and restart them if they crash - I think the problem comes in when data overflows or buffers spill out into other processes memory areas. ( I am not a programmer, although I do try to be for fun ).


but wouldn't it be better for that second function to just monitor the queue - and pull data off the queue that is problematic?

The most universal way to detect the problem is to catch the exception from the crash. However, the process that does that will be in the realm of undefined behavior, so it should be crashed.

Of course 'monitoring' is not exactly the same as 'processing' the actual data - so you'd have to know exactly what to pull off.

Your scheme is limited to all the known strings. My scheme can respond to anything in the future which might crash the 1st process.


Why not give up on Unicode and retreat back to reliably parsed simplified ASCII?



I enjoy this sort of humour as much as the next nerd, but just so y'all know, someone close to me who has been through psychosis linked to that particular delusion could have any one of a number of responses to that sentence - from getting a little freaked out through to anxiety attack or in the worst case full-on relapse into psychosis.

I wouldn't remove it from the list, as this sort of thing adds a bit of character, fun and a sense of shared values to programming; what's life for if you can't have a laugh? But if you're ever making jokes of that ilk and someone in the room goes a bit ... quiet ... do gently check up on them.


There are so many things in the world that can set someone off in some way, I don't think it makes sense to worry too much. Your advice about gently checking up on people who go weirdly quiet makes sense in any situation, not just with something like this.


True that.


As someone who have also been up and close with psychosis, I think your worry is overkill. Such a sentence will push noone into psychosis by itself. And anyone with budding psychosis can be triggered by anything. I would be more worried about tv news than random jokes in a git repo.


Well, I have witnessed these ideas causing relapse into anxiety attack for someone with pre existing issues. How unlikely is it in the general populatin? Probably pretty unlikely for sure, but I bet it's not an isolated case. See my longer post below.


A text by Plato was instrumental in starting a psychosis for one of my family members. My point is that anything can act as a trigger. Psychos is unreasonable. Even without any stimuly, anxiety can just come on its own and be undirected (as I have witnessed).


Sorry to hear about your family member. I hope they're doing better now.

The fact that Plato played a part in your case shows that while anything can act as a trigger, deep philosophical ideas do seem to be a common theme (2 out of 2 on this thread).


While Plato was indeed the trigger, there was nearly no connection between the psychotic ideations and the context of the text. Watching a cloud formation could have done the same.


Someone else I know who struggled with mental health, when I first met him, casually assumed I could move clouds around with my eyes just like he could.


It isn't delusion, because it could be true, it's just irrelevant. For all you know you're just a brain in a jar hooked up to simulated input, there's no way to prove otherwise. We have no choice but to accept that our senses provide us some approximation of reality.


If the brain in a vat gets perfectly simulated input, then indeed, there is no way to tell. Same thing if we live in a computer simulation.

However, if such a surrounding simulation 'errors' that can be noticed. Moreover, there is the rather pressing issue of 'being turned off'. Its a difficult (semantic?) discussion of whether you can notice your world being deleted, but the thought of unpredictable all-encompassing extinction isn't quite comforting.

A better reason not to care, is the idea that there is nothing you can do about it. However, if we are being simulated, there is probably some observer who might react to our actions. Moreover, if we are in a computer simulation we might be able to trigger some 'bug'. So even that line of reasoning isn't clear cut.

In the end, life does seem to be easier when you accept reality. Even more so because any way we have to influence the 'outside' is so infeasible as to be useless.


In theory some extremely high energy particle physics experiments ought to be impossible to properly simulate, and so if those experiments produce anomalous data then that could expose a "glitch in the matrix".


How would you determine the difference between anomalous data as a result of the simulation and anomalous data as an indication that your model of physics is simply incomplete?

After all, you don't have data from a real reality to compare to.


Indeed, there's no guarantee that the real world follows the same laws of physics whatsoever. For all we know, it might allow hypercomputation.


That's all fine and dandy, but I think that, as a human cohabitating with other humans, we should do our best to look out for each other and try not to trigger people into psychotic states. When someone near you is experiencing an anxiety attack, it's not very helpful to tell them what they're experiencing shouldn't cause them anxiety.


Username checks out, I wouldn't go into mental health work if I were you ;)

Is it coincidence that some of the world's most famous philosophers went mad?


> Is it coincidence that some of the world's most famous philosophers went mad?

Yes. Some of <insert anything> went mad, that doesn't prove causation at all.


There is of course a plausible mechanism involved though, which is one of the ways we deduce causation from correlation.


Your friend experienced psychosis from being convinced they'd been in a coma? Genuinely curious.


Probably he didn't expirience psychosis from reading that string but I got chills on my body reading that string... Could imagine someone prone to believing stuff like this to get a little bit anxious


Long story. Her most immediate trigger was actually a traumatic experience of childbirth. Postpartum/puerperal psychosis affects mothers in approx 1 in 1000 births, or around 1 in 300 where an emergency C section was performed which was the case here; C section is of course pretty routine but in this case was complicated by an ileus. But there were many latent contributing factors to the psychosis as well: adverse reaction to an antibiotic (metronidazole - hallucinations can be a side effect), traumatic childhood, to name just a few.

The most acute episode of psychosis happened in hospital and involved some pretty random behaviour accompanied by total loss of response to input. This did not last more than 10 minutes or so, but on return from that state she continued to behave in a way that pretty much resembled someone on a bad LSD trip. This went on for about a week, during which she hardly slept and was never sure whether awake or not, and reported many types of delusional thinking, e.g. that reality was not as it seemed (think inception/matrix/comment in the OP here). Also paranoia that family, friends or strangers were staring or wanted to harm her or her child. Also occasional thoughts of self harm as a means of escape from this awful state of being.

This was all pretty much solved by Olanzapine which she still takes 1.5 years later although dosage is now tapering down to almost nothing. Also CBT. At some point the diagnosis was augmented by one of "O-only OCD" resulting in a second prescription for Sertraline. But the condition is very much under control; she is back to work part time, in a role that includes responsibility for children; her co workers know the score and are alert to possible issues but there haven't been any that she couldn't manage herself. Occasionally various thoughts disturb her but she has learned to handle them appropriately.

Mental health issues are often exacerbated by stigma, and this is in part because people (for understandable reasons) don't rush to share their experiences with others, which is why I've taken the time to write this post. I would recommend anyone to familiarize themselves with the material published by Mind and others, partly because it's interesting, partly because spotting the signs - and understanding the differences between e.g. psychosis, schizophrenia, bipolar, OCD, etc (even if the science behind them is far from well understood) - might help someone you know one day.

https://www.mind.org.uk/information-support/types-of-mental-....

Edit: interesting addendum, the matrix/inception fear had been there long before the psychosis. I remember years ago telling her I had lost The Game ( https://en.wikipedia.org/wiki/The_Game_(mind_game) ) which she hadn't encountered before. I explained that "the object of the game is not to remember the game" ... she took it to mean we were all in some sort of simulation and got kinda scared even then.


Crazy. Not the first time I hear about this. Friend of mine told me about someone's emergency C-section where they used ketamine to put her under, sent her in to some kind of psychosis that required quite a lot of psychological help to get over.


Great! Now I want my last words to be "I lost the game".


It gave me a kind of uncomfortable laugh, I had to check my faculties in the same way I might ask myself if I'm dreaming right now. I see what you're saying.


[flagged]


didnt know not caring about the mentally ill was a requirement for being fun at parties


Well its a lot more fun than having to call an ambulance for a sudden mental break


Yeah, what a weirdo. Caring about other people and shit. Despicable!


Well, when I read the title of the post, I thought, "there's no way I'm opening that up in my browser". But you managed to pique my curiosity enough that I went ahead and did it anyway. I hope nothing bad happens.


IMHO, the 'reading' gives it away. How would the doctors know it would appear in written form? Could be part of the technique of course. Still, I would change it to the more generic 'receive'.


This is how I would rationalize that point:

Clearly, the doctors did not intend for that specific message to appear specifically on my screen. Rather, their treatment will make my brain "fight back" against the delusion, and reality will start to creep in unexpectedly. Maybe someone will stop me on the street and tell me "you need to wake up". Maybe I'll hear a new song titled "you are in a coma". Maybe I'll find a hospital gown on my wardrobe. Or maybe a random string will appear on some repository.

Note: I also realize how much you can eff up someone's day by doing all of this. If you are one of those Youtube pranksters, don't do that.


Yes that's what gives it away ?!

Going majorly off topic,I have an internal voice, I don't have an internal text. I assume this is normal, and not diagnostic of being in a coma.


What if the "new technique" is to identify neural nets in the visual cortex that encode the cover of your favorite book, and then altering that visual memory so as to replace the title with this message.


I'd assume them to at least vary it a little bit. If they can printf some text into my comatose brain, they can also make it dynamic where it shouldn't be.


What would happen if you only thought you liked that book because it mentioned you.

I mean do you think computer scientists named a sub collection of instructions after you, or do you think its a hint the wake up? Wake up.... Wake up subroutine.


What if that is happening to all of us when reading news headlines?


See, if that happened then you'd have good reason to worry.


Makes sense. I thought I had entered the weirdest timeline around then, and it's only got worse!


Shit it's not working!



Haͭhͥa̾, yeͥs, ͣgͩreaͫt humour!


I liked this one:

The quick brown fox jumps over the lazy dog

𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱 𝐣𝐮𝐦𝐩𝐬 𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐥𝐚𝐳𝐲 𝐝𝐨𝐠

𝕿𝖍𝖊 𝖖𝖚𝖎𝖈𝖐 𝖇𝖗𝖔𝖜𝖓 𝖋𝖔𝖝 𝖏𝖚𝖒𝖕𝖘 𝖔𝖛𝖊𝖗 𝖙𝖍𝖊 𝖑𝖆𝖟𝖞 𝖉𝖔𝖌

𝑻𝒉𝒆 𝒒𝒖𝒊𝒄𝒌 𝒃𝒓𝒐𝒘𝒏 𝒇𝒐𝒙 𝒋𝒖𝒎𝒑𝒔 𝒐𝒗𝒆𝒓 𝒕𝒉𝒆 𝒍𝒂𝒛𝒚 𝒅𝒐𝒈

𝓣𝓱𝓮 𝓺𝓾𝓲𝓬𝓴 𝓫𝓻𝓸𝔀𝓷 𝓯𝓸𝔁 𝓳𝓾𝓶𝓹𝓼 𝓸𝓿𝓮𝓻 𝓽𝓱𝓮 𝓵𝓪𝔃𝔂 𝓭𝓸𝓰

𝕋𝕙𝕖 𝕢𝕦𝕚𝕔𝕜 𝕓𝕣𝕠𝕨𝕟 𝕗𝕠𝕩 𝕛𝕦𝕞𝕡𝕤 𝕠𝕧𝕖𝕣 𝕥𝕙𝕖 𝕝𝕒𝕫𝕪 𝕕𝕠𝕘

𝚃𝚑𝚎 𝚚𝚞𝚒𝚌𝚔 𝚋𝚛𝚘𝚠𝚗 𝚏𝚘𝚡 𝚓𝚞𝚖𝚙𝚜 𝚘𝚟𝚎𝚛 𝚝𝚑𝚎 𝚕𝚊𝚣𝚢 𝚍𝚘𝚐

⒯⒣⒠ ⒬⒰⒤⒞⒦ ⒝⒭⒪⒲⒩ ⒡⒪⒳ ⒥⒰⒨⒫⒮ ⒪⒱⒠⒭ ⒯⒣⒠ ⒧⒜⒵⒴ ⒟⒪⒢


Google understand almost all of them (it misses just the last one) and the first result is https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over...

Bing doesn't understand any, if you search it, the first result is the github repo with these frases.


So did I. But I'd appreciate if someone could explain how it works.


From the outset Unicode's goal (more so than ISO 10646 though now they're one and the same) was to unify all existing character sets, so you'd only need one.

Necessarily then, there should not be other sets that encode things you can't in Unicode, since then you can't displace those with Unicode.

So, particularly in the early life of Unicode the goal was collect stuff that already exists and add it to Unicode. (These days we're finished with that and most new work is on adding things that weren't previously in any character set)

Two controversial things were done, at opposite ends of the spectrum, during this period of consolidation:

What you're seeing here is adding copies of the entire Latin alphabet, but with some particular property that Latin users would not really consider part of the character, such as "bold" or "italic" but which _was_ preserved in some character set being used somewhere. Without this choice, if we converted a text file encoded in a way that distinguished bold and italic characters, we'd lose that bold/ italic and it might be significant. This would be like when you get a black & white photocopy of a sheet that says

"Ignore any text below shown in red"

Um, but none of this text is red? Oh. Probably some of it was before it was photocopied. Oops.

At the far end of the spectrum, a process called CJK unification took place in which scholars of the languages using characters from the Han ("Chinese") writing system decided that although say, a Japanese character set and a Chinese character set both had a particular character, and the Chinese and Japanese would not draw this character the same way, actually in some linguistic sense it's the same character (and in many cases the visual differences are quite small) and so Unicode should not encode both separately.

There's a coherent technical argument for why both these types of decisions made sense, but they were nonetheless controversial.

You should not use weird characters like italic Latin letters in new documents, but you also should not transform these characters without warning when processing an existing document as you may lose important meaning.


Thanks for the write-up.

Both had always bothered me deeply, but I'd never stopped to think that they're also essentially opposed in philosophy to each other. So now that I'm aware of that, I'm triply annoyed :S


One of the reason for these sets is mathematics ℜ <> ℝ in a math text (and BTW the math symbols ℂℍℕℙℚℝ in the double strike set are "out of sequence" which can be a nasty surprise if you do naive incrementation.


And ℤ. The reason these double-struck symbols are in a weird place (U+2100-214f, separate from the rest in U+1d400-1d7ff) is because they all have commonly used special meanings in mathematics -- they're used to represent the sets of all numbers of various types. ℂ = complex numbers, ℍ = quaternions, ℕ = natural numbers, ℚ = rational numbers, ℝ = real numbers, ℤ = integers.


There are three slightly different things going on.

The first line, The quick brown fox, originates with east-Asian character based terminals, on which ideographic characters occupied twice the space of alphabetic characters, and there was also a desire to have latin characters that were also double width. See https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

The middle lines are included as mathematical symbols. The justification is that 𝑖 is a mathematical symbol that has its own independent meaning, which only coincidentally looks like italicized i. (I think this is silly, and naturally leads to a bloody mess as people misuse these symbols as letters, and in this case there is no backwards-compatbility excuse.) https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symb...

The final line, like the first, is apparently present for compatibility with pre-Unicode east-Asian character sets. https://en.wikipedia.org/wiki/Enclosed_Alphanumerics


For some reason unicode includes a few characters in a different "font".


Interesting. This is the first thing I thought, but when I fed "lazy" into google, it happily accepted and displayed the results, so I thought there might be something else. But teh text editors that I tried indeed don't match the characters when I search them.


google is using unicode equivalence[1] to remap back to "standard" latin characters. this is important because, e.g., professional type-setting software may replace two adjacent "f" characters with a double-F ligature "ff" depending on kerning. without unicode equivalence, google would fail on a lot of copy-paste queries.

[1] https://en.wikipedia.org/wiki/Unicode_equivalence


They are just different Unicode glyphs on the same font.


+++ATH0 brings memories. In my early days of IRC, it used to wreck havoc to write that in crowded channels. It blows my mind that modems processed that string no matter the context.


Modems have no concept of context, and out of band signalling was optional.

It was _supposed_ to be +++[wait 0.5s]AT... but that pause was patented by Hayes, so to avoid patent issues, many cloners didn't require it. At least, that's the story I heard.


It seems strange they didn't use DTR since RTS/CTS are usually used for hardware flow control. I can see the point of DSR, which tells the computer the modem exists... but there is no point in the modem looking for the computer. In theory the PC could have asserted DTR whenever it wanted to activate the control channel.


What did it do?



In a similar vein, I always wondered how much trouble people born at midnight on 01/01/1970 have with websites and software in general.


I always enter that date into systems that ask for my birth date (unless there's actually good reason for them to know it). I have never experienced any problems.


Never have I been asked for the time of day I was born (and I don't know the answer).


Probably as often as they’re asked for the time of day when they were born. So likely never.


Implementing a solution where bad strings are not allowed is always fun. Had to create a gig long list of warranty registration codes a while back where it wasn't just the swear words that were naughty. There were also a few product specific words as you can't have 'FAIL' or 'SNAP' as part of a warranty registration code.

Other words that had to go included 'Jew'. Nothing 'naughty' about the word it just didn't look good in the codes so that had to be added to the big, long list of mostly 'naughty' words.

We also used a reduced dictionary so that codes could be given out over the phone without people getting it wrong, so no '1, l, L' problem or '0, o, O' problem.

I didn't find a handy library for our exact requirements and, as a consequence, I would advise people to roll their own code for such applications.

The fun part was making my non-technical manager in charge of the big, long list of rude words. I think that his additions were the only contribution to 'code' that he ever made.


That's actually really interesting. Something you'd never think about unless you've actually worked on a similar application for a rather large system where a lot of long codes had to be generated.


I had a project like that once. I decided to use a chunk of a uuid in hopes of not getting inappropriate words in the codes. Of course, as soon as we start testing, we promptly get stuff with DEAD, FAD, etc in it.



A clbuttic example.


I'll never look at basements the same.


"If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."


This is really useful for security testing, where unexpected input could have security implications.

There is a similar project, which I think is better organized and has more lists to play with:

https://github.com/danielmiessler/SecLists


It looks like that the master file is blns.txt while the other files are generated from this text file (correct me if I'm wrong). So the strings are separated by newline characters and there is no way to escape in the strings. So a nasty string that trips up blns.txt and the tools around it would be "\n".

I think this is unfortunate since strings with newlines in them could certainly trip up some bash scripts. Filenames on Linux can contain newlines for example.


I just had an issue today with an internal program, the error was "unclosed parenthesis". A string with an Unclosed parenthesis would given you this error.

In our case the string that had the issue should have been machine generated but ended up being assigned user input... Is that kind of string also helpful in a list like this?


There's also a great tool to scrub bad symbols from your data: https://www.polydesmida.info/cookbook/gremlins.html


The Twitter @glitchr_ account is also a great source of bizarre Unicode strings for testing, especially with UIs.

https://twitter.com/glitchr_


I'm curious to know what it is about each of the strings that makes it naughty. A comment next to each one describing what it's supposed to be testing would be great.


https://github.com/minimaxir/big-list-of-naughty-strings/com... is a fine commit message, don't you agree?


And this is the language we're meant to reinvent apps with. Oh my.


There's a reason we're moving toward webassembly. This isn't it, but there's a reason :)



it's in the code/comments


Am I missing something or do none of these strings have newlines in them (because they're defined in a newline delimited file)?

EDIT: Yup, there's an issue for that https://github.com/minimaxir/big-list-of-naughty-strings/iss... It's a little ironic BLNS can't handle certain strings :)


Shouldn't the SQL injection test be something other than `DROP TABLES`? I mean, this is meant for testing for weakness, not exploiting it.


> Strings which may cause human to reinterpret worldview

I laugh every time I see this one. Which is just about every year on HN.


I actually did use this for unit testing once at an old job.

I can't for the life of me remember why, or even what I was testing. I honestly think it was more to amuse myself than for an actual practical reason.


> medieval erection of parapets

lordy


Once, I copied a back-space and I don't know how.


there's an ASCII code that means "backspace"

https://en.m.wikipedia.org/wiki/Backspace#Computers


Why do strings like “dW5kZWZpbmVk” cause problems?


    $ echo -n 'dW5kZWZpbmVk' | base64 --decode
    undefined
Many JavaScript-based applications break when the user-input is the string “undefined”.

All of the entries in this repository are encoded using Based64, for obvious reasons.

If the reasons are not obvious, read the header of the repository here:

> Also, do not send a null character (U+0000) string, as it changes the file

> format on GitHub to binary and renders it unreadable in pull requests.

To prevent unintentional breaks of GitHub’s UI, the author decided to encode the strings.


That doesn't look like anything to me...


That's the base64 version of the file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: