...can someone explain how the repo keeps resurfacing? I haven’t promoted it in a long time. (Looking at the repo traffic, it recently spiked on the 6th, but nothing since then.)
Since you are here. Thanks for making this. I recently used it to prove to a client that the api I delivered could take any content they cared to throw at it. They were especially impressed considering they were coming from a 35 year old system that only allowed ASCII.
The BLNS allowed me to prove it and I hooked it into our integration and fuzz tests which managed to shake out a few bugs.
Tangentially related to the original project intent;
Is there a place where common things in the dev world like this are accumulated? For example, a list of all countries or list of the US states, for use with an HTML dropdown. I know there are various repos on Github that maintain these types of lists, such as English stop words, profanity word lists etc, but is there a service that accumulates these in a familiar, structured api?
Let’s move this tidbit to a structered api of common knowledge! Dewey decimal for data, not just a generic search engine for datasets in different formats (like the recent google datasets site), but a familiar, goto resource.
Structured, maintained API though, not general knowledge. I personally see an issue that someone has to accumulate their own stash of structured data for common knowledge (random examples) like: countries, zip codes, valid HTML5 element names, css properties, hex colors, common naming prefix/suffixes/professional titles, etc. A growing list of work repeated by each dev team/company for really no reason. No complaint about this repo, at all, just seeking if a solution exists.
> Corpora is a collection of small files. It is not meant to be an exhaustive source of anything: a list of resources should contain somewhere in the vicinity of 1000 items.
I've used faker [0] for stuff like this. I think originally a perl package, has similar packages in other languages as well. I've used the python implementation and enjoy it, along with it's localization feature.
SecLists (https://github.com/danielmiessler/SecLists) contains a wealth of security-related lists of this sort, including a useful section containing the most common passwords.
That's a cool idea. I've seen individual packages for things like US states and HTTP status codes but I don't think I've ever seen them all packaged together.
Somebody needed to find strings to test his/her app with, saw your repo, found it interesting and posted it here.
About the repo: nice job, I've used it a lot when testing sites/apps I did, good job on providing different formats too so it's easy to automate testing!
I imagine because it's useful and has a fun name. So when someone stumbles across it, they post it. I've seen it on here a number of times and I still upvote it...because its useful and has a fun name.
It sounds like the VMWare comment is most likely, but I thought I would share how I learned of the project just yesterday. There was a HN post yesterday about https://sr.ht/ and in looking at that I noticed the project used a blacklist of usernames that I thought was cool, so when I took a look at that project it had a link to this repo.
I would guess because it's a useful tool that gets shared whenever people think about these issues. It's a nice reminder that not everything needs to be regularly updated or promoted to be useful. :)
There's this recurring problem with certain strings which crash the Messages app in iOS, leaving a big hole in functionality and making iOS look pretty bad. I've pointed out that this is inexcusable for any language where you have exception handling. The standard reply to that on HN is to point out that you want the process to die once it's gone into undefined behavior, then downvote me.
This puts me in mind of interviews, where I point out to the candidate that their update routine would go into an infinite loop if there was a 2 node cycle in their data. So then they give me an if statement that detects only the 2 node loop. I've even then asked what would happen if there was a 3 node loop, and gotten a 2nd if statement for that as well.
Apps which might crash due to processing untrusted data should be reading that data from a queue. Then a 2nd process can monitor the 1st process, taking problematic data off the queue if necessary. This way the 1st process can die, but be restarted, and your smartphone OS doesn't have to look completely broken due to a primary function just dying, requiring the user to reboot.
I hope someone tells me Apple has already done this. It's been something approaching a decade, at this point.
I have no knowledge of apple's decisions here, and no experience with iOS, and you're generally correct that any sort of batch processing should be able to recover from bad input, but I'll try to make an argument as to why this isn't as simple as it sounds.
1. Creating a graceful degradation path is a complicated and expensive, from a UX, product, and testing perspective - if you silently drop the message, that's much worse than crashing; if you insert some sort of tombstone "this message could not be parsed" into the UI, you have to figure out if that's something users will actually understand. These code paths are hard to exercise with normal integration or manual testing, and it's likely that over time they may stop working correctly.
2. Monitoring graceful degradation is more complicated than tracking crashes, especially for unforeseen issues like what you're describing. There's a real risk that if this periodically showed up with "failed to parse" messages, the actual issue would have remained undiscovered by Apple a lot longer.
3. Again, no familiarity with iOS, but on other mobile operating systems I've worked with there's a significant memory overhead to using multiple processes and IPCs this way. If your device is under memory pressure, the extra cost of a pipeline like this introduces new failure modes. First, your other process may be killed by the LMK, which the sender process could interpret as being a failed message. Second, you may increase the amount of memory required to receive a text message in the background - this can directly affect critical high-memory use cases like taking a live video with the camera. There may just not be enough room to do both at the same time.
4. There's significant input processing that can't be done in another process, or meaningfully isolated from that of other messages - the best example being UI rendering of the text. If there's a magic string that causes view measurement to fail, that's extremely difficult to attribute to any specific piece of input - so adding this extra process to do validation won't really help you, since the "validated" string will fail later on.
1. Is probably a lot more desirable than the smartphone OS looking like trash, requiring a reboot. iOS has suffered epidemics of people trolling by sending out messages that crash processes on other's phones.
2. Since this has happened several times at this point, the point is moot. Since this is familiar territory, why not log it and have the system tell HQ something weird happened?
3. May have been true in the past. Very doubtful at this point.
4. Irrelevant. The process can have the UI, crash, then be restarted by the other process. There is no validation done by the other process. The 2nd process only detects that the 1st process has crashed, then removes the offending datum from the queue.
> 2. Since this has happened several times at this point, the point is moot. Since this is familiar territory, why not log it and have the system tell HQ something weird happened?
This has real privacy considerations. I would rather Apple not simply get a message sent to me forwarded to them. Is the trade off worth it in this case? That's not so easy to pick for all ~1/2 billion or so users.
I would rather Apple not simply get a message sent to me forwarded to them. Is the trade off worth it in this case? That's not so easy to pick for all ~1/2 billion or so users.
All Apple would need to get is the notification that there was a crash. It could be left to the user if the crashed message gets forwarded.
What sort of useful information would that even provide? It would just show that the system crashed because of a text. You'd still have to send some sort of contextual information about the text, which itself would carry non-metadata information about the text (ie, components of the text's contents).
> The 2nd process only detects that the 1st process has crashed, then removes the offending datum from the queue.
Thereby hiding away the problem, diminishing the urgency of a fix while strongly favouring it's exploitation. Something that's just a DOS in an ASLR environment escalates to a remotely exploitable hole where the atacker can iterate untill they get the correct offset.
Unless a process is expected to crash or terminate by design, automatic respawning is not a good security practice.
Are you seriously making the argument that a program reading untrusted data off a network should crash if it can’t parse the data?
Programs should never, EVER crash. If you can’t decide/handle a Unicode codepoint, replace it with a question mark (or box etc) and carry on. And yes, a big-name program like iMessage absolutely should have unit tests for this.
I suggest you research font formats and the code required to render arbitrary unicode code points with case folding, reflow, ligatures, stacking, bidi, combining marks, etc.
TrueType and OpenType are turing-complete all by themselves.
There are nearly 140,000 Unicode code points; exhaustively testing the combinations is computationally infeasible.
It is the definition of a decidedly non-trivial problem. Furthermore layout passes often involve hundreds or even many thousands of string measurements.
Every platform has had numerous bugs, including exploitable ones, in font and string handling.
Oh yes they should. If, for example, the alternative is to continue in an undefined state, potentially corrupting user data. Not all error conditions are recoverable.
The right question to ask yourself is why this is the only alternative. It almost never is, if the program is properly designed, especially for a user-facing application.
As an example, the default reaction to accessing an invalid index in a Swift array is trapping. How would you solve this “properly”? Returning an optional (ie. Maybe) would be too cumbersome and wouldn’t lead to safer code in practice.
The problem is with triggering pathological cases in the complicated beast that is font rendering. And rendering text is something that's so extremely common in apps, and so crucial, that it needs to be done in-process in order to avoid unacceptable overhead.
I see where you're going - and I like the queue idea - but wouldn't it be better for that second function to just monitor the queue - and pull data off the queue that is problematic?
Of course 'monitoring' is not exactly the same as 'processing' the actual data - so you'd have to know exactly what to pull off. Which would be just as easy to add that to the original process and just delete the unwanted data as you pull it out of the queue anyway.
In either event, you seem fairly bright - I probably don't follow. The main thing is that I believe system processes do monitor those other processes and restart them if they crash - I think the problem comes in when data overflows or buffers spill out into other processes memory areas. ( I am not a programmer, although I do try to be for fun ).
but wouldn't it be better for that second function to just monitor the queue - and pull data off the queue that is problematic?
The most universal way to detect the problem is to catch the exception from the crash. However, the process that does that will be in the realm of undefined behavior, so it should be crashed.
Of course 'monitoring' is not exactly the same as 'processing' the actual data - so you'd have to know exactly what to pull off.
Your scheme is limited to all the known strings. My scheme can respond to anything in the future which might crash the 1st process.
I enjoy this sort of humour as much as the next nerd, but just so y'all know, someone close to me who has been through psychosis linked to that particular delusion could have any one of a number of responses to that sentence - from getting a little freaked out through to anxiety attack or in the worst case full-on relapse into psychosis.
I wouldn't remove it from the list, as this sort of thing adds a bit of character, fun and a sense of shared values to programming; what's life for if you can't have a laugh? But if you're ever making jokes of that ilk and someone in the room goes a bit ... quiet ... do gently check up on them.
There are so many things in the world that can set someone off in some way, I don't think it makes sense to worry too much. Your advice about gently checking up on people who go weirdly quiet makes sense in any situation, not just with something like this.
As someone who have also been up and close with psychosis, I think your worry is overkill. Such a sentence will push noone into psychosis by itself. And anyone with budding psychosis can be triggered by anything. I would be more worried about tv news than random jokes in a git repo.
Well, I have witnessed these ideas causing relapse into anxiety attack for someone with pre existing issues. How unlikely is it in the general populatin? Probably pretty unlikely for sure, but I bet it's not an isolated case. See my longer post below.
A text by Plato was instrumental in starting a psychosis for one of my family members. My point is that anything can act as a trigger. Psychos is unreasonable. Even without any stimuly, anxiety can just come on its own and be undirected (as I have witnessed).
Sorry to hear about your family member. I hope they're doing better now.
The fact that Plato played a part in your case shows that while anything can act as a trigger, deep philosophical ideas do seem to be a common theme (2 out of 2 on this thread).
While Plato was indeed the trigger, there was nearly no connection between the psychotic ideations and the context of the text. Watching a cloud formation could have done the same.
Someone else I know who struggled with mental health, when I first met him, casually assumed I could move clouds around with my eyes just like he could.
It isn't delusion, because it could be true, it's just irrelevant. For all you know you're just a brain in a jar hooked up to simulated input, there's no way to prove otherwise. We have no choice but to accept that our senses provide us some approximation of reality.
If the brain in a vat gets perfectly simulated input, then indeed, there is no way to tell. Same thing if we live in a computer simulation.
However, if such a surrounding simulation 'errors' that can be noticed. Moreover, there is the rather pressing issue of 'being turned off'. Its a difficult (semantic?) discussion of whether you can notice your world being deleted, but the thought of unpredictable all-encompassing extinction isn't quite comforting.
A better reason not to care, is the idea that there is nothing you can do about it. However, if we are being simulated, there is probably some observer who might react to our actions. Moreover, if we are in a computer simulation we might be able to trigger some 'bug'. So even that line of reasoning isn't clear cut.
In the end, life does seem to be easier when you accept reality. Even more so because any way we have to influence the 'outside' is so infeasible as to be useless.
In theory some extremely high energy particle physics experiments ought to be impossible to properly simulate, and so if those experiments produce anomalous data then that could expose a "glitch in the matrix".
How would you determine the difference between anomalous data as a result of the simulation and anomalous data as an indication that your model of physics is simply incomplete?
After all, you don't have data from a real reality to compare to.
That's all fine and dandy, but I think that, as a human cohabitating with other humans, we should do our best to look out for each other and try not to trigger people into psychotic states. When someone near you is experiencing an anxiety attack, it's not very helpful to tell them what they're experiencing shouldn't cause them anxiety.
Probably he didn't expirience psychosis from reading that string but I got chills on my body reading that string... Could imagine someone prone to believing stuff like this to get a little bit anxious
Long story. Her most immediate trigger was actually a traumatic experience of childbirth. Postpartum/puerperal psychosis affects mothers in approx 1 in 1000 births, or around 1 in 300 where an emergency C section was performed which was the case here; C section is of course pretty routine but in this case was complicated by an ileus. But there were many latent contributing factors to the psychosis as well: adverse reaction to an antibiotic (metronidazole - hallucinations can be a side effect), traumatic childhood, to name just a few.
The most acute episode of psychosis happened in hospital and involved some pretty random behaviour accompanied by total loss of response to input. This did not last more than 10 minutes or so, but on return from that state she continued to behave in a way that pretty much resembled someone on a bad LSD trip. This went on for about a week, during which she hardly slept and was never sure whether awake or not, and reported many types of delusional thinking, e.g. that reality was not as it seemed (think inception/matrix/comment in the OP here). Also paranoia that family, friends or strangers were staring or wanted to harm her or her child. Also occasional thoughts of self harm as a means of escape from this awful state of being.
This was all pretty much solved by Olanzapine which she still takes 1.5 years later although dosage is now tapering down to almost nothing. Also CBT. At some point the diagnosis was augmented by one of "O-only OCD" resulting in a second prescription for Sertraline. But the condition is very much under control; she is back to work part time, in a role that includes responsibility for children; her co workers know the score and are alert to possible issues but there haven't been any that she couldn't manage herself. Occasionally various thoughts disturb her but she has learned to handle them appropriately.
Mental health issues are often exacerbated by stigma, and this is in part because people (for understandable reasons) don't rush to share their experiences with others, which is why I've taken the time to write this post. I would recommend anyone to familiarize themselves with the material published by Mind and others, partly because it's interesting, partly because spotting the signs - and understanding the differences between e.g. psychosis, schizophrenia, bipolar, OCD, etc (even if the science behind them is far from well understood) - might help someone you know one day.
Edit: interesting addendum, the matrix/inception fear had been there long before the psychosis. I remember years ago telling her I had lost The Game ( https://en.wikipedia.org/wiki/The_Game_(mind_game) ) which she hadn't encountered before. I explained that "the object of the game is not to remember the game" ... she took it to mean we were all in some sort of simulation and got kinda scared even then.
Crazy. Not the first time I hear about this. Friend of mine told me about someone's emergency C-section where they used ketamine to put her under, sent her in to some kind of psychosis that required quite a lot of psychological help to get over.
It gave me a kind of uncomfortable laugh, I had to check my faculties in the same way I might ask myself if I'm dreaming right now. I see what you're saying.
Well, when I read the title of the post, I thought, "there's no way I'm opening that up in my browser". But you managed to pique my curiosity enough that I went ahead and did it anyway. I hope nothing bad happens.
IMHO, the 'reading' gives it away. How would the doctors know it would appear in written form? Could be part of the technique of course. Still, I would change it to the more generic 'receive'.
Clearly, the doctors did not intend for that specific message to appear specifically on my screen. Rather, their treatment will make my brain "fight back" against the delusion, and reality will start to creep in unexpectedly. Maybe someone will stop me on the street and tell me "you need to wake up". Maybe I'll hear a new song titled "you are in a coma". Maybe I'll find a hospital gown on my wardrobe. Or maybe a random string will appear on some repository.
Note: I also realize how much you can eff up someone's day by doing all of this. If you are one of those Youtube pranksters, don't do that.
What if the "new technique" is to identify neural nets in the visual cortex that encode the cover of your favorite book, and then altering that visual memory so as to replace the title with this message.
I'd assume them to at least vary it a little bit. If they can printf some text into my comatose brain, they can also make it dynamic where it shouldn't be.
What would happen if you only thought you liked that book because it mentioned you.
I mean do you think computer scientists named a sub collection of instructions after you, or do you think its a hint the wake up? Wake up.... Wake up subroutine.
From the outset Unicode's goal (more so than ISO 10646 though now they're one and the same) was to unify all existing character sets, so you'd only need one.
Necessarily then, there should not be other sets that encode things you can't in Unicode, since then you can't displace those with Unicode.
So, particularly in the early life of Unicode the goal was collect stuff that already exists and add it to Unicode. (These days we're finished with that and most new work is on adding things that weren't previously in any character set)
Two controversial things were done, at opposite ends of the spectrum, during this period of consolidation:
What you're seeing here is adding copies of the entire Latin alphabet, but with some particular property that Latin users would not really consider part of the character, such as "bold" or "italic" but which _was_ preserved in some character set being used somewhere. Without this choice, if we converted a text file encoded in a way that distinguished bold and italic characters, we'd lose that bold/ italic and it might be significant. This would be like when you get a black & white photocopy of a sheet that says
"Ignore any text below shown in red"
Um, but none of this text is red? Oh. Probably some of it was before it was photocopied. Oops.
At the far end of the spectrum, a process called CJK unification took place in which scholars of the languages using characters from the Han ("Chinese") writing system decided that although say, a Japanese character set and a Chinese character set both had a particular character, and the Chinese and Japanese would not draw this character the same way, actually in some linguistic sense it's the same character (and in many cases the visual differences are quite small) and so Unicode should not encode both separately.
There's a coherent technical argument for why both these types of decisions made sense, but they were nonetheless controversial.
You should not use weird characters like italic Latin letters in new documents, but you also should not transform these characters without warning when processing an existing document as you may lose important meaning.
Both had always bothered me deeply, but I'd never stopped to think that they're also essentially opposed in philosophy to each other. So now that I'm aware of that, I'm triply annoyed :S
One of the reason for these sets is mathematics ℜ <> ℝ in a math text (and BTW the math symbols ℂℍℕℙℚℝ in the double strike set are "out of sequence" which can be a nasty surprise if you do naive incrementation.
And ℤ. The reason these double-struck symbols are in a weird place (U+2100-214f, separate from the rest in U+1d400-1d7ff) is because they all have commonly used special meanings in mathematics -- they're used to represent the sets of all numbers of various types. ℂ = complex numbers, ℍ = quaternions, ℕ = natural numbers, ℚ = rational numbers, ℝ = real numbers, ℤ = integers.
There are three slightly different things going on.
The first line, The quick brown fox, originates with east-Asian character based terminals, on which ideographic characters occupied twice the space of alphabetic characters, and there was also a desire to have latin characters that were also double width. See https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms
The middle lines are included as mathematical symbols. The justification is that 𝑖 is a mathematical symbol that has its own independent meaning, which only coincidentally looks like italicized i. (I think this is silly, and naturally leads to a bloody mess as people misuse these symbols as letters, and in this case there is no backwards-compatbility excuse.) https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symb...
Interesting. This is the first thing I thought, but when I fed "lazy" into google, it happily accepted and displayed the results, so I thought there might be something else. But teh text editors that I tried indeed don't match the characters when I search them.
google is using unicode equivalence[1] to remap back to "standard" latin characters. this is important because, e.g., professional type-setting software may replace two adjacent "f" characters with a double-F ligature "ff" depending on kerning. without unicode equivalence, google would fail on a lot of copy-paste queries.
+++ATH0 brings memories. In my early days of IRC, it used to wreck havoc to write that in crowded channels. It blows my mind that modems processed that string no matter the context.
Modems have no concept of context, and out of band signalling was optional.
It was _supposed_ to be +++[wait 0.5s]AT... but that pause was patented by Hayes, so to avoid patent issues, many cloners didn't require it. At least, that's the story I heard.
It seems strange they didn't use DTR since RTS/CTS are usually used for hardware flow control. I can see the point of DSR, which tells the computer the modem exists... but there is no point in the modem looking for the computer. In theory the PC could have asserted DTR whenever it wanted to activate the control channel.
I always enter that date into systems that ask for my birth date (unless there's actually good reason for them to know it). I have never experienced any problems.
Implementing a solution where bad strings are not allowed is always fun. Had to create a gig long list of warranty registration codes a while back where it wasn't just the swear words that were naughty. There were also a few product specific words as you can't have 'FAIL' or 'SNAP' as part of a warranty registration code.
Other words that had to go included 'Jew'. Nothing 'naughty' about the word it just didn't look good in the codes so that had to be added to the big, long list of mostly 'naughty' words.
We also used a reduced dictionary so that codes could be given out over the phone without people getting it wrong, so no '1, l, L' problem or '0, o, O' problem.
I didn't find a handy library for our exact requirements and, as a consequence, I would advise people to roll their own code for such applications.
The fun part was making my non-technical manager in charge of the big, long list of rude words. I think that his additions were the only contribution to 'code' that he ever made.
That's actually really interesting. Something you'd never think about unless you've actually worked on a similar application for a rather large system where a lot of long codes had to be generated.
I had a project like that once. I decided to use a chunk of a uuid in hopes of not getting inappropriate words in the codes. Of course, as soon as we start testing, we promptly get stuff with DEAD, FAD, etc in it.
"If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you."
It looks like that the master file is blns.txt while the other files are generated from this text file (correct me if I'm wrong). So the strings are separated by newline characters and there is no way to escape in the strings. So a nasty string that trips up blns.txt and the tools around it would be "\n".
I think this is unfortunate since strings with newlines in them could certainly trip up some bash scripts. Filenames on Linux can contain newlines for example.
I just had an issue today with an internal program, the error was "unclosed parenthesis". A string with an Unclosed parenthesis would given you this error.
In our case the string that had the issue should have been machine generated but ended up being assigned user input... Is that kind of string also helpful in a list like this?
I'm curious to know what it is about each of the strings that makes it naughty. A comment next to each one describing what it's supposed to be testing would be great.
https://news.ycombinator.com/item?id=13406119
https://news.ycombinator.com/item?id=10035008