Fun story about one of the devices mentioned there that I worked on. We used to store the saved wifi creds in a file named exactly what the SSID was.
Some user managed to break things, and with their permission we gathered detailed wifi logs and found they were connected to an SSID that was an ASCII depiction of the equation: [redacted] plus [redacted] equals [redacted]. The issue was the forward slashes, presumably there to add [redacted]. Must have been an awkward customer service follow up when we told them to change their SSID while they waited for an update.
Really they should have fixed the software instead of telling the user to change it. It's a perfectly valid SSID.
And really, using raw environment-derived data directly on the filesystem?? What if the SSID had been "/etc/passwd" or something similar and it wrote to that?
I used to have something like this as my SSID:
ʕ•̫͡•ʕ̫͡ʕ•͓͡•ʔ-̫͡-ʕ•̫͡•ʔ̫͡ʔ-̫͡-ʔ (Not this particular one as it was too long though!) Many nice examples at: https://1lineart.kulaone.com/#/
It was fun but some OSes didn't show it correctly, in particular Windows. It would just show it in HEX. And more annoyingly, some devices refused to connect to it at all, especially IoT crap like those WiFi power sockets.
So eventually I gave up.
PS: Something with more vertical stuff would also be really fun, some of these can write across multiple lines of unrelated content! Unfortunately most OSes block this from happening now. Example:
The 802.11 standards have always allowed up to 32 bytes which can be filled with any data, it does not have to be in a particular encoding. In 802.11-2012 there is a separate tag SSIDEncoding which can be used to specify if these bytes are in UTF-8 or "unspecified". If the UTF-8 option is set, the SSID should be interpreted as UTF-8.
It is not clear in this case if the router sets this flag or not. Either way there is no stipulation in the spec about how the UTF-8 characters should be displayed so many of these options are potentially valid.
The bytestring was truncated after 32 bytes, in the middle of a UTF-8 byte sequence.
This means the resulting truncated string is not valid UTF-8 anymore.
So my guess is that most devices decide "if it's not valid UTF-8, it must $LEGACY_ENCODING".
Unicode offers two ways forward when you can't decode what you have, one alternative is an exception, you just fail because you weren't able to decode something.
The other is for any code unit that won't decode you emit U+FFFD the Unicode Replacement Character and then you carry on decoding.
For humans U+FFFD makes it obvious something is wrong, it's typically visualised as a black diamond with a white question mark. And for a machine it shouldn't match parsing rules, it isn't an alphanumeric, it isn't any of the common separator or spacing characters, so it's unlikely to be of use in an attack.
That is a reasonable approach if you know that what you are decoding is supposed to be UTF-8.
If you don't know the text encoding because there is no information to indicate it (or you don't trust that information to be correct) then you will have to guess and "decode as UTF-8 for valid UTF-8, use some legacy encoding otherwise" is a common approach (used e.g. by many text editors).
Huh, I'm surprised emojis aren't more popular for SSIDs... can't wait until this knowledge spreads more and we'd have a vomit of color when we open the "Wireless Networks" menu.
OTOH for most people the SSID is "Linksys 4FBD" or similar...
> OTOH for most people the SSID is "Linksys 4FBD" or similar...
And to think that one of the major reasons behind having random strings after <Vendor name> (Apart from non-technical people in apartment blocks being super confused), is so that you can't go around rainbow tables that work for large swathes of the routers you would encounter.
> can't wait until this knowledge spreads more and we'd have a vomit of color when we open the "Wireless Networks" menu.
You're limited to 32 bytes, which limits the spew somewhat. Some emoji are up to 4 bytes long, so you can in theory get a sequence of 8 of them in a row if you want. Should encourage a little bit of creativity to fit within those lines...
I don't even want to know if any system would process things like bell characters or Right to Left special character...
Unrelated note: Had to file a bug last month because OpenWrt's web interface kept accepting more than that and stopping wireless from coming back up when you tried. Javascript length checks are weird.
I work on some ecommerce sites. I've had to cancel orders because the order exports can't handle emoji in the fields. I can't wait until baby names have actual emoji in them. I bet some idiot has already tried it.
> Both the s8 and the Firestick are rendering the result in what I deem as the correct way with it showing the name just with some of the vertical characters cutoff.
At least one is doing a poor job, though, because the diacritics look nothing alike…
> After asking around on the Apple discord server someone said it might be using the Mac OS Roman character set. It turns out it which is strange because iOS used UTF-8 internally and not Mac OS Roman as that was phased out with the release of Mac OS X.
I would guess that some part of IOKit is passing a C or C++ string to CoreFoundation using an inappropriate function or using the “system encoding”. I can’t remember of the top of my head, but Mac OS Roman might also be encoding 0. In any case there’s certainly a convention going on there with a poor default or some sort of strange compatibility story.
(I’m actually curious if there is “supposed” to be an encoding for this. Perhaps Mac OS Roman is just as correct and more convenient?)
The first Apple Airport routers predate MacOS X, so it wouldn’t be crazy for the initial MacOS X implementation to fall back to MacOS Roman as backcompat to routers configured with MacOS 8.6/9. And then if they never changed it since for 99% of users the UTF8 auto detect works fine...
My Canon printer won’t join my SSID containing an emoji, helpfully throws generic E36 (or something like that). All Apple devices show and connect to the SSID just fine.
On my Firefox it looks like four "a"s, with a sort of tower over the first "a" that ends in a frowny face with an accent over it. Is this[1] what you're seeing and describing differently? Or are we having different things displayed by Firefox?
On my computer I see three different representations: In the text on Hacker News, I see the stuff on top of the first "a", in the tab title, it is on top of the second "a", and in the window title, it doesn't render the SSID string (although the rest of the title is displayed).
Very cool. It's pretty interesting to see the various failure modes. Some seem straightforward (e.g., the font is missing the glyphs) while others seem to be parsing limitations.
As an aside, this finally convinced me to explore using additional SSIDs in creative ways with emojis.
For most of the Western world, if you take the set of all commonly used characters in the language(s) that are widely recognized in each country and form their intersection, you'll have at least the Arabic numerals and plain A-Z.
If SSIDs were restricted to just those characters, it would be fine in the Western World. But of course there is more to the world than the West.
Question: do most or all non-Western languages also have small subsets of characters that would be fine to restrict SSIDs to? For instance, Wikipedia tells me that Persian is written with a 32 character alphabet, and Arabic uses 28 characters for its alphabet.
I'd expect that for every alphabet-based language, there is a similar base set of characters you could reasonably limit SSIDs too, and so avoid all the problems you get with allowing full Unicode.
How about the languages that use logographic writing systems, such as Chinese, Japanese, Korean, and Vietnamese? Do they all have reasonable (albeit probably very large) subsets SSIDs could be limited to that would avoid all their weird stuff that can happen in Unicode but still allow most reasonable names to be used?
Don't forget that some of these are left-to-right (e.g. Hebrew, Arabic). Words are rendered left-to-right, and early email software would just expect each word to be sent reversed so that simple RTL rendering could be used. UTF solves this (and many other issues) quite nicely.
I tested this out of curiosity, and all iPhones I could find in my household rendered correctly in UTF-8 with only 12 octets [0]. This is replicated on iPhone 7, SE and XR, all running 13.5.1. So it may well be the issue was fixed in 6s or 7.
This is a really good post that shines some light on how the insanity of encodings still isn't fixed today, since so many operating systems still don't completely use Unicode everywhere.
Some of the reasonings behind why the characters are displayed like that are slightly incorrect, though, so here are some corrections:
I'm going to supply each example here with some python3 code to reproduce with, with the following definition:
> My router just cut the name down to 32 octets though to stay complient
> This was what was being sent according to iw
> `a\xcc\xb6\xcc\x81\xcc\x93\xcc\xbf\xcc\x88\xcc\x9b\xcc\x9b\xcd\x90\xcd\x98\xcd\x86\xcc\x90\xcd\x9d\xcc\x87\xcc\x92\xcc\x91\xcd`
If you look at this closely, the last byte in this sequence is `\xcd`, which is an incomplete UTF-8 character. It's missing the final `\x84` that the router cut off (along with the three additional `a` characters).
> with the raw hex being
> `97ccb6cc81cc93ccbfcc88cc9bcc9bcd90cd98cd86cc90cd9dcc87cc92cc91cd`
small mistake: the hex of `a` is `61`, not `97` (that's decimal), but otherwise correct.
These two devices render the result of UTF-8 decoding while ignoring bytes that are invalid unicode (in python3: `data.decode('utf-8', 'ignore')`)
> iPhone 6 running iOS 13.5.1
> Apple TV Second Generation
Completely correct. This is definitely Mac OS Roman (in python3: `data.decode('mac_roman')`)
> Windows 10 Pro 10.0.19041
This one is a incorrect again:
Windows is interpreting the characters in the "Windows Codepage 1252" (also known as "Western") encoding and ignoring invalid characters (in python3: `data.decode('cp1252', 'ignore')`)
Decoding every character separately as UTF-8 would fail (since every byte that can be a continuation of a UTF-8 character is not a valid start byte).
Interpreting every character as a Unicode code-point number would give something very similar, but not exactly the same: What Windows decodes as quote, caret-y thing, angle bracket-y thing, tilde, dagger, double dagger, and single quote fall into a control character block at the start of the Unicode "Latin-1 Supplement" block (`\x80` to `\x9f`).
> Chromebook running ChromeOS 83.0.4103.97
Correct.
The Chromebook seems to have rendered the ASCII a, but replaced all other 31 characters with question marks.
> Kindle Paperwhite running Firmware 5.10.2
> Vizio M55-C2 TV
Also correct.
Those two devices seem to opt to display hex instead of falling back to question marks as the Chromebook does.
I hope this comment gave some useful insight into why these devices decoded it this way :)
Some user managed to break things, and with their permission we gathered detailed wifi logs and found they were connected to an SSID that was an ASCII depiction of the equation: [redacted] plus [redacted] equals [redacted]. The issue was the forward slashes, presumably there to add [redacted]. Must have been an awkward customer service follow up when we told them to change their SSID while they waited for an update.