Hacker News new | comments | ask | show | jobs | submit login

Most of what I do involves the messy world of text, and I think this is a great resource. I wish the software I depended on tested against it.

I can think of a few more cases that I've seen cause havoc:

- U+FEFF in the middle of a string (people are used to seeing it at the beginning of a string, because Microsoft, but elsewhere it may be more surprising)

- U+0 (it's encoded as the null byte!)

- U+1B (the codepoint for "escape")

- U+85 (Python's "codecs" module thinks this is a newline, while the "io" module and the Python 3 standard library don't)

- U+2028 and U+2029 (even weirder linebreaks that cause disagreement when used in JSON literals)

- A glyph with a million combining marks on it, but not in NFC order (do your Unicode algorithms use insertion sort?)

- The sequence U+100000 U+010000 (triggers a weird bug in Python 3.2 only)

- "Forbidden" strings that are still encodable, such as U+FFFF, U+1FFFF, and for some reason U+FDD0

People should also test what happens with isolated surrogate codepoints, such as U+D800. But these can't properly be encoded in UTF-8, so I guess don't put them in the BLNS. (If you put the fake UTF-8 for them in a file, the best thing for a program to do would be to give up on reading the file.)

Isolated UTF-16 surrogate code points definitely crash Unity when it tries to display them. (Seen when I pasted some emoji in a text box in TIS-100 and tried to backspace.)

BOMs have already caught me off guard at the start of strings.

Bi-directional text is probably another one. All the bidi control characters, especially. Probably really all Unicode control characters in general.

Sure, but there's already a lot of bidi text in the file.

Bah, I only saw mono-directional text. Looking closely I only see one line of with bi-directional text, "הָיְתָהtestالصفحات التّحول"?

There are Bidi controller characters in the Trick Unicode. (Doesn't appear in Github rendering, oddly)

The range U+FDD0..U+FDEF is reserved for internal use by applications.

It's supposed to be reserved for applications. In practice... you may see them in the wild anyway. So it's important that they show up in test vectors!

This. Hidden white space has ruined my day before!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact