Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Personally no, but you have to handle 0 correctly (well, mainly, consistently) otherwise you end up with an attack vector, where one part of the application stops at the 0 (e.g. string equality), and another later steps assumes the condition of the previous part but continues past the 0, resulting in it reading bytes that should've been considered previously (but weren't).


Exactly, and moving from ASCII to UTF-8 means you get to keep that consistency: 0x00 means 'End of string' in ASCII and it means 'End of string' in UTF-8. No change. Never a miscommunication. No possibility of old software getting confused on this issue. Any code which had its last buffer overrun flushed out in 1983 is still free of buffer overruns in 2013.

And, if you really need to represent codepoint 0 in strings, you can use Java's Modified UTF-8, where codepoint 0 is represented by the byte sequence 0xC0, 0x80. (This isn't valid UTF-8 because in straight UTF-8, every codepoint must be represented by its shortest possible representation.)

http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8


> it means 'End of string' in UTF-8

No it doesn't, unless you are saying that one should treat it like that. But null termination is as dangerous[1,2] with UTF-8 as it is with ASCII and should be avoided as much as possible anyway. Also, ASCII doesn't mandate that \0 is end-of-string, that's just a "convention" from C.

(Did you notice that my original comment actually included the exact modified UTF-8 link you provided?)

[1]: http://cwe.mitre.org/data/definitions/170.html [2]: http://projects.webappsec.org/w/page/13246949/Null%20Byte%20...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: