Hacker News new | past | comments | ask | show | jobs | submit login

Python as of 3.3 chooses an internal representation on a per-string basis. This encoding will be either latin-1, UCS-2, or UCS-4, and the choice is made based on the widest code point in the string; Python chooses the narrowest encoding capable of representing that code point in a single unit.

This does mean that a string which contains, say, some English text and an emoji will "blow up" into UCS-4, but the overhead isn't that severe; most such strings are not especially large. It also means that strings containing only code points < U+00FF are smaller in memory on Python 3.3+ than previously, since prior to 3.3 they would be using at least two bytes per code point and now use only one.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact