

Dive into Python 3: Strings - arthurk
http://diveintopython3.org/strings.html

======
tumult
This is actually one of the best intros to Unicode (and string encoding in
general) I've seen. If the rest of the book ends up being of this quality,
I'll be pretty pumped.

~~~
yan
This especially: "In Python 3, all strings are sequences of Unicode
characters. There is no such thing as a Python string encoded in UTF-8, or a
Python string encoded as CP-1252. "Is this string UTF-8?" is an invalid
question. UTF-8 is a way of encoding characters as a sequence of bytes. If you
want to take a string and turn it into a sequence of bytes in a particular
character encoding, Python 3 can help you with that. If you want to take a
sequence of bytes and turn it into a string, Python 3 can help you with that
too. Bytes are not characters; bytes are bytes. Characters are an abstraction.
A string is a sequence of those abstractions."

Couldn't have said it better myself.

~~~
cammil
"In Python 3, all strings are sequences of Unicode characters"

What does that mean exactly? Everything seemed really well explained until
then. The line above lost me.

EDIT: I think I'm confused with the difference between UTF32 and Unicode. Is
there one?

~~~
inklesspen
Yes. Unicode is a method of converting between code points and characters. A
character is a symbol used in a human language or other written communication.
A code point is a number, commonly written as "U+<decimal integer expression
of the number>".

There are many different ways to actually store these sequences of code points
in a computer. UTF-32 is one of those ways. It takes the decimal integer,
coverts it to a 32-bit binary integer, and then splits that number into four
8-bit bytes. As the book says, there are problems with space usage -- in
ordinary English text, all the code points will be U+127 or less, which leads
to a lot of zero bytes taking up space. In addition to the waste of space,
zero bytes can cause problems in C, since they're the symbol for the end of
the string. So people invented other 'encodings' to convert Unicode code-
points into bytes. UTF-16, UTF-8, etc.

------
akirk
I love how you can hover your mouse over certain lines of code and it
highlights a paragraph below that explains something about it, and vice versa.

~~~
MarkPilgrim
Thanks for noticing!

Fun fact: I originally coded that "by hand," i.e. manipulating the DOM in pure
JavaScript. Then I decided to rewrite it in jQuery, which I had heard about
but never used. Then I realized that I will never voluntarily write JavaScript
without jQuery, ever again.

~~~
tvon
Nice work, this is probably the nicest looking technical material I've ever
seen (but then I'm a sucker for nice clean typography).

------
denimboy

       "On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters."
    

This is not exactly true. Chinese has an iconic lexicon where each glyph is a
single word. Both Cantonese and Mandarin speakers both use the same lexicon,
but have different pronunciations. There are several lexicons (pinyin, big5,
ancient) and thousands of glyphs in each.

Japanese has three lexicons; hiragana, katagana, and kanji. Kanji is the
oldest and adapted from Chinese. Glyphs are iconic. Hiragana and katagana were
developed in Japan and are phonetic. Together they form most of what you see
as Japanese text today. I think katagana is used more for foreign, non-Chinese
words. There is also romanji which is essentially English letters. Anyway,
apart from kanji which is Chinese, there are less than 100 hiragana and
katagana glyphs.

Korean is even simpler. They too borrowed from the Chinese and occasionally
still use some Chinese glyphs, but the official lexicon is Hangul. Hangul is
phonetic and has 24(?) basic glyphs. Some glyphs can be combined into compound
glyphs called double consonants and double vowels making about 40 glyphs
total. A Korean word can be written by breaking it down into syllables,
combining glyphs to form a syllable super-glyph, then putting those together
to form a word.

The explanation of unicode and python3 is great. I just wanted to clear up the
misconception stated in the first paragraph. Nothing more to see here...

~~~
chairface
I am having trouble seeing what you're claiming. Are you saying that there are
not thousands of characters in Japanese and Korean because they borrowed from
the Chinese? If so, I'd say your correction is misplaced. These characters
must still be taken into account for a charset to be used for these languages.

In any case, I found many of your comments to be irrelevant to the question of
how many characters must be used in a language. For instance, the difference
between Cantonese and Mandarin pronunciation doesn't have anything to do with
this issue. Nor does Chinese origin.

edit: I just spent a little time researching Korean (which I know much less of
than Chinese or Japanese), and now I understand more what you were saying
about it. However, it seems to me that each "super-glyph" as you call them
counts as a character, as far as any charset is concerned. The fact that they
can be broken up into constituent glyphs is irrelevant.

~~~
bobbyi
Really? I thought the idea of Han Unification was that the duplicated
characters between the CJK languages all map to the same unicode codepoints.

~~~
chairface
Yes that's true, but even so, you can't fit all those characters into 8 bits,
which is basically the point of that first section. Also, I don't see how
mapping to the same codepoint would mean that Chinese has thousands of
characters, while Japanese does not. They just share many of those thousands
in common.

------
mace
Here's another great explanation of Unicode from PyCon '08:
<http://farmdev.com/talks/unicode/>

It is really helpful in understanding Unicode handling in Python pre 3.0.

------
brisance
Mark... thank you for doing this and making it available on the web. I will
buy a hardcopy of it when it's out.

