
Strings – Dive Into Python 3 - morphics
http://getpython3.com/diveintopython3/strings.html
======
emiljbs
Apparently, all I know about strings is correct.

~~~
Roboprog
And, he didn't even touch on the horror that is EBCDIC. Once you've had to
touch that, the idea of "code point" for a character is something you can't
ignore, hoping that things just work "most of the time" -- ASCII A != EBCDIC
A.

~~~
sluu99
Are there still (many) of people using EBCDIC?

~~~
Roboprog
Yes. High volume printing is often done using IBM's AFP/MODCA print language,
which typically has the text in IBM's EBCDIC encoding.

(disclosure: I once worked at a company that made tools to port code and data
from IBM minicomputers to Unix & MS platforms, and also at the largest
[format,] print & mail shop in the US)

------
sluu99
TL;DR: think of string as tuple of numbers. some are bytes, some are integers.
if you want to transform those numbers into a particular encoding (e.g. UTF-8,
CP-1252) then that's a different story.

EDIT: I know the article went all out about character abstraction, that why i
said "some are bytes, some are integers"

~~~
morpher
This is entirely the wrong take-away message from this article. The point is
that strings are _not_ sequences of numbers, but are, rather sequences of
_characters_. Characters are abstracted from the underlying byte
representation which is unimportant when dealing with strings.

For situations where a concrete byte representation is needed, you can get one
by encoding the string.

~~~
BorgHunter
Even this definition can get hairy, though. What is a character? Is 'á' one
character or two? Most human beings would say one, but in actuality I formed
it with an 'a' (U+0061) and a combining acute accent (U+0301): Two separate
code points. But you can also get the same result with 'á' (U+00E1); this is
not true of all combining character combinations.

In the past, I've had to deal with horrible mashups of fixed-byte-length
columns in flat text files with UTF-8 bolted onto it. In Java, no less. Trying
to figure out how to deal with all the edge cases (how do you truncate a
string when the boundary is between a "normal" character and a combining
character?) was an endless parade of the bizarre. Strings are _hard_ ,
fundamentally.

~~~
Flimm
Yes. It's much better to think of strings as a sequence of Unicode code
points.

~~~
ygra
As your parent already noted, thinking of it as a sequence of code points goes
wrong when you need to truncate a string in between a base and a combining
character.

~~~
Flimm
Not true. Take this string:

d͊

It is composed of two code points: U+0064 and U+034A. The second code point is
a combining character. The two code points together form one glyph. The term
"character" is confusing because people use different definitions for it, I
avoid using it, but the term Unicode code point is very clear.

Python 3's strings is a sequence of code points. The above string is
represented like this:

    
    
      >>> print("d\u034A")
      d͊
      >>> len("d\u034A")
      2
    

Truncating between the base and combining code points works as expected:

    
    
      >>> "d\u034a"[0]
      'd'
      >>> "d\u034a"[1]
      '͊'

~~~
ygra
Except it _doesn't_ work as expected because users generally expect graphemes
to stay as they are instead of losing random diacritics.

~~~
Flimm
By users, do you mean Python 3 programmers?

------
moreati
A Friday challenge: In Python when is u'ß'.upper() equal to u'SS'?

I discovered one case today, there may be others. Answer:
<https://twitter.com/moreati/status/332910618858364928>

~~~
safod
Actually, this is a bug. Unicode codepoint U+1E9E is LATIN CAPITAL LETTER
SHARP S and should be the result of u"ß".upper(). This is especially so
because otherwise u"Maße".upper() (Maße means measures) returns "MASSE", which
could be confused with u"Masse".upper() (Masse means mass). In such cases,
where confusion is possible and no uppercase ß is available, the German
dictionary Duden actually suggests using SZ instead. Therefore,
u"Maße".upper() would have to return "MASZE". However, since the Python string
processing routines can hardly carry a dictionary around just to check whether
there is a similar word that would have the same uppercase spelling, this is
obviously not feasible. U+1E9E would be the way to go.

~~~
ygra
According to Unicode ß gets converted to SS in uppercase. This is by
definition and doesn't change (stability policies, as far as I recall). Even
in German you'll never see ß capitalised as SZ (except when I do it, but I'm a
very, very small minority – and now I'm more likely to use ẞ).

------
gruseom
The epigraph to that chapter is brilliant.

