
Swift String’s ABI and UTF-8 - ingve
https://forums.swift.org/t/string-s-abi-and-utf-8/17676
======
josephg
This is really great to hear. The different internal string representations
really age a language:

\- C sort of assumes all strings are ASCII

\- Java, C#, Obj-C, Javascript and Python2 were all written when it was
assumed Unicode would have no more than 65536 characters. They all use 2 byte
encodings, which has become the worst of all worlds. ASCII text is twice the
size it needs to be, you still need to handle the mismatch between array
length and codepoint length, and errors are harder to find because there are
very few common characters that actually cross the 2 byte boundary - although
the popularity of emoji has changed this.

\- Go, Dart, Rust and Zig all return to single byte encodings. They all use
UTF-8 internally, and have different functions to interact with the string as
a list of bytes, and the string as a list of unicode codepoints. (In Go's case
its because Rob Pike was behind both Go and UTF8.)

Its delightful to see Swift stepping away from Obj-C's mistakes and improving
things internally. Although in Obj-C's defence, NSString has always hidden its
internal representation, which did a lot of work to protect developers from
the footguns.

One of the big ironies of all this is how well C has stood the test of time. C
strings interoperate beautifully with UTF8 - all thats missing is some libc
helper methods to count & iterate through UTF8 codepoints. strlen / strncpy /
strcmp / etc all work perfectly when dealing with UTF8. The only change is
that you need to supply lengths in bytes not characters.

~~~
josteink
> One of the big ironies of all this is how well C has stood the test of time.
> C strings interoperate beautifully with UTF8

I wouldn't say that.

I'd rather claim that UTF-8 has been engineered to be ASCII pass-through
compatible because of the lacking Unicode support in most C-codebases out
there.

I mean... If you compare Unicode support in platforms and OSes where this was
clearly a thought (Windows, C#, Java) to platforms with the more naive
approach (Linux, C, PHP, etc), you will see a very clear picture of which side
has the most unicode bugs and encoding errors.

And I say that as a Linux-guy.

C is terrible for text-processing. UTF-8 was designed as it was because of
that.

Trying to paint C as the best choice here because someone found a working
solution they could also apply to terrible C-code is clearly seeing things
backwards.

~~~
slavik81
Windows is the platform where I see most of my Unicode problems. Here's an
example of one of them:

hello.py:

    
    
        print("こ")
    

Ubuntu:

    
    
        $ python3 --version
        Python 3.5.2
    
        $ python3 hello.py
        こ
    
        $ python3 hello.py > out.txt
        $ cat out.txt
        こ
    

Windows:

    
    
        C:\Users\u\tmp>python --version
        Python 3.7.1
    
        C:\Users\u\tmp>python hello.py
        ?
    
        C:\Users\u\tmp>python hello.py > out.txt
        Traceback (most recent call last):
          File "hello.py", line 1, in <module>
            print("?")
          File "C:\Users\u\AppData\Local\Programs\Python\Python37-32\lib\encodings\
        cp1252.py", line 19, in encode
            return codecs.charmap_encode(input,self.errors,encoding_table)[0]
        UnicodeEncodeError: 'charmap' codec can't encode character '\u3053' in position
        0: character maps to <undefined>

~~~
quietbritishjim
This is because Python 3.5 and earlier used the old non-Unicode Windows API to
access the console. This was fixed in Python 3.6:

[https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep52...](https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep528)

~~~
jai_
But he's using Python 3.7?

------
iknowstuff
Swift's String implementation is just porn to me. They had a few misguided
attempts in Swift 1 through 3, but their final design is truly marvelous and
points programmers who have no idea about encodings towards the right
solutions, like by simply counting grapheme clusters correctly, avoiding
cloning thanks to views and not allowing for direct string[subscript] access
without deliberately stepping down into utf8 or utf16 codepoint
representations.

The fact that Swift is aiming for ABI stability, something neither C++ nor
Rust have because we all rely on C for FFI, is very interesting.

~~~
eridius
Unfortunately, since Swift's String implementation operates on grapheme
clusters by default, a lot of parsing code people write is actually subtly
broken in the presence of combining characters. As a trivial example, let's
say the input is a comma-delimited string (say, a line of CSV without quotes).
The obvious way to split this is

    
    
      let fields = line.split(separator: ",")
    

But given the input "foo,\u{301}bar" (which looks like foo,́bar), this won't
split correctly and you'll end up with a single field that contains a comma.
The correct way to split this is

    
    
      let fields = line.unicodeScalars.split(separator: ",").map(Substring.init)
    

This will get you the correct 2 fields (at the cost of an intermediate array,
as there is no lazy split).

~~~
Twisell
Have you opened a ticket about that? This is more due to the split
implementation than to the format itself. Room for improvement.

~~~
eridius
This is not an issue with the split implementation. Split is behaving
correctly (according to the defined semantics of String). You'd get the exact
same issue if you said str.index(of: ",") instead; in both cases, the "," is a
Character, not a UnicodeScalar, and "," != ",\u{301}".

------
ComputerGuru
Wait. A programming language written this side of 2010 stored strings in a
bastardized sometimes-ansii sometimes-utf16?

 _blink_

Why on earth not just use utf8 from the start? Surely no micro-optimization
made possible by their choice could be worth such a convoluted design?

~~~
josephg
There's a huge performance benefit to using a complex string type like this in
user-facing applications. The reason is that most strings are really small -
like, a few bytes small. When strings are smaller than pointers, allocating
them on the heap is silly and inefficient.

So NSString uses a bunch of wacky encodings internally to pack very small
strings into the 8 byte pointer that would otherwise be used to point to an
object on the heap.

I suspect that microoptimizations like this have done a lot of the work in
making iOS outperform Android clock-for-clock. Although I suppose right now
thats less of a big deal given how fast their A-series chips are as well.

~~~
saagarjha
> So NSString uses a bunch of wacky encodings internally to pack very small
> strings into the 8 byte pointer that would otherwise be used to point to an
> object on the heap.

Nitpick: these pointers are odd, so they can't point to a valid object on the
heap (malloc is 16-byte aligned on macOS). So it makes sense to reuse them by
tagging them and packing a string (or number, or date) into the remaining bits
to save on an allocation.

