
Why Is Swift's String API So Hard? - chmaynard
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html
======
zeveb
> Human-readable text, file paths, SQL statements, and others are all
> conceptually different, and this should be represented as different types at
> the language level. I think that having different conceptual kinds of
> strings be distinct types would eliminate a lot of bugs. I'm not aware of
> any language or standard library that does this, though.

The Common Lisp standard includes pathname objects, with an implementation-
defined (since different platforms have different conventions) mapping from
strings to path objects. It doesn't specifically address the type-safety
issue, though, since just about every standard function which accepts a
pathname will also accept a namestring.

There exist libraries such as CLSQL which enable representation of SQL queries
as objects.

Common Lisp: doing it right for so long that most people haven't even heard of
it!

Also:

> When measuring a string to ensure that it fits in a 140-character tweet, you
> want to go by unicode code points.

Really? Doesn't one _actually_ want 140 graphemes? Regardless of what one
wants, does Twitter enforce 140 code points, or 140 bytes?

It seems to me that graphemes = characters, and that the fact that some
characters are made up of multiple codepoints is as important as the fact that
some code points are made up of multiple bytes.

~~~
grey-area
Isn't this quite common? Most languages allow you to represent those strings
as different types and parse them/to from a plain string representation.

From Ruby: String, Pathname, Url, ActiveRecord::Relation etc or from Go:
strings, database/sql stmt + orm types built on top, net/url etc though file
paths are just strings.

~~~
mozumder
Yep. He basically just described "classes" in the object-oriented world.

~~~
tokenrove
I don't think it has much to do with classes. Python, for example, has no
pathname/string distinction that I know of.

Common Lisp is in the object-oriented world, and indeed some of the things
mentioned are implemented as classes. I think the point was more about the
standard library's attempt to do The Right Thing and how library implementors
tend to follow that.

(Also, for better or for worse, CL's pathnames contain information like host,
device, and version, that aren't so relevant on the major OSes in use these
days; but try dealing with VMS filenames with Ruby's Pathname...)

~~~
lispm
Device and even host are quite common in the Windows world:

[https://msdn.microsoft.com/en-
us/library/windows/desktop/aa3...](https://msdn.microsoft.com/en-
us/library/windows/desktop/aa365247\(v=vs.85\).aspx)

Volume = Device

Or Microsoft's UNC path:

[https://en.wikipedia.org/wiki/Path_(computing)#Uniform_Namin...](https://en.wikipedia.org/wiki/Path_\(computing\)#Uniform_Naming_Convention)

\\\ComputerName\SharedFolder\Resource -> \\\host\device\path

------
chvid
This article reads like the Stockholm syndrome for programmers - come on - any
modern language needs to handle this stuff in a sane manner.

Recently I wanted to skip the first character of a string and then convert it
to a float.

Here is the beauty:

(text.substringFromIndex(text.startIndex.advancedBy(1)) as
NSString).floatValue

(Notice the start index "concept" and then the "cast" into the other string
implementation (NSString) to get the to float conversion method.) In any
normal language it would be something like:

Float.parse(text.substring(1))

String handling is in Swift is half-finished at best.

~~~
mikeash
Let's at least write your example in a sane manner:

    
    
        Float(String(s.characters.dropFirst(1)))
    

Is that so bad? The need to convert back to String is slightly annoying, but
overall this is not bad.

And of course there is the inevitable question: when you say "skip the first
character," do you mean the first grapheme cluster, the first code point, the
first UTF-16 code unit, or the first UTF-8 code unit?

You say that "any modern language" needs to handle this stuff in a sane
language? Which modern languages handle all of those possibilities sanely?

I agree that the String API has some holes in it. Which, you know, is why I
say things like "The String API does have some holes in it...." But in terms
of how it's designed, it's the only one I've seen that gets it right.

~~~
fauigerzigerk
That's pretty clean. It is also rather inefficient. I think dropFirst copies
the sequence and then the result is copied again to create a String and then
finally it gets parsed into a Float. No problem for one Float, but parsing a
largish CSV file like that may be a different matter.

At least I think that's what's happening. I'm not 100% sure.

~~~
mikeash
Nope, it's all just pointer games. I just tested it here and inspected the
guts in the debugger. My initial string containing "123456789" contains a
pointer internally to 0x000000010006d326 which contains that data. The result
of .characters contains that same pointer. The result of .dropFirst(1)
contains 0x000000010006d327 and the result of turning it back into a string
still contains 0x000000010006d327.

That's the great thing about abstract APIs and immutable data types, you
rarely need to actually _copy_ anything.

~~~
fauigerzigerk
Maybe it's a short string optimization in this case, because the docs say that
dropFirst is O(n)
[https://developer.apple.com/library/prerelease/ios/documenta...](https://developer.apple.com/library/prerelease/ios/documentation/Swift/Reference/Swift_CollectionType_Protocol/index.html)

Or they have optimized it after writing the docs. In any event, there's a lot
more going on than what is necessary to parse a float. But I guess that's OK
if the goal is to have a clean and consistent high level API.

~~~
mikeash
In that context, n is the number of items being dropped, not the length of the
collection.

~~~
fauigerzigerk
I wonder what makes it so slow then. I'm looking forward to the day when we
can look at the source code :)

~~~
mikeash
Where are you getting that it's slow?

~~~
fauigerzigerk
In my own little ad-hoc micro benchmark it is. That's why I suspected that
there was copying going on, but there could be so many other reasons for it
that it was probably a pointless exercise in the first place.

[Edit] And I mean relatively slow (4x the C++ code), not absysmally slow. Not
a big deal.

~~~
mikeash
Gotcha. Do you have optimizations on? If so, I'd guess it's just the overhead
of the extra indirection, plus dropFirst does have to examine that first
little bit of data to determine the grapheme cluster boundaries.

~~~
fauigerzigerk
Yes I've done a release build.

------
CydeWeys
Maybe it's just my first-world-privilege at work, but, honestly, why would I
want to put up with a language that makes all of my day-to-day coding tasks so
much more difficult than it is in other programming languages? It seems like
I'm losing more than I'm gaining. 99% of what I do is in English. Everything
that's not in English is going through an i18n library to convert labels and
such. It doesn't seem like I'm gaining much of anything, but I sure am losing
a lot. No default indexOf, length, and substring methods? Ouuuch.

~~~
Analemma_
Exactly. I understand where pilif and Lx1oG-AWb6h_ZG0 are coming from when
they say that indexOf, length, and substring actually get a little ill-defined
when Unicode is involved, but I can't see the logical leap from that to it
being a good idea to exclude them entirely. The motto of a good API/SDK is
"Make the easy case easy, and the hard case possible"; Swift seems to fail the
first condition. Surely there are default assumptions for these methods that
work most of the time, with the option for more complicated functionality _if
you need it_.

Maybe all this complexity is necessary when you're writing huge i18n'd apps,
but the point is that if it's too much bother to even do Hello World, I'm
never going to _get_ to the huge i18n'ed apps.

~~~
anon1385
The problem with strings is that there is very little that is easy. The lesson
is to avoid using strings as much as possible, although sadly fashions in
programming seem to have been heading in the other direction for a long time.

I don't think hiding difficult operations behind an easy facade that sometimes
breaks is a good idea. It makes people think that working with strings is
easy. It results in lots of people writing broken or even insecure code[1].

The idea that you only have to deal with unicode strings if you are writing an
i18n'd app is wrong. File names are unicode. The content of the pasteboard is
unicode. The results of calling a web API are unicode. Text the user types
into a text field is unicode. Unless your app doesn't handle outside text data
at all then it's going to encounter non-ascii text. And if you aren't handling
text from outside the app then you probably won't have much need for string
operations anyway.

[1] do a github search for 'UTF8String' and 'length' to see how much code
using NSString is passing the wrong length value into C apis.

------
tigeba
When I was working on one of our SDKs (pre Swift 2.0), I found it rather
maddening that the Swift string class had native support for emoji, but
developers were left to create their own implementation of very basic features
like indexOf, length, subString, etc.

~~~
pilif
... until you realise these "very basic" features are not very basic at all
and pretty much depend on your current use-case. What length do you mean? Byte
length? Character length? Code Point length? In case of indexOf, how do you
handle surrogate pairs? does indexOf('ä') only find 'ä' (LATIN SMALL LETTER A
WITH DIAERESIS) or also 'ä' (LATIN SMALL LETTER A followed by COMBINING
DIAERESIS)?

The good thing about the Swift API is that it gives you all the building
blocks needed to actually having a chance at getting this right. Many other
languages sweep those things under a rug and you're screwed or you'll have a
_much_ , _much_ harder job to get it right if you need to.

~~~
cookiecaper
I think there's a pretty strong generalized use case for things like length
(count the number of characters in a string) and indexOf (return the position
of a character in a string). No reason to punish the 99% and force them to
write their own implementations of these very conventional string operations
to accommodate the 1% that means something not normally meant.

~~~
anon1385
Based on Github searches I've done in the past a lot of code calling length on
NSString was broken. Those people probably thought they were in the 99% who
just needed the convenient 'normal' length. People not realising that length
isn't giving them what they want is one of the things the Swift API aims to
solve.

Giving the wrong choice for most situations (and number of UTF-16 code units
isn't what people want most of the time) the easy convenient name is just a
recipe for broken code. It's bad API design, or at least unfortunate
historical accident, and it's good to improve things when there is an
opportunity to do so.

~~~
cookiecaper
It's not an improvement to exclude functions that you know everyone wants from
the API. It's just going to result in _more_ mess, because we all know most
people are going to search for "swift string length function" and copy and
paste direct from Stack Overflow without more than skimming it. In practice,
it just makes the behavior messy and non-standard, ultimately meaning Swift
applications are more difficult and/or annoying to debug and maintain.

The _correct_ way to solve this would be to design the API so that the
distinction between what you're getting and what you want is clear, and the
place to go to get what you actually want is also clear. Including nothing is
an admission that they couldn't do this.

~~~
anon1385
>It's not an improvement to exclude functions that you know everyone wants
from the API.

Which functions have they excluded? You mentioned counting characters, but you
haven't actually specified which definition of characters you want it to use.

>The correct way to solve this would be to design the API so that the
distinction between what you're getting and what you want is clear, and the
place to go to get what you actually want is also clear

This is what they have done.

~~~
cookiecaper
>Which functions have they excluded? You mentioned counting characters, but
you haven't actually specified which definition of characters you want it to
use.

The typical human definition of characters. Humans usually don't think about
bits or bytes and they shouldn't have to. A character is a single independent
glyph, a separate unit that would be taught to a human whilst learning to
write, regardless of the internal representation in the computer.

If I need something else, I should ask for something else.

Hypothetical API calls that may address this:

"String".length()

"String".lengthInBytes()

"String".lengthInCodePoints()

This would provide standard implementations for these common functions
(circumventing the issue of copying a random chunk of code from SO and all its
attendant problems), make it obvious that there is a difference to anyone
browsing the docs and/or using autocomplete, and make it easy to select and
use the one you actually want in a particular situation. This type of
discoverability is an important component in a usable API.

~~~
mikeash
How is your suggestion any better than the _actual_ Swift code for these?

    
    
        "String".characters.count
        "String".utf8.count
        "String".unicodeScalars.count
    

Looks pretty much the same to me in concept and typing difficulty.

~~~
brazzledazzle
I think it has less to do with typing difficulty and more to do with
annoyance. Sometimes what people like is less about technical differences and
more about previously built mental models that reduce cognitive overhead. In
other words: It let's them simply start playing and experimenting without
having to relearn things they take for granted.

If most people (let's say 99% for argument's sake) expect "string".length to
refer to character count and only 1% need something like "string".utf8.count
why not just accommodate them and make the language more accessible? This
reminds me a bit of UX design. I've seen people time and again make the
mistake over the years of designing their UX in a vacuum. What they produced
wasn't bad or hard to use, it just broke expectations.

All that said, sometimes breaking expectations is exactly what you should do
because someone has to for things to change and move in the right direction.
And sometimes when you do that you're taking one for the team, so to speak. Of
course I don't know shit, but if I were looking at making a language popular
I'd weigh doing what's "right" against my desire to increase the popularity
carefully.

~~~
mikeash
Apple is an interesting position here because of their massive clout. They
could have introduced a language that looks like COBOL crossed with APL and it
still would have been immensely popular just because they have such a huge
base of fanatical developers. I think you're right that this sort of thing
could pose trouble for adoption of a normal language, but Apple doesn't need
to worry about it so much. I'm hoping that this will allow them to push the
Right Thing even though some developers don't like it.

------
fauigerzigerk
I think the Swift String API has a lot going for it conceptually.

But I have a big problem with knowing so little about the performance and
memory usage characteristics of the functions I'm calling and the strings I'm
keeping in data structures.

The problem is exacerbated by the fact that the API is extremely incomplete. A
ton of things can only be done by resorting to NSString functions and that, I
think, requires the String being copied into a UTF-16 representation
underneath. Or does it? I don't know. That's exactly the problem.

The idea of putting grapheme clusters at the center of the string universe is
great when it comes to text that users see and manipulate. It's excellent for
writing word processors.

But it is not convenient for analysing large amounts of text data or for
parsing semi structured text formats where we have a lot of fixed length stuff
that could be conveniently accessed using numeric slice indexes.

I think Swift's String class would work very well as a view on top of a UTF-8
buffer.

~~~
mikeash
Is a String class the right type to use for parsing text formats with lots of
fixed length stuff? Maybe Array<UInt8> would be more suitable.

~~~
fauigerzigerk
There's a lot of middle ground between parsing a purely structured format and
UI oriented string manipulation. Most of what I have done in my life involved
semi-structured data with both natural language text and structured parts with
fixed lengths. It's extremely inconvenient to have no string functions
available when working with byte arrays.

I actually like the C approach a lot in principle, because it doesn't make you
choose one or the other.

~~~
mikeash
Maybe it would be best to have a lot of these things we think of as "string"
functions available on arrays, like sub-sequence matching, replacement,
trimming, even something like regex. Then strings can be "all that, plus
unicode."

~~~
fauigerzigerk
Yes, I think that's a good idea.

------
patsplat
From the OP:

> (Incidentally, I think that representing all these different concepts as a
> single string type is a mistake. Human-readable text, file paths, SQL
> statements, and others are all conceptually different, and this should be
> represented as different types at the language level. I think that having
> different conceptual kinds of strings be distinct types would eliminate a
> lot of bugs. I'm not aware of any language or standard library that does
> this, though.)

There are many sub types which can be constructed from strings. Most languages
treat at least following as special types that can be constructed from/to
strings:

* numeric types (integer, float, etc) * regular expressions * file paths * dates * XML * JSON

The real absence is SQL. It is shocking how such an apparently old standard
has no stdlib parsing support.

Until one realizes there is no such thing as "Standard SQL". Any sufficiently
visible "SQL Parser" project eventually will be buried in a sea of vendor-
specific edge cases.

------
cballard
> I think that having different conceptual kinds of strings be distinct types
> would eliminate a lot of bugs. I'm not aware of any language or standard
> library that does this, though.

Foundation seems like the obvious one? NSURL vs. NSString?

~~~
steveklabnik
In the Rust standard library, we have String, str, OsString, Path, and
PathBuf, off the top of my head, with external crates implementing things like
URLs, ropes, and others.

That said, I completely feel what this blog post is saying: strings are
_hard_, especially if you're not just doing ASCII.

~~~
x5n1
why are they so hard?

~~~
Gankro
There's two aspects of complexity with strings:

1) 40 years of crazy encodings and languages.

2) human languages are wildly diverse and basically any assumption you wish to
apply is broken.

For 1, any system that wants to deal with the outside world needs to deal
with: operating system encodings (arbitrary bytes on unix, malformed UCS2 on
windows), C representation (null-terminated strings), systems that only work
with ASCII, systems that only work with utf8, systems that work with arbitrary
encodings/languages (HTML). This is arguably unnecessary complexity that
exists because of short-sighted decisions in the past.

2 is the necessary complexity; the fact that languages are really complicated.

There are thousands of symbols in writing. Do you try to encode these symbols
in a monolithic manner, or in a compositional way? For historical reasons, you
can often do both! ë can be a single character, or e with an accent modifier.
How do you handle string searching in such a model? Do you match `noel` with
`noël`? What's the length of noël? 4 characters? 5 characters? bytes?
graphemes? codepoints? Can you correctly reverse noël (do it wrong and you can
get leön)?

Different letters which have similar/identical representations but different
semantics/origins! Is Ε "capital e" or "capital ε"? How do you upper-case or
lower-case these letters? Do you expect to_upper(to_lower(char)) to roundtrip
(it won't)? Do you expect capitalization to be doable in-place (it's not)? Do
you expect capitalization to be region-specific (it is)?

Are any of these operations even _coherent_ in a language like Japanese? Why
are you trying to do them?

God help you if you want to _display_ this text. Are you ready to handle
right-to-left text? Are you assuming that your font is monospace (hey there
terminal and text editors)? C̢̫a̘̺̯n ̘̜̦̹y̷̫̼̘̩o̶͉u̗̩̻̞
̻ẹ͡v̴̤͎̹e̶̫̠̤̭̺̤̞n̛̞̹̣̩̲͉̮ ̜͖̪͔̖d̤e̘̯ͅa̺l̟̀ ͚̗̣w̭i̸͇̠̥̣̜̥t̸h̸̻̮̼̙̹
̗̺̱̣̰̱̙z̟a̺͜l̠̦̖̟̰͍g҉̜͖͓̫ơ̩̹̰͕?̹̳̼̯̘̺̟

~~~
mikeash
I think my favorite part of this article is how my Zalgo example doesn't
render even remotely correctly in any browser I've tried it in.

------
ex3ndr
the main problem with swift strings is actually naming. You don't have simple
method

replace(src:String, dst:String)

but you have weird

stringByReplacingOccurrencesOfString(src, withString: dest, options:
NSStringCompareOptions(), range: nil)

Why not to have simple convenience method?

~~~
alexbock
Aside from the fact that they carried that style of naming over from
Objective-C, the one benefit I see is that it makes it obvious that it returns
the result in a new string rather than modifying the object itself. In C++ you
would need to look at the signature to tell which way a method named replace
worked, e.g.

void A::replace(const A& source, const A& dest);

vs

A A::replace(const A& source, const A& dest) const;

That said, I think you can get this same benefit without such a verbose name.
Perhaps something like "withReplacement"?

~~~
cballard
A solution to this would be to disallow mutable value types. Then, all methods
would _have_ to return a new instance.

The compiler implementation could use mutability behind the scenes for
efficiency, while the language exclusively allowed immutable values.

------
mpweiher
> Human-readable text, file paths, SQL statements, and others are all
> conceptually different, and this should be represented as different types at
> the language level.

Objective-Smalltalk[1] has Polymorphic Identifiers[2][3], which are URIs used
as identifiers in the language. So that handles file paths and web addresses.
SQL is not solved directly, but XPath can be encoded in the URI and there are
mappings from relational DB APIs to URIs[4][5][6].

I hope we can get rid of identifiers encoded as strings once and for all.

[1] [http://objective.st/URIs](http://objective.st/URIs)

[2]
[http://dl.acm.org/citation.cfm?id=2508169](http://dl.acm.org/citation.cfm?id=2508169)

[3] [https://www.hpi.uni-
potsdam.de/hirschfeld/publications/media...](https://www.hpi.uni-
potsdam.de/hirschfeld/publications/media/WeiherHirschfeld_2013_PolymorphicIdentifiersUniformResourceAccessInObjectiveSmalltalk_AcmDL.pdf)

[4] [http://blog.dreamfactory.com/add-a-rest-api-to-any-sql-db-
in...](http://blog.dreamfactory.com/add-a-rest-api-to-any-sql-db-in-minutes)

[5]
[http://restsql.org/doc/Overview.html](http://restsql.org/doc/Overview.html)

[6] [http://www.slashdb.com](http://www.slashdb.com)

------
consto
>Human-readable text, file paths, SQL statements, and others are all
conceptually different, and this should be represented as different types at
the language level.

String, Path/URL Objects, Prepared Statements.

------
chii
i think this is apropo
[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

------
dilap
While you can technically store whatever you want in there, the assumption and
convention the bytes in a string in Go will be utf8.

E.g.:

    
    
       for i, x := range someString // iterates unicode code points
    

Works well in practice. (There are the unicode and unicode/norm packages to do
more complex unicode operations.)

~~~
ridiculous_fish
I don't think it works well for text manipulation. For example, in Go, how
would you truncate a string and add ellipsis, without dropping an accent or
otherwise splitting a grapheme cluster?

------
coldtea
>I think that having different conceptual kinds of strings be distinct types
would eliminate a lot of bugs. I'm not aware of any language or standard
library that does this, though.)

Rebol does that AFAIK.

