Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why Is Swift's String API So Hard? (mikeash.com)
97 points by chmaynard on Nov 6, 2015 | hide | past | favorite | 114 comments



> Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.

The Common Lisp standard includes pathname objects, with an implementation-defined (since different platforms have different conventions) mapping from strings to path objects. It doesn't specifically address the type-safety issue, though, since just about every standard function which accepts a pathname will also accept a namestring.

There exist libraries such as CLSQL which enable representation of SQL queries as objects.

Common Lisp: doing it right for so long that most people haven't even heard of it!

Also:

> When measuring a string to ensure that it fits in a 140-character tweet, you want to go by unicode code points.

Really? Doesn't one actually want 140 graphemes? Regardless of what one wants, does Twitter enforce 140 code points, or 140 bytes?

It seems to me that graphemes = characters, and that the fact that some characters are made up of multiple codepoints is as important as the fact that some code points are made up of multiple bytes.


> Regardless of what one wants, does Twitter enforce 140 code points, or 140 bytes?

Number of code points after normalisation: https://dev.twitter.com/overview/api/counting-characters


Isn't this quite common? Most languages allow you to represent those strings as different types and parse them/to from a plain string representation.

From Ruby: String, Pathname, Url, ActiveRecord::Relation etc or from Go: strings, database/sql stmt + orm types built on top, net/url etc though file paths are just strings.


It is, even Java does that.

Where it gets interesting is advanced trickery, as performed by Haskell. You can have a type String that cares of escaping/unescaping stuff that comes from tainted sources in a webapp, eliminating a whole lot of security issues. Thats just one example I know, but I'm not a Haskell guy - there are surely more.


Yep. He basically just described "classes" in the object-oriented world.


I don't think it has much to do with classes. Python, for example, has no pathname/string distinction that I know of.

Common Lisp is in the object-oriented world, and indeed some of the things mentioned are implemented as classes. I think the point was more about the standard library's attempt to do The Right Thing and how library implementors tend to follow that.

(Also, for better or for worse, CL's pathnames contain information like host, device, and version, that aren't so relevant on the major OSes in use these days; but try dealing with VMS filenames with Ruby's Pathname...)


Device and even host are quite common in the Windows world:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa3...

Volume = Device

Or Microsoft's UNC path:

https://en.wikipedia.org/wiki/Path_(computing)#Uniform_Namin...

\\ComputerName\SharedFolder\Resource -> \\host\device\path


> implementation-defined (since different platforms have different conventions) mapping from strings to path objects.

Unfortunately, this translates to a problem. We don't have that many path representations any more, but Common Lisp implementations for the same platform (or platform family, like POSIX) do the translation differently.

If you want to work with CL pathnames, but expose paths as strings to the user, and use multiple CL implementations, you have to write your own path translation which behaves the same way.

For instance, what is

  a/b/c/
? Is that a pathname with directory a b c, and an empty name? Or directory a b and name c?

Do you map suffixes to types? How about compounded suffixes like .tar.gz? Is that a "tar.gz" type or a "gz" type?


Are there any filesystems that support empty names? If not, clearly the second is correct.


Ideally, you'd provide interfaces for both interpretations.


Two interpretations across N features: (expt 2 N) interface combinations. :)


That's really only a problem if you're building interfaces that can't be composed.


I wonder if part of the reason for this is that Common Lisp dates back to a time when filesystem path structures were a lot more variable.

These days, you either have Unix-style or Windows-style paths and that's basically it

But go back 30 years and there was quite a bit more variety that would make an abstraction layer all but a requirement if you wanted to define a portable language spec.


Indeed.

"The dominant file systems at the time the design was done were TOPS-10, TENEX, TOPS-20, VAX VMS, AT&T Unix, MIT Multics, MIT ITS, not to mention a bunch of mainframe OS's. Some were uppercase only, some mixed, some were case-sensitive but case-translating (like CL). Some had dirs as files, some not. Some had quote chars for funny file chars, some not. Some had wildcards, some didn't. Some had :up in relative pathnames, some didn't. Some had namable root dirs, some didn't. There were file systems with no directories, file systems with non-hierarchical directories, file systems with no file types, file systems with no versions, file systems with no devices, and so on. People often critique the design now because Unix and Mac and DOS are so comparatively similar (compared to some of those others)..."

-- Kent Pitman in comp.lang.lisp, https://groups.google.com/forum/#!original/comp.lang.lisp/Pl...


I remember reading a LISP book from the 80's that gently bragged about how all the legacy C code would have to be rewritten after upcoming operating systems moved beyond identifying files by a tree of folders, but that your LISP code would be future proof.


This strikes me as one of those instances were the cost of the abstraction exceeds the savings. There are some things which are simply core to your code. Attempting to abstract those things is going to create a lot of pain when, realistically, those things changing would require component rewrites anyway.

If you're working with files, you've probably made many implicit assumptions about how the file system works and how it is represented. So you are either going to have a leaky abstraction that works, or a non-leaky abstraction that has those assumptions built in and will break when porting to a system without those assumptions anyway.


I'm actually not sure what Twitter is doing, now. Putting Emoji into the box decreases the remaining character count by one for each Emoji. But putting Zalgo into the box decreases the remaining character count by several for each grapheme cluster. My head hurts.


Twitter actually counts codepoints, but after normalisation, so it doesn't matter which of the two ways you represent é: https://dev.twitter.com/overview/api/counting-characters


Right, stupid me, emoji are still a single code point. Put a flag in the box, and the count decreases by two.


Well, depends on the emoji. Some use combing characters, e.g. a red heart is the black heart character plus some invisible joiner IIRC, and some of the family emoji are composed of their individual components (mom/dad/girl/boy) with joiners.


Yes, you're right. In a fit of silliness I was testing only with the ones that are a single code point, though.


Twitter has a limit of 140 characters to fit in a regular SMS (the rest 20 are used for the username).

From https://en.wikipedia.org/wiki/Short_Message_Service

> Short messages can be encoded using a variety of alphabets: the default GSM 7-bit alphabet, the 8-bit data alphabet, and the 16-bit UCS-2 alphabet. Depending on which alphabet the subscriber has configured in the handset, this leads to the maximum individual short message sizes of 160 7-bit characters, 140 8-bit characters, or 70 16-bit characters.


That's not true. Firstly, a single tweet doesn't necessarily fit into a single SMS (as the Twitter limit is per character, whereas an SMS is limited per byte).

Secondly, an SMS is limited to 140 bytes, which is 160 7 bit 'characters', or 140 8 bit 'characters' or 70 16 bit 'characters' (in UCS 2, normally).


It was true. Twitter heavily emphasized SMS back when it was new.

The 140-thingy limit hasn't been about SMS for a while now, and has just transitioned into a weird artificial limitation that's become part of what Twitter is. But that is how it started out.


That might be the origin, but that is not the current implementation. For example, URLs only ever count as 23 letters, because they always get shortened using Twitter's own t.co shortener. BUT they still display in their original form, so unless they are de-referencing the shortened URL on the fly, they have to be storing what the user typed and a pre-processed form of the text are being stored. That's a lot more than 140 ASCII characters.


What's really fun is that URLs always count as 23 characters even if they're shorter. For example, I had a tweet about Apple's Keynote.app fail to send the other day, because Twitter thought "Keynote.app" was a URL and "shortened" it such that my tweet exceeded the length limit afterwards. Fun times.


Yeah, it's not good. I think Twitter would do well to not hurt users with things like that. Use the restraint of UI to keep people's posts small. It's difficult to type a screed into a single-line textbox. But get rid of any actual character limit at all.


This article reads like the Stockholm syndrome for programmers - come on - any modern language needs to handle this stuff in a sane manner.

Recently I wanted to skip the first character of a string and then convert it to a float.

Here is the beauty:

(text.substringFromIndex(text.startIndex.advancedBy(1)) as NSString).floatValue

(Notice the start index "concept" and then the "cast" into the other string implementation (NSString) to get the to float conversion method.) In any normal language it would be something like:

Float.parse(text.substring(1))

String handling is in Swift is half-finished at best.


Let's at least write your example in a sane manner:

    Float(String(s.characters.dropFirst(1)))
Is that so bad? The need to convert back to String is slightly annoying, but overall this is not bad.

And of course there is the inevitable question: when you say "skip the first character," do you mean the first grapheme cluster, the first code point, the first UTF-16 code unit, or the first UTF-8 code unit?

You say that "any modern language" needs to handle this stuff in a sane language? Which modern languages handle all of those possibilities sanely?

I agree that the String API has some holes in it. Which, you know, is why I say things like "The String API does have some holes in it...." But in terms of how it's designed, it's the only one I've seen that gets it right.


That's pretty clean. It is also rather inefficient. I think dropFirst copies the sequence and then the result is copied again to create a String and then finally it gets parsed into a Float. No problem for one Float, but parsing a largish CSV file like that may be a different matter.

At least I think that's what's happening. I'm not 100% sure.


Nope, it's all just pointer games. I just tested it here and inspected the guts in the debugger. My initial string containing "123456789" contains a pointer internally to 0x000000010006d326 which contains that data. The result of .characters contains that same pointer. The result of .dropFirst(1) contains 0x000000010006d327 and the result of turning it back into a string still contains 0x000000010006d327.

That's the great thing about abstract APIs and immutable data types, you rarely need to actually copy anything.


Maybe it's a short string optimization in this case, because the docs say that dropFirst is O(n) https://developer.apple.com/library/prerelease/ios/documenta...

Or they have optimized it after writing the docs. In any event, there's a lot more going on than what is necessary to parse a float. But I guess that's OK if the goal is to have a clean and consistent high level API.


In that context, n is the number of items being dropped, not the length of the collection.


I wonder what makes it so slow then. I'm looking forward to the day when we can look at the source code :)


Where are you getting that it's slow?


In my own little ad-hoc micro benchmark it is. That's why I suspected that there was copying going on, but there could be so many other reasons for it that it was probably a pointless exercise in the first place.

[Edit] And I mean relatively slow (4x the C++ code), not absysmally slow. Not a big deal.


Gotcha. Do you have optimizations on? If so, I'd guess it's just the overhead of the extra indirection, plus dropFirst does have to examine that first little bit of data to determine the grapheme cluster boundaries.


Yes I've done a release build.


Thank you for that. Who says you can't learn anything by ranting on HN ... :-)

As for the inevitable question - if I write text.substring(1) in JavaScript - why are people not asking the same question?


I don't know why people aren't asking the same question for JavaScript. Certainly, that call is going to cause trouble sometimes. Just try it on the string "<smiley emoji> I love it." (HN is unhappy with emoji, apparently. Surprise! Stuff that doesn't fit into UTF-16 often breaks all sorts of things.) The results are rather unexpected. But it works often enough that I assume people just don't really think about it much.

That's the trouble with string manipulation: it's usually easy to make a 99% solution and hard to make a 100% solution. Swift is attempting to prod you into the 100% solution by making the 99% solution more difficult, and the 100% solution maybe a bit easier.


Do you want to proceed forward one byte, "character", grapheme, or codepoint?


I get the idea behind very explicit method names, but seriously, python-like (or whoever invented it) slices are the most straightforward and succint I've seen so far, even without the [concrete:syntax].


Maybe it's just my first-world-privilege at work, but, honestly, why would I want to put up with a language that makes all of my day-to-day coding tasks so much more difficult than it is in other programming languages? It seems like I'm losing more than I'm gaining. 99% of what I do is in English. Everything that's not in English is going through an i18n library to convert labels and such. It doesn't seem like I'm gaining much of anything, but I sure am losing a lot. No default indexOf, length, and substring methods? Ouuuch.


Exactly. I understand where pilif and Lx1oG-AWb6h_ZG0 are coming from when they say that indexOf, length, and substring actually get a little ill-defined when Unicode is involved, but I can't see the logical leap from that to it being a good idea to exclude them entirely. The motto of a good API/SDK is "Make the easy case easy, and the hard case possible"; Swift seems to fail the first condition. Surely there are default assumptions for these methods that work most of the time, with the option for more complicated functionality if you need it.

Maybe all this complexity is necessary when you're writing huge i18n'd apps, but the point is that if it's too much bother to even do Hello World, I'm never going to get to the huge i18n'ed apps.


The problem with strings is that there is very little that is easy. The lesson is to avoid using strings as much as possible, although sadly fashions in programming seem to have been heading in the other direction for a long time.

I don't think hiding difficult operations behind an easy facade that sometimes breaks is a good idea. It makes people think that working with strings is easy. It results in lots of people writing broken or even insecure code[1].

The idea that you only have to deal with unicode strings if you are writing an i18n'd app is wrong. File names are unicode. The content of the pasteboard is unicode. The results of calling a web API are unicode. Text the user types into a text field is unicode. Unless your app doesn't handle outside text data at all then it's going to encounter non-ascii text. And if you aren't handling text from outside the app then you probably won't have much need for string operations anyway.

[1] do a github search for 'UTF8String' and 'length' to see how much code using NSString is passing the wrong length value into C apis.


Because you're making mistakes in those other languages? Because the world is not just English anymore, and "going through an i18n library" is all well and good until you need to accept literally any form of user input, which is kind of the point of programming.

> No default indexOf, length, and substring methods?

This makes me think you didn't read the post. There is no lack of those methods. There isn't a one size fits all version, because the problem space isn't generalizable.


Because it is harder to do the wrong thing. Same as the arguments for static typing and defensive coding. An i18n library will not save you if you are fiddling around with indexOf and substring.

I think want we actually want is a separate string type for ASCII-only strings.


Do you never deal with user input?

indexOf, length, substring, and all the rest are still in there. They just require you to pick which representation you want to work on first. This is often just a matter of typing .characters after the string variable.


In Europe we do localize our applications, no such luck with 99% English.


When I was working on one of our SDKs (pre Swift 2.0), I found it rather maddening that the Swift string class had native support for emoji, but developers were left to create their own implementation of very basic features like indexOf, length, subString, etc.


... until you realise these "very basic" features are not very basic at all and pretty much depend on your current use-case. What length do you mean? Byte length? Character length? Code Point length? In case of indexOf, how do you handle surrogate pairs? does indexOf('ä') only find 'ä' (LATIN SMALL LETTER A WITH DIAERESIS) or also 'ä' (LATIN SMALL LETTER A followed by COMBINING DIAERESIS)?

The good thing about the Swift API is that it gives you all the building blocks needed to actually having a chance at getting this right. Many other languages sweep those things under a rug and you're screwed or you'll have a much, much harder job to get it right if you need to.


>... until you realise these "very basic" features are not very basic at all and pretty much depend on your current use-case. What length do you mean? Byte length? Character length? Code Point length?

Still VERY basic, and all should be provided by the standard library.

("basic" as: very basic and frequent needs. They are of course quite complicated to write. Which is even more of a reason to have them written for developers in the standard library).


I think there's a pretty strong generalized use case for things like length (count the number of characters in a string) and indexOf (return the position of a character in a string). No reason to punish the 99% and force them to write their own implementations of these very conventional string operations to accommodate the 1% that means something not normally meant.


Once you decide what a "character" means, these operations are easy. You want the count of grapheme clusters? string.characters.count. You want the index of a particular UTF-16 code unit? string.utf16.indexOf(codeUnit). If you need that index in the form of the number of code units from the start of the string, string.utf16.startIndex.distanceTo(index).


But the definition of "character" is not generalized, therefore the use case cannot be generalized.


Based on Github searches I've done in the past a lot of code calling length on NSString was broken. Those people probably thought they were in the 99% who just needed the convenient 'normal' length. People not realising that length isn't giving them what they want is one of the things the Swift API aims to solve.

Giving the wrong choice for most situations (and number of UTF-16 code units isn't what people want most of the time) the easy convenient name is just a recipe for broken code. It's bad API design, or at least unfortunate historical accident, and it's good to improve things when there is an opportunity to do so.


It's not an improvement to exclude functions that you know everyone wants from the API. It's just going to result in more mess, because we all know most people are going to search for "swift string length function" and copy and paste direct from Stack Overflow without more than skimming it. In practice, it just makes the behavior messy and non-standard, ultimately meaning Swift applications are more difficult and/or annoying to debug and maintain.

The correct way to solve this would be to design the API so that the distinction between what you're getting and what you want is clear, and the place to go to get what you actually want is also clear. Including nothing is an admission that they couldn't do this.


>It's not an improvement to exclude functions that you know everyone wants from the API.

Which functions have they excluded? You mentioned counting characters, but you haven't actually specified which definition of characters you want it to use.

>The correct way to solve this would be to design the API so that the distinction between what you're getting and what you want is clear, and the place to go to get what you actually want is also clear

This is what they have done.


>Which functions have they excluded? You mentioned counting characters, but you haven't actually specified which definition of characters you want it to use.

The typical human definition of characters. Humans usually don't think about bits or bytes and they shouldn't have to. A character is a single independent glyph, a separate unit that would be taught to a human whilst learning to write, regardless of the internal representation in the computer.

If I need something else, I should ask for something else.

Hypothetical API calls that may address this:

"String".length()

"String".lengthInBytes()

"String".lengthInCodePoints()

This would provide standard implementations for these common functions (circumventing the issue of copying a random chunk of code from SO and all its attendant problems), make it obvious that there is a difference to anyone browsing the docs and/or using autocomplete, and make it easy to select and use the one you actually want in a particular situation. This type of discoverability is an important component in a usable API.


How is your suggestion any better than the actual Swift code for these?

    "String".characters.count
    "String".utf8.count
    "String".unicodeScalars.count
Looks pretty much the same to me in concept and typing difficulty.


I think it has less to do with typing difficulty and more to do with annoyance. Sometimes what people like is less about technical differences and more about previously built mental models that reduce cognitive overhead. In other words: It let's them simply start playing and experimenting without having to relearn things they take for granted.

If most people (let's say 99% for argument's sake) expect "string".length to refer to character count and only 1% need something like "string".utf8.count why not just accommodate them and make the language more accessible? This reminds me a bit of UX design. I've seen people time and again make the mistake over the years of designing their UX in a vacuum. What they produced wasn't bad or hard to use, it just broke expectations.

All that said, sometimes breaking expectations is exactly what you should do because someone has to for things to change and move in the right direction. And sometimes when you do that you're taking one for the team, so to speak. Of course I don't know shit, but if I were looking at making a language popular I'd weigh doing what's "right" against my desire to increase the popularity carefully.


Apple is an interesting position here because of their massive clout. They could have introduced a language that looks like COBOL crossed with APL and it still would have been immensely popular just because they have such a huge base of fanatical developers. I think you're right that this sort of thing could pose trouble for adoption of a normal language, but Apple doesn't need to worry about it so much. I'm hoping that this will allow them to push the Right Thing even though some developers don't like it.


Discoverability. Most people are going to try out "String".length first because that's what most other languages have. Whilst typing, if their IDE supports autocomplete, they'll see that 3 types of "String".length are available.

Most people are not going to read through the entire list of methods of String and thereby discover .characters, .utf8, and .unicodeScalars when they want the character count of a string.

I acknowledge that it may not be difficult to find this information with a quick Google search, but the more that can be done without having to switch to the browser or docs to look something up, the better. Injecting these so that they appear more frequently on the most likely UX path is better.


If anything I like it more because it makes it very clear what you are getting back. Whereas with "length" I would be heading to the docs to check what that means.


>Those people probably thought they were in the 99% who just needed the convenient 'normal' length.

And they probably indeed were.


... This entire article is about how those operations are basic at all, and will cause bugs if you're not careful.


I think the Swift String API has a lot going for it conceptually.

But I have a big problem with knowing so little about the performance and memory usage characteristics of the functions I'm calling and the strings I'm keeping in data structures.

The problem is exacerbated by the fact that the API is extremely incomplete. A ton of things can only be done by resorting to NSString functions and that, I think, requires the String being copied into a UTF-16 representation underneath. Or does it? I don't know. That's exactly the problem.

The idea of putting grapheme clusters at the center of the string universe is great when it comes to text that users see and manipulate. It's excellent for writing word processors.

But it is not convenient for analysing large amounts of text data or for parsing semi structured text formats where we have a lot of fixed length stuff that could be conveniently accessed using numeric slice indexes.

I think Swift's String class would work very well as a view on top of a UTF-8 buffer.


Is a String class the right type to use for parsing text formats with lots of fixed length stuff? Maybe Array<UInt8> would be more suitable.


There's a lot of middle ground between parsing a purely structured format and UI oriented string manipulation. Most of what I have done in my life involved semi-structured data with both natural language text and structured parts with fixed lengths. It's extremely inconvenient to have no string functions available when working with byte arrays.

I actually like the C approach a lot in principle, because it doesn't make you choose one or the other.


Maybe it would be best to have a lot of these things we think of as "string" functions available on arrays, like sub-sequence matching, replacement, trimming, even something like regex. Then strings can be "all that, plus unicode."


Yes, I think that's a good idea.


From the OP:

> (Incidentally, I think that representing all these different concepts as a single string type is a mistake. Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level. I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

There are many sub types which can be constructed from strings. Most languages treat at least following as special types that can be constructed from/to strings:

* numeric types (integer, float, etc) * regular expressions * file paths * dates * XML * JSON

The real absence is SQL. It is shocking how such an apparently old standard has no stdlib parsing support.

Until one realizes there is no such thing as "Standard SQL". Any sufficiently visible "SQL Parser" project eventually will be buried in a sea of vendor-specific edge cases.


> I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.

Foundation seems like the obvious one? NSURL vs. NSString?


Typing strings to solve bugs is the epitome of an anti-pattern.

If you want type, yank the data out of the string representation and into an actual data type!

"The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information." -- Alan Perlis, Epigram #34.


>, yank the data out of the string representation

Well, the very act of "yanking" can be helped by more well-defined behavior policies of "typed strings". It is unavoidable that the programmer will have to confront a contiguous piece of memory (string) with no boundaries between characters whether it's a block of bytes on disk, network stream, or the clipboard buffer. It can be helpful to differentiate the various types of strings with a formal type system before the yanking takes place.

The Alan Perlis epigram is wise but I don't think it applies here. I think's he's talking about abusing strings as an adhoc data format by stuffing them with multi-value data and then turning right around and parsing them. For example, concatenating name+":"+id+":"+amount into a string with colons like "John:389432:$5.99", passing the "complex" string around, and then later parsing out the colons with string.split(). Basically, the programmer is making an in-memory csv (comma separated values) data format instead of using "C struct {}" or passing 3 separate formal parameters.


> ...before the yanking takes place.

I take it that you are advocating for typed strings in addition to fully-formed types for things like URLs and SQL, but I think all that will lead to is a lot of gratuitous casting of strings just so they can be passed to the fully-formed types. Furthermore, few of these casts could be statically checked for validity.


>, but I think all that will lead to is a lot of gratuitous casting of strings

But if programmers are not directly doing memcpy() of bytes directly to the stack frame, they are already doing a roundabout version of "casting".

If a programmer has to "yank" from a string to put it into a more organized data type, the "yank" has to be done within the framework of semantics. If one "yanks" or "slices", or "substr", etc characters 3 to 15, the exact behavior depends on whether contiguous bytes is ASCII, UTF8, UTF16, or binary blob (which means NUL(0) is valid byte in the middle of the block of bytes).

That "meaning overlayed onto those bytes to manipulate it correctly" has to decided somewhere. It is either implicit or explicit.

Likewise, if one pulls 4 bytes from disk or network, the programmer has to assign meaning to it and say it is the 4-bytes of an IEEE floating point (float) or 4-bytes of a integer (int). The programmer may later want to do a bitshift (<<) which is not applicable to a float.


I think the first part of your argument works equally well for strings generally as for typed strings, and so I don't think it shows the latter are better than the former.

>That "meaning overlayed onto those bytes to manipulate it correctly" has to decided somewhere.

Sure, but I don't think it has to be done twice. A string type serves the purpose of abstracting the stringiness of a byte sequence, and an URL type serves the purpose of abstracting the URL-iness of strings, and to my mind, adding something to abstract the URL-iness of a byte sequence is a gratuitous redundancy that conflicts with the principle of separating concerns.


>I think the first part of your argument works equally well for strings generally as for typed strings,

But NameAscii.CharacterIndex(3) is a different meaning than NameUtf16.CharacterIndex(3). The byte index could be 3 or 6+n depending on semantics.

>Sure, but I don't think it has to be done twice. [...]abstracting the URL-iness of strings,

There's no redundancy. If I have a config file that has user-specified URLs in it. It is more likely that the programmer will know if that file on disk is ASCII or Utf16. The programmer can know that before the compiler or runtime can know it and assign a formal type to the incoming string. This is independent knowledge from a URL data type. A URL data type isn't going to have member functions to pull data directly off of the disk. That's a separate concern as you say.

>, adding something to abstract the URL-iness of a byte sequence is a gratuitous redundancy that conflicts with the principle of separating concerns.

We are envisioning different things. I thought of it as:

  - bytes -> typed string -> URL
I think you're thinking of:

  -        -- typed string
  - bytes /
  -       \-- URL
... and yes that would be redundant for a URL data type to understand all kinds of raw bytes.


What I am actually thinking of is this:

  bytes -> string
                   -> URL
                   -> SQL
                   ...
Once you type strings, you get multiplicities. Completing your diagram, we get

  bytes -> string
        -> URL-string -> URL
        -> SQL-string -> SQL
        ...
or

  bytes -> string
                  -> URL-string -> URL
                  -> SQL-string -> SQL
                  ...
There's the redundancy. What are the typed strings doing for us?

Edit: I don't think these diagrams are formally correct - I don't think of URL as a subtype of string, even though it has a string representation and URL objects can be constructed from strings - but I think they make my point clear.


(LATE EDIT: I now see that your responses were influenced by cballard's original examples of NSURL and NSString but I wasn't talking about those.)

I was thinking of ASCII/Unicode string types and not the URLString vs SQLString types. Yes, a URLString would be redundant. In C#, the URI data type contains a Utf16 string as a data member but it is not another type of string.

We were talking about different things.

The ASCII/Unicode string data types would be somewhat analogous to int32/int64/float32/float64/Money data types.

The way C# does it is that everything is Utf16 and there is no separate ASCIIString data type for pure ASCII strings. They also don't have pure Utf8 strings. Instead, the UTF16 classes has member "get()/encode()" functions that interpret as ASCII or UTF8. The ASCII/Utf8 data state is transitory. This architectural choice is fine for most apps but if you have 1 gigabyte of ASCII data, it's very memory inefficient to double the memory footprint to 2 GB for Utf16 just to slice & dice strings. If you use byte[] arrays to keep everything as raw ASCII to conserve memory, you don't get any convenient string-like slice & dice functions. The "string type" is lost in a opaque byte[] array.


> We were talking about different things.

You started by disagreeing with kazinator's objection to typed strings, and for several posts after that you explicitly defended the concept of typed strings. Now you are trying to rewrite the discussion as if it were about the internal representation of strings, which it has never been.


>You started by disagreeing with kazinator's objection to typed strings,

You misunderstood our subthread. Kazinator was talking about "fat" strings. I was pointing out that there can be "string types" that helps detect defects. We had 2 different ideas of "string types". I can't say we disagreed insomuch as clarified what each of us was talking about.

I also don't care for your accusatory tone as if I'm deceiving people. My examples of string types has been consistent from the very beginning. I was never talking about NSURL/UrlString but you did. I simply didn't pick up on that divergence which was huge.

Also, an internal memory representation has ramifications on external semantics. That's not a rewrite. That's been my point all along (see my earlier clipboard example). A sequence of of bytes on disk, or network port, or memory always requires correct semantics to process.


Kazinator has the last word on what he meant, but his original post, which is short, clear, and to-the-point, contains no such qualifications.


He qualified it with the Alan Perlis epigram:

"The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information."

My "string type" examples to help uncover defects in parsing & yanking were not relevant to Alan Perlis criticisms. The "hiding information" of strings is an orthogonal issue.


Given that Kazinator's post is in reply to one that specifically gives NSString vs. NSURL as an example, and you were making a point that you say does not apply to that example, then the onus was on you to make it clear that you were going off on a tangent.

>My "string type" examples to help uncover defects in parsing & yanking were not relevant to Alan Perlis criticisms.

Did you mean to write that Alan Perlis' criticisms were not relevant to your examples? Anyway, this should perhaps have brought it to your attention that you were going off on a tangent, given that Kazinator chose to qualify his point with this quote.


Yes, kazinator replied to a post that happened to have NSURL but he made a very generalized sounding piece of advice: "Typing strings to solve bugs is the epitome of an anti-pattern."

It is not unreasonable to interpret that sentence as some kind of "universal" advice. If he meant it in specific circumstances, he might have said, "solve that kind of bug" instead of "solve bugs". (It's not even clear if kazinator felt that NSURL is a bad example of string-but-not-really-a-string. NSURL is an object with well-defined structure and not a pure string. It therefore doesn't fit the criticism of the Alan Perlis epigram)

If I went off on a tangent, then I felt the door was opened by kazinator's (seemingly) universal statement which expanded the scope of discussion. Whether kazinator meant it universally is irrelevant because that's how I read it and that point in time is long gone. All my subsequent replies flowed from that (mis)interpretation. In any case, writing universal replies to posts with specific examples is not uncommon on HN or any forum for that matter. It's typical ebb & flow of discussions.

Lastly, I looked at your history of posts and you're just a generally argumentative type of communicator. The mods have asked people to read posts with a charitable interpretation but you don't seem to follow that spirit. I'm fine with debate but I'll ignore your posts with needless hostility. Regards,


I have to admit that there is some truth there, and I apologize for implying that you were deliberately misrepresenting the issues under discussion. I wasn't giving much thought to your arguments because they seemed to be fundamentally missing the point, and you were probably thinking the same thing.

As it happens, I was considering the general case, and the examples of URLs and SQL came from Mike Ash's article, where he was considering the interpretation of text as information. My interpretation of cballard's post is that he regards an URL to be primarily a string, and therefore that an NSURL is, in some way, a typed string. I didn't think it was a good example, but as I was making a general point, I did not see that as mattering - I could make my point with Mike Ash's general examples instead.

My interpretation of Mike Ash's original statement is that it would be useful to have different string types where strings have different interpretations, and I disagree - the semantics of a type do not depend on its textual representation (if that were so, it would not be possible to translate those semantics between languages.) If I now understand your position correctly, it is that it is useful to have different string types because there are several different ways to interpret the underlying bit sequence as a valid string. I am not convinced that is the best solution to the problem created by us having incompatible bytes-as-text representations, but I agree that it is not wrong to consider it as an exception to the general rule.


>Now you are trying to rewrite the discussion

Take it from someone that used to be very uncharitable in my interpretations of other's words and actions and was subsequently very argumentative and hostile: this is something you need to extract from yourself.

The sooner you take that step the better because it takes a long time to retrain yourself out of bad habits. I'm still working on it and every time I fail I wish I'd started sooner.

And even if you think I'm wrong, look at this way: You're not going to convince the person you're doing this to of your accusation and the other people reading it are going to ignore your argument and will just think you're being a hostile asshole. At that point you may as well be talking to yourself.


> That "meaning overlayed onto those bytes to manipulate it correctly" has to decided somewhere.

The correct place is at the earliest possible point: the input into the system. Parse the textual cruft into a data structure as early in the data processing pipeline as possible. (And treat any deviation from this cautiously, for the sake of optimization.)


>Parse the textual cruft into a data structure as early in the data processing pipeline as possible.

Totally agree. But whether we "parse" or "yank", that processing requires semantics. Data types can help with correct semantics to interpret the textual cruft the way the programmer intended.

The other issue of strings acting as adhoc compound data formats is orthogonal.


In the Rust standard library, we have String, str, OsString, Path, and PathBuf, off the top of my head, with external crates implementing things like URLs, ropes, and others.

That said, I completely feel what this blog post is saying: strings are _hard_, especially if you're not just doing ASCII.


std provides String (utf8), OsString (bytes/wtf8), PathBuf (bytes/wtf8), CString (null-terminated).

Each has their own unsized view: str, OsStr, Path, and CStr.

We also provide AsciiExt for those times where you really truly believe you want to be working with a String as Ascii.

That said, we generally try to make it as ergonomic as possible to pass a plain str where a Path/OsStr is expected. This is because utf8 is a subset of wtf8, so it's always fine to convert in that direction blindly (and it's really nice to just be like `File::open("foo.txt")` when hacking something together). This is why so many interfaces are riddled with something like `P: As<Path>`. The differentiation largely exists for the other direction, IMO. Paths and OsStrs aren't guaranteed to be valid UTF8, and shouldn't be provided where a proper utf8 string is expected.

Path is just a convenience wrapper over OsStr that understands the platform's seperator conventions and provides convenient utilities.


Don't forget we kinda sorta have byte characters and strings a la `b'x'` and `b"foo"`, which are really just simpler ways of expressing byte slices. Unfortunately they lack string specific methods until we get specialization (fingers crossed).


why are they so hard?


There's two aspects of complexity with strings:

1) 40 years of crazy encodings and languages.

2) human languages are wildly diverse and basically any assumption you wish to apply is broken.

For 1, any system that wants to deal with the outside world needs to deal with: operating system encodings (arbitrary bytes on unix, malformed UCS2 on windows), C representation (null-terminated strings), systems that only work with ASCII, systems that only work with utf8, systems that work with arbitrary encodings/languages (HTML). This is arguably unnecessary complexity that exists because of short-sighted decisions in the past.

2 is the necessary complexity; the fact that languages are really complicated.

There are thousands of symbols in writing. Do you try to encode these symbols in a monolithic manner, or in a compositional way? For historical reasons, you can often do both! ë can be a single character, or e with an accent modifier. How do you handle string searching in such a model? Do you match `noel` with `noël`? What's the length of noël? 4 characters? 5 characters? bytes? graphemes? codepoints? Can you correctly reverse noël (do it wrong and you can get leön)?

Different letters which have similar/identical representations but different semantics/origins! Is Ε "capital e" or "capital ε"? How do you upper-case or lower-case these letters? Do you expect to_upper(to_lower(char)) to roundtrip (it won't)? Do you expect capitalization to be doable in-place (it's not)? Do you expect capitalization to be region-specific (it is)?

Are any of these operations even coherent in a language like Japanese? Why are you trying to do them?

God help you if you want to display this text. Are you ready to handle right-to-left text? Are you assuming that your font is monospace (hey there terminal and text editors)? C̢̫a̘̺̯n ̘̜̦̹y̷̫̼̘̩o̶͉u̗̩̻̞ ̻ẹ͡v̴̤͎̹e̶̫̠̤̭̺̤̞n̛̞̹̣̩̲͉̮ ̜͖̪͔̖d̤e̘̯ͅa̺l̟̀ ͚̗̣w̭i̸͇̠̥̣̜̥t̸h̸̻̮̼̙̹ ̗̺̱̣̰̱̙z̟a̺͜l̠̦̖̟̰͍g҉̜͖͓̫ơ̩̹̰͕?̹̳̼̯̘̺̟


I think my favorite part of this article is how my Zalgo example doesn't render even remotely correctly in any browser I've tried it in.


Because we spent a lot of time optimizing every assumption for the ascii-only case, and letting go of that simplification is distressing, I guess.


Yes. And many languages had ASCII-only stuff for a long time, so "real" APIs seem much, much more complex. Of course, that complexity was always there, but us English speakers could mostly just ignore it...


Sort of. One spot of trouble is that NSString paths still exist and show up all over the place, even though NSURL is slowly replacing them. Another spot of trouble is that NSURL ends up in a similar situation, because sometimes NSURL means "an arbitrary URL" but often it means "a local file URL and nothing else." NSURL can hold arbitrary URLs with arbitrary schemas, but there are a ton of APIs in the frameworks which take NSURLs but only if they're file: URLs. It would be really nice to have a NSFileRef class or something like that which means, really, this is a reference to a local file, promise.


Yeah, I feel like all of the NSString-based APIs should have been deprecated by now?


/Regexp?/


the main problem with swift strings is actually naming. You don't have simple method

replace(src:String, dst:String)

but you have weird

stringByReplacingOccurrencesOfString(src, withString: dest, options: NSStringCompareOptions(), range: nil)

Why not to have simple convenience method?


Aside from the fact that they carried that style of naming over from Objective-C, the one benefit I see is that it makes it obvious that it returns the result in a new string rather than modifying the object itself. In C++ you would need to look at the signature to tell which way a method named replace worked, e.g.

void A::replace(const A& source, const A& dest);

vs

A A::replace(const A& source, const A& dest) const;

That said, I think you can get this same benefit without such a verbose name. Perhaps something like "withReplacement"?


A solution to this would be to disallow mutable value types. Then, all methods would _have_ to return a new instance.

The compiler implementation could use mutability behind the scenes for efficiency, while the language exclusively allowed immutable values.


That is not a Swift method, it's an Objective-C method on NSString, which Swift String can be silently converted to.


Not anymore; those methods are implemented in an extension on `String` that don't always involve bridging back.


They're still only available if you import Foundation. They're just NSString methods with wrappers. The real problem is just that Swift itself doesn't have methods for locating a string in another string, or replacing strings in another string.

This is why I was careful to say that Swift's API is the best in terms of its fundamental design. It still has a lot of missing functionality compared to other languages right now.


> Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level.

Objective-Smalltalk[1] has Polymorphic Identifiers[2][3], which are URIs used as identifiers in the language. So that handles file paths and web addresses. SQL is not solved directly, but XPath can be encoded in the URI and there are mappings from relational DB APIs to URIs[4][5][6].

I hope we can get rid of identifiers encoded as strings once and for all.

[1] http://objective.st/URIs

[2] http://dl.acm.org/citation.cfm?id=2508169

[3] https://www.hpi.uni-potsdam.de/hirschfeld/publications/media...

[4] http://blog.dreamfactory.com/add-a-rest-api-to-any-sql-db-in...

[5] http://restsql.org/doc/Overview.html

[6] http://www.slashdb.com


>Human-readable text, file paths, SQL statements, and others are all conceptually different, and this should be represented as different types at the language level.

String, Path/URL Objects, Prepared Statements.



While you can technically store whatever you want in there, the assumption and convention the bytes in a string in Go will be utf8.

E.g.:

   for i, x := range someString // iterates unicode code points
Works well in practice. (There are the unicode and unicode/norm packages to do more complex unicode operations.)


I don't think it works well for text manipulation. For example, in Go, how would you truncate a string and add ellipsis, without dropping an accent or otherwise splitting a grapheme cluster?


>I think that having different conceptual kinds of strings be distinct types would eliminate a lot of bugs. I'm not aware of any language or standard library that does this, though.)

Rebol does that AFAIK.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: