

String types are fine. How about your code? - skrebbel
http://eteeselink.wordpress.com/2013/11/27/string-types-are-fine

======
tel
Haskell does this somewhat nicely by shunning the built-in String type and
having ByteString and Text types. All there (and others) can be created using
string literals, though that can be dangerous, but Text is a UTF-16 encoded,
ICU-backed human-text monster type which handles upcasing ligatures and even
more complex collation send the sort (which is, btw, how you solve the phone
book issue, and it's just one C library away).

ByteString is a series if bytes that just may happen to be ok to print as
human text for debugging. The system makes it hard for you to treat it
otherwise by moving the "Char8-assuming" functions to different modules and
packages which must be explicitly imported and carry warnings.

You convert between them using functions in the Text.Encoding module which may
fail like "decodeUtf8'" and "encodeLatin1". There's also a slew of normalizing
functions.

I really encourage anyone interested in this problem to peruse the Text and
Text.ICU documentation.

[http://hackage.haskell.org/package/text](http://hackage.haskell.org/package/text)

[http://hackage.haskell.org/package/text-0.11.3.1/docs/Data-T...](http://hackage.haskell.org/package/text-0.11.3.1/docs/Data-
Text-Encoding.html)

[http://hackage.haskell.org/package/text-
icu](http://hackage.haskell.org/package/text-icu)

[http://hackage.haskell.org/package/text-
icu-0.6.3.7/docs/Dat...](http://hackage.haskell.org/package/text-
icu-0.6.3.7/docs/Data-Text-ICU-Normalize.html)

[http://hackage.haskell.org/package/text-
icu-0.6.3.7/docs/Dat...](http://hackage.haskell.org/package/text-
icu-0.6.3.7/docs/Data-Text-ICU-Collate.html)

------
RyanZAG
Misses the point, I think? There are plenty of times you must alter human
text. For example, a word processor with a find/replace function must be able
to replace the correct text and not move extra symbols around while doing so.
There are thousands of similar examples.

~~~
rbehrends
Yes, but functionality that is about dealing with text for human consumption
is properly the function of a library, not the core language, or even a single
type in the standard library.

Human language and culture is too complex, fuzzy, and variable to hardwire its
rules into a programming language specification. Character boundaries and
character transformations are really only the beginning of it.

Consider finding the end of an English word. Is an apostrophe at the end an
actual apostrophe denoting the possessive case (of a plural word) or meant as
a closing single quote (even where there are different symbols, they are often
mixed up when doing data entry)? How do you pluralize words? You have to
consider exceptions to the usual rules (e.g., a dictionary), words that only
exist in the plural, languages that don't have the concept of plural words,
etc.

~~~
informatimago
Nonetheless, that's what Wolfram is trying to do with his "programming"
language. At least it's at the same level. Provide human level rules and
knowledge embodied in a computer programming language.

------
zamalek
Finally someone who at least touches on one of my massive pet peeves:
something I call suck typing (string-duck typing). Have a DateTime? ToString
it. Have XML? IndexOf (this one _really_ gets to me). Got some binary data?
Base64 it, store in XML (use IndexOf and Substring to access it) and then put
that XML in a binary field in the database. I am a strong advocate that suck
typing should be a fireable offence.

While I think unicode support in languages could be better; there is a lot of
truth in this article that surrounds his subject core.

~~~
arethuza
You've seen code that uses basic string manipulation (IndexOf, Substring) to
get values from XML? That's nice - although I guess it probably seems natural
to those who insist on creating XML by string concatenation...

~~~
mbq
Imagine that there is a performance bottleneck there and substring is, like
500x faster than engaging a proper XML parser; what would you do?

Obviously it is likely an effect of a global design flaw, but such things are
very hard to fix.

~~~
zamalek
Find a different XML parser or adopt a different way of reading the XML (DOM
vs. SAX; or just a different library that performs better). I see where you
are coming from though. The problem with XML is that it is used to solve
problems that it shouldn't be solving - it's a great technology when used
correctly (XMPP is a great example of how XML can make other transfer formats
look like a dress rehearsal). In most cases, as you said, "global design flaw"
\- a good indicator that you are abusing XML is if you are not using xmlns
attributes and if you do not have multiple namespace (because in that case
JSON is simpler, faster and makes more sense).

~~~
lmm
What's the advantage of XMPP? I find the one time when I don't hate xml is
when it doesn't have namespaces or schemata, as then it's just a slightly more
verbose JSON.

~~~
zamalek
Not much is that special in terms of the XEPs (extension protocols) that they
have defined. When you innovate with it though, man you really see the power
of correct XML.

You can slap custom elements pretty much anywhere you want, as long as you
have your own namespace (and it's recommended you only place them under
<message> or <iq> elements). Say you have some proprietary technology in a
client application, with XMPP you can throw an element under the <message>
that your client can recognise and act on. For everyone else provide a
hyperlink within the <body> element and serve up a web page for them. If they
are using your client "bam!" instant added functionality - but if they are on
device X which you do not support they are not left out in the cold.

------
bjourne
Talk about throwing out the baby with the bathwater..

The original article was wrong because it proposed replacing strings with
arrays of code points. Clearly, that doesn't work.

This article is wrong because adding more string types just shuffles the
problem around. There is nothing "machine consumable" about strings encoded in
a certain anglo-centric character encoding! Just don't even think that
thought. "abc" is absolutely not more "machine consumable" than "東京". You
don't "hash prose" by transliterating text into ascii characters.

It's not impossible to fix the existing string types. Principle of least
surprise holds. In cases when it doesn't, the locale is the tie-breaker. E.g.
"Scheiße".upper(locale.DE) may be different from "Scheiße".upper(locale.RU).

~~~
lmm
Please tell me how to correctly implement search (ideally in scala), such that
"Lodz" will match "Łódź". Heck, making "noël" match "noël" is nontrivial. I
don't think existing string types are adequate for this (e.g.
java.lang.String.equals() gives exceedingly surprising results).

~~~
bjourne
You are looking for a distance metric on strings. If d(s1, s2) = 0, where s1
and s2 are strings, then they are equivalent.

"Łódź" may be the same as "Lodz" to you in the same way that "komputor" may be
the same as "computer" to a Russian speaker. To someone else the names
"Anderson" and "Andersson" are equivalent. Now you see the problem -- exact
matching is futile and you should use fuzzy matching instead, like normalized
Levenshtein distance, and rank the results based on similarity.

Even that is not enough if you want to support non-Latin alphabets because
they have different ideas about what a character is but it should get you
started.

~~~
lmm
I shouldn't have to implement this from scratch though. It's not like this
problem is unique to my program; programming languages should have some
support for solving this kind of problem in their standard libraries (or a
readily available library)

------
lmm
The conclusion doesn't really match the title.

> If there is any takeaway from this entire discussion, it may be that there
> is a need for multiple string types in strongly-typed languages

Yes, yes there is. Until we get that, string types are _not_ fine.

~~~
skrebbel
Good point!

It is my opinion, however, that string types _are_ fine, just not perfect. I
should have maybe made that clearer.

~~~
zokier
Yeah, I was bit confused when your conclusions seem to match the original
"strings are broken" posts conclusions.

The whole problem is that current string types enable broken unsafe behavior
on Unicode ("human only" in your parlance) strings. Current string types are
broken because they do not enforce the requirement that string operations are
done only on plain ASCII ("machine only") strings.

~~~
ipedrazas
what if I have to create strings with non ASCII chars?

Calling ASCII "machine only" is totally wrong. I agree that encoding/decoding
is a pain, but we have different string encodings for a reason.

------
randallsquared
> Similarly, why would you ever need to take a substring of someone’s name

In my experience, you need to do this any time you're displaying someone's
name, a place name, an article title, or whatever. Often the display area just
isn't that wide, and shifting around other content may not be an option. You
need to display enough to let people know what's there, but eventually it
needs to be cut off.

~~~
stefan_kendall
You need a designer if you ever plan to internationalize. What you're doing
won't work - you'll make nonsense phrases.

------
brudgers
The problems arise not from strings, they are well understood computationally,
and most languages provide sound functions for working with them. The problems
arise when strings are held responsible for the side effects of other
functions which manipulate the data within strings.

In particular, problems arise when some function renders a sequence of
characters contained in a string [fundamental string operations such as
concatenation tend not to be problematic]. These problems are due to the
transition from the mathematical certainty of strings to the heuristics of
text rendering. The compromises required to map semantics of human writing
systems onto strings via Unicode contributes to this problem...glyphs are not
necessarily ordered sequentially or without resolution under a context.

------
aufreak3
The post makes a good point about the distinction between two types of text -
symbols and human text. Sadly, this is made very complex in our world through
the introduction of the markup (and markdown) languages which mix text
targeting both machines and humans. As long as you're having to work with
embedded structuring or formatting codes, woe betide you. You'll have to deal
with script injection, sql injection and what not.

------
tambourine_man
I disagree with the premise of the article. Text is for humans only. If you
are connecting computers with no human interaction, binary is much more
efficient.

------
burstmode
Great, so now that he explained to us how to use strings for human readable
text, the last missing bit is a blogpost telling the author how to choose the
colors for letters & background. What's the sense in optimizing string usage
when half of the audience can't read the lightgray text on slightly-lighter-
gray background ?

~~~
skrebbel
Hey, thanks for the feedback. I just picked a random Wordpress.com theme to
get started fast (I started blogging 3 days ago). I'll do more effort to pick
(or even _gasp_ make) a better theme!

------
quarterto
_If you got this far, you’ll probably want to hire me as a consultant._

Nope.

~~~
skrebbel
Damn. I'll live, I guess. Thanks for reading till the end anyway!

~~~
tommo123
I say leave it. Don't reduce your chance of getting paid work just because
some smug clown on the internet wanted to score a psuedo-zinger by yelling at
his television screen.

If anything, change 'probably' to 'might'. Confidence is good, but some people
could take 'probably' as an imputation on their ability to comprehend the
subject matter. I'd always aim at "That was just the beginning. Whip out your
chequebook and then you'll REALLY see what I have to offer" rather than "You
sound like you're in trouble."

~~~
skrebbel
Hah :-)

Thanks for the support, but if I make an arrogant remark, I have to expect to
get snarky responses, right :-)

The line was supposed to be taken as a joke, in reference to people signing
off their blog posts with "If you read this far, you should probably follow me
on Twitter" and the likes.

Thanks for your feedback though! I hadn't thought about how people could
interpret it as a slight insult to their understanding of the subject matter,
so next time I make an arrogant joke I'll try and take stuff like that into
account.

