
Joe Armstrong: "In my opinion Erlang is brilliant at handling text" - drewr
http://groups.google.com/group/erlang-programming/msg/a5c0e9585d148294
======
ellyagg
Erlang now has a full-featured standard regex module called re that handles
utf8 strings and binaries. It now has a standard module for handling unicode.

Yes, regexes aren't part of the language syntax, but then that's not true for,
say, python, either. Having to use a library to do regex is not going to be
the difference maker in productivity for your app.

Besides, while I love regex, I still use them as a last resort. You're going
to want to use a real parser for reliable structured text processing.

Generally, erlang programmers keep strings in binaries, which are compact.
Most modules for handling string type tasks allow you to do this, e.g., the re
module understands a unicode_binary type.

In 6 months of programming erlang professionally, in a domain dominated by
scripting languages like python and ruby, I've certainly never been tempted to
bolt over string handling issues. Erlang's flexible distribution, concurrency,
and reliability model is just too compelling.

To get competitive performance in a reasonable amount of programmer time for
concurrent applications in other languages you're limited to the subset of
tasks that, say, twisted or tornado makes easy. The program I'm writing now
couldn't have been done with either of them.

Frankly, if it's the choice between built-in support for regex and built-in
primitives for distribution, concurrency, and fault tolerance, there's no
question in my mind which is more important.

------
amix
He compares Erlang's string handling with C... And C sucks at string handling.
He should compare Erlang to all the modern languages where strings are a first
class citizen. Comparing Erlang's string handling to Java, Python or Ruby
might change his mind.

~~~
shiro
I think the main (and almost the only) reason to have distinct string type is
performance, not for the ease of programming. Many string operations are
useful as a general list operations as well, including regular expression
matcher. (Note: Some people emphasize importance of O(1) access of string
access by index, but using integer index is also a performance hack. If search
operations can return some way to point to the substring you don't need
integer indexes.)

Another minor reason is to display; people prefer reading sequence of
characters in a string syntax. If you have a statically typed language it is
easy to display a list of chars in string syntax instead of list syntax. For a
dynamically typed language with heterogeneous lists, it can be a performance
penalty to check whether a list entirely consists of characters or not at
runtime. So, in a sense, it is also about a performance. (Note: Having a
syntax for strings has nothing to do with having distinct type for strings.
The string syntax can be just a syntax sugar.)

But performance is important, of course. One thing very common in string (a
list of characters) but not very common in general lists is concatenation. To
be precise, lazy language programmers use list concatenation without a guilt,
but eager language programmers tend to avoid it since it may cause unnecessary
copying of lists. So for the eager evaluation languages, it makes sense to
have a string type that has very cheap concatenation operation (e.g. using
tree representation) internally.

~~~
amix
Performance and _ease of use_ are the main reasons why strings should be first
class citizens. Strings are one of the most used data structures and most of
today's popular languages have very good support for them, encodings of them
and manipulation of them. C does not have that good support for them. Ignoring
strings and labeling them as "a list of integers" is a step backward, since a
lot of the data we have today is textual and will continue to be textual in
the future.

~~~
shiro
Having dynamic typing makes the discussion complicated, so let's assume we
have a cheap way to know the type of a given object.

In one world, you get an object of type [Char] ---means a list of
characters--- and you can apply all sorts of list operations on it, and all
sorts of operations specialized to [Char]. You can add a type alias String to
the type [Char]. In the source file you can write "string" and it is read as a
list of six characters. On the output the same list is printed as "string".

In another world, you get an object of type String, which is distinct type
from a list of characters. Type String has all sorts of useful operations. But
if you want to apply a generic list algorithm, you have to either duplicate
the code, or coerce the string into a list. Conversely, if you have a list of
characters and want to pass it to a string library function, say, regexp
matcher, you have to coerce it to a string.

Which is easier?

As nostrademons commented, one way is to implement a generic interface so that
you can write a generic algorithm on top of both list of characters and
Strings, but that's actually the same thing I'm saying. I say "list" as some
data structure on which you can peek the head, the tail, and you can add an
element in front of existing one. I don't care how it is represented---if the
runtime or the compiler can find out the list only contains ASCII characters,
it can freely store the entire list in an octet array. In a sense, I say
"list" as "data structures that implement the list interface".

Now, suppose if you have such a smart runtime/compiler. Suppose you can have
specialized functions on [Char], apart from generic list. Do you still think
having distinct string type is for ease of use?

In reality we don't have such sufficient smart runtime/compiler, so we
compromise. That's the distinct string type.

~~~
amix
I would prefer if strings were treated as a list of characters and NOT as a
list of integers (as they are in Erlang)...! I.e. your reasoning does not
really apply to Erlang.

~~~
shiro
Oh, I thought you were arguing with my proposition: Distinct string type is
for performance and not for ease of programming.

I agree that conflating [Char] and [Integer] is not good. That's a different
story.

~~~
amix
I disagree with your proposition. Languages should have a string type and not
only for performance, but also for ease of programming, since strings are an
important and special data type. I don't really care how this string type is
implemented - if strings are an array of characters (like in Java) or a list
of characters (like in Haskell). What I think is important is that a language
treats strings as first class citizens - and not just for performance, but
also for ease of use. And currently, strings aren't a first class citizen in
Erlang (due to Erlang's limited support for custom datatypes).

~~~
shiro
Let's forget Erlang. We've agreed that conflating [Integer] and string is bad.

I can't find the reasoning to back up your claim in your posts; if I miss it,
could you point it? I think I explained a few points that a language does not
need _distinct_ string type, except from performance reasons.

Note that I've never said that strings shouldn't be a first class citizen. A
list of characters _is_ a first class citizen. You can have rich string
library on top of lists of characters, plus generic list operators works on
them. So, why do you want a string type _disjoint_ from lists?

~~~
jimbokun
It seems the key is having a good "Char" type, so that Unicode bytes turn into
something matching the intuition of what a character should be. So the bytes
for "a umlaut" or whatever become a single Char in the list. So the Char type
is responsible for storing a representing characters at the right level of
granularity, and then string operations can all be implemented as list
operations.

Is there a case where this still breaks down?

------
bugs
I don't know very much about erlang but I can understand why it has the stigma
that it doesn't handle text well as almost every introduction I have seen has
said this is the case due to it being created by telecommunication
companie[s].

However my familiarity is limited and whether this is true or not I cannot
comment on, but comparing erlang to C in performing text based operations is
probably silly if the other person has an option such as say perl available to
them.

~~~
mahmud
String processing is just one of those things you can't offload to a second
process. Having Perl in the same box doesn't mean you have the luxury to open
some IPC channel and pass your texts to Perl. String processing is one of
those types of problems that just benefit from a little thought and 10 seconds
of planning.

C itself is bogged down by its own horrible string representation, the nul
termination, where most operations need to traverse the string up to the
terminator before they know where it ends. This causes all sorts of horrible
buffering tricks, conditional tests on every character, and other unpleasant
things. Pascal had length-prefixed strings from day one, and DJB created
Netstrings to make network programming easier on people. See
<http://c2.com/cgi/wiki?LeasedString>

------
cloudhead
Even though he has a good point: strings are lists, and erlang is good at
lists — erlang doesn't have native regexps like perl, js, or ruby, nor is the
support for them that good.

~~~
rjurney
This is true - looked at making a very concurrent webcrawler in Erlang, and
the regex bit was painful.

~~~
babo
From 12Bx that changed, regexps are enjoyable but still not as rich as Perl.

~~~
tlack
Maybe someone should write a parse transform for dealing with them?

------
dtf
How does Erlang fare with Unicode in practice?

~~~
zacharypinter
If I understand it correctly, Erlang fares well here because it stores each
character as a 32-bit integer in a list. The implication of that approach is
more memory overhead for each character, but that allows you to treat strings
as a list of characters and gives you all the benefits of the standard list-
manipulation libraries.

~~~
naz
> Erlang fares well here because it stores each character as a 32-bit integer
> in a list

Which is silly because if you make a list of 32 bit integers that have ASCII
equivalents then Erlang assumes it is a string.

~~~
silentbicycle
The main difference is the way it's displayed by default in the shell. Strings
are handled with list operations, and Erlang is _great_ for list operations.

