Hacker News new | past | comments | ask | show | jobs | submit login

I agree, although the arguments in that article aren't the strongest. In particular:

The main reason given for 'String' being slow is that it's immutable and hence leads to many duplicate Strings being created. In fact, the main reason Strings are slow is that their linked-list structure has a lot of overhead, and may be scattered around in memory. For example, the String "abc" will be represented something like this:

    +---+---+   +---+---+   +---+---+
    | * | *---->| * | *---->| * | *---->[]
    +-|-+---+   +-|-+---+   +-|-+---+   
      |           |           |
      V           V           V
     'a'         'b'         'c'
Those arrows are pointers, which can point to locations arbitrarily far away. From this we can see that just storing a (fully evaluated) String is expensive, since each character requires a pair of pointers (in addition to the unique characters themselves). Processing the contents of this String is also slow, since we need to dereference two pointers for each character (one to get the character, one to get the tail of the String). Since the data may not be contiguous in memory, this can make poor use of the CPU caches, which might otherwise speed up these dereferences.

Compare this to a C-style string, which would look like this in memory:

    | a | b | c | 0 |
This "packed" representation requires no long-distance dereferencing, the memory is contiguous so it will make good use of the cache, and we can process the characters directly without having to chase pointers.

That article describes the ByteString type as a "list of Word8 objects", but that's a bit misleading. Firstly we often say "list" to mean a chain-of-pointers like the first diagram above, e.g. that's exactly what Haskell's built-in list type is, and Haskell's String is such a list. ByteStrings store their Word8 objects more like the second diagram, so it would be better to call them an "array of Word8 objects". Secondly, ByteStrings use a clever three-layered structure to speed up common operations:

- The raw Word8 characters are stored in arrays, like the C-style string above (but without the NUL byte at the end)

- These arrays are pointed to by values called "chunks". These are "fat pointers", i.e. they also store a length and offset value.

- A ByteString itself is a list of chunks (like the first String diagram, but instead of values like 'a', 'b' and 'c' the values are chunks).

This makes ByteString really fast, for three reasons:

- Some operations can be performed by just fiddling at the list-of-chunks level. For example, to append X and Y, the result is just a list of X's chunks followed by Y's chunks; no need to touch the underlying chunks or arrays.

- Some operations can be performed by just fiddling the length and/or offset of a chunk. For example, incrementing a chunk's offset will chop characters off the start (they're still stored in memory, but they will be ignored); decrementing a chunk's length will chop characters off the end.

- The same underlying Word8 arrays can be pointed to by many different chunks in many different ByteStrings; i.e. we rarely need to copy the actual characters around.

As an example, let's say we read the ByteString "[\"foo\", \"bar\"]" (without the backslashes), we parse it as JSON to get a list of ByteStrings ["foo", "bar"], then we concatenate that list to get the single ByteString "foobar", here's how it might look in memory:

                         BS--+---+   BS--+---+
    "foobar":            | * | *---->| * | *---->[]
                         +-|-+---+   +-|-+---+
                           |           |
                      +----+           +----------------+
                      |                                 |
                      |  List+---+      List+---+       |
    ["foo", "bar"]:   |  | * | *------->| * | *---->[]  |
                      |  +-|-+---+      +-|-+---+       |
                      |    |              |             |
                      |    |              |             |
                      |    |              |             |
                      |    V              V             |
                      | BS--+---+       BS--+---+       |
    "foo" and "bar":  | | * | *---->[]  | * | *---->[]  |
                      | +-|-+---+       +-|-+---+       |
                      |   |               |             |
                      |   |               +---+         |
                      |   |                   |         |
                      V   V                   V         V
                     Chunk---+---+           Chunk---+---+
    (chunks)         | 3 | 2 | * |           | 3 | 9 | * |
                     +---+---+-|-+           +---+---+-|-+
                               |                       |
           BS--+---+           |                       |
    Input: | * | *---->[]      |                       |
           +-|-+---+           |                       |
             |                 |                       |
             V                 |                       |
             Chunk+---+---+    |                       |
    (chunk)  | 14 | 0 | * |    |                       |
             +----+---+-|-+    |                       |
                        |      |                       |
                        |      |                       |
                        |      |                       |
                        V      V                       V
    (array) | [ | " | f | o | o | " | , |   | " | b | a | r | " | ] |
(Note that the exact position where arrows "land" doesn't matter; they're pointing to the entire datastructure they "hit")

Here we can see that there's only one copy of the underlying text; that the substrings 'foo' and 'bar' are simply chunks with length 3, offset by appropriate amounts; and that the resulting 'foobar' ByteString is just a list of these two chunks.

This approach of "store things once, then do all processing with indices" can be very fast (even faster than C in some cases https://chrisdone.com/posts/fast-haskell-c-parsing-xml ). Whilst we can obviously write this sort of algorithm in any language, re-using arrays and chunks like this relies on them being immutable, which is more idiomatic in Haskell. In particular:

- Languages which tend to use mutation aren't well suited to this, since mutating one value can have unforseen consequences to those which are re-using the same components. Copying is safer and more predictable in the face of mutation.

- Languages which favour some built-in interface rather than high-level functions may need to copy values in order to comply with these interfaces. In particular, C's array syntax will work for arrays and chunks, but not for the overall ByteString structure (a list of chunks).

- If we want NUL-terminated arrays, we'll need to make copies when truncating strings, to avoid the NUL byte overwriting the truncated part.

- Re-using values can make it hard to keep track of which parts can be freed. Garbage collection (and other approaches, like linear types, borrow checking, etc.) can make this easier.

The difference between lazy and strict ByteStrings is just whether the overall list-of-chunks is lazy (chunks generated on-demand, useful for streaming) or strict (chunks are generated up-front, closer to using one big array, hence potentially faster and more predictable).

The Text type is just a ByteString paired with a particular encoding method, e.g. 'UTF-8'.

The article you link also talks about fusion, claiming that's why Text (and hence ByteString) is faster, or avoids intermediate allocations. Fusion is great, but care must be taken if we're going to rely on it; e.g. very minor changes to an expression (say, to make debugging easier) can stop things from fusing, greatly changing the compiled code's speed and memory usage.

On the subject of fusion, it's worth noting that Haskell's built-in list type is also subject to fusion, often resulting in zero allocation (e.g. generating a list of characters, then counting them, might compile down to a single loop with a single counter variable, and no Chars/Lists/Strings in sight!). Again, it can be tricky to ensure that this happens. One approach is to test for it with something like https://hackage.haskell.org/package/inspection-testing

Sounds a bit like the Rope data structure from 1995: https://en.wikipedia.org/wiki/Rope_(data_structure)

Although instead of a list of chunks, a rope is an immutable binary tree of chunks, so more editing operations are efficient, eg insertion or deletion in the middle of the rope.

Abseil library contains an open source C++ implementation of a similar structure called a Cord: https://github.com/abseil/abseil-cpp/blob/master/absl/string...

I believe both Firefox and Chrome now use Ropes as their string implementation due to those rather desirable editing operations.

Does this approach avoid the issue of substrings preventing the strings they refer to from being garbage collected (e.g. as in Java's substring() before JDK 7)?

AFAIK ByteString suffers this problem: substrings will prevent their whole array from being GCed.

By default, data read from files, stdin, etc. is split into 64k chunks (e.g. see functions like 'hGetN' in http://hackage.haskell.org/package/bytestring- ), so only the 64k chunks containing the parts we want will be kept.

There is a function 'copy' which will explicitly return a copy of only the character range that's used, so the original can be GCed ( https://hackage.haskell.org/package/bytestring- )

Thanks for the clarification. BTW I agree with tome - your comment above would make a great blog post.

> In fact, the main reason Strings are slow is that their linked-list structure has a lot of overhead, and may be scattered around in memory.

This is almost entirely a myth in the context of Garbage Collected languages. Most popular GC'd languages use a copying and compacting garbage collector. They should follow the pointers when copying and put the list together in one place. Furthermore, if the compiler were to mark the list as contiguous memory region of the same type (data/data pointer and one pointer to the next link), it could jump directly to the Nth element with the same performance as a real array (assuming you do bounds checking). The only significant difference is when you get long lists and due to the doubled size, you can't fit them in a cache line, but that's not a huge issue for most lists or arrays. If cache lines are actually that much of an issue for most programs, then b-trees are a much, much larger issue as they tend to hold more data for longer and necessarily can't be lined up in access order.

I believe another factor is too much dynamic experience. With a dynamic language like lisp, list items generally won't be carrying their own types. Instead, the list will be pointers to a boxed value where the box contains type info. Implementing in a language like Haskell with the requirement of static types should (in theory anyway) decrease the need for unboxing and radically improve performance (I believe common lisp type hints also reduce the need to box everything).

It is also possible to have a linked list pattern, but not a linked list implementation. Javascript is a good example here. You can add/remove at both ends and even splice to add/remove in the middle too. For years, everyone implemented all the lists as linked lists. Today, they mostly use arrays with fill pointers. Add/removing at the beginning and middle takes a performance hit, but the much more common pushing to the end does not (until you exceed the array size). You can add a start pointer offset to an array and decrease penalties for pushing the beginning too.

You're right that there are higher-order effects at play, but I was careful to hedge my bets (e.g. the sentence you quoted says "may be").

Besides a GC re-arranging/compacting things, the elephant in the room for Haskell is laziness, which introduces another level of indirection. I tactfully avoided this by limiting what I said to "fully evaluated" strings.

Lazy values start out as "thunks" (effectively zero-argument closures); if their value is needed (e.g. if we're branching on it), the thunk is evaluated to "weak head normal form" (i.e. until we reach a constructor we can branch on); the thunk in memory is replaced by this evaluated version, so it won't need to be re-evaluated if it's accessed again.

There are a few ways this can complicate things: thunks can prevent values from being garbage collected; layers upon layers of thunks can cause "space leaks" (where the order we evaluate things is correct, but inefficient; not to be confused with "memory leaks", which is where incorrect memory usage prevents it ever being freed); evaluating in order of demand can jump around inside different values, causing their allocations to be interleaved, and hence fragmenting their memory usage (until a compacting GC is able to defragment them).

As an extra complication, Haskell compilers like GHC will perform "strictness analysis" to find values whose evaluation is inevitable, and hence can be evaluated immediately rather than making a thunk. Strictness annotations can also be added to the code.

As far as types and boxing goes, GHC erases types at compile time, so values aren't technically boxed. However, most values will have several levels of nested structure, due to the way their types are defined, which is amounts to the same thing. For example, in GHC the default 'Int' type is a wrapper around a machine integer, whose type is 'Int#', e.g.

    data Int = I# Int#
In other words an 'Int' value like '42' is not a boxed pair like '(IntType, 42)', like in Lisp or Python; instead it's just a bare, raw 'Int' value, but that value will be still involve a wrapper pointing at a machine word, like 'I# 42#'. For this reason, the #-sufficed types are known as "unboxed".

We can write our code in terms of unboxed types but, since they're always strict, it's less flexible than using the normal "boxed" types (e.g. many common library functions won't work for boxed types). Hence we usually rely on the compiler to unbox values for us.

In practice, the biggest effect on space and time usage will be whether or not fusion is triggered. After that, I'd guess it's the difference between ByteString's fiddling-with-indices/chunks versus String's single-character-at-a-time rearrangement of pointers (even if they're reasonably contiguous).

Regarding the pattern/implementation distinction, I also agree. I was originally going to say something about "view patterns", which we can use to pattern-match on ByteStrings as if they were linked-lists. AFAIK Clojure does this too, using arrays of a reasonable length to implement its cons-based lists.

One tricky aspect of that is how we treat polymorphism. One of the only redeeming features of Haskell's default String type is that we can use it with any generic list functions we like; if we want to change the underlying representation of String, we would need to change the way all of our linked lists work, which might be less ideal for large element types. There are ways to avoid this, but they introduce complications; e.g. writing against "any Traversable, Foldable Functor", rather than just lists. That's more generic, but can also be more confusing (especially for newcomers)!

This is a very nice write up, maybe worthy of becoming a blog post!

Great write up! It deserves to be put in a blog post of its own!

Heh, I tend to use HN comments as inspiration for writing blog posts. This one's now at http://chriswarbo.net/blog/2020-06-08-haskell_strings.html

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact