As a guy who casually (definitely not in-depth) reviewed working with Haskell ab...

2mol · on June 8, 2020

The tooling situation has rapidly improved in my opinion, nowadays ghcide [0] (a language server implementation) is really great and easy to set up.

I really do hope that not too many people feel like the message is "keep out". In recent years the community has generally strived to be welcoming to beginners and share the excitment of learning Haskell, without the arrogance.

The aforementioned tooling improvements were only possible because a bunch of excellent people did exactly what you mention: tackle everyday productivity and pleasantness by improving IDE integration deep into the stack, all the way to the information the compiler makes available to the tooling.

If you want to get a fresh impression, ZuriHac [1] is happening (remotely) this weekend, and there is an online class for absolute beginners held by the excellent Julie Moronuki (of haskellbook.com fame).

Disclaimer: I'm very biased, because I'm a co-organizer, but I think it will be great fun!

[0]: https://github.com/digital-asset/ghcide

[1]: https://zurihac.com

misja111 · on June 8, 2020

>> ghcide .. is really great and easy to set up

Honestly, I tried to set up ghcide because I would love to have a good ide for Haskell but stopped because the setup procedure is not easy at all. Or maybe I'm getting something completely wrong?

On the link you are giving, there are the following steps:

- clone a git repo

- run cabal install

- test that ghcide is able to load your code (!) Quote: "If it doesn't work, see the hie-bios manual to get it working."

- finally integrate it in the ide that you are currently using. I am using IntelliJ IDE with the Haskell plugin and apparently I am out of luck because it is not listed.

tome · on June 8, 2020

> finally integrate it in the ide that you are currently using. I am using IntelliJ IDE with the Haskell plugin and apparently I am out of luck because it is not listed.

Well, is it an IDE that supports LSP? If so, it should just work. If not, then ghcide won't work with it. Providing a language server is the whole point of what it does!

misja111 · on June 8, 2020

There is a LSP plugin for IntelliJ but it is not very stable. But that was not my point: maybe I'm spoiled because I'm coming from the JVM universe, but I don't agree that ghcide installation is easy. For me, "easy" means running an installer or invoking homebrew, and continue being productive.

tome · on June 8, 2020

Yes, fair enough. There's a lot of scope for making it easier :)

bjz_ · on June 8, 2020

> OCaml's is also excellent but not immediately obvious (their docs have improved a lot)

Interesting! The last time I tried OPAM, it actually seemed more frustrating than Cabal! Maybe it's improved? Last time I tried OPAM it would install packages globally by default, and avoiding that was a confusing process, when that should be the default behavior.

Cargo (while not perfect) has really nice defaults out of the box - ie. it generates a lockfile by default, scopes packages locally to individual projects, and lets you use more than one package of the same version in the same project.

Really hoping OCaml and Haskell can improve their package management story - they are getting there, but it still holds me back from really using them on a daily basis.

nextos · on June 8, 2020

A bit off topic, but is Haskell viable for production? Or is it just really intended as an experiment? Is it worthwhile to jump into Haskell now, or perhaps one should look into Idris or Agda? Any companies using Haskell where it has proven a distinct advantage?

Ocaml and SML (the former being sadly quite unpopular these days) are an order of magnitude simpler than Haskell. They are easy to master. I have been quite productive with Scala which, while being a member of the ML family, is much more intricate due to OO + FP on top of great Java interop and many Haskell-like features (laziness, monads, lenses...).

Haskell doesn't feel the same. I have learned some Haskell over the years. Since I come from a pure FP education (started with SML) and I have lots of type theory background, everything feels familiar, yet I've found it really hard to be productive or to justify the mental overhead of laziness. I have developed some small projects in Haskell, but I haven't found it a productive alternative to the languages above.

bojo · on June 8, 2020

My company writes and supports a lot of custom Haskell software for our parent, an ISP up in Alaska.

What we gain is a high level of confidence in our business logic rules due to strict typing, fewer overall tests, and by avoiding a few small gotchas uncrashable runtimes.

Not to say it doesn’t come without problems. Poor IDE integration (I personally don’t care), questionable documentation at times, and a steep rewrite your brain learning curve.

Some posts I have written.

https://www.alasconnect.com/2018/10/02/introducing-haskell-c...

https://www.alasconnect.com/2018/10/04/productive-haskell-en...

http://blog.bojo.wtf/2020/04/15/is-haskell-a-bad-choice/

MaxGabriel · on June 8, 2020

I'm the CTO at Mercury (https://mercury.com/) and we have 100% of our backend written in Haskell. It's gone really well so far.

Answering another commenter's question, I would say the "secret weapons" are:

1. Hiring: Haskell is a very in-demand language by very good engineers. For a startup, it's absolutely amazing for recruiting and I can't overstate how important recruiting good people is for a startup.

2. Correctness: Haskell helps you ensure your code doesn't have bugs. For a bank this is fairly important. Some aspects of Haskell that are great for this: algebraic data types ("discriminated unions" in some languages) model business logic quite well, purity makes a function much more of a known quantity, libraries like Persistent ensure type-safe SQL queries, and in general the community cares about correctness.

I haven't used the other languages you mentioned, so I can't really speak to a comparison with them. Re: laziness, it's almost never something I think about in practice.

klipt · on June 8, 2020

> 1. Hiring: Haskell is a very in-demand language by very good engineers. For a startup, it's absolutely amazing for recruiting and I can't overstate how important recruiting good people is for a startup.

Although for engineers, supply-and-demand favoring employers typically means it disfavors employees: you may effectively be taking a pay cut to use Haskell compared to more popular languages.

lostcolony · on June 8, 2020

Sure, but those that are willing to because they're sufficiently excited about the language means you can have a higher confidence in your hires, since they, A. Know what Haskell is, B. Want to learn it, C. Are willing to take a pay cut to be able to get paid to learn/use it. Very high signal to noise I'd wager.

If you're a Java shop, you better be paying higher than market rate if you want to be able to attract good talent, but then you still have to figure out who they are amidst all the mediocre/poor devs, who are interested because of the higher comp. Very low signal to noise.

It seems like a less popular language, that has real business value, helps achieve the thing every company that isn't "hire fast, fire fast" tries to do with their hiring policy. It does mean you can't easily hire people who already know the language, so you have to consider the ramp up time, but I don't think that's nearly as painful as many hiring managers seem to think it is.

klipt · on June 8, 2020

Well yeah, it seems you're reiterating that it's great for employers, not for employees. I don't disagree.

Of course if Haskell somehow takes off and all the FANG companies want to hire as many Haskell engineers as they can, then the situation would probably reverse.

lostcolony · on June 8, 2020

I'm more calling out that it's mutually beneficial; the employees choosing that tradeoff are still choosing that tradeoff. The other tradeoff, work for a higher compensated position in a language that they hate, isn't objectively "better", just different.

marcosdumay · on June 8, 2020

The labor market is not zero-sum.

If Haskell is just a signal that selects for more competent people, companies will be happy to hire people whose competence they would have to invest a lot to match otherwise, and developers will be happy to get a larger salary than they would be able to get otherwise. (But yes, this would disfavor people with other kinds of signal.)

If it selects for high productivity people, or if it helps people be more productive, companies will be happy from getting more output than they would otherwise, and developers will still get more money.

The stereotype is that Haskell jobs pay less, and developers are happy to work with a better language. That still doesn't disfavor any party, but I don't think anybody has any evidence that this is the case. Notice that the GP is talking about people competence, not salary.

smabie · on June 8, 2020

You really think so? Take for example, kdb+/q. Not exactly widely used, but engineers who specialize it are making top dollar, it's not unusual for total comp to be around a million dollars, or sometimes a lot more, if you're the principle architect of a new system.

klipt · on June 8, 2020

You're just saying that kdb has the opposite imbalance as Haskell (relative to mainstream languages).

If supply and demand for kdb favours employees, then presumably it disfavors employers, who would rather have a bigger pool of kdb programmers so they could pay less.

That doesn't contradict what I said about Haskell vs mainstream languages. I didn't say all niche languages are like Haskell.

tome · on June 8, 2020

To answer your questions literally without much nuance:

Yes, Haskell is viable for production. There are caveats, as with any other language. Haskell has more caveats that most other languages you would be familiar, but it's nonetheless perfectly suitable in the right environment.

No, don't use Agda or Idris in production, they're totally unfit. Agda is not intended to be. Idris may be but it will take 5 or 10 years.

The only members of the ML family that I'm aware of are Ocaml and the flavours of Standard ML. Only Ocaml is fit for production.

mr_custard · on June 8, 2020

> The only members of the ML family that I'm aware of are Ocaml and the flavours of Standard ML. Only Ocaml is fit for production.

F# is also a member of the ML family and is totally fit for production.

FridgeSeal · on June 9, 2020

F# is the language I want to like, and would like to use in prod, but it constantly feels like the ignored step-child.

The REPL is great in theory, but trying to import packages is a nightmare every time I try.

Dependencies exist, but they all seem to target wide and inconsistent range of the ecosystem. Want to target dotnetcore 3.1? Good luck have fun: everything you find useful is targeting some combination of standard/framework/core/whatever confusing variant Microsoft decides to come up with this week.

If I somehow convinced my co-workers to adopt a language that wasn't C#, I'd point them Rust or Haskell instead.

akra · on June 10, 2020

Most of those package issues also apply to C# particularly as it stabilizes towards .NET 5. However it isn't too bad once you get it; and for most standard use cases it seems to work with the dotnet cli out of the box. Many people seem to be productive with C# so I'm not sure how big of an issue it is other than the small initial learning curve but many package managers have that (IMO Maven is harder yet many people use that in Java space and not too hard once you learn it).

REPL's often have problems with package management in a number of languages as they often have a different build chain from the compiled style apps which have the chance to resolve packages. When doing .NET Core in the past I know Paket can generate these scripts for the REPL to import packages (seen csx/fsx scripts for each package) - but another comment alludes to a more supported way in the future which is good.

Doesn't seem like you have a problem with the language per se; rather the package management story for .NET Core.

Paradigma11 · on June 9, 2020

I totally agree with your points but the situation seems to be improving. With https://devblogs.microsoft.com/dotnet/announcing-f-5-preview... you can import nuget packages in fsx scripts. The dependency situation also seems to grow saner every month.

Nevertheless i would still wait another few month for things to stabilize with .net 5.

tombert · on June 8, 2020

I think that F# is ok for production. I used it at a previous job without too many pain-points, especially in comparison to Haskell. I'm not a huge fan of MSBuild, but it works relatively well, and of course you have access to the massive .NET library and the entire ecosystem of C# libraries out there, and there's even some tooling to make the C# stuff play nicely with the F# constructs.

tome · on June 8, 2020

I totally forgot F# was part of the ML family, but can't edit my comment now! Thanks to everyone who replied to correct me.

_bxg1 · on June 8, 2020

There's ReasonML, which compiles to JS and seems intended for production

tomku · on June 8, 2020

Reason is a syntax layer on top of OCaml, so it shares OCaml's production-readiness. The tooling that Reason uses to compile to JS (Bucklescript) was originally designed for OCaml, and Reason supports native targeting via the plain OCaml compiler backend as well. It's a pretty pleasant ecosystem at the moment.

kqr · on June 8, 2020

In the same space we also find Fable, which hooks into Babel to compile F# to JS. We're using it in production and it's been surprisingly good. It's very nice to be able to share the definitions of data structures between the C# backend and JS (via F#) front-end.

smabie · on June 8, 2020

ReasonML serves no purpose whatsoever. The syntax is worse than OCaml's (though, more regular) and the editor tooling is really awful. OCaml compiles to JS just fine.

ska80 · on June 8, 2020

Idris 2 is a big step towards a production grade FP language

the_af · on June 8, 2020

> A bit off topic, but is Haskell viable for production? Or is it just really intended as an experiment?

It's viable for production and there are businesses which use it, notably fintechs. Also Facebook for some things (Facebook employs Simon Marlow [1]). And I know this is an irrelevant anecdote, but I have a friend who has been working in several startups which use Haskell for the last few years. Like Paul Graham has argued of Lisp, Haskell really is a "secret weapon" and people who use it for production tend to love it.

EDIT: a prior version of this post claimed Facebook employed Philip Wadler, which is incorrect. Sorry! I got my wires crossed and confused Philip Wadler with Simon Marlow!

[1] https://www.linkedin.com/in/simonmarlow/

nextos · on June 8, 2020

> Haskell really is a "secret weapon" and people who use it for production tend to love it

That's exactly my question. I'd be really interested in hearing about domains where Haskell is a secret weapon and how it compares to ML family languages and dependently-typed ones.

implicit · on June 8, 2020

I did a bunch of professional Haskell work in a prior life[1]. Most of its 'secret weapon' status springs from the way Haskell lets you control side effects.

We had a fantastic unit testing harness[2]. With effect tracking, you can arrange for your harness to put precise fences around nondeterministic code. You don't need to rely on experience and discipline to ensure that tests are reliable or fast. The type system does it for you.

Effect tracking also makes concurrent code easier to write. You have a great deal of control over how and when threads can be created, and what kinds of logic is permitted on those threads.

GHC's IO manager offers green threads and async IO, so rather than a select loop, you just fork threads and perform "blocking" IO calls. You get about the same efficiency as a select loop this way.

Lastly, something we learned to appreciate over the years is that Haskell is easier to refactor. When you combine the rigid type system, effect tracking system, and ease of writing unit tests, it becomes very easy to refactor old Haskell. It doesn't really matter how bad the old code was or whether the original author is around to help you at all.

If you intend for your product to evolve continuously forever, this is truly a secret weapon.

    [1] https://andyfriesen.com/2014/03/25/what-its-like-to-use-haskell.html
    [2] https://andyfriesen.com/2015/06/17/testable-io-in-haskell.html

6gvONxR4sf7o · on June 8, 2020

Of all the things I like in haskell (having only used it a as a hobbyist), refactoring tops the list. It's a lot less scary to refactor when the compiler tells you so much about which parts you missed.

ghostwriter · on June 8, 2020

https://news.ycombinator.com/item?id=23367803

the_af · on June 8, 2020

I'll let people who actually use it -- instead of toy with it, like myself -- answer you. I know many haskellers read HN and have answered similar questions in the past.

Like I said, I do know Haskell is used at fintechs (source: a friend whose job is precisely that).

auggierose · on June 8, 2020

Facebook employs Wadler? Since when?

the_af · on June 8, 2020

Gah! I meant to say Simon Marlow [1], co-developer of the Glasgow Haskell Compiler (GHC) [2]. Sorry for the confusion, I'll edit my other post now.

  [1] https://www.linkedin.com/in/simonmarlow
  [2] https://en.wikipedia.org/wiki/Simon_Marlow

alephu5 · on June 8, 2020

I've got a REST API in production linking to postgres and rabbitmq and I'm very happy with the performance and maintainability. The ecosystem is a bit more bare than with other languages like node, python or java but in exchange the paradigm and type-system lets you build safe and well isolated code. In my opinion it's also quite easy to read despite being terse.

Despite the complaints the build tooling is good, and there are decent plugins for IDE capabilities although they're a little bit resource hungry.

My biggest problem is the lack of component libraries, for example I wanted to use SMTP a little while ago but couldn't find a library compatible with the latest GHC so ended up writing a python script connected to rabbitmq.

For business logic involving timezones, currencies, API calls, calculations etc. you're pretty well covered though.

embwbam · on June 8, 2020

I’ve used it on multiple projects for clients over the last 5 years. These projects were large, made it to production, and were successful. I love it. Once you clear the initial learning curve it feels very productive, and gives you the feeling of always having something more to learn and master.

nextos · on June 8, 2020

What domains?

hazz99 · on June 8, 2020

Hasura (one of my favourite technologies) is built with Haskell!

ghostwriter · on June 8, 2020

> Really hoping OCaml and Haskell can improve their package management story - they are getting there, but it still holds me back from really using them on a daily basis.

Cabal 3.x with Nix-style buildscovers all packaging needs that I ever had with Haskell ecosystem[1]. And there's a new wave of tooling based on incremental Nix builds, that you can begin using today [2]

[1] https://cabal.readthedocs.io/en/latest/nix-local-build-overv...

[2] https://github.com/nmattia/snack

bjz_ · on June 8, 2020

Yeah, it's really great to see the progress there. However, afaik, it still doesn't freeze packages by default, or let you have multiple packages of the same version in a dependency tree[1]. The former can be worked around, but it's annoying that it's not the default. The latter is more frustrating however!

[1] https://news.ycombinator.com/item?id=23454711

lallysingh · on June 8, 2020

Stack does this, if I understand what you're saying. Stack+nix has worked quite well for me for a few projects.

marcosdumay · on June 8, 2020

Yep. Cabal 3.x is great, better than anything else I've seen for any other language.

As the article says, it just needs to correct the default behavior (deprecating the old behavior is on the roadmap, interestingly, the documentation doesn't even say what is the standard right now) and some improvements on usability (just better docs goes a long way).

talex5 · on June 8, 2020

By default, opam installs everything into the current "switch". Typically you have switch one per compiler, but you can create one per project. You can also create a "local" switch directly inside your project directory: https://opam.ocaml.org/blog/opam-local-switches/

Another option is to use the duniverse tool (https://github.com/ocamllabs/duniverse). That downloads all dependencies to the project directory, and then builds them all together as one giant project. That makes it very easy to make changes across multiple libraries at once, while keeping fast incremental builds.

And a final option is to build projects with Docker, which allows you to have separate versions of non-OCaml packages too.

bjz_ · on June 8, 2020

Ahh cool - had some similar questions here: https://news.ycombinator.com/item?id=23460980 - mainly, how much manual switching do you have to do? Or is it seamless, depending on what project directory your in? I think I tried the local switches in the past and got really confused when switching projects and everything broke, thinking 'wasn't all this meant to prevent this?'.

Erlangen · on June 8, 2020

Packages are not installed globally in opam by default nowadays. `opam switch` "enables the user to have several installations on disk, each with their own prefix, set of installed packages, compiler version, etc". https://opam.ocaml.org/doc/Usage.html#opam-switch

bjz_ · on June 8, 2020

Oh that's nice to hear! Some questions:

Do you have to run this command manually, and does it mutate the shell state? That's one thing that frustrated me with opam in the past as well. I couldn't just jump into a directory and build a thing, then switch to another project directory - there was a lot of manual switching and unswitching of packages if I recall correct?

Is it possible to install multiple tools globally using opam that use disjoint library versions? Like, I might want to install Abella and Coq side-by-side, but they might have conflicting version requirements. I think I was super excited about opam 2, then tried installing one thing, only to have it break again when I installed something else.

Is it possible to have multiple versions of the same libray in the same project, or does the constraint solver need to find a single solution for each library? [1]

[1] https://news.ycombinator.com/item?id=23454917

Erlangen · on June 19, 2020

Sorry I miss your reply. I create opam switches to have different versions of compiler, or allow libraries that conflict to live at different branches.

1. Dune is a relative new build system for Ocaml. You just write a configuration file, and dune will handle switches automatically. "It also supports multi-context builds, such as building against several opam roots/switches simultaneously. This helps maintaining packages across several versions of OCaml and gives cross-compilation for free." See https://opam.ocaml.org/packages/dune/

2. Yes. You can can multiple versions library at different switch.

3. I am not sure.

pdimitar · on June 8, 2020

I got told some months ago to look for these docs: https://news.ycombinator.com/item?id=22395595

It seems all we want can be made to work but it does require some tinkering and being aware of the tools' options.

andrepd · on June 8, 2020

Maybe you tried it before v2? Imho opam now is very usable, lets you be productive for 95% tasks after reading a 5min introduction, and generally gets out of your way.

ghostwriter · on June 8, 2020

It is in your interest to have multiple string-like datatypes in a lazy language, especially when some of them are not real strings, but rather streams of binary data.

https://mmhaskell.com/blog/2017/5/15/untangling-haskells-str...

chriswarbo · on June 8, 2020

I agree, although the arguments in that article aren't the strongest. In particular:

The main reason given for 'String' being slow is that it's immutable and hence leads to many duplicate Strings being created. In fact, the main reason Strings are slow is that their linked-list structure has a lot of overhead, and may be scattered around in memory. For example, the String "abc" will be represented something like this:

    +---+---+   +---+---+   +---+---+
    | * | *---->| * | *---->| * | *---->[]
    +-|-+---+   +-|-+---+   +-|-+---+   
      |           |           |
      V           V           V
     'a'         'b'         'c'

Those arrows are pointers, which can point to locations arbitrarily far away. From this we can see that just storing a (fully evaluated) String is expensive, since each character requires a pair of pointers (in addition to the unique characters themselves). Processing the contents of this String is also slow, since we need to dereference two pointers for each character (one to get the character, one to get the tail of the String). Since the data may not be contiguous in memory, this can make poor use of the CPU caches, which might otherwise speed up these dereferences.

Compare this to a C-style string, which would look like this in memory:

    +---+---+---+---+
    | a | b | c | 0 |
    +---+---+---+---+

This "packed" representation requires no long-distance dereferencing, the memory is contiguous so it will make good use of the cache, and we can process the characters directly without having to chase pointers.

That article describes the ByteString type as a "list of Word8 objects", but that's a bit misleading. Firstly we often say "list" to mean a chain-of-pointers like the first diagram above, e.g. that's exactly what Haskell's built-in list type is, and Haskell's String is such a list. ByteStrings store their Word8 objects more like the second diagram, so it would be better to call them an "array of Word8 objects". Secondly, ByteStrings use a clever three-layered structure to speed up common operations:

- The raw Word8 characters are stored in arrays, like the C-style string above (but without the NUL byte at the end)

- These arrays are pointed to by values called "chunks". These are "fat pointers", i.e. they also store a length and offset value.

- A ByteString itself is a list of chunks (like the first String diagram, but instead of values like 'a', 'b' and 'c' the values are chunks).

This makes ByteString really fast, for three reasons:

- Some operations can be performed by just fiddling at the list-of-chunks level. For example, to append X and Y, the result is just a list of X's chunks followed by Y's chunks; no need to touch the underlying chunks or arrays.

- Some operations can be performed by just fiddling the length and/or offset of a chunk. For example, incrementing a chunk's offset will chop characters off the start (they're still stored in memory, but they will be ignored); decrementing a chunk's length will chop characters off the end.

- The same underlying Word8 arrays can be pointed to by many different chunks in many different ByteStrings; i.e. we rarely need to copy the actual characters around.

As an example, let's say we read the ByteString "[\"foo\", \"bar\"]" (without the backslashes), we parse it as JSON to get a list of ByteStrings ["foo", "bar"], then we concatenate that list to get the single ByteString "foobar", here's how it might look in memory:

                         BS--+---+   BS--+---+
    "foobar":            | * | *---->| * | *---->[]
                         +-|-+---+   +-|-+---+
                           |           |
                      +----+           +----------------+
                      |                                 |
                      |  List+---+      List+---+       |
    ["foo", "bar"]:   |  | * | *------->| * | *---->[]  |
                      |  +-|-+---+      +-|-+---+       |
                      |    |              |             |
                      |    |              |             |
                      |    |              |             |
                      |    V              V             |
                      | BS--+---+       BS--+---+       |
    "foo" and "bar":  | | * | *---->[]  | * | *---->[]  |
                      | +-|-+---+       +-|-+---+       |
                      |   |               |             |
                      |   |               +---+         |
                      |   |                   |         |
                      V   V                   V         V
                     Chunk---+---+           Chunk---+---+
    (chunks)         | 3 | 2 | * |           | 3 | 9 | * |
                     +---+---+-|-+           +---+---+-|-+
                               |                       |
           BS--+---+           |                       |
    Input: | * | *---->[]      |                       |
           +-|-+---+           |                       |
             |                 |                       |
             V                 |                       |
             Chunk+---+---+    |                       |
    (chunk)  | 14 | 0 | * |    |                       |
             +----+---+-|-+    |                       |
                        |      |                       |
                        |      |                       |
                        |      |                       |
                        V      V                       V
            Array---+---+---+---+---+---+---+---+---+---+---+---+---+
    (array) | [ | " | f | o | o | " | , |   | " | b | a | r | " | ] |
            +---+---+---+---+---+---+---+---+---+---+---+---+---+---+

(Note that the exact position where arrows "land" doesn't matter; they're pointing to the entire datastructure they "hit")

Here we can see that there's only one copy of the underlying text; that the substrings 'foo' and 'bar' are simply chunks with length 3, offset by appropriate amounts; and that the resulting 'foobar' ByteString is just a list of these two chunks.

This approach of "store things once, then do all processing with indices" can be very fast (even faster than C in some cases https://chrisdone.com/posts/fast-haskell-c-parsing-xml ). Whilst we can obviously write this sort of algorithm in any language, re-using arrays and chunks like this relies on them being immutable, which is more idiomatic in Haskell. In particular:

- Languages which tend to use mutation aren't well suited to this, since mutating one value can have unforseen consequences to those which are re-using the same components. Copying is safer and more predictable in the face of mutation.

- Languages which favour some built-in interface rather than high-level functions may need to copy values in order to comply with these interfaces. In particular, C's array syntax will work for arrays and chunks, but not for the overall ByteString structure (a list of chunks).

- If we want NUL-terminated arrays, we'll need to make copies when truncating strings, to avoid the NUL byte overwriting the truncated part.

- Re-using values can make it hard to keep track of which parts can be freed. Garbage collection (and other approaches, like linear types, borrow checking, etc.) can make this easier.

The difference between lazy and strict ByteStrings is just whether the overall list-of-chunks is lazy (chunks generated on-demand, useful for streaming) or strict (chunks are generated up-front, closer to using one big array, hence potentially faster and more predictable).

The Text type is just a ByteString paired with a particular encoding method, e.g. 'UTF-8'.

The article you link also talks about fusion, claiming that's why Text (and hence ByteString) is faster, or avoids intermediate allocations. Fusion is great, but care must be taken if we're going to rely on it; e.g. very minor changes to an expression (say, to make debugging easier) can stop things from fusing, greatly changing the compiled code's speed and memory usage.

On the subject of fusion, it's worth noting that Haskell's built-in list type is also subject to fusion, often resulting in zero allocation (e.g. generating a list of characters, then counting them, might compile down to a single loop with a single counter variable, and no Chars/Lists/Strings in sight!). Again, it can be tricky to ensure that this happens. One approach is to test for it with something like https://hackage.haskell.org/package/inspection-testing

klipt · on June 8, 2020

Sounds a bit like the Rope data structure from 1995: https://en.wikipedia.org/wiki/Rope_(data_structure)

Although instead of a list of chunks, a rope is an immutable binary tree of chunks, so more editing operations are efficient, eg insertion or deletion in the middle of the rope.

Abseil library contains an open source C++ implementation of a similar structure called a Cord: https://github.com/abseil/abseil-cpp/blob/master/absl/string...

hajile · on June 9, 2020

I believe both Firefox and Chrome now use Ropes as their string implementation due to those rather desirable editing operations.

boygobbo · on June 8, 2020

Does this approach avoid the issue of substrings preventing the strings they refer to from being garbage collected (e.g. as in Java's substring() before JDK 7)?

chriswarbo · on June 8, 2020

AFAIK ByteString suffers this problem: substrings will prevent their whole array from being GCed.

By default, data read from files, stdin, etc. is split into 64k chunks (e.g. see functions like 'hGetN' in http://hackage.haskell.org/package/bytestring-0.9.2.1/docs/s... ), so only the 64k chunks containing the parts we want will be kept.

There is a function 'copy' which will explicitly return a copy of only the character range that's used, so the original can be GCed ( https://hackage.haskell.org/package/bytestring-0.10.10.0/doc... )

boygobbo · on June 8, 2020

Thanks for the clarification. BTW I agree with tome - your comment above would make a great blog post.

chriswarbo · on June 8, 2020

It's now at http://chriswarbo.net/blog/2020-06-08-haskell_strings.html

hajile · on June 9, 2020

> In fact, the main reason Strings are slow is that their linked-list structure has a lot of overhead, and may be scattered around in memory.

This is almost entirely a myth in the context of Garbage Collected languages. Most popular GC'd languages use a copying and compacting garbage collector. They should follow the pointers when copying and put the list together in one place. Furthermore, if the compiler were to mark the list as contiguous memory region of the same type (data/data pointer and one pointer to the next link), it could jump directly to the Nth element with the same performance as a real array (assuming you do bounds checking). The only significant difference is when you get long lists and due to the doubled size, you can't fit them in a cache line, but that's not a huge issue for most lists or arrays. If cache lines are actually that much of an issue for most programs, then b-trees are a much, much larger issue as they tend to hold more data for longer and necessarily can't be lined up in access order.

I believe another factor is too much dynamic experience. With a dynamic language like lisp, list items generally won't be carrying their own types. Instead, the list will be pointers to a boxed value where the box contains type info. Implementing in a language like Haskell with the requirement of static types should (in theory anyway) decrease the need for unboxing and radically improve performance (I believe common lisp type hints also reduce the need to box everything).

It is also possible to have a linked list pattern, but not a linked list implementation. Javascript is a good example here. You can add/remove at both ends and even splice to add/remove in the middle too. For years, everyone implemented all the lists as linked lists. Today, they mostly use arrays with fill pointers. Add/removing at the beginning and middle takes a performance hit, but the much more common pushing to the end does not (until you exceed the array size). You can add a start pointer offset to an array and decrease penalties for pushing the beginning too.

chriswarbo · on June 12, 2020

You're right that there are higher-order effects at play, but I was careful to hedge my bets (e.g. the sentence you quoted says "may be").

Besides a GC re-arranging/compacting things, the elephant in the room for Haskell is laziness, which introduces another level of indirection. I tactfully avoided this by limiting what I said to "fully evaluated" strings.

Lazy values start out as "thunks" (effectively zero-argument closures); if their value is needed (e.g. if we're branching on it), the thunk is evaluated to "weak head normal form" (i.e. until we reach a constructor we can branch on); the thunk in memory is replaced by this evaluated version, so it won't need to be re-evaluated if it's accessed again.

There are a few ways this can complicate things: thunks can prevent values from being garbage collected; layers upon layers of thunks can cause "space leaks" (where the order we evaluate things is correct, but inefficient; not to be confused with "memory leaks", which is where incorrect memory usage prevents it ever being freed); evaluating in order of demand can jump around inside different values, causing their allocations to be interleaved, and hence fragmenting their memory usage (until a compacting GC is able to defragment them).

As an extra complication, Haskell compilers like GHC will perform "strictness analysis" to find values whose evaluation is inevitable, and hence can be evaluated immediately rather than making a thunk. Strictness annotations can also be added to the code.

As far as types and boxing goes, GHC erases types at compile time, so values aren't technically boxed. However, most values will have several levels of nested structure, due to the way their types are defined, which is amounts to the same thing. For example, in GHC the default 'Int' type is a wrapper around a machine integer, whose type is 'Int#', e.g.

    data Int = I# Int#

In other words an 'Int' value like '42' is not a boxed pair like '(IntType, 42)', like in Lisp or Python; instead it's just a bare, raw 'Int' value, but that value will be still involve a wrapper pointing at a machine word, like 'I# 42#'. For this reason, the #-sufficed types are known as "unboxed".

We can write our code in terms of unboxed types but, since they're always strict, it's less flexible than using the normal "boxed" types (e.g. many common library functions won't work for boxed types). Hence we usually rely on the compiler to unbox values for us.

In practice, the biggest effect on space and time usage will be whether or not fusion is triggered. After that, I'd guess it's the difference between ByteString's fiddling-with-indices/chunks versus String's single-character-at-a-time rearrangement of pointers (even if they're reasonably contiguous).

Regarding the pattern/implementation distinction, I also agree. I was originally going to say something about "view patterns", which we can use to pattern-match on ByteStrings as if they were linked-lists. AFAIK Clojure does this too, using arrays of a reasonable length to implement its cons-based lists.

One tricky aspect of that is how we treat polymorphism. One of the only redeeming features of Haskell's default String type is that we can use it with any generic list functions we like; if we want to change the underlying representation of String, we would need to change the way all of our linked lists work, which might be less ideal for large element types. There are ways to avoid this, but they introduce complications; e.g. writing against "any Traversable, Foldable Functor", rather than just lists. That's more generic, but can also be more confusing (especially for newcomers)!

jimbokun · on June 8, 2020

This is a very nice write up, maybe worthy of becoming a blog post!

tome · on June 8, 2020

Great write up! It deserves to be put in a blog post of its own!

chriswarbo · on June 8, 2020

Heh, I tend to use HN comments as inspiration for writing blog posts. This one's now at http://chriswarbo.net/blog/2020-06-08-haskell_strings.html

pdimitar · on June 8, 2020

Sure, same as Erlang has "binaries" (which can be C strings or literally any binary data as you said) and Elixir built on top of that to give us UTF-8 strings. But, you know, only these two. No more.

I quite dislike when languages start choking me with analysis paralysis. There have to exist sensible defaults!

ghostwriter · on June 8, 2020

Erlang is not lazy by default. If it were you'd like to have at least two more types being generally available in its standard library.

_fq4v · on June 8, 2020

> (several types of them).

I find this criticism very bizarre. C++ and C have several different string types, as does python. This comes down to the fact that what programmers think are strings are often not really and vice versa.

To be clear, a 'string' is a sequence of characters or glyphs meant for human consumption or that are produced by a human as input. A string is not a sequence of bytes. C++ and C get this really wrong, and we pretend char* or std::string is actually a string, but they're not -- they're sequences of bytes. The best string in C is wchar* and the best string in C++ is std::wstring or even better a string from the ICU library. std::string completely glosses over the various encodings of real strings.

Python gets the string situation much better. str is for strings, and bytes is for bytes. str can contain actual human characters, and bytes can't (unless you encode a str to bytes properly).

From this perspective, Haskell really has only one string type: Text. ByteString is actually just a sequence of bytes, and it's an accident of history that Roman characters fit into bytes.

String is an iterator over human readable characters, akin to str in python or std::wstring::iterator in C++.

pdimitar · on June 8, 2020

You might find this a bizarre criticism but In Java, Ruby, Elixir, Rust strings are all Unicode with escape hatches for manually handling byte arrays (that could be C strings).

In that regard I still find it extremely puzzling that some languages / runtimes still struggle with only giving you two choices: byte arrays and UTF-8 strings. I really don't know what's so hard about it. As mentioned in another comment in this [now pretty big] sub-thread, this is one of the things I found off-putting in OCaml which is otherwise an excellent language.

As for Haskell, it has been about 2 years since I last looked at it. I might be spoiled by the languages I enumerated but I just found the docs [at the time] lacked proper nuance to give you a way to immediately make a choice. Dunno. It's not only the strings though; Haskell's community focuses so much on certain aspects that basic ergonomics get neglected.

I am willing to retry getting into it one day. Just shared my general impressions which are very likely outdated.

_fq4v · on June 8, 2020

> You might find this a bizarre criticism but In Java, Ruby, Elixir, Rust strings are all Unicode with escape hatches for manually handling byte arrays (that could be C strings).

In Java, there is a 'String' class that is used for strings most of the time, but there is also char[], which would be confusing to many C programmers. It's only familiarity and ubiquity that make this choice 'obvious' for most devs

> In that regard I still find it extremely puzzling that some languages / runtimes still struggle with only giving you two choices: byte arrays and UTF-8 strings.

To be clear, Haskell's strings and Text are not UTF-8. Thy are containers that can hold any unicode character, but the particular way in which they're stored is not really guaranteed. I believe Text is UTF-16 internally, and String, being based on Char, uses full UTF-32.

EdwardDiego · on June 8, 2020

> In Java, there is a 'String' class that is used for strings most of the time

Nearly all the time. Only time you're likely to see a char[] is in high performance code needs to mutate strings in a space efficient manner.

I would also mention CharSequence, an interface implemented by several string like classes.

lmm · on June 9, 2020

> You might find this a bizarre criticism but In Java, Ruby, Elixir, Rust strings are all Unicode with escape hatches for manually handling byte arrays (that could be C strings).

Java's strings have noticeable misbehaviour for astral characters, and since the language is built around a single hardcoded string type, almost all Java programs inherit this. Ruby had to have a major migration to fix its string support. Rust is much, much younger than Haskell (so it hasn't had to go through a migration yet) and is famous for having at least 6 different string types.

tome · on June 9, 2020

Astral characters?

lmm · on June 9, 2020

Unicode characters that lie outside the Basic Multilingual Plane.

tome · on June 8, 2020

Another great comment on this thread which could be a blog post of its own :)

morty_s · on June 8, 2020

> complexity in tooling and unnecessary tension when working with libraries

Yep. I’m currently feeling this. I’m thankful I’m not a complete beginner and thankful for the Haskellers that blog their opinionated experiences. But this is such a roundabout way of learning.

It’s also almost unfair to compare things to Rust’s Cargo (and Elixir’s Mix). They set the bar pretty high IMO.

Haskell would be more widely used if the ecosystem was more cohesive (perhaps a bit more centralized).

I’ve been enjoying Haskell enough that I’d like to help improve things.

tome · on June 8, 2020

> Haskell would be more widely used if the ecosystem was more cohesive (perhaps a bit more centralized).

> I’ve been enjoying Haskell enough that I’d like to help improve things.

I am hatching an idea along these lines. How do I get in touch with you? Alternatively can email me at the address here if you like: http://web.jaguarpaw.co.uk/~tom/contact/

hopia · on June 9, 2020

Gonna write anything publicly about whatever you're working on? If I'm correct, you've already made Opaleye which is really great, I'd be interested to hear if you got a new Haskell project coming up.

tome · on June 9, 2020

Thanks, glad you like Opaleye! I would like to start a project to improve programmer experience in Haskell. It's not ready to go public yet but if you want to know more then feel free to email me: http://web.jaguarpaw.co.uk/~tom/contact