Hacker News new | past | comments | ask | show | jobs | submit login
How to split a string into an array in Bash? (2012) (stackoverflow.com)
59 points by tetris11 on July 28, 2021 | hide | past | favorite | 50 comments



The suite of tools you're typically working with in bash (i.e. bash built-ins and basic stuff that comes on most linux systems) doesn't like arrays very much. It's generally more productive to split things into multiple lines and/or insert delimiters so that you can play to the strengths of the tools you have at your disposal.

Everything is basically either a formatted input or a formatted output at the end of the day. Trying to shoehorn things into data "structures" that enable efficient access in certain cases rarely makes sense because of all the processing you need to do to get it there in the first place and it winds up being more efficient just deal in formatted streams of data even if you need to deal with some ugly nested loops and function calls to do it.


This! I was doing some work with a colleague and he was showing off some bash knowledge. I know a bit but my experience is limited since my last job was a Windows a shop.

We did a bunch of stuff with “xargs”. He said it was the one misunderstood command. One of the cool things was the ability to parallelize the work with a simple -P flag.

You do a find, make sure to use the -print0 flag, pipe it into xargs and use the -0 option then run a command over all the files. If you add -P it magically becomes parallel.


I really hate the way there's no standard flags, so you have to remember that find is -print0 while xargs is -0 and something else might be -null.

I spend half my time trying to find the right arguments and the whole time wishing I just bit the bullet at the start and used a real programming language.

And god forbid if a filename might have a quote character or a space in it. Then you have to think 5x as hard about what you're doing.


xargs and awk are two commands I really regret using as little as I do.


I know awk (well, gawk) pretty well and I still regret not having just learned perl.


While I agree that arrays are usually not the thing you want in bash, in the end to me this just feels a bit like having to adapt the job to the tool instead of being able to select the right tool for the job. Both of these principles have their place of course, but having to tell people something as common as an array is not really available in this widely used environment is like telling a carpenter to avoid nails or glue all together because this particular workshop only uses screws and an non-electric screwdriver. Perhaps not the most appropriate metaphore, but you get the point: mistakes will be made and time will be wasted. Because 'formatted stream of data' really means 'any tool can output a different arbitrarily formatted stream so if you need a field it is up to you to figure out how, every single time again because no-one can remember this for all possible tools anyway'.


Agreed! I love bash and use it all the time, but if some solution gets to the point of needing actual data structures, that's when you know it's time to move to a real programming language.


Does anyone get the "Accept all cookies" stackexchange modal popup every time? And if you go into Customize settings, you're faced with a dark pattern of being confronted with the "Accept all cookies" button AGAIN, and in the spot where you'd expect "Confirm my choices" to be.

Honestly I'm flabbergasted that stackoverflow is willing to engage in such unethical, immoral behavior on their very first customer-facing interaction.


I get it all the time when I don’t use blockers, but what I’ve seen when I choose to customize is a set of toggles where the additional cookies (don’t recall them, but they’re two among a total of three) are turned off. The size of the dialog on desktop is a huge annoyance.

When I use a browser with blockers, the experience is better.


I also feel like that, but I'm pretty sure it's just because we get a fresh one every time we visit a different stackexchange subdomain.


My complaint is more that they're willing to "deface" their website with a dark pattern like that.

The frequency just serves as a helpful reminder of this significant taint on their ethics.

It makes me wonder about the people that work at stackexchange.


I honestly don't think it's in bad faith, it does make some sense to treat subdomains separately for cookies.



Each of StackExchange's sub-sites has its own cookie consent management. That said, once I make my choices initially on each one, I haven't had any issues with them popping up again (I'm also signed in, which could impact things).


Yes me too, I too wondered if it a persistent attempt to accept all or my tracking blocker interfering


I am at peace with my cookies.


It is this kind of thing that makes me want to avoid Bash every time. Something that should be trivial has no general consensus about how it should be done and, not only that, many pitfalls.


Or it is a sign that if something is this complex, perhaps you shouldn't be doing it in the language to begin with. AKA, a code smell.

Bash is tiny, flexible, and highly effective at what it was designed to do: shell scripting, which is effectively user OS commandline automation.

It would be akin to me complaining that my microwave doesn't actually prepare a dinner for me, it just heats things up.


Bash is indeed very limited on some aspects (and confusing/quirky on others, etc.etc.).

Regarding the multiple ways though, I think it's a matter of good judgment. Depending on the input, one may choose different tools - one should not forget that Bash is a glue language, intended to make tools work together.


This whole class of problems, where there's no general consensus about how to achieve something, makes it incredibly difficult to train juniors who still have a one-correct-answer mindset. It makes me wonder how any of us got past it in the Java world, where there were no fewer than three commonly-accepted ways just to represent time.

Declarative languages, like SQL or (dare I say) Terraform, seem to be good teaching tools in that regard.


I meant that trivial problems should have more consensus but even the smallest of them I find when you ask about them there's controversy.


> Something that should be trivial has no general consensus about how it should be done and, not only that, many pitfalls.

Oh, boy. Let me tell you about CSS.


Oh yes, I know the classic center something issue before flexbox came along.


Looks like the correct answer is to call a python script (or similar). I'm impressed by the bash knowledge of the replier, but yikes. I stand by my rule to abort from bash as soon as anything less trivial than an if statement is required.


I made this mistake over the last few weeks.

Needed to build a custom R Shiny server with CI/CD, typical build and deploy. I start writing the initial install script which is in Bash.

I didn't take a step back when I started writing the CI/CD or the customised tooling to automate the deployment.

Now I'm a thousand lines deep across two pipelines and I'm equal parts impressed and terrified of the bash monster I've created.


Sometimes it's nice to have a pure bash script if you don't know your environment.

If the script is simple enough it might be worth learning and implementing a bash script than trying to test if Python is installed and the right version etc.


And that's why I switch to a sane language like python once my shell scripts get longer than a few lines.

But I would love a shell with a sane language. At the moment I am using zsh + oh-my-zsh because I can not let go of the autocompletion I get. I tried some other shells like oil shell, ion shell, nushell and elvish but sadly the completion is just not there yet. The only shell with maybe even better completions I came a cross is fish, but I don't love the language, while it seems better than sh/bash to me I'd much rather have something more similar to ion shell with stronger typing.

Thinking about command completion, this seems like the analog problem to editors and the language server protocol. Is there something like a command completion server?


The downside to Python is people start screeching if you use subprocess to call executables, and dammit, sometimes I don't want to find and implement the API in Python - I just want to run the program that already does that for me.

My personal flip point is error handling. If errors aren't important, shell. If they are, shell with `set -e`. If errors are important but also shouldn't immediately kill the script from one failure, Python.


Who is screeching about it? Why care? I use subprocess where it makes sense. That's good enough for me.


But I would love a shell with a sane language

Doesn't everyone :) Piping typed objects instead of text (notably Powershell) solves quite some general bash issue, unfortunately Powershell isn't exactly a sane language.


The long rebuttal from bgoldst fails to answer the question of how to solve this "in bash" when he introduces additional commands like tr(1) and sed(1). You should avoid using additional programs to perform actions where bash builtins can do the job. The extra overhead and impact on runtime of context switching to load in a new program is non-trivial if you have to loop over it thousands of times. It's better to normalize the string data for use with 'read' using builtin string substitution.

  $ string="Los Angeles, London, Belfast, New York"
  $ IFS="," read -r -a array <<< "${string/, /,}"
  
  $ echo ${array[0]}
  Los Angeles
  
  $ echo ${array[1]}
  London
..etc.

Don't have the free time today to read the rest of it unfortunately.


> The extra overhead and impact on runtime of context switching to load in a new program is non-trivial if you have to loop over it thousands of times

The specification is: "speed does not matter".

The long answer addresses this solution:

  $ string="Los Angeles, London, Belfast, New York"
  $ IFS="," read -r -a array <<< "${string/, /,}"
  
  $ echo ${array[0]}
  Los Angeles
  
  $ echo ${array[1]}
  London
as "not very generic" in point #3, which is correct. Bash simply doesn't support generic splitting by itself (things go downhill quickly once, for example, newlines are introduced, and so on), and if precision/flexibility are priority over speed, then it's better to use standard linux tools.


If you have newlines present then process the data a line at a time, as you would if reading from a file. This is nowhere near as difficult or cumbersome as you're making out.


One certainly can, but the increase in complexity shows that Bash starts not to be the most effective tool, when performing tasks it isn't designed for (and compare to a full-blown programming language, there are many).


The pure bash solution has overhead too. If you need to split 1000 strings it will create, write, and read 1000 temp files. Depending on your hardware and file system, that's more expensive than creating 1000 or 2000 processes.

I would worry about making it correct before making it fast, the former being a big challenge!

Shells Use Temp Files to Implement Here Documents : http://www.oilshell.org/blog/2016/10/18.html

(Oil doesn't do this; it creates a process for here docs without touching disk. In theory this could be eliminated for here docs less than PIPE_BUF, which is probably a lot of them)


It's the read/readarray builtin that's creating the tmpfiles and that's not great, but the string substitution doesn't. My point was there's no need to call out to another program to do something that bash is capable of doing itself.


No, it's the here doc, including here strings. See the blog post, which doesn't use read or readarray.


Understand. Did some quick tests myself and see what you mean.

Here's a version that doesn't use "here string" and so doesn't create temporary files.

  #!/bin/bash

  shopt -s lastpipe

  string="Los Angeles, London, Belfast, New York"
  echo "${string/, /,}" | readarray -d, -t arrayA
  echo ${arrayA[0]}
  echo ${arrayA[1]}
Also, the lastpipe option runs the readarray in the context of the current process.


For me, "I need an array" is a clear sign that the script should be done with Awk, Perl, Python, etc.


Sadly shell is not very good at string manipulation :-/ I would pipe the string to something like this and read it back into an array of lines (readarray):

    python -c 'print sys.stdin.read().split(", ")'
Or you can use sed and use read -d $'\x01':

    echo -n "$mystr" | sed $'s/, /\x01/
That will handle newlines but not the 0x01 byte.

I think really a shell should have the ability to iterate over bytes/code points reasonably efficiently to do arbitrary string processing. Python isn't great at this either, since it creates a lot of 1 byte string objects.


This straightforward in the fish shell:

> set array (string split ", " $string)

Afterwards you can also use the fact that arrays and array elements act far more predictably in fish (no implicit splitting on whitespace, for example).


fish is great because it isn't POSIX-compliant. fish is also terrible because it isn't POSIX-compliant.


The comma part reminds me of a not well known feature of bsd globbing.

  $ echo {one,two,three}{A,B,C}{0,1,2,3,4}
  oneA0 oneA1 oneA2 oneA3 oneA4 oneB0 oneB1 oneB2 oneB3 oneB4 oneC0 oneC1 oneC2 oneC3 oneC4 twoA0 twoA1 twoA2 twoA3 twoA4 twoB0 twoB1 twoB2 twoB3 twoB4 twoC0 twoC1 twoC2 twoC3 twoC4 threeA0 threeA1 threeA2 threeA3 threeA4 threeB0 threeB1 threeB2 threeB3 threeB4 threeC0 threeC1 threeC2 threeC3 threeC4


It also works with `..` to suggest ranges. eg

  echo {one,two,three}{A..C}{0..4}


Does any POSIX-like shells do arrays nicely?

Fish has "string split":

    $ string split ", " "Paris, France, Europe"
    Paris
    France
    Europe
Murex has "jsplit":

    » echo "Paris, France, Europe" -> jsplit ", "
    [
        "Paris",
        "France",
        "Europe"
    ] 
But neither are really Bash compatible. Is zsh any better?


I wouldn't - you know - use this, but with zsh…

     $ print -l ${(s:, :):-Paris, France, Europe}
     Paris
     France
     Europe
I won't try to answer the "is zsh any better?" question with this response either ;)


zsh has so-called expansion modifiers that do this. For example

  string="Paris, France, Europe"
  arr=(${(s:, :)string})
  for x in $arr; do echo "<$x>"; done

  <Paris>
  <France>
  <Europe>


from one of the answers:

> set -f

And why does `set` use a flag with no argument after it? It's simple, just follow the rules:

Single dash for single letter flags

Double dash for full word flags

No dash for flag arguments

No dash for some non-flag arguments from ancient commands

Sometimes the dash is a "minus" prefix and has a matching "plus" prefix for a non-flag argument

Gang them all together sometimes

Git uses lots of subcommands and flags, often in surprising combinations

Not always can you gang together single letter flags

? Can you always gang together single letter flags? I'm not really sure TBH

Never gang together full word flags.

The short pneumonic "SDNNSGGN?N" is instructive here. The trouble you'll have remembering that pneumonic will remind you how much of a pain bash is.


At this point just write a small program that does what you want and invoke it from bash


Shells aren't designed for general programming, they are designed for quick and dirty tasks. They are excellent for that purpose, not because the shell is useful, but because of the large ecosystem of tools that makes the shell useful.

Virtually any problem of "how do I do X in bash?" is solved trivially by Awk. If you don't know Awk, then any number of tools cobbled together in an inefficient, janky Bash function will get the job done so you can get on with your day.

If you need something robust, use an actual programming language, as "robust" things already need significant investment in development, testing, and maintenance.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: