
Why you shouldn't parse the output of ls - dgellow
http://mywiki.wooledge.org/ParsingLs
======
ScottBurson
This is the big thing that disappointed and frustrated me after I had spent a
bunch of time hacking on Lisp Machines and then switched to Unix: in the Unix
world, everything but _everything_ is character strings. On the LispM, when
you called 'directory', you'd get back a list of pathname objects. All the
system interfaces were like that; it was hardly ever necessary to parse
anything -- and when you did have to, it would be in s-expression format, so
all you'd have to do is call 'read' on it.

In contrast, Unix is a Babel of different syntaxes. Every basic command like
'ls' has its own output syntax; every configuration file is in a different
syntax. (Command line parsing isn't standardized either, but that train wreck
deserves another conversation.)

In the case of the LispM all this was achieved by running the entire OS and
all apps in a single address space; this obviously made passing objects
between apps trivial, but at the price of a complete absence of security. Such
a design would be a non-starter today. However, what you _could_ do today
would be to specify a standard system-wide serialization format, and give all
the basic system commands an option to generate it. S-expressions would work
great, but if you can't stand them, okay, use JSON. (Don't even think about
using XML.)

The result would be, instead of just piping text strings from one app to
another, you could, in effect, pipe _objects_. It's a far more powerful
paradigm and would save you all this parsing pain.

~~~
bobbyi_settv
> everything but everything is character strings

Actually it's worse: they're byte streams. They don't have to be decodable as
any encoding, can contain weird control characters, etc.

------
bradleyland
The utility 'find' has a nice parameter you can use if you need to parse its
output.

    
    
        find ./ -type f -print0
    

Using the '-print0' option will output a null terminated list. Since Linux
filenames can't contain nulls, you can reliably parse the output.

~~~
ams6110
Indeed, xargs (often combined with 'find') has a -0 option indicating that the
input is null-terminated. So you often see:

    
    
      find ./ -type f -print0 | xargs -0 ...

------
AndrewDucker
All of this makes me happy that I use Powershell, where my output isn't some
text I need to carefully parse (avoiding edge cases), but a list of objects,
each of which has properties for me to interrogate.

~~~
ams6110
Of course bash is not the only shell, nor is it the only approach for
scripting on Linux/Unix. See e.g. perl, python, etc.

------
slashdotaccount
Some of these mistakes are detected by ShellCheck:

[http://www.shellcheck.net/](http://www.shellcheck.net/)

------
ams6110
It's good to be aware of these pitfalls, but in practice they often don't
arise. If you're parsing log files, or any other system-generated files with
sane filenames (no spaces, or other odd characters) you won't have an issue.
Still, I normally would never attempt to parse 'ls' for this sort of thing.
The preferred approach in a shell script is to use the shell's globbing
capabilities (as in the example given):

    
    
      for f in *; do
          [[ -e $f ]] || continue
          ...
      done

------
alayne
I think if you're trying to use the shell for something other than some basic
program launch glue, you are doing it wrong.

In C, readdir returns a perfectly usable struct dirent * with no parsing
issues to worry about.

Python also provides a usable Unix layer for automation.

~~~
todd8
Every so often, I'll find myself frustrated with some bash script. For me,
once a shell script gets to that point, it's best to rewrite it in Python.
Translating a script into Python is almost fun, and I find the result much
more maintainable. BTW, the original submission is great. I've written shell
scripts over the years that have made this very mistake! For example doing
something like:

    
    
       ls pj* | wc -l
    

Which normally returns the number of pj* files, but will fail for pathological
file names as the submission points out.

------
simula67
Does not work well if you also care about hidden files :

[simula67@hades test_bash]$ touch .hidden

[simula67@hades test_bash]$ touch not_hidden

[simula67@hades test_bash]$ find . -type f

./not_hidden

./.hidden

[simula67@hades test_bash]$ ls -al

total 8

drwxr-xr-x 2 simula67 simula67 4096 Jul 7 00:32 .

drwx------ 40 simula67 simula67 4096 Jul 7 00:31 ..

-rw-r--r-- 1 simula67 simula67 0 Jul 7 00:33 .hidden

-rw-r--r-- 1 simula67 simula67 0 Jul 7 00:33 not_hidden

[simula67@hades test_bash]$ for f in _; do echo $f; done

not_hidden

You have use shopt -s dotglob

[simula67@hades test_bash]$ shopt -s dotglob

[simula67@hades test_bash]$ for f in _; do echo $f; done

.hidden

not_hidden

------
schrodingersCat
This is is one of those "Required Reading" posts for *nix users.. Thank you
for this!

------
lemcoe9
[http://webcache.googleusercontent.com/search?q=cache:GuOdGlv...](http://webcache.googleusercontent.com/search?q=cache:GuOdGlvv1SYJ:mywiki.wooledge.org/ParsingLs+&cd=1&hl=en&ct=clnk&gl=us)

------
ayrx
This unix.stackexchange post[0] is relevant as well.

[0]:
[http://unix.stackexchange.com/q/128985/24124](http://unix.stackexchange.com/q/128985/24124)

~~~
jl6
That question/argument is a fine case study in some form of bug in the human
mind, the name of which I know not.

