
Writing a Unix Shell – Part II - dhanush
https://indradhanush.github.io/blog/writing-a-unix-shell-part-2/
======
kevindong
If you want to learn more about writing a shell from an undergraduate
coursework perspective (and far closer to how bash does things), this is a
chapter entirely on writing your own shell:
[https://www.cs.purdue.edu/homes/grr/SystemsProgrammingBook/B...](https://www.cs.purdue.edu/homes/grr/SystemsProgrammingBook/Book/Chapter5-WritingYourOwnShell.pdf)

That PDF covers the basics of making your own shell (i.e. splitting input into
tokens aka using a lexer, parsing the resulting tokens aka using yacc [a token
parser], I/O redirection, piping, executing commands, wildcarding, interrupts,
environmental variables, history, subshell, revising what you've typed without
having to retype it, etc.).

Every undergraduate CS major at Purdue University is required to do the
infamous shell lab (essentially, recreate csh). This project really taught me
how shells work. Before doing this project, I could do the minimum in shells,
but now I'm fairly competent at it.

~~~
userbinator
I would not recommend using a lexer/parser generator for writing a shell,
especially if you want to support things like backquote substitution and here-
docs, since parsing and evaluating are interleaved. A recursive-descent parser
is more flexible and suitable to the task.

~~~
chubot
Yes, this great article is all about how the maintainer of Bash regrets that
it uses a parser generator (yacc):

[http://www.aosabook.org/en/bash.html](http://www.aosabook.org/en/bash.html)

I've mentioned this here before, but I was able to parse bash almost entirely
up front, without interleaving parsing and execution. The first half of my
blog [1] is about this.

To make a long story short, I use four interleaved parsers, and they ask the
lexer to change state at the appropriate points. It's three separate recursive
descent parsers, and then a Pratt parser for C-style arithmetic expressions.

It works very nicely, and surprisingly the algorithm is efficient, requiring
only two tokens of lookahead:
[http://www.oilshell.org/blog/2016/11/17.html](http://www.oilshell.org/blog/2016/11/17.html)

Aside from lookahead, the lexer reads the text exactly once, not 2, 3, or 4
times.

There are two things you can't parse up front that I know of:

\- Associative array syntax, but this is bash 4.0-specific:
[http://www.oilshell.org/blog/2016/10/20.html](http://www.oilshell.org/blog/2016/10/20.html)

\- A crazy instance of runtime parsing of arithmetic expressions inside
strings, AFTER variable substitution:
[https://github.com/oilshell/oil/issues/3](https://github.com/oilshell/oil/issues/3)
(all shells I tested implement this, not just bash)

Also there is one issue that would require arbitrary lookahead:

\- Bash does arbitrary lookahead to distinguish $((1+2)) and $((echo hi)), the
former being arithmetic, and the latter being a subshell inside a command sub,
but it's not required by POSIX:
[http://www.oilshell.org/blog/2016/11/18.html](http://www.oilshell.org/blog/2016/11/18.html)

In bash, Brace substitution is really metaprogramming which can be done at
parse time. You can manipulate program fragments, e.g. a{b,$((i++)),c,d}e, and
it doesn't rely on any program input.

In ksh, brace substitution is done AFTER variable substitution, so it's
another level of runtime parsing.

Globbing is done AFTER variable substitution in all shells.

But yes, lex and yacc are totally unsuitable for parsing shell. It's
unbelievably awkward to express, and results in more code, because the parser
has to be used for interactive input (the $PS2 problem), and it also should be
used for command completion, e.g completing something like 'echo $(ls
/b<TAB>...' .

It also forces you into parsing at runtime, as far as I can tell. The yylex()
interface involves a lot of globals and the generated parsers probably don't
compose as I would like.

[1] [http://www.oilshell.org/blog/](http://www.oilshell.org/blog/)

------
pacaro
Thanks for continuing your series.

This is a great illustration of how non-trivial a production quality shell is.

Parsing input is "tricky"

Each builtin needs comprehensive error handling

~~~
kevindong
IRL, shells don't do string manipulation (well, technically everything becomes
string manipulation at some point, but in this context not in the normal sense
of the term). Shells generally use a lexer to split inputs up into tokens
(generally using regex) [0] and then make sense of the inputs using a parser
(the most famous of which is called yacc [1]).

[0]: I was going to link to Bash's lex file here, but they appear to do
something funky which would require a non-trivial amount of time to find,
understand, and write here. So, you'll just have to take my word on this. I
give you wikipedia as a substitute:
[https://en.wikipedia.org/wiki/Lexical_analysis](https://en.wikipedia.org/wiki/Lexical_analysis)

[1]:
[https://git.savannah.gnu.org/cgit/bash.git/tree/parse.y](https://git.savannah.gnu.org/cgit/bash.git/tree/parse.y)

~~~
chubot
The lexer for bash is inside that file, parse.y -- see yylex(), which calls
read_token(). It doesn't use lex; it's written by hand.

I'm not sure what you mean that shells don't do string manipulation. Almost
ALL they do is string manipulation.

That's true for the shell interpreter, which has to make sense of the input
program, and for user programs, which are processing argv strings like file
system paths, and stdin.

There are actually a handful of different parsers inside bash, which I mention
here:
[http://www.oilshell.org/blog/2016/10/26.html](http://www.oilshell.org/blog/2016/10/26.html)

Brace substitution is another little parser as well. And globbing, and regex,
both of which need their own parsers. (bash has its own glob parser, but some
shells use libc's glob implementation). bash is really at 4-7 sublanguages in
one.

The annoying thing about shell is that it makes it impossible NOT to do string
manipulation in your program, because there is all this implicit stuff like
word splitting.

~~~
pacaro
One of my takeaways from TFA was along the lines of…

"Hmmm… he's using strtok, that's not how a real shell would work. What would a
minimal shell, without scripting, pipes, redirects etc. do? Just correctly
parsing legal file paths (which TFA needs to correctly implement 'cd') is well
out of scope of a small article like this."

~~~
chubot
Right, a real shell obviously can't use strtok. If you're leaving out pipes,
redirects, and any control flow, then separating a shell string into words for
the argv[] array is fairly similar to lexing a C-escaped string (e.g. in C,
Java, Python, JavaScript).

You have backslashes, single quotes, and double quotes basically.
Traditionally this is done with switch statement in a loop in C.

But that is not a good approach for a real shell. Even inside double quotes
you can have a fully recursive program, like:

    
    
        $ echo "hi ${v1:-A${v2:-X${v3}Y}B}"
        hi AXYB
    

Once you have recursion then you need some kind of parser, not just a lexer.

~~~
laumars
I've been mocked on HN for saying this before but Bash and other shells of
it's ilk are programming languages in their own right. I mean sure you're
dependant on the suite of tools in $PATH to do anything useful, but that's not
that much different to the standard libraries that make modern languages so
powerful.

~~~
Spakman
I have have a hard time seeing what there is to mock about your opinion of
shells. I absolutely consider them languages - better at some things, worse at
others.

------
tyingq
Still missing the check for fork returning -1, which would make your waitpid()
hang forever, as -1 waits on all p̶i̶d̶s̶,̶ ̶e̶v̶e̶n̶ ̶i̶n̶i̶t̶.̶ children.

Edit: less problematic than my original wrong guess, but still bad for a
shell, which eventually supports multiple concurrent children.

~~~
lkurusa
Wrong, waitpid(-1, ...) waits on all _child_ processes of the current process,
and would return -1 with errno set to ECHLD if there is no children.

The point about missing error checking still stands though!

~~~
tyingq
Ah, yes, any child...thanks for the catch.

