https://github.com/oilshell/oil (osh/ and core/ directories):
Right now the goal is to clone bash but have better parse time and runtime errors. I hope to make an initial release this summer. But the project is much larger (see the blog if interested).
Your blog post about "vectorized, point-free and imperative style" was so beautiful that it brought tears to my eyes. Thank you!
Yeah Python wasn't the original plan, but it helped me focus on correctness. I was able to iteratively reverse engineer bash and other shells, while keeping the code relatively clean.
I had hoped to rewrite it in native code, but it's a huge amount of effort, so the first release will be Python.
Existence and correctness are both higher priorities than performance :)
I disguised Python by bundling a slice of the Python interpreter, which will allow me to rewrite it over time without user-visible packaging/deployment changes. I'm also able to fork the Python language because I have the "OPy" compiler.
One thing that should make it faster is using integer offsets everywhere instead of dictionary lookups, and that should be possible without much code change. It is written in a fairly "static" style, although with some important metaprogramming.
EDIT: I also hope to resume blogging more regularly once the code is released. I have a big pile of ideas/notes. But getting it out there in any primitive form is the most important thing right now.
Much of the work in writing a shell is parsing -- I would estimate that parsing is 60% of the work, whereas it might only be 10-20% of the work of a compiler. This is both because interpreters are smaller than compilers (fewer transformations), and because shell is harder to parse than most programming languages.
But as you can see from my blog, it requires a few different parsing techniques, and there are some non-obvious choices I made that I think made it easier. (e.g. Lexer modes aka lexical state are a huge win for reducing complexity.)
The shell uses surprisingly few system calls -- fork/exec/wait, open/close/read/write, pipe/dup/fcntl, and that's almost it. These are well worth studying. File descriptors are non-obvious and essential.
I never took the OS class that most people take, which shows you how C and Unix work. This thread lists a few, and also check out xv6 in the parent thread:
I learned a lot about system calls by using strace on many programs over the years. I think Python helps a lot because it is interpreted and has relatively direct bindings to Unix system calls.
FWIW, although I learned interesting things about parsing, I don't think there is that much "hard" in the computer science sense about shell. It's mainly an exercise in software engineering -- how do you keep your program from degrading into an enormous mass of poorly-debugged if statements? (I would classify bash in that category, unfortunately)
As far as data structures, you should be able to write a shell without anything fancy at all. In fact some shells are basically just tons of linked lists. Linked lists make memory management easier in C, although I'm using a more modern, high-level style.
I also spent a significant amount of time reading bash, dash, and mksh code for this project. (and to a lesser degree zsh).
Happy to answer any other questions.
As mentioned, I'm deferring to GNU readline for interactive features. I'd be interested in what problems you had with PTYs and what module you used?
Are PTYs standardized by POSIX? It feels like you should be able to write a shell against only POSIX APIs (and ANSI C).
I'd also be interested to see your shell. There are a bunch of alternative shells and POSIX shell implementations here:
EDIT: I saw the sibling comment linking to https://github.com/lmorg/murex. Taking a look! A few months ago I started a thread with the authors of elvish, oh, mash, and NGS shells to exchange ideas. (elvish and oh are also written in Go.) I didn't know about your shell or I would have included you!
The problems I had was initially not even realising that many command line tools check if they're outputting to a PTY. This affected their execution behaviour. I'm sure this is all stuff you're already familiar with but a few examples I noticed were grep wouldn't highlight it's match, ls wouldn't output in multi-lined view and apt-get wouldn't give you many (any?) of it's interactive options. I also wanted tools like vi and top to function the same in my shell as they would in Bash but they couldn't without me assigning a PTY (I also needed to put the terminal into "raw mode" to passthrough sigkill and disabling echoing from STDIN - but thankfully that was very easy in Go)
I've played around a little with GNU readline. It's a really nice tool but I had a few cross compilation issues in Go when porting it to other platforms (eg FreeBSD). So I used an entirely Go package instead - which isn't without it's own bugs but at least it keeps my shell fully portable.
> A few months ago I started a thread with the authors of elvish, oh, mash, and NGS shells to exchange ideas. (elvish and oh are also written in Go.) I didn't know about your shell or I would have included you!
Thank you. My shell is only about 3 months old though (possibly less as I only named it 2 months ago) so likely wouldn't even have existed when you started your thread. :)
If you know C at the level of K&R (especially the last few chapters on system calls), you can learn what you need to write a shell from a book like Stevens' Advanced Programming in the Unix Environment and/or Bryant and O'Halloron's Computer Systems: A Programmer's Perspective.
I use metaprogramming in several places to make the code shorter, so 11K sounds about right for Python vs. C. I think it will be 15K with all the bash features filled out, excluding interactive parts.
I use GNU readline and not my own code, so that substantial part isn't counted.
The osh/osh.asdl file gives a concise overview of what features are represented and implemented.
It's a good book, give it a read.
It also shows how to set up pipes - using the pipe(), dup() etc. system calls. There are some subtleties involved in that. And that stuff is used for the shell he creates.
The idea is that the shell should be able to optimize pipelines. Pipeline here meaning a chain of commands piped into one-another.
So if you have a pipeline like
grep ^2017-05-29 /var/log/somefile | grep -v 'INFO|WARN' | tail -n5 | cut -f1 -f3
Now you might say that that sounds like it goes against the Unix philosophy, but actually it doesn't need to. If all of the core utilities were implemented in such a way that their logic could be extracted without duplication of code then the shell can still be doing "one thing and one thing well".
Another idea I have is to make these core utilities pipe objects instead of text, like on Windows. I am very fond of bash but I think one thing that Microsoft seems to have done right is to have the idea of being able to work on objects in PowerShell. But I don't want PowerShell. I want a mostly Unix shell, except, as I said, with objects.
I just wish I had more time :(
tail is a little different because it needs to find the end of the file but the greps certainly don't wait for each other to finish.
As for your "objects" point. The shell I'm working on (discussed elsewhere in this thread) uses JSON to pass objects around functions written for that shell while still falling back to standard flat text files when piping out to traditional UNIX/GNU/whatever tools. Though I still need to make the whole thing more intuitive.
I am aware of that. Still, performing multiple operations on the data within a single loop and with less copying would be an optimization.
Shell scripts are surprisingly efficient at pipelined work where you're just dealing with streaming data (as your example mostly does). It's the more complex logic they fall significantly short on.
The weakest area and why I kept saying "most" of that pipeline would be the use of tail part way through the chain (as others have mentioned). But that could be optimised within Bash to something like:
tail -r somefile | grep 'expression' | egrep 'expression' | head -n number | cut ...
You're object point is definitely relevant though. For example how many times have we seen shell scripts break because they don't consider spaces in filenames?
Talking about annoyances in Bash I'd also throw in exception handling as a major problem for traditional shells. Having Bash fail in expected ways and handle them cleanly is a huge pain in the arse.
Those above two points (and wanting cleaner syntax for iteration) are what drove me to write my own $SHELL so it does sound like I'm addressing some of your annoyances - albeit not all of them. However it's a long way from being something usable in production environments but I'm happy to access pull requests if you do ever feel you want to contribute to something that's trying to work towards at least some of your goals :)
Almost always you'll loose more time on I/O than on pipe processing.
I don't have the citation but I believe they mentioned that the Chapel language also does this. Rather than materializing intermediate arrays, it just does everything element-wise in one pass.
I think this optimization may make sense in some cases that arise in practice, but probably not very many. There are a lot of other things that need to be fixed with the shell first.
Did you mean cut -d' ' -f1,3?
Assuming delimiter is a space the above example can be reduced to:
' /var/log/somefile \
Linux: see busybox ("multicall binary")
BSD: see binaries on install media ("crunched binary")
sed and tail are both compiled into busybox
re: bsd compiling in tail with crunchgen is easy
As for "objects", the k language can mmap the output of UNIX commands and run parallel computations on those results as "data" not "text". It can be faster than I/O.
Using sed or awk is an option, yes, but I am so used to the standard utilities that I would rather keep using them.
Also I'm not sure if your sed script does what I intended for it to do.
1. Take ever line that starts with 2017-05-29.
2. Out of the lines we have, remove any that contain INFO or WARN.
3. Take the five last of all of those lines.
4. Take the first and third fields of all of those lines.
Let's create a sample file
cat > /tmp/somefile <<EOF
2017-05-28T08:30+0200 nobody ERR: Foobar failed to xyzzy.
2017-05-29T13:01+0200 nobody INFO: Garply initiated by grault scheduler.
2017-05-29T13:37+0200 nobody DEBUG: Garply exited with 0.
2017-05-29T14:12+0200 nobody WARN: Plugh quux corge.
2017-05-29T14:55+0200 nobody ERR: PLUGH QUUX CORGE!
2017-05-30T00:17+0200 nobody ERR: Failed to retrieve baz needed for thud.
grep ^2017-05-29 /tmp/somefile | egrep -v 'INFO|WARN' | tail -n5 | cut -f1 -f3
2017-05-29T13:37+0200 DEBUG: Garply exited with 0.
2017-05-29T14:55+0200 ERR: PLUGH QUUX CORGE!
' /tmp/somefile \
2017-05-29T13:37+0200 nobody DEBUG: exited
2017-05-29T14:55+0200 nobody ERR: QUUX
Your use of the word "INFO" as a placeholder for the first two occurrences of the space character threw me off quite a bit when reading your script, since the word "INFO" occurs in the file we are working with itself, but it makes sense to use a word that we know is no longer possible to be present since we've already removed all lines that contain it. However, while a neat trick, those kinds of strange hacks are the kinds of things that has brought me to believe that having objects (like they have in Microsoft PowerShell) instead of pure text would be beneficial in Unix also.
As for your comments on compiling into a single program and run that, I don't think you understood what I meant, or I don't think that busybox and those chrunched binaries you mentioned perform the optimization I am talking about, do they.
Using the k language you mention in that fashion seems more like a hack and will require a lot of work each time. I would rather rewrite all core commands of my system so that they produced true objects, or actually, rather than objects just structured binary data. I don't need the output to have methods you can call.
One of the main things I want from structured binary data is to be able to select the columns of data by name instead of by index and without having a mess of some commands using tab for delimiter and others space and so on.
So instead of
zfs list -H | cut -f1,3,4
zfs list | cut name used avail
Also all commands that output tabular data must have a "header" command that will show the column headers. So to see the headers that the 'list' subcommand of the zfs command will output, I would say
zfs header list
NAME USED AVAIL REFER MOUNTPOINT
So if I type
zfs list | cut
Naturally, "header" will be a reserved word.
All commands will understand how to work with tabular data.
When you use grep you will either specify which column to use, or you can tell it to look across all columns with *
The shell will only expand * to file names when the * is positionally last in the argument list of a command since all commands will take the list of files as the last ones in their list of arguments.
All of this being said, I appreciate all replies, including yours.
2017-05-29T13:37+0200 DEBUG: Garply exited with 0.
2017-05-29T14:55+0200 ERR: PLUGH QUUX CORGE!
Ctrl-V then tab. Or if using GNU sed just type \t.
We can reject this simplicity as a "trick" and demand something more complex.
But that suggests that the goal is not to solve the problem, it is to satisfy someone's desire for having some underlying complexity that moves the solution out of the realm of "trivial".
Even if we don't assume we have such astounding transformations, I think you're vastly overestimating the power of multithreading in this example. There's very little for each core to do, especially the final step (which is getting a whopping 5 lines to process).
All that being said, I'm not sure that, if I wanted highly parallel text processing OR deeply clever compiler-style optimizations, I'd start from "existing Unix utilities and their associated cruft".
Starting from the Unix utilites is simply because I'm so used to them.
I can assure you, however, that anyone who was under the impression that my system was actually Unix would be in for a nasty surprise :->
My system would be, shall we say, not POSIX compliant.
My system would take the things I like about Unix, such as being command-line centric and having utilities focused on being good at small individual tasks and using pipes to glue them together seemingly (but actually as mentioned each pipeline is compiled to an optimized program when run), and the filesystem would remain essential and hierarchical.
I would separate Operating System binary data, Operating System config files, Application binary data, Application config files, User config files, User downloaded data and User created data STRICTLY!
Also like I mentioned, to have structured binary data instead of text. This would lead to commands which are named like in POSIX but which operate a bit differently with regards to the arguments they take.
Probably a lot of other things as well that I've been thinking about but which escapes me at the moment.
But yeah, all of that remains a dream because it would involve more work than I have time for :(
Only if `/var/log/somefile` is a regular file and isn't opened for writing by any other process. Or your "smart enough compiler" doesn't care about doing the same thing.
Can't you change that to cut -f1,3 ?
(I read this article quickly, so apologies if I missed it.)
And then there's the fun of dealing with stdout, and stderr at the same time; in a single thread.
One more fun thing to do with fork is to learn to pass ownership of the child process to pid 1, what the command nohup does. A parent process when it exits, requires that all child processes exit as well.
fork() cannot fail with EMFILE or ENFILE
> It is the resposibility of the forking code to close filedescriptors it doesn't need.
FD_CLOEXEC can cause exec() to close file descriptors automatically.
> A parent process when it exits, requires that all child processes exit as well.
This is not true
This is true. But if you believe you're only closing files or sockets in the child process, then running into this issue is very simple.
> > A parent process when it exits, requires that all child processes exit as well.
> This is not true
Sorry, you're right I was thinking of the SIGHUP to the child, it doesn't actually require an exit.
This isn't a big deal per se but is indicative that the author has a limited understanding of the subject.
Which itself isn't a big deal either, but the article would be improved by the author stating up front that they don't know the material very well and they too are learning as they go along.
Thank you for pointing these out. I've corrected my assumption and put up a disclaimer on the post.
If you want a detailed explanation with examples you'd have to paste several chapters from a book on Unix programming in the comment thread, although in this case there's probably already a StackOverflow thread that summarizes the problems with this post's subject.
Forks are too sharp for children, if you like that pun.
It's basically the third chapter of every C course every CS student will get. And not very well done.
Diving in and trying it out is a good way to learn. Sharing your results can be helpful for others, even if you haven't learned very much yet; it can be a good starting point, and others can correct your misunderstandings and comment on areas you might investigate further.
I'm always happy to see people who have enough intellectual humility to show their incomplete work as they go, so that others can benefit from the learning process.
I've been a professional dev for a decade now, but failed out of college due to mental health issues (depression). I know a ton about stuff higher in the stack, but I've never spent much time writing C and mucking around with the OS itself.
This project is interesting enough to me to be worth the time to read it, and it's given me the idea of writing my own shell as a means of learning a Nim.