Hacker News new | past | comments | ask | show | jobs | submit login

Text isn't the universal interface in Unix, byte streams are. You can quite happily send non-textual control characters and such around in Unix, or pipe data containing NULLs from one process to another. 'Text' is a very seductive abstraction, but it's one of the most brutal to work with once you start interacting with the real world and have to give up on ascii and deal with encodings and unicode and so on.

Putting commands and data inline is a recipe for disaster and a million command injection exploits. The Unix philosophy has broken the minds of generations of programmers. It leads them to doing things like concatenating strings to build SQL queries or doing IPC with ad-hoc regex-parsed protocols or using a couple of magical characters to indicate that the contents of a variable should be parsed and executed instead of just stored. Take a read of some of the earlier threads on HN about Shellshock, and you will find numerous people blaming Apache for not "escaping" the data it was putting in a shell variable. As if it even could.

Even Unix nerds have at least partially internalised the dangerousness of the paradigm -- "don't parse the output of ls" and so on. The fact that the Unix paradigm (passing everything as strings with magical characters and escape sequences) is broken for the most fundamental computing tasks like working with file names ought to be a damning inditement of the paradigm. Sadly people merely parrot the rote learned lesson "don't parse ls because file names can't be trusted", without thinking about all the other untrusted data they expose to unix shells all the time.

Just this week Yahoo got exploited. At first people thought it was Shellshock, but no, it was just a routine command injection vulnerability in their log processing shell scripts. A problem blighting just about every non-trivial shell script ever written.

The usual reply is "don't use shells with untrusted data". But auditing where any particular bit of data came from can be just about impossible once it has been across several systems through programmes written in a variety of languages, stored on a file system, read back and so on. The only sane solution is to never use shell scripts.

Like the C memory and integer model makes writing secure C code borderline impossible, the Unix "single pipe of bytes that defaults to being commands" paradigm makes writing secure shell scripts borderline impossible.

Unix needs to be taken out back and shot.




> Putting commands and data inline is a recipe for disaster and a million command injection exploits. The Unix philosophy has broken the minds of generations of programmers. It leads them to doing things like concatenating strings to build SQL queries or doing IPC with ad-hoc regex-parsed protocols or using a couple of magical characters to indicate that the contents of a variable should be parsed and executed instead of just stored.

Yes! And then to compensate, they have to "sanitize" untrusted input to their systems. I had a meeting yesterday with a developer and a project manager at an organization that wants to work with my company to integrate one of our products with one of theirs. I mentioned the possibility of submitting some data in JSON format to a web API on their end, and the project manager asked about the risk of code injection attacks, by which he apparently meant SQL injection. I had to assure him, based on my knowledge of their tech stack (Node.js, CouchDB, and naturally, JSON) that code injection wouldn't be an issue. My point is that the common abuses of strings by Unix and web developers have led to well-known and widely feared security vulnerabilities which just don't exist in software that's built on a foundation of properly structured data.

See also this classic by Glyph Lefkowitz:

https://glyph.twistedmatrix.com/2008/06/data-in-garbage-out....


I would not be half as capable of troubleshooting in my work as I am if I couldn't parse random byte strings on the command line. That alone is why I use Unix for embedded development. Powerful things are often dangerous. For instance, consider nuclear power, motor vehicles, rocket engines, medicine, plastics, et. al..

So how do you provide the same level of capability as C and UNIX but without mixing data and commands? Is there some alternative paradigm that is just as powerful but safe?


Watch the Mathematica-backed StrangeLoop 2014 keynote by Stephen Wolfram. That's one way to think about it. There are others like how things work on the Lisp Machine (and somewhat similarly in CLIM).


> Take a read of some of the earlier threads on HN about Shellshock, and you will find numerous people blaming Apache for not "escaping" the data it was putting in a shell variable.

It seems to me that this is the result of Cargo Cult Programming. People know that SQL strings, and user input need to be 'escaped,' so they just think "Obviously this needs to be escaped too! It's user input!" Yet they don't realize that they are trying to put a square peg in a round hole. They just know that pegs go through holes, so they keep banging away at it.

Also, it's always amazed me that there was never some sort of 'standard' way to shell-escape things, even though the shell has been around for ages. Why can't I generate a shell string in the same way that I generate a SQL string? E.g.:

  sprintf("mv %t %t", src, dest);
Where "%t" is a special token that shell-escapes the input (e.g. "My File Name.txt" => "My\ File\ Name.txt"). Instead it's something where people continue to use ad-hoc, incomplete, of 'implemented everywhere' solutions to this.

Note: I don't generate SQL strings with sprintf(), but it's a close approximation of:

  execute('select * from table where id = ?', id);


SQL injection is not avoided by escaping arguments, but by never mixing the command and user supplied arguments in the first place. The equivalent to your example would be

  execlp("mv", "mv", src, dest, NULL);
which does not rely on the shell to try to untangle your arguments from a single string.

Edit: Fixed, thanks!


> SQL injection is not avoided by escaping arguments, but by never mixing the command and user supplied arguments in the first place.

People see:

  execute("select * from table where id = ?", id);
as for the most part like:

  execute(sprintf("select * from table where id = '%s'", escape(id)));
Where `escape()` is written by "smarter people" and makes sure that `id` isn't a string like this:

  0'; delete all from user;'
(e.g. turning it into `0''; delete all from user;''`). I realize that this isn't what actually happens, but the general idea is that you are sanitizing your inputs.


> Where `escape()` is written by "smarter people"

can we stop with this please? I'm sure it's not your intention and it's the way it's always phrased but it's casual contempt and we deserve to treat each other and be treated better.

"escape() is written by someone who spent the large amount of time analysing all the issues, testing, taking and incorporating feedback so the rest of us, who are both smart and competent, don't have to duplicate the work.


Similarly, no amount of escaping would protect you from Shellshock.

P.S. Doesn't execlp() require a NULL at the end of the parameter list?


Agree that text is being abused in Unix all the time. The problem with passing everything around as text is that you cannot reason about anything, because everything is of the same type. One big advantage of object-based systems is that they can catch type errors and notify you of the problem. Text pipelines will simply break because one of your implicit assumptions didn't hold.


"Like the C memory and integer model makes writing secure C code borderline impossible, the Unix "single pipe of bytes that defaults to being commands" paradigm makes writing secure shell scripts borderline impossible. Unix needs to be taken out back and shot."

What alternative do you advocate/propose ?

Genuinely curious ...


For the interactive general purpose data munging and quick execution of simple commands that the shell is best at, I really don't know what a better system would look like. It seems like a really hard problem. Anything purely text based ends up being fairly cumbersome to use for simple commands if it has to use real data structures (consider having to type (["a", "b"]) instead of a b to pass arguments with json style syntax or whatever). At least that was my experience of trying to write a very simple shell. There are a hell of a lot of people a hell of a lot smarter than me though.

It seems to me that a lot of shell scripts could be ported to other languages. Does DHCP on Linux need to use a shell script instead of python or something like that? The benefits of the shell grammar and semantics which are designed to make interactive use easy seem more like hindrances in a lot of those kinds of use cases. I assume it's largely done to make it easier for sysadmins to customise things. If I was a sysadmin I'd much rather learn python (and feel like I actually understood it) than the crazy byzantine grammar of bash. Maybe that's why I'm not a sysadmin.

This paper by Rob Pike might also be of interest: http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

>The current UNIX® text processing tools are weakened by the built-in concept of a line.


The reason the shell is used everywhere is because it's guaranteed to be installed (although DHCP used Bash explicitly), is much faster to start and run a simple script than Python, and its syntax is the command line that everybody should be familiar with.


I've hardly used it myself, but apparently Microsoft's PowerShell pipes typed objects between processes: http://technet.microsoft.com/en-us/library/dd347728.aspx


Actually, not between processes - all Powershell commands run in the same address space as the shell, and must be implemented in .Net. I don't think you can easily write an external process which takes a Powershell object directly as input.


Thanks for the correction, that explains so much--I was picturing a lot of DCOM craziness.


so in one corner: the ability to compose small, comprehensible functions with data streams, delivering staggering riches to the world.

in the other corner: the complaint that sometimes, people can put carriage returns in their filenames and screw up the output of `ls -1`. With no suggestion of what would replace composed functional streams.

Nice try Lennart.


> in the other corner: the complaint that sometimes, people can put carriage returns in their filenames and screw up the output of `ls -1`. With no suggestion of what would replace composed functional streams.

How about functions that operate on data structures such as lists and maps? These can include generic functions that can slice and dice data structures regardless of what type of data those structures contain.

> Nice try Lennart.

Is Mr. Poettering's name now an epithet to be thrown at anyone who opposes the Holy Unix Way? Or do you just know something I don't about the identity of the commenter to whom you were replying?


The unix dataset already works on list and map data structures.

lists: ls -1 | wc -l

maps: ps -ef | grep 'tobekilled' | grep -v grep | awk '{ print $2 }' | xargs kill

would unix be better if it were

cwd.files.count

and

processes.filter{name = 'tobekilled'}.map(kill)

?

maybe for the newbie who paradoxically already understands functional and object-oriented programming, I'll grant.

But the 'pipe' ("this is a bucket brigade") has always been pretty easy for people to grasp.

The Lennart bit was a joke. The OP's post was so histrionically overwrought that I responded sardonically. Apologies if you were offended.


I'm honestly not sure if your post is meant to be satire or not.

>lists: ls -1 | wc -l

The computer has taken a real array (probably an array in C), joined all the items together into one big string using magical characters as dividers, and then split it again on those magic characters to try and reconstruct the metadata that it threw away. I think the problem is pretty obvious and well known.

>would unix be better if it were cwd.files.count

Well, at least that is going to give you the correct result. Correctness seems like it should be pretty important, no?

Are you really arguing for shells being easier to learn using an example of a complicated command with 4 pipes, 2 different quoted strings, several single letter arguments, and that requires implicit knowledge about the structure of the output from several commands? Compared to a much shorter, simpler, type safe, and self documenting bit of code?

Also, yes my original post was hyperbolic. That was because I was responding to a histrionically overwrought post claiming that unix is perfect.


So your argument is that ls is not efficient enough in its implementation, but that you would suggest replacing all of that with an object model that implements the equivalent of Ruby's enumerable. Got it.

Shell is easy to learn because people innately understand "and then do this with it". You can start with ls, get to ls -1, then think, I want to count these, and get to wc -l.

Yeah, there are exceptions -- e.g., files with newlines in their name -- and yeah, the interface could use some cleanup. But pedagogically, I can assure you that teaching people shell is easier than teaching them map-reduce.


>So your argument is that ls is not efficient enough in its implementation

My complaint wasn't about the efficiency of ls, it was the fact that valuable information that is required for correctness is thrown away to achieve compatibility with the unix 'stream of text' interface and the attempt to recover that information leads to incorrect results. The paradigm is just fundamentally broken.

>you would suggest replacing all of that with an object model that implements the equivalent of Ruby's enumerable. Got it.

I don't even like Ruby at all (ironically, I find its grammar far too complex and shell-like to be able to parse in my head), so I've no idea what you are talking about. You seem to be assuming that everybody who dislikes the shell must be some strawman hipster.

>there are exceptions -- e.g., files with newlines in their name

I honestly find it extremely bewildering that any programmer would see that as being acceptable. It's not just that it fails to give the correct result, it fails silently. Silent data corruption is surely just about the worse class of bug.

>But pedagogically, I can assure you that teaching people shell is easier than teaching them map-reduce.

I presume you mean the functional ideas of map, reduce, and filter, not MapReduce ( https://en.wikipedia.org/wiki/Map-reduce ). The latter is irrelevant to the discussion.

My experience is the exact opposite. I found understanding the concepts of map and filter trivial. If you can understand a loop you can understand them. Reduce/fold isn't hard to understand either, although a bit tricker to make use of. Your example didn't use reduce anyway. Map and filter are typically much easier to use and reason about than a for loop in C, or a chain of commands in a shell script.

The shell is an absolute nightmare to learn. I have tried to learn to use it numerous times of the last decade or so, and I have always forgotten it the next time I come to do anything in the shell. The amount of knowledge you need to actually do anything is huge (the awk language, obscure and terse command names, complex regexes, memorising a bunch of command flags, memorising the output format of commands - usually a format designed for displaying to users rather than machine parsing, the shells ridiculously complex grammar, how to escape things etc etc) Your example illustrates that. It would have taken me 20 mins at least to put together that line of code you gave. Also, it's not like you escape having to understand concepts like map and filter. If you don't understand them (not necessarily by name) then you won't be able to write the line of unix commands you gave.

>Shell is easy to learn because people innately understand "and then do this with it".

People might find the concept of piping data easy to understand (I'm not convinced they do to be honest), but that alone won't do them much good because as your examples showed, you always need to run a bunch of complex and obscurely named commands, regex, or awk on the data to make the next command able to understand it.


>> there are exceptions -- e.g., files with newlines in their name

> I honestly find it extremely bewildering that any programmer would see that as being acceptable. It's not just that it fails to give the correct result, it fails silently. Silent data corruption is surely just about the worse class of bug.

Yes! We should strive to build our software on solid, non-leaky abstractions as much as possible, so that exceptions like a filename with an odd character in it just don't exist. Until we reach that point, computers will continue to frustrate their users for no good reason.


I didn't see any argument about performance. I only saw an argument about correctness.


> Well, at least that is going to give you the correct result. Correctness seems like it should be pretty important, no?

The devil is in the details. In OP's example, is "files" a field of "cwd," and "count" a field (or getter) of "files?" Is "filter" a method of "processes" and "map" a method of the resulting (implicit) list returned by "filter"?

If the answer to any of these is "yes," then you will find yourself needing to implement these fields and methods (and probably others) for each OS object. The "filter" in "processes" necessarily has a different implementation from "filter" in "files", since despite having the same external interface, they both operate in different contexts on different internal state (i.e. a process object is not a file object).

Contrast this with the UNIX approach, where the "filter" and "map" implementations (i.e. grep, awk, sed, tr) exist independent of OS-level objects (i.e. processes, files, semaphores, sockets, etc.) and their state, allowing them to be developed and improved independently of one another and the data they operate on.

You want there to be some notion of type safety and structure in IPC. This can already be achieved: simply rewrite your programs to communicate via asn.1, json, or protobufs, or some other common structured marshalling format. You can have ls send wc an array of strings, instead of having wc guess which bytes are strings by assuming that whitespace delimits them.

However, upon doing this, you will find that you will only be able to use wc with programs that speak wc's protocol. Now if you're lucky, you can convince everyone who wants to send data to wc to use the new protocol. If you're unlucky, you'll instead end up with a bunch of programs that only implement it partially or with bugs. If you're really unlucky, there will also be competing versions of the wc protocol. Moreover, what about programs that wc will need to pipe data to? wc will need to know how to communicate with all of them as well.

My point is, if we go the route of imposing structure and types on IPC, the only thing you'll have to show for it are O(N^2) IPC protocols for N programs, which they all must implement perfectly (!!) to get type safety. Want to write a new program? Now you also have to write O(N) additional IPC protocols so it can communicate with the others.

Maybe you can additionally mandate that each program speaks the same IPC protocol (i.e. there is One True Record for piping data in, and One True Record for piping data out). But, if this IPC protocol encompasses every possible use-case, how is it any different than a byte stream?


> ps -ef | grep 'tobekilled' | grep -v grep | awk '{ print $2 }' | xargs kill

versus

> processes.filter{name = 'tobekilled'}.map(kill)

Yes, actually, I would like the second one better (although I know it's only pseudo-code). To me, the most egregious problem with the Unix way of doing this is exemplified by the "grep -v grep" part. The first grep command didn't specify precisely what you wanted, i.e. the subset of processes whose executable name (or argv[0]) is "tobekilled". It can't, because ps produces lines of text intended primarily to be displayed on a terminal and read by a human, and grep merely searches those lines of text for a substring. It's all a very lossy process. So you had to add a second grep to work around the fact that the first grep also matched a particular occurrence of "tobekilled" in a command-line argument other than argv[0]. But what if someone were running "vim tobekilled.txt" at the same time? These sorts of workarounds are to Unix as epicycles were to Ptolemaic astronomy -- evidence that the foundation is flawed.

> maybe for the newbie who paradoxically already understands functional and object-oriented programming, I'll grant.

I think it would be easier to teach the fundamentals of functional programming -- not the crazy academic type-theory stuff, but basics like map and filter -- than all the intricacies and gotchas of combining Unix tools like grep, sed, cut, xargs, awk, and so on. You do realize that, in addition to the shell language itself, you casually dropped in a whole second language (Awk) in your second example, right?

> The Lennart bit was a joke. The OP's post was so histrionically overwrought that I responded sardonically. Apologies if you were offended.

It bothers me that you used Lennart's name as a synonym for "anti-Unix", as if that's his most salient characteristic (if it's even true of Lennart). How do you think he would feel about that "joke" if he read it? Especially on top of all the other vitriol he's received?


Don't get me wrong, tics like 'grep -v grep' are suboptimal, and indeed that's why people invented killall, and xargs -0, and so on. But in practice, there exist zero systems without workarounds and grotesqueries, and the unix philosophy, rather than any competing philosophy, is the entire reason why you're able to use any of the networks or computing devices you use today.

Can we do better? Sure! One could imagine a reformation of the unix philosophy to center around strongly typed streams that could totally work after you put about 20 years of effort into replacing the existing tools. All too often, though, the reformers are people who don't understand the philosophy and want to try a weird collection of ad-hockery instead.


I don't have access to powershell on linux but I think the syntax would be

(ls).Count

(where ls is an alias for http://ss64.com/ps/get-childitem.html which returns .net FileInfo objects, it can also return info for registry paths)

and

ps | ?{ $_.Name -eq 'tobekilled' } | %{ $_.Kill() }

(ps an alias for http://ss64.com/ps/get-process.html returns .net Process objects)

or the equivalent of pkill

Stop-Process -Name 'tobekilled'


> would unix be better if it were > cwd.files.count?

How about

    file:. count 
?


I agree with you to some degree. It's incredibly easy to write insecure software if it ever touches a shell.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: