Hacker News new | past | comments | ask | show | jobs | submit login

The arguments in the article indeed don't have much to say about Unix Philosophy per se - they're just a list of various fuckups and idiocies Unix accumulated for some reasons or others. As for Unix Philosophy, the point (3) in your summary is something that's a) dumb, and b) back when it was created, they already had better solutions.

Passing text streams around is a horrible idea because now each program has to have its own, half-assed shotgun parser and generator, and you have to glue programs together with your own, user-provided, half-assed shotgun parsers, i.e. calls to awk, sed, etc.

Think of it this way: if, per Unix Philosophy (points (1) and (2) of your summary), programs are kind of like function calls, and your OS is kind of like the running image, then (3) makes you programming with a dynamic, completely untyped language which forces each function to accept and return a single parameter that's just a string blob. No other data structures allowed.

I kind of understand how it is people got used to it and don't see a problem anymore (Stockholm syndrome). What shocked me was learning that back before UNIX they already knew how to do it better, but UNIX just ignored it.




> "The arguments in the article indeed don't have much to say about Unix Philosophy per se - they're just a list of various fuckups and idiocies Unix accumulated for some reasons or others."

Right. The title should have been reflective of that "Various idiocies Unix has accumulated to this day" but since the article mentions Unix Philosophy, my point is that the article should have criticised the philosophy and not the practice.

> "Passing text streams around is a horrible idea because now each program has to have its own, half-assed shotgun parser and generator, and you have to glue programs together with your own, user-provided, half-assed shotgun parsers, i.e. calls to awk, sed, etc."

But this has actually proved to be very useful as it provided a standard medium of communication between programs that is both human readable and computer understandable. And ahead of its time since it automatically takes advance of multiprocessor systems, without having to rewrite the individual components to be multi-threaded.

> "(3) makes you programming with a dynamic, completely untyped language which forces each function to accept and return a single parameter that's just a string blob. No other data structures allowed."

That may be a performance downside in some cases, but the benefit of having a standard universally-agreeable input and output format is the time it saves Unix operators who can quickly pipe programs together. That saves more total human time than gained from potential performance benefits.


> And ahead of its time

It wasn't ahead of its time. By the time Unix was created, people were already aware of the benefits of structured data.

> it automatically takes advance of multiprocessor systems, without having to rewrite the individual components to be multi-threaded.

That's orthogonal to the issue. The simple solution to Unix problems would be to put a standard parser for JSON/SEXP/whatever into libc or OS libraries and have people use it for stdin/stdout communication. This can still take advantage of multiprocessor systems and whatnot, with an added benefit of program authors not having to each write their own buggy parser anymore.

> but the benefit of having a standard universally-agreeable input and output format is the time it saves Unix operators who can quickly pipe programs together. That saves more total human time than gained from potential performance benefits.

I'd say it's exactly the opposite. Unstructured text is not an universally-agreeable format. In fact, it's non-agreeable, since anyone can output anything however they like (and they do), and as a user you're forced to transform data from one program into another via more ad-hoc parsers, usually written in form of sed, awk or Perl invocations. You lose time doing that, each of those parsing steps introduces vulnerabilities, and the whole thing will eventually fall apart anyway because of million reasons that can fuck up the output of Unix commands, including things like your system distribution and your locale settings.

As an example of what I'm talking about, imagine that your "ls" invocation would return a list of named rows in some structured format, instead of an ASCII table. E.g.

  ((:columns :type :permissions :no-links :owner :group :size :modification-time :name)
   (:data
    (:directory 775 8 temporal temporal 4096 1488506415 ".git")
    (:file 664 1 temporal temporal 4 1488506415 ".gitignore")
      ...
    (:file 755 1 temporal temporal 69337136 1488506415 "hju")))
With such a format you could trivially issue commands like:

  ls | filter ':modification-time < 1 month ago' | cp --to '/home/otheruser/oldfiles/'
  find :name LIKE ".git%" | select (:name :permissions) | format-list > git_perms_audit.log
Hell, you could display the usual Unix "ls -la" table for the user trivially too, but you wouldn't have to parse it manually.

BTW. This is exactly what PowerShell does (except it sends .NET objects), which is why it's awesome.


There are no problems where you see them.

Most text formats are trivial to parse and space-separated or character-separated is the way to go. It really doesn't help if you enclose shit in parens. (Parens are sometimes a good way to encode trees, though).

    > (:columns :type :permissions :no-links :owner :group :size :modification-time :name)
That format doesn't solve any of the problems you mention. The problem is that it's hard to agree what data should be inside, not how you encode it.

    > ls | filter ':modification-time < 1 month ago' | cp --to '/home/otheruser/oldfiles/'
    find -mtime -30 | xargs cp -t /home/otheruser/oldfiles

    > find :name LIKE ".git%" | select (:name :permissions) | format-list > git_perms_audit.log
    find -name '.git*' -printf '%m %f\n' > git_perms_audit.log
Use 0-separated if you care that technically filenames can be anything (except / and NUL). Or say "crap in, crap out". Or assert that it's not crap before processing it.

> Hell, you could display the usual Unix "ls -la" table for the user trivially too, but you wouldn't have to parse it manually.

You don't parse "ls -la". You just don't.

> BTW. This is exactly what PowerShell does (except it sends .NET objects), which is why it's awesome.

Powershell is an abomination, and because it encourages coupling of interacting programs it will never be as successful as the Unix model. There will never be the same variety of interacting programs for very practical reasons.


> But this has actually proved to be very useful as it provided a standard medium of communication between programs that is both human readable and computer understandable. And ahead of its time since it automatically takes advance of multiprocessor systems, without having to rewrite the individual components to be multi-threaded.

Except it is completely unusable for network applications because the error handling model is broken (exit status? stderr? signals? good luck figuring out which process errored out in a long pipe chain) and it is almost impossible to get the parsing, escaping, interpolation, and command line arguments right. People very quickly discovered that CGI Perl with system/backticks was a very insecure and fragile way to write web applications and moved to the AOLServer model of a single process that loads libraries.


It's true that error handling with (shell) pipes is not possible in a clean way in general. In shell, the best you can do is probably "set -o pipefail", but that's only in bash. Concurrency with IO on both sides is really hard to get right even in theory.

Text representation is a good idea regardless of whether you pipe or not.


> (3) makes you programming with a dynamic, completely untyped language which forces each function to accept and return a single parameter that's just a string blob. No other data structures allowed.

I think this is great. There's slightly more principled ways to do it, but having to convert everything to one single format at the end of the day keeps you humble.

Let's go back to the previous decade's Hacker News:

http://wiki.c2.com/?AlternateHardAndSoftLayers


It's not converting to "one single format", it's converting to "any and all possible formats", because with unstructured text, you're literally throwing away the structure and semantics inherent in the data, instead relying on users to glue things together with ad-hoc parsers.


Please stop spreading bs. It's not throwing away structure. Piping text doesn't even preclude sexps. It's just that they are seldom needed. Simpler encodings like space-separated are sufficient for many use cases, and better for interoperation.

It's misguided and inefficient to encode everything in the same way. Would you prefer to have your JPG or MP4 encoded in sexps?

And I say that as someone who is working on a serialization format for relational databases.


> Please stop spreading bs.

This breaks the HN guidelines: https://news.ycombinator.com/newsguidelines.html. Please edit out such bits. This would be a fine comment without it.


Piping isn't the culprit. It's what you pipe that is, and Unix Philosophy says "pipe whatever the hell you want, and let the users upstream sort it out".

It's not about encoding everything in exactly the same way. It's about providing the basic, shared protocol for representing structure. With typical Unix tools, you don't have "simpler encodings", you have no encoding at all. Each tool outputs whatever its particular author felt like (and this changes between Unix systems), each tool parses things in whatever way its author felt like, and as a user your job is to keep transforming unstructured blobs of text to glue them together.


[flagged]


> Name a well-thought out text file format that can't correctly be parsed e.g. by a Python one-liner with basic string operations. And please don't include: JSON, XML, YAML, sexps, because it's not possible, at least not without a library.

Well, because this library should be a part of the OS API.

A set of conrete cases where existing practice is bad is Unix itself (and its descendants). Think of every time a script breaks, does something unexpected, or introduces a security vulnerability, because every program has to contain its own, half-assed parser for textual data.


> because every program has to contain its own, half-assed parser for textual data.

As I said, name me a format that I can't parse correctly as a Python one-liner.

I work as a systems administrator and my scripts (mostly shell, python) don't break. I'm not kidding you.

Of course when writing shell scripts (which I think you imply) I need to know how to write non-broken shell scripts (in a clean and straightforward way) and I will freely admit that it's not easy to learn how to do it. Partly because shell has some flaws, but more because of the insane amount of broken scripts and tutorials out there.

But it's not even about defending shell. We are talking about text representation.

> Well, because this library should be a part of the OS API.

You are free to postulate that but it's won't make it less work. By the way "OS API" is ridiculous. These libraries have to be implemented for every language (and they have been, for most popular languages).


> As I said, name me a format that I can't parse correctly as a Python one-liner.

mboxo? [1] It is a popular text format that cannot be unambiguously parsed.

More generally, most Unix tools' output is also not able to be unambiguously parsed. For example, use gcc to compile a file, then collect the warnings? The regex "^.+:\d+:\d+: warning.*" will be right most of the time, but there's no 'correct' way to parse gcc output (there is not a surjective mapping of output to input).

There are various ways to work around the problem: mboxrd format uses an escape sequence to work around the earlier problem mentioned with mboxo. `ls -l --dired' (GNU) will allow you to parse ls by appending filename byte offsets to the output. `wc --libxo xml` (FreeBSD) will give the output in XML, which is unambiguous as well. multipart/form-data (RFC2388) is used to embed binary data in a text format, by using a byte sequence which doesn't appear in the data.

Binary formats present their own set of issues, but "accidentally unparseable" is more common in text-based formats (or ad-hoc text output).

[1] https://jdebp.eu/FGA/mail-mbox-formats.html


Thanks!

It's true that filenames with whitespace or newlines are bad for interoperability ("make" is another example). There are three simple options: escaping filenames, making filenames NUL-terminated or declare such filenames as invalid. The latter way seems to have won for practical reasons, and it's a pity that "safe filenames" were never standardized (but C-identifier plus extension should be safe everywhere).

Mbox is definitely broken (for example body lines that start with "From" are changed to "> From"). I don't think it is ambiguous today (all software I know interprets "From " at the beginning of a line as a new mail), but it clearly was not much designed at all. It still has some precious properties which is why it's still in use today. For example, appending a new email (Mail server) is very fast. Crude interactive text search works also very well in practice, although automation can't really be done without a library.

Email is complex data (not line- or record-oriented), so various storage formats achieving various tradeoffs are absolutely justified.

> Binary formats present their own set of issues, but "accidentally unparseable" is more common in text-based formats.

It's true, especially with formats from the 70s where the maxime was "be liberal in what you accept", and where some file formats weren't really designed at all.

On the other hand, "accidentally unextendable" (for example, fixed-width integers) and "accidental data loss" is much more common in binary formats.


> As I said, name me a format that I can't parse correctly as a Python one-liner.

Sorry, I misread that in your previous comment as "name me a format that I can parse correctly with a Python one-liner, without special libraries".

Anyway, the original article contains numerous examples of the issues I'm talking about; scroll to "Let’s begin with a teaser" and read from there. The point being, it's very difficult to correctly parse output in general case, because unstructured text doesn't reliably tell you when various data items begin and end. Most people thus won't bother with ensuring their ad-hoc parsing is correct.

> By the way "OS API" is ridiculous. These libraries have to be implemented for every language (and they have been, for most popular languages).

Sure each language has to implement its own bindings to the OS. My point is that there should be a structured format defined as standard on the system level, so that all CLI programs could use the same parser and generator instead of each rolling their own.


> Let’s begin with a teaser. How can we recursively find all the files with \ name in a folder foo? The correct answer is: find foo -name '\\\\'

He doesn't know shell quoting (or has problems with the blogging software). It's '\\' and there is nothing wrong with that (-name accepts a pattern, not a fixed string)

> How to touch all files in foo (and its subfolders)? At first glance, we could do it like this: find foo | while read A; do touch $A; done

No.

     find foo -exec touch {} \;
     #or
     find foo -print0 | xargs -0 touch
These examples only prove that the author is not proficient at shell.

And we are not talking about shell (which does have flaws) but text representation. You still haven't provided the text format I asked for.

> To argue for the OP, consider the case of passwd being parsed on every system call. That is simply sub-optimal.

As you know there are various encoding schemes, but mostly character separated (space, newline, NUL, colon, whatever) or record-oriented (two separator levels, often newline and space/colon/comma.

In most places, only identifiers are allowed ("and" ("there" ("is") ("no" "fucking" "point") "in" ("wrapping" "everything" "in" ("quotes" "and" "parens"))). Just write things like this, and parsing won't be any harder than splitting on whitespace. Was that so hard?


> Author is not proficient at shell

> Partly because shell has some flaws, but more because of the insane amount of broken scripts and tutorials out there.

So what are you saying then? Basically, "git gud"? I am struggling to find your exact argument here. I wonder if you keep saying "it's not broken, you're just using it wrong", or "you must be proficient and if you're not, it's nobody's fault", or what exactly?

The main argument here is IMO that unstructured text which can be parsed with space/tab delimiters in mind is NOT good enough. You say it is. I disagree; I've had numerous cases in my career where any random dev never takes that into account and just throws almost-native-English files into a Linux VM expecting a 1970s system tool to be able to parse it and make sense of it.

Their fault? Absolutely and definitely. But it's the job of the tech to slap you through the wrist if you are not obeying to standards. Computers are not AI and they need protocols / standards. Are there standards in piping things between processes in UNIX/Linux? No.

Then what's the point of technology at all, I ask.


I clearly said I'm not defending shell. Even when the author is responsible himself for wanting to put a fixed string where a pattern is expected.

But this is about text formats. Text is simple. It's only the overengineering farts who think they have to wrap everything in three levels of parens. It doesn't make a difference.

> Their fault? Absolutely and definitely. But it's the job of the tech to slap you through the wrist if you are not obeying to standards. Computers are not AI and they need protocols / standards. Are there standards in piping things between processes in UNIX/Linux? No.

I just don't get why people keep thinking just because it's "text" it's somehow not standardized (enough), or why putting things in parens would help.

Please, stop with this vague FUD. Give an actual example.

> Are there standards in piping things between processes in UNIX/Linux? No.

That's called separation of concerns. That the kernel doesn't care doesn't mean that the endpoints don't care.


> Text is simple.

Sigh. I am not here to argue with your out-of-context sweeping generalizations. So I won't.

BTW, do you have a particular gripe with S-expressions / LISP? You ranted twice about parens in your comment towards me.

And no -- me, the OP, and several others in this thread will definitely not stop with this "vague FUD", "bs", "trolling" -- all your quotes from other comments -- simply because it's something we struggle with regularly.

We all have day jobs. When we stumble upon a piping problem -- be it unable to find an erroring process easily and quickly (sometimes not at all), or unable to understand an exit code, or having to actually look for signal values, or stumbling upon a bug in an older version we're stuck with -- we try our best to get the problem out of the way and move on. Most non-tech-savvy bosses would react extremely bad if you told them you're spending hours or days on a problem they perceive as one small piece of the glue you're using to put a painting together, and especially when they find out that you're not even at the part where you must hang the painting on the wall (example: deployment). And that's a fact of the daily life of many devs. You can call that a vague FUD if you wish. <shrugs>

So forgive all of us working folk who don't keep Star Trek-like exact logs on every problem we ever bump into. /s

The negative impressions build up with time. You can try calling for 100% scientific method on this but I can bet my neck that if I've known every single minute of your life, I'd catch you with your pants down on many occasions that you don't keep a detailed record on everything that has ever frustrated you. Can you deny this? If not, then I don't understand why you are holding on to a strictly scientific approach on things people bump daily into but can never excuse spending huge amounts of time on, in front of their bosses. Peace?

TL;DR:

Since we have jobs and we must go on about it relatively quickly, most of us never spend the effort to write down every single instance where the UNIX shell semantics have made our lives harder but we managed to pull through via a workaround and just went on about our business minutes or hours later.


Again, you have ignored that this discussion is not about shell (which I know, including its few flaws, and can easily deal with, but am in no way trying to describe as easy to learn given that there are so many broken scripts and tutorials out there. It's hard to just learn the quoting rules for once, and browse through "bash pitfalls" once, simply because people don't know where to look for good resources. And I have freely admitted it was hard for me as well. Nevertheless I seriously recommend learning it rigorously because it has tremendous practical benefits).

This discussion is about text representations. Why do you keep claiming that text formats are broken when you can't give a single example?

> BTW, do you have a particular gripe with S-expressions / LISP? You ranted twice about parens in your comment towards me.

I will rant again until people stop making stupid claims.

I actually like LISP as a programming language. There is just zero benefit from writing record- or even word-oriented data in a (random) free-form syntax that is meant for representing trees. If I wanted I could parse /etc/passwd format like this:

  struct_passwd = namedtuple("passwd", "pw_name pw_passwd pw_uid pw_gid pw_gecos pw_dir pw_shell")
  passwd = [struct_passwd(* line.rstrip('\n').split(':')) for line in open('/etc/passwd')]
That's it. It works. There, I even made a nice datatype for you. And there's already more integrity checking in these two lines compared to a json.parse() or similar.

It works so nicely that I'm even making a text format for such databases with strong schema support that can still quite easily be used with generic text tools (git, grep, sed, awk, diff...). http://jstimpfle.de/projects/wsl/main.html

> So forgive all of us working folk who don't keep Star Trek-like exact logs on every problem we ever bump into. /s

Never asked for that. Give a single reasonable example why text file practice is bad, to get any credibility. It can't be that hard.

> ... And that's a fact of the daily life of many devs. You can call that a vague FUD if you wish. <shrugs>

Well, it's a bit less vague now that you have actually described a little better. But there is no connection to text representations. Sorry, you replied to the wrong thread.


> There exists a shared protocol. It's called "explain it". But that's typically not even needed, the user can just look at the data and figure it out.

This is the root cause of 99% of all parse errors and security holes in the world.

If you just "look" on the output of ls in some arbitrary directory there is nothing there telling you that a file name can contain a newline that will mess up the output. Write your parser with this assumption and it's broken. (See OP)

If i had a penny for every csv-"parser" I've seen that is just data=input.split(','); i would be a rich man now. Because the developer when looking at their data had no comma in any cell. Doesn't mean the customer don't have it.


I'm pretty sure most security errors come from implementations of complex binary formats. (Okay, there is the web world and I hear people still haven't learnt to escape their SQL queries).

ls is only for human consumption. I said this elsewhere in this thread.

CSV is utterly broken (at least was RFC'ed at some point, but the escaping rules are still shit. We have known for decades how to do it better).


I call "No True Scotsman"


I don't understand (sorry). Could you explain?


"All things like A are in category X. Except this long list I wrote, but they aren't really A, because I need my syllogism to work."


You missed my cynism. I was opposed to these formats in the first place.


Which seems at odds with the thesis, at least as far as I can figure it out.


Look again. For example I wrote "space-separated" in multiple places.




Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: