Hacker News new | past | comments | ask | show | jobs | submit login

Because for some bizarre reason, "cut" doesn't ship with any decent column selection logic that is the equivalent of awk's $1, $2, etc., even in 2020.

That's like 90% of my use of awk right there. I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.

Posted partially so the Internet Correction Squad can froth at the mouth and set me straight, because I'd love to be showed to be wrong here.




> I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.

Does `cut -f2` not work? My complaint with cut is that you can't reorder columns (e.g. `cut -f3,2` )

Awk is really great for general text munging, more than just column extraction, highly recommend picking up some of the basics

Edit to agree with the commenters below me: If the file isn't delimited with a single character, then cut alone won't cut it. You need to use awk or preprocess with sed in that case. Sorry, didn't realize that's what the parent comment might be getting at.


It does not. Compare:

    $ echo '   1     2   3' | cut -f2
       1     2   3
    $ echo '   1     2   3' | cut -f2 -d' '
    
    $ echo '   1     2   3' | awk '{print $2}'
    2
"-f [...] Output fields are separated by a single occurrence of the field delimiter character."


echo ' 1 2 3' | tr -s ' ' | cut -b 2- | cut -d' ' -f2

or

echo ' 1 2 3' | tr -s ' ' '\t' | cut -b 2- | cut -f2


oh yes, totally agree. if the data isn't delimited by a single character, then you definitely need awk or sed+cut


Also, the field separator (FS) can be a regular expression.

    FS = "[0-9]"


IIRC, there is an invocation of cut that basically does what I want, but every time I try, I read the manual page for 3 or 4 minutes, craft a dozen non-functional command lines, then type "awk '{ print $6 }'" and move on.


> IIRC, there is an invocation of cut that basically does what I want

I don't think there is, because cut separates fields strictly on one instance of the delimiter. Which sometimes works out, but usually doesn't.

Most of the time, you have to process the input through sed or tr in order to make it suitable for cut.

The most frustrating and asinine part of cut is its behaviour when it has a single field: it keeps printing the input as-is instead of just going off and selecting nothing, or printing a warning, or anything which would bloody well hint let alone tell you something's wrong.

Just try it: `ls -l | cut -f 1` and `ls -l | cut -f 13,25-67` show exactly the same thing, which is `ls -l`.

cut is a personal hell of mine, every time I try to use it I waste my time and end up frustrated. And now I'm realising that really cut is the one utility which should get rewritten with a working UI. exa and fd and their friends are cool, but I'd guess none of them has wasted as much time as cut.


Perfect example of how "do one thing and do it well" is a lie.


> Does `cut -f2` not work?

Most utilities don't use a tab character as separator, and that's what cut operates to by default. Can't cut on whitespace in general, which is what's actually useful, and what awk does.

Only way to get cut to work is to add a tr inbetween, which is a waste of time when awk just does the right thing out of the box.


> which is a waste of time when awk just does the right thing out of the box.

Agree in general. Only exception I'd make to this is when you're selecting a range of columns, as someone else mentioned elsewhere in the thread. I typically find (for example) `| sed -e 's/ \+/\t/g' | cut -f 1,3-10,14-17` to be both easier to type and easier to debug than typing out all the columns explicitly in an awk statement.


Instead of piping to sed, I would simply use

  | tr -s ' ' '\t'


"Does `cut -f2` not work?"

As others have pointed out, no. It should! (Said the guy sitting comfortably in front of his supercomputer cluster in 2020. No, I don't do HPC or anything; everything's a supercomputer by the time that cut was written's standards.) But it doesn't. Going out on a limb, it's just too old. Cut comes from a world of fixed-length fields. Arguably it's not really a "unix" tool in that sense.

"highly recommend picking up some of the basics"

I have, that's the other 10%. I've done non-trivial things with it... well... non-trivial by "what I've typed on the shell" standards, not non-trivial by "this is a program" standards.


Not if the columns are separated by variable number of spaces. By default, the delimiter is 1 tab. You can change it to 1 space, but not more and not a variable number.

In my experience, most column based output uses variable number of spaces for alignment purposes. Tabs can work for alignment, but they break when you need more than 8 spaces for alignment.


The Internet Correction Squad would like to remind you that 1) they are different programs that do different things, 2) if they changed over time, they wouldn't be portable, 3) if all you use awk for is '{print $2}', that is perfectly fine.

You can submit a new feature request/patch to GNU coreutils' cut, but they'll probably just tell you to use awk.

Edit: Nevermind, it's already a rejected feature request: https://lists.gnu.org/archive/html/bug-coreutils/2009-09/msg... (from https://www.gnu.org/software/coreutils/rejected_requests.htm...)


One of the bad things about having an ultra-stable core of GNU utils as that they've largely ossified over time. Even truly useful improvements can often no longer get in.

It's a sharp and not-entirely-welcome change from the 80s and 90s.

Here's another that would be great but will never be added: I want bash's "wait" to take an integer argument, causing it to wait until only that number (at most) of background processes are still running. That would make it almost trivial to write small shell scripts that could easily utilize my available CPU cores.


> I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.

I'm not sure if you refer spefically to cut, but Perl has something similar and approximaly terse:

> echo 'a b c' | perl -lane 'print $F[1]'

Also, Perl can slice arrays, which is something that I really miss in Awk.


PERL is bloatware by comparison and less likely to be installed on distros than AWK. (e.g, embedded or slim distros. that's why you rarely see nonstandard /bin execs in shell scripts).


Perl used to be part of most distros, but I think favor shifted to Python a few years ago.

I wouldn't call it bloat, but yes it is much bigger. At the time you had C (really fast, but cumbersome) and Awk/Bash (good prototyping tools, but not good for large codebases). Perl was the perfect answer to something that is fairly fast, relatively easy to develop in, and easier to write full-sized codebases


Larry Wall referred to the old dichotomy of the “manipulexity” of C for reaching into low-level system details versus the “whipuptitude” of shell, awk, sed, and friends. He targeted Perl for the sweet spot in the unfilled gap between them.


Thanks for explaining it better than me!


The awk on an embedded system is most likely a non-mainstream awk implementation with fewer or different features.


Proof? Or are you just guessing?


Can confirm. GNU awk is GPLv3, which means it can't be legally included on any system that prevents a user from modifying the installed firmware. This is a result of GPLv3's "Installation Instructions" requirement.

Every commercial embedded Linux product that I've seen uses Busybox (or maybe Toybox) to provide the coreutils. If awk is available on a system like that, it's almost certainly Busybox awk.

And Busybox awk is fine for a lot of things. But it's definitely different than GNU awk, and it's not 100% compatible in all cases.


That rule only apply if the manufacturer has the power to install modified version after sale. If the embedded Linux product is unmodifiable with no update capability then you do not need to provide Installation Instructions under GPLv3.

The point of the license condition is that once a device has been sold the new owner should have as much control as the developer to change that specific device. If no one can update it then the condition is fulfilled.


Thanks. I forgot GPL is the touch of death in many cases due to how it infects entire codebases.

I can't edit my OP but I'm already downvoted so that will suffice.


Specifically GPLv3 is the sticking point - not the GPL in general. GPLv2 is a great license, and I use it for a lot of tools that I write. That's the license that the Linux kernel uses.

GPLv3 (which was written in 2007) has much tougher restrictions. It's the license for most of the GNU packages now, and GPLv3 packages are impractical to include in any firmware that also comes with secret sauce. So most of us in the embedded space have ditched the GNU tools in our production firmware (even if they're still used to _compile_ that firmware).


That's not an entirely accurate understanding of the GPLv3 "anti-tivoisation" restrictions. The restrictions boil down to "if you distribute hardware that has GPLv3 code on it, you must provide the source code (as with GPLv2) and a way for users to replace the GPLv3 code -- but only if it is possible for the vendor to do such a replacement". There's no requirement to relicense other code to GPLv3 -- if there were then GPLv3 wouldn't be an OSI-approved license.

It should be noted that GPLv2 actually had a (much weaker) version of the same restriction:

> For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. [emphasis added]

(Scripts in this context doesn't mean "shell scripts" necessarily, but more like instructions -- like the script of a play.)

So it's not really a surprise that (when the above clause was found to not solve the problem of unmodifiable GPL code) the restrictions were expanded. The GPLv3 also has a bunch of other improvements (such as better ways of resolving accidental infringement, and a patents clause to make it compatible with Apache-2.0).


I do appreciate the intention behind GPLv3. And it does has a lot of good updates over GPLv2.

The reason why I said it's impractical to include GPLv3 code in a system that also has secret sauce (maybe a special control loop or some audio plugins) is more about sauce protection.

If somebody has access to replace GPLv3 components with their own versions, then they effectively have the ability to read from the filesystem, and to control what's in it (at least partially).

So if I had parts that I wanted to keep secret and/or unmodifiable (maybe downloaded items from an app store), I'd have to find some way to separate out anything that's GPLv3 (and also probably constrain the GPLv3 binaries with cgroups to protect against the binaries being replaced with something nefarious). Or I'd have to avoid GPLv3 code in my product. Not because it requires me to release non-GPL code, but more because it requires me to provide write access to the filesystem.

And I guess that maybe GPLv3 is working as intended there. Not my place to judge if the license restrictions are good or bad. But it does mean that GPLv3 code can't easily be shipped on products that also have files which the developer wants to keep as a trade secret (or files that are pirateable). With the end result that most GNU packages become off-limits to a lot of embedded systems developers.


I will post the code fragment if I can find it (this was 10 years ago). I had a tiny awk script on an embedded system (busybox) to construct MAC addresses. There was some basic arithmetic involved and I couldn't quite figure out how to do it with a busybox shell script. The awk script didn't work at all on my Linux desktop.


Even assuming the odd "bloatware" characterization, this is irrelevant. From the article's point of view of "simple tasks", bloat or not doesn't matter; what matter is the language syntax and features used to accomplish a task (and I'd add consistency across platforms).

Regarding slim/embedded distros, it depends on the use cases, and the definition of "slim". It's hard to make broad statements on their prevalence, and regardless, I've never stated that one should use Perl instead and/or that it's "better"; only stated that the option it gives is a valid one.


Do you have data on the relative sizes of the Perl and awk install bases?

Perl is the language, and perl is the implementation. Spelling it with ALL CAPS announces that someone knows little about the language.


Unfortunately, for large files perl is significantly faster than awk. I was working on some very large files doing some splitting, and perl was over an order of magnitude faster.


A tool that is stable, well supported, has outstanding documentation, thoroughly tested, won’t capriciously break your code, and outperforms the rest of the pack is not the unfortunate case.


To be more clear, the unfortunately part was due to the title of this article. I think awk is great, but if you know perl well enough it can easily replace it and be much more versatile


> PERL is bloatware

never heard that before


There is a first time for everything, and it's true: Perl is mega bloatware, especially when compared to AWK.


That is closest, yes. I'd say clearly a couple more things to remember, but if I can get it into my fingers will be just as fluid. Awk's invocation has its own issues around needing to escape the things the shell thinks it owns, too, not that it's at all unique about that.


If you're golfing, there's also

    echo a b c | perl -pale '$_=$F[1]'


Perl stole array slicing from AWK's split() function... which slices arrays.


I define aliases c1, c2, c3, c4, etc. in my .bashrc as "awk '{print $1}'" etc.

But it's nice to have awk for the slightly more complicated cases, up until it's easier to use Python or another language.


I don't suppose your dotfiles are available anywhere. I'm just wondering what other useful things I can steal ;)


I don't have my dotfile here (on my phone) but here's some ideas from things I've aliased that I use a lot:

cdd : like cd but can take a file path

- : cd - (like 1-level undo for cd)

.. : cd ..

... : cd ../.. (and so on up to '.......')

nd : like cd but enter the dir too

udd : go up (cd ..) and then rmdir

xd : search for arg in parent dirs, then go as deep as possible on the other branch (like super-lazy cd, actually calls a python script).

ai : edit alias file and then source it

Also I set a bindkey (F5 I think) that types out | awk '{print}' followed by the left key twice so that the cursor is in the correct spot to start awk'in ;D

# Bind F5 to type out awk boilerplate and position the cursor

bindkey -c -s "^[[[E" " | awk '{print }'^[[D^[[D"

Edit: better formatting (and at work now so pasted the bindkey)


Sorry, not much else going on in my dot file except stuff that is peculiar to my current environment.


awk '{print $1}' can also be written as awk '$0=$1'


And awk doesn't offers cut's column range selection ;)


Absolutely. Everything comes with costs & benefits. But I'm not sure I've, in my entire 23-year professional programming career, ever encountered a fixed-width text format in the wild. I've used cut even so for places where by coincidence the first couple of columns happened to be the same size, but that's really a hack.

Obviously, other people have different experiences which is why I quality it so. (I only narrowly missed it at the beginning, but I started in webdev, and we never quite took direct feeds from the mainframes.) But I don't think it's too crazy to say UNIX shell, to the extent it can be pinned down at all, works most natively with variable-length line-by-line content.


Some topics where you will most definitely come across fixed-width formats: - processing older system printouts that are save to text file - banking industry formats for payment files and statements - industrial machine control

and my favourite....... - source code.

My first intro to awk was using it to process COBOL code to find redundant copy libs, consolidate code, and generally cleanup code (very very crude linting). And it was brilliant. Fast, logical, readable, reliable - was everything i needed.

It is also an eminently suitable language for teaching programming because it introduces the basic concept of SETUP - LOOP - END . which is exactly the same as one will find in most business systems, you find it in arduino sketches, hell you even find it in a browser which is basically just a whole universe of stuff sitting atop a very fast loop that looks for events.

AWK fan for sure - my heirachy of languages these days would be cmd line where there is specific command doing all i need, AWK for anything needing to slice and dice records that dont contain blobs, python for complete programs, and python+nuitka or lazarus or C# when need speed and/or cross platform.


> But I'm not sure I've, in my entire 23-year professional programming career, ever encountered a fixed-width text format in the wild.

SAP comes to mind. I think it does support various different formats, but for reason or another fixed-width seemed to be some kind of default value (that's what I usually got when I asked for SAP feed at least, but that was years ago).


I can confirm that they are not a common problem.

Admittedly I have encountered fixed-width text formats in the wild. But the last such occasion was about 15 years ago. (It was for interacting with a credit card processor to issue reward cards.)


Within my first year of professional development, I encountered several fixed-width files I needed to read and write. I suppose exposure depends a lot on the specific industry.


Also big mainframe users (banks, insurance) often send fixed width data to us.


Several scientific data formats in my industry have fixed width columns that traces back to the era of punch cards


I'm an expert on neither AWK nor cut, but AWK allows you to select a field, or the entire line, and then substrings within those.

Select characters 3-6 (inclusive) in the second field:

$ echo Testing LengthyString | awk '{print substr($2,3,4)}'

> ngth

If you want to select columns from the entire line, then:

$ echo Testing LengthyString | awk '{print substr($0,3,4)}'

> stin

Is that what you meant?


I don't think so. I think they're referring to cut's ability to select an arbitrary range of columns, e.g. `cut -f 2-7` to select the 2nd through 7th columns, while awk requires you to explicitly enumerate all desired columns, i.e. `awk '{print $2, $3, $4, $5, $6, $7}'


> "while awk requires you to explicitly enumerate all desired columns"

Or you can use loops, e.g.,

  echo $(seq 100) | awk '{ for (i = 2; i <= 7; i++) { print $i; }; }'


It's more characters than manual way. There should be just a built-in function.


What cauthon said, cut lets you select range of fields with a simple option, awk doesn't, ex(csv line):

  cut -d, -f10-30
(selects from field 10 to 30)

Not saying this can't be written in awk with more code, but we were talking about field selection ergonomics.


OK, understood. And yes, in my experience AWK is less good at that, cut would definitely be the right tool.

It doesn't detract from the point at hand - which is perfectly valid - but it's worth noting that there's a confusion here with regards the terminology: "fields" vs "columns". I thought they were referring to "columns of characters" whereas the added explanations[0] are about "columns of fields". That makes a difference.

But as I say, yes, I agree that to select a range of columns of fields, especially several fields from each line, is definitely better with cut.

[0] https://news.ycombinator.com/item?id=22109551


I wrapped awk in a thin shell script, f, so that you can say “f N” to get “awk '{print $N}'” and “f N-M” (where either of the end points are optional) to do what cut does, except it operates on all blanks.

Repo has a few other shortcuts, too:

https://github.com/kseistrup/filters


The reason isn't that bizarre; it's a POSIX utility and must conform to the specification, which you can read here:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/c...


Surely we can infer that jerf would just say that it's bizarre for the POSIX specification to not include any easy way to get columns.


Does POSIX forbid extensions? That would be terrible.


Well... no, it doesn't, obviously, we can take a quick look at gawk to confirm this.

But, just as we can't say "awk offers the -b flag to process characters as bytes", we can't really say that cut offers any extensions not defined in the standard.

An implementation could, sure. I'd prefer that it didn't, writing conformant shell scripts is hard enough.


Before standards happen, creative developers are free to use their imagination and come up with useful features. Then someone makes up a standard, and from thereon progress is halted. The only way you grow new features and functionality is through design-by-committee, if people aren't making extensions that would one day make it to the next revision. I think it is ridiculous.

Tools should improve, and standards should eventually catch up & pick up the good parts.

People who need to work with legacy systems can just opt not to use those extensions, but one day they too will benefit. Others benefit immediately.


I find that, for these kinds of utilities, all the extra add-ons tend to cost me more in the form of, "Whoops, we tried using this on $DIFFERENT_UNIX and it broke," than they ever save me.

When I'm looking to do something more complicated, I'd rather look to tools from outside the POSIX standard. The good ones give me less fuss, because they can typically be installed on any POSIX-compliant OS, which is NBD, whereas extensions to POSIX tools tend to create actual portability hassles.



I'm quite sure we just linked to the same document, were you meaning to address the grandparent?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: