
If you use GNU grep on text files, use the -a (--text) option - rurban
https://utcc.utoronto.ca/~cks/space/blog/unix/GNUGrepForceText
======
gpvos
I just looked through the GNU grep history to see when it suddenly started
being able to decide halfway through a file that it is binary after all; this
is since 16 September 2014, so fairly recently. Before that, it just checked
the first few kilobytes to decide, and didn't change its opinion afterwards.
To me, this is a very nonintuitive change.

~~~
gpvos
Why is it that in recent years, with this and the more recent ls quoting
fiasco, maintainers of longstanding UNIX utilities suddenly got the urge to
fix what isn't broken?

~~~
chubot
What's the 'ls quoting fiasco'?

Actually I recently found that coreutils and ls behave fairly well with funny
filenames:

Here is an invalid utf-8 byte and then a valid utf-8 sequence

    
    
        $ x=$'\xce\xce\xbc'
        $ touch "$x"
    

You can list it:

    
    
        $ ls
        ?μ
    

And here 'ls' does better than other tools that display filenames. It shows
the invalid byte and then keeps decoding with error recovery:

    
    
        $ ls --escape
        \316μ
    

However GNU stat (which I think is also in coreutils) does something similar,
but weirdly messed up:

    
    
        $ stat *
        File: ''$'\316''μ'
    

(it looks like it's outputting a valid shell string, except with extra quotes)

\-----

Most command line tools are not aware of stuff like this. For example you can
touch "x$ANSI_TERMINAL_CODES" and if you do "bash x??" or "python x??", then
your terminal will change color because of the escape codes printed back to
the terminal.

I just changed Oil to use a well-defined format I called QSN (quoted string
notation):

[http://www.oilshell.org/blog/2020/04/release-0.8.pre4.html#t...](http://www.oilshell.org/blog/2020/04/release-0.8.pre4.html#the-
highlights)

It adapts Rust's string literal syntax to express arbitrary byte strings
precisely and losslessly. (JSON can't express arbitrary byte strings.)

The QSN encoder does UTF-8 _decoding_ with a specific error recovery
mechanism. So it's basically like what ls and stat do, but it's more precise.

(If anyone is interested in QSN, please contact me. I think it's more
generally useful in a lot of places. It's something we already do but it's
precise like JSON.)

~~~
_jal
They broke it in 2016.

[https://www.gnu.org/software/coreutils/quotes.html](https://www.gnu.org/software/coreutils/quotes.html)

At least with Gnu, you can recompile your own, non-broken version, which is
the only saving grace of these stupid, trendy changes.

~~~
chubot
The old behavior was the bad one; the new behavior is good. And that link
explains why very well.

I guess I should have figured that oblique references to "ls quoting fiasco"
is shorthand for "I don't understand what's wrong but I'm angry about it..."

(On the other hand I would say the grep -a issue is bad both before and after
because either way it relies on autodetection. The fundamental issue there is
that there is too much variance in encodings, which isn't easy to fix. Luckily
UTF-8 is growing in popularity, and it doesn't have this issue because it
doesn't require metadata for extremely common operations like "find ascii
substring".)

~~~
wruza
”The old behavior was the bad one; the new behavior is good”

If you’re a human, yes. If you’re a script, it breaks you in half. If you’re a
script that has to run on various versions, then maybe it’s time fix yourself
and use find. You’re a sophisticated script after all, not one of these who
require a human with a debugger. Modern culture may not appreciate that little
‘compat’ thing, but it is essential if you want something to continue to work
and not just stop and wait for someone’s educated guesses. Good software
doesn’t point fingers at you, it just works. I remember how recently I wanted
to check network interfaces on some machine and commanded ‘ifconfig’. Now it’s
called ‘ip a’, and there is no ifconfig. I can guess the reason – ifconfig was
bad and ip is good. There is also an eternal “FAT” label issue in unetbootin
app, which resurrects every time Apple changes its fdisk output format (in
every release, as it seems). The workaround is to run it with a cli option – a
very thing that unetbootin was created for to skip. This is what makes systems
so much fun. Without all these cool things, we would just sit there and cry
over our uselessness.

ed: I read below that ls does that in interactive mode only, maybe it’s not
that bad then.

~~~
int_19h
You can't just expect compatibility about random things - that's why we have
formal contracts: standards, specifications, documentation. A human-readable
output of any app, in particular, should never be assumed to be stable, or to
have a specific format (even if observations imply it), unless its docs
specifically say otherwise.

------
linsomniac
I mostly run into this on searching my environment: "set | grep whatever", now
needs an "-a", possibly because of escape codes added to the environment a
decade ago.

Maybe the fix would be to only activate the "detect binary files" code if
stdout isatty?

Because it is a nice feature when I do a big grep to find something among my
home directory or the entire filesystem. It is certainly annoying to get
binary garbage in my terminal. Or maybe the binary detection could get
smarter, maybe making the determination on a match-by-match basis ("This line
I'm about to output is a kilobyte and half of it is non-printable", say).

Though, ack-grep doesn't seem to avoid putting binary garbage on my terminal,
so maybe reasonable to switch to something that isn't so clever? Most of my
terminal greping is done with ack these days, so I'd probably be happy with
gnu-grep disabling this cleverness.

~~~
downerending
Usually you want this feature no matter where the output is going. Adding "-a"
sucks, but it's not obvious how else this could work (and still be backward-
compatible).

IIRC the grep heuristic only considers a short prefix of the file. If the
garbage comes later, you lose. Unfortunately, this makes things seem a bit
unpredictable.

~~~
gumby
> IIRC the grep heuristic only considers a short prefix of the file. If the
> garbage comes later, you lose. Unfortunately, this makes things seem a bit
> unpredictable.

This was changed about five years ago to just keep looking. Which makes things
a bit unpredictable in a different way.

------
arendtio
Does someone know if using grep on a binary file is somehow defined by POSIX?

At a glance, I couldn't find a reference on the grep page:

[https://pubs.opengroup.org/onlinepubs/9699919799/utilities/g...](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html#top)

~~~
jwilk
It isn't. The page says:

> _The input files shall be text files._

"Text file" is defined in
[https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403)
:

> _A file that contains characters organized into zero or more lines. The
> lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in
> length, including the <newline> character._

~~~
arendtio
Thank you :-)

Interesting, especially the part about the LINE_MAX. Even though it kinda
makes sense, I would never have thought that having a very long line makes a
file a non-text file when all characters are 'normal' characters.

~~~
bawolff
Id kind of like that. Grepping in a directory also containing minified js
(with ten thousand character long lines) is a pain

------
arendtio
Actually, I never considered log files untrusted input, but as this example
shows, it would be wise to do so.

~~~
JoachimSchipper
FWIW, attacks like Javascript or SQL injection via logfiles are hardly
unknown. Log files are plenty scary. ;-)

~~~
hinkley
Someone did a POC of a CSRF against the admin interface for Cisco routers.
They sent garbage packets that sufficiently confused the 'hex editor' display
view in the admin web pages in such a way that it made a request to another
page, changing permissions.

------
king_phil
I see this and suddenly it clicks! That is exactly why I couldn't import an
SQL dump that I tried importing for days now, that is filtered with grep. Wow.

And I was wondering all the time why mysql reported this strange error "SQL
error in Binary file" when the .sql file was clearly a text file...

------
viklove
I started using ripgrep a few years ago and haven't looked back. It's way
faster, automatically excludes .gitignored files, and just has a bunch of
common sense functionality.

[https://blog.burntsushi.net/ripgrep/](https://blog.burntsushi.net/ripgrep/)

------
zajio1am
This 'feature' is especially irritating when one uses grep on some text files
with legacy (non UTF-8) encoding, but has locale with UTF-8 encoding. The grep
decides that regular text file is binary just because there are byte sequences
that are not valid UTF-8 sequences.

~~~
_ZeD_
How... How can grep possibly work on a strange non-utf8 encoding if you don't
say to it?

~~~
twic
If you're grepping for ASCII strings, then the UTF-8 pattern will match the
Latin-1 file.

~~~
_ZeD_
how can you expect to find ASCII strings in the parent's "text files with
legacy (non UTF-8) encoding"?

~~~
twic
Because some of them, like Latin-1, which i mentioned in the comment to which
you replied, are supersets of ASCII.

------
kwoff
Also precede the file list by `--`. A very confusing thing can happen if a
file happens to begin with a dash... (You can intersperse options like `-e
pattern` among file names, if for some reason you wanted to do that.)

~~~
iforgotpassword
But that's not grep specific and generally a good idea, especially in scripts
that get the file names from their command line, some input file or god knows
where.

------
jindraj
Have you thought about environment variable GREP_OPTIONS?
[https://www.gnu.org/software/grep/manual/grep.html#Environme...](https://www.gnu.org/software/grep/manual/grep.html#Environment-
Variables) You can define it at the beginning of the script.

~~~
yellowapple
> As this causes problems when writing portable scripts, this feature will be
> removed in a future release of grep, and grep warns if it is used. Please
> use an alias or script instead.

------
spenrose
Love ack for source code and text docs:
[https://beyondgrep.com](https://beyondgrep.com)

~~~
Skunkleton
I used ack for a while, switched to ag (I don’t remember why? FOTM maybe?),
and finally ended up with ripgrep. If you haven’t tried ripgrep you definitely
should. It has almost completely replaced gnu grep for me.

~~~
petepete
I did the same but stayed with ag. For me, like with its contemporary find
replacement fd, it's the UX that provides the most benefit.

The speed benefit isn't really a huge factor; working how I'd expect, omitting
ignored files, being able to specify file extensions to search, plus simple
editor integration. Amazing.

~~~
Skunkleton
I keep meaning to replace find, but never get around to it. I will check out
fd. Ty.

------
chaps
Also works well on non-text files, similar to `strings`! I don't think it
works as well, but can still be useful for quick checks.

~~~
lizknope
I normally do:

strings file | grep search_pattern

~~~
cesarb
But remember, always use "strings -a"!
[http://lcamtuf.blogspot.com/2014/10/psa-dont-run-strings-
on-...](http://lcamtuf.blogspot.com/2014/10/psa-dont-run-strings-on-untrusted-
files.html)

~~~
steerablesafe
It's the default since 2014:

[https://sourceware.org/git/gitweb.cgi?p=binutils-
gdb.git;a=c...](https://sourceware.org/git/gitweb.cgi?p=binutils-
gdb.git;a=commit;h=7fac9594c41ab180979bdf5927ff7f7e1d13a9e9)

edit: I would say that if you are doing forensics on an untrusted binary and
you are not using a dedicated VM for it then you are not careful enough.
objdump, nm are still attack vectors, not to mention debuggers and
disassemblers.

------
battery_cowboy
Oh man, I've had this issue before and I just chose to nuke the logs and try
again, thinking they were corrupted!

------
eu
I normally use -I to skip binary files

------
nerdponx
Why is this even a thing?

This is a serious anti-feature as far as I can see.. can someone clarify
otherwise for me?

~~~
_ZeD_
Try `cat /usr/bin/*` in your terminal

------
exabrial
> 'LC_ALL=C'

Wuff, reminds me of the completely incompatible difference between BSD sed and
GNU sed

------
the_jeremy
shoutout to
[RipGrep]([https://github.com/BurntSushi/ripgrep](https://github.com/BurntSushi/ripgrep)),
which is generally faster, has more intelligent defaults (searches cwd by
default, ignores files matching .gitignore), and can search through only
certain text files (like your .java and .py files, say). Not affiliated, just
found it worth the effort to learn some slightly different flags, though many
are the same as normal grep.

~~~
ainar-g
> ignores files matching .gitignore

Maybe it's just me, but that sounds like a bad default. I can definitely
imagine people being confused by that.

~~~
burntsushi
They are. That's why it's always mentioned in the first few sentences of docs
(man page, --help, README). With that said, this default is one of ripgrep's
defining features and is something that users consistently report as one of
their favorite things about ripgrep.

You can disable all smart filtering (gitignore, hidden, binary) with `rg -uuu
foo`. That will search the same stuff that `grep -r foo ./` will.

------
jodrellblank
Drum banging time: another good use for the UTF8 BOM.

