Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: People forget that you can stick any data at the end of a bash script
321 points by BasedAnon on July 5, 2023 | hide | past | favorite | 141 comments
This is a neat trick I've used to write self-extracting software or scripts that extract files from archives by just using

    tail -c <number of bytes for the binary> $0
All you have to do is make sure you append an explicit 'exit' to the end of your program before your new 'data section', so that bash won't parse any of the 'data section'.

One thing to bear in mind is that if you append binary data, it will be corrupted if you save it in most text editors so when I want to make changes I just delete all the binary and reappend it.




If you care less about space efficiency and more about maintainability of the script, you can also encode the binary as base64 and put an

  echo '...base64 data...' | base64 -d > somefile
in your script.

Or add compression to reclaim at least some of the wasted space:

  echo '...base64 gzipped data...' | base64 -d | gunzip > somefile
Also note that bash accepts line breaks in quoted strings and the base64 utility has an "ignore garbage" option that lets it skip over e.g. whitespace in its input. You can use those to break up the base64 over multiple lines:

  echo '
    ...base64 gzipped data...
    ...more data...
    ...even more data...
  ' | base64 -di | gunzip > somefile


If you care about maintainability, you keep the binary data out of the source file and have a build process.


I meant "maintainability" in the sense that you can open it in a text editor without either corrupting the file or having the editor crash.

It's also useful if you want to check the file into a git repo and want to keep using the diff and merge tools.


For something small. I would take a data.bin file rather than a build process. But yes.


The "build process" in this case is concatenation.


You can also use here-documents to avoid hitting any argv length limits:

    { base64 -d | gunzip > output; } <<EOF12345
    ...data...
    EOF12345


An even simpler way would be to include a marker to denote the end of the shell script, and the start of the data. For example, if you put this in extract.sh

    #!/bin/sh
    sed -E '1,/^START-OF-TAR-DATA$/d' "$0" | tar xvzf -
    exit
    START-OF-TAR-DATA
and then run:

    cat extract.sh ../foobar.tar.gz > foobar.tar.gz.sh
You can then run foobar.tar.gz.sh to self-extract. And you still get the benefit of being able to modify the shell script without needing to count lines or characters without sacrificing any compression.


You've already got a marker to denote the end of the shell script baked in there, it's ^exit$


There may be more than one exit.


There will be only one that is neither preceded by indentation nor followed by an exit code, so that it could match ^exit$, unless you contrive some hypothetical nonsense purely for the sake of contrarianism.

Any reasonable person will indent a conditional exit within the block testing its condition, and more than one unconditional exit doesn't make sense.


It would still make me uncomfortable that a 2nd exit or lack of indent would break my script. And it's less clear than an explicit marker


Ah fair enough. I didn’t think about using the indent, that’s pretty clever.


Is there an encoding that is less wasteful that base64 but not vulnerable to text editor corruption issues? I think avoiding 0x0 to 0x20 should be enough to not get corrupted by text editors, though base64 avoids a lot more than that.


If you can count on every printable ascii character being not-mangled, you can use ascii85/base85/Z85 (5 "ascii characters" to 4 bytes) instead of base64.


there's also base91, with an efficiency of 6.5 bits of data per printable character, compared to 6.4 with ascii85, 6.0 with base64


There's probably a base(bigger number) with Unicode chars today


base65536, and look who the author is :-D

https://github.com/qntm/base65536


Who is the author?




But you need to make sure to use utf-16 or utf-32 instead of utf-8, or you may be worse off.


Those get mangled by text editors that don't support them.


While a couple of people suggested Base65536, that encoding isn't particularly compact, and it can't be as elegant as 65536 would suggest because it has to dodge special cases in unicode.

It's almost always the case that either Base32768 is denser, or encodings with 2^17 or 2^20 characters are denser.


At that point you're basically doing yEnc.


if you mean the thing you want to encode is mostly-ascii, then https://en.wikipedia.org/wiki/Quoted-printable ... it's a real throwback, I've not seen this in the wild since the 90s, but it's there in the python standard library (quopri), perl (MIME::QuotedPrint) etc


Base85, also called Ascii85. Also yEnc.


base16


Woah, why the downvotes.


Beware that often you need to append -n to echo to not include a newline when you base64 encode/decode something.


I think the -i flag when decoding should handle this and cause the decoder to skip the newlines.


Just to be sure I’m following you correctly, what is the advantage of zipping the base64 data vs having the original binary, zipped if you like?


As I understood, you base64 the zipped data on input and the other way around on output.

The reasoning being that the base64'd binary data is safe from being corrupted when the file is edited in text editors, as a response to the warning stated on the last paragraph of the original post.


The idea is to first zip the binary, then base64 the zipped data. Conversely, the script first decodes the base64 to a zipped binary, then unzips the binary.

It's just to mitigate the wastefulness of base64 encoding. You end up with a file that is text editor friendly and not quite as bloated as directly encoding the binary would be - but of course the file is still larger than simply appending the binary directly like in the OP.

Also, if you don't care about text editor friendlyness, you could indeed just zip the binary and then append it to the script for an even smaller file.


The build process would produce the concatted file


This trick is used in the demoscene. Instead of using -c, I use -n,

  tail -n +2 $0
The -n +2 option means “starting at line 2”, which is what you want if you cram your script into one line. You can make an executable packed with lzma this way,

  a=`mktemp`;tail -n+2 $0|unxz>$a;chmod +x $a;$a;rm $a;exit
This is the polite way to do it, using mktemp. You can save some bytes if you don’t care about that stuff.


There must be a way to run something without needing a temp file...


If you're willing assume a Linux platform and a Python installation, you can do something like:

  echo -n "base64encodedelf" | base64 --decode | python3 -c 'import os;import sys;f=os.memfd_create(os.MFD_CLOEXEC);os.write(f,sys.stdin.buffer.read());os.execve(f,["dummy"],[])'


Surprising that no combination of bash symbols lets you do this.


Neat. But is the base64 even necessary?


I don’t know that every editor today will gracefully handle a blob that isn’t valid UTF-8. Emacs needed a library (so-long) to disable some features that choke trying to render very long lines.


You'll to edit the shell script without the binary at the end and once you're done you append the blob.


Yes. You can load the decompressed code into memory, mprotect() it to give it execution permissions, then jump in. Hard to do in a shell, though.


Basically everything in Linux will create a temp file one way or another even pipes. For you to take a binary and run it directly it has to have an inode. At best you can use Python & the ctypes module to write a program into part of Python's memory and trick it into continuing execution from there.


Pipes are not temp files, they’re more like kernel buffers associated with no file. When I think of temp file, I think of something that is at least associated with a filesystem.

The reason that running a binary needs a file is because execve() takes a path as an argument. But, as you said, there are other ways to load code into memory.


> Basically everything in Linux will create a temp file one way or another even pipes.

pipe(2) doesn't create a file, at least not in the sense that I usually think about files (something that's accessible through the filesystem).


Ruby (and earlier, Perl) formalised this with the __END__ section: https://www.honeybadger.io/blog/data-and-end-in-ruby/



Back in the day I made a Perl script that would use an inline encryption algorithm to decode a payload and execute it in memory so it would never hit the disk. Data after the __DATA__ line.


While applying for new job once I made a self-unpacking CV in ruby using this trick; whole binary content was compressed with zlib, I also added some blank padding to get some 'nice' numbers, so I could stuff like DATA.seek(1337 * 32, IO::SEEK_CUR); needless to say either nobody appreciated the idea, then again, what I was expecting? :D


Are you sure that Perl took it from ruby and not the other way around?

(edit: a subsequent correction has obsoleted this comment)


They said "earlier, Perl".


yup. after that you can use the global var DATA to access the data injected after the __END__


When I was first learning Ruby knowing Perl rather well, it was discovering that it supported DATA/__END__ that made me feel like I was at home ...



Thank you. As a shell amateur I was having trouble wrapping my head around the original description.


Shell archive it was called? There used to be a lot of installers like that.



Exactly. Very popular way of distributing software e.g. on alt.sources, back in the day.


Minor nit: every "shar" I've seen (from distant memory) used a "here document" rather than appending (possibly binary) data to the end of a shell script.

https://en.wikipedia.org/wiki/Here_document


Hmm i have a yocto generated sdk installer handy and it seems to do just this:

payload_offset=$(($(grep -na -m1 "^MARKER:$" "$0"|cut -d':' -f1) + 1))

and at the end of the code it just has:

exit 0

MARKER:


Since zip files use a directory at the end, you can make a kind of mullet file - script at the front, archive at the back. I generated single-file runnable Java binaries like that at once point.


> mullet file

That's a great expression


In German that would be VokuHila: Vorne Kurz, Hinten Lang

Front Short, Back Long

https://de.wikipedia.org/wiki/Vokuhila


I note that in Danish it is called "Svenskerhår" [1], which i think means "Swedish hair", and in Polish, "Czeski piłkarz" [2], or "Czech footballer"! Meanwhile in Swedish, it is "hockeyfrilla", the "Hockey frill".

[1] https://da.wikipedia.org/wiki/Frisure

[2] https://pl.wikipedia.org/wiki/Czeski_pi%C5%82karz

[3] https://sv.wikipedia.org/wiki/Hockeyfrilla


Czech Footballer HAHAHA Amazing


Similar trick to bundle a python script with self-contained dependencies (with some limitations, and doesn't include python itself): https://n8henrie.com/2022/08/easily-create-almost-standalone...



Ha, turns out I just wrote this helper function a few weeks ago, inspired by Perl and Ruby:

    #!/usr/bin/env bash

    # read data starting from the provided section marker up to the next one or EOF
    function section() {
        local section="$1"
        local source="${BASH_SOURCE[0]}"
    
        awk '/^__[A-Z0-9]+__$/{f=0} f{print} /^'"${section}"'$/{f=1}' "${source}"
    }
    
    section __JSON__ | jq
    section __YAML__ | ruby -ryaml -e 'p YAML.load(STDIN.read)'
    
    exit
    
    __JSON__
    { "a": 1 }
    __YAML__
    b:
      - 1
      - 2
      - 3
My only wish is that shellcheck had a directive to stop yelling at me starting at a certain line.

Usually I augment it with such functions for clarity:

    # whatever raw data
    function data() {
        section __DATA__
    }

    # man/perldoc like
    function doc() {
        section __DOC__
    }

    # command line help
    function help() {
        section __HELP__
    }


> My only wish is that shellcheck had a directive to stop yelling at me starting at a certain line.

But it does no? Shellcheck can be configured to ignore SC by putting a "disable" directive in a config file or even as a comment in your Bash source file. It would work if it's always the same annoying warning / SCxxx message you get.


Directives apply line by line. The only ones that apply to a bigger scope are mandated to be the next line right after shebang and apply to the whole file.

https://github.com/koalaman/shellcheck/wiki/Directive

> Directives that replace or are immediately after the shebang apply to the entire script. Otherwise, they are scoped to the command that follows it

https://github.com/koalaman/shellcheck/wiki/Ignore

> Note that the directive must be on the first line after the shebang with versions before 0.4.6. As of 0.4.6 comments and whitespace are allowed before file-wide directives.


There's an issue already about it: https://github.com/koalaman/shellcheck/issues/760


In Perl, __DATA__ indicates the beginning of the data section of the file. A portable way to provide test data or sample data.

https://perldoc.perl.org/functions/__DATA__


That's how I made a bash backdoor once. It was just a script somewhere on the FS, until it unpacked itself and executed the rest of the rootkit.

Long story but trust me that I had good intentions.


This is a great trick, but no one should ever run someone else's script that does this unless they have verified the script line by line beforehand.


Maybe? People run all manner of binaries/installers without checking them; I'm not sure why these sorts of things require any EXTRA scrutiny.


EXACTLY! I can't stand the overblown security posturing with regards to things like bash and git, even python.

When everyone in the world INCLUDING cyber-security professionals seems to think that it's perfectly OK to download setup.exe or install.msi from sketchy driver mfgs. and run them. But SHARs and curl | sh is too scary!

It's not that any of these are a good idea, it's the blind-spot to traditional binary installers that's annoying. How do you audit them? Why are they OK (because they have to be)


curl | sh is actually bad because a completely innocuous script can become dangerous with a network hiccup, e.g.

  rm -rf /path/to/directory
becomes

  rm -rf /path
if the connection accidentally drops half way. Or something less dramatic, but still leaving you in a broken state.

Some people took the “don’t curl | sh” advice as “you have to inspect every line of a shell script installer you download”, which is of course absurd.


Ideally, scripts that are in any danger of being truncated are supposed to be wrapped in immediately invoked functions exactly because of this gotcha, and most (but not all) of those that I inspect before running often are.


that's a very interesting point and example. TCP is normally lossless and byte-for-byte before data is forwarded to the application. I think your example may illustrate a loophole - truncation.

It'd only work if, like your example, it was the truncation that caused the flub. You can't have random corruption.

Does curl in a pipeline really behave that way - if the connection is interrupted (TCP RESET), does it just end the pipe? It probably actually does, from what I see.

It's an interesting edge case.

Personally I use, which would seem more immune:

    curl -o /tmp/a
    vi /tmp/a
    {get bored easily, fuckit}
    sh /tmp/a


Or a malicious server trying to detect if you're using `curl | sh` https://news.ycombinator.com/item?id=17636032


Could be confirmation bias/rage on my part of course, but I never took...

> Some people took the “don’t curl | sh” advice as “you have to inspect every line of a shell script installer you download”, which is of course absurd.

... to ever mean anything BUT "inspect every line". Which is, as you say, completely absurd.


Sure, but that's turtles all the way down... any time you run untrusted code, you are making a risk based decision, usually based on the provenance of the code.


I don't think I've ever read through the Nvidia binary drivers that way. (They're named *.run but are basically shar files)


Java JAR files are similar, but reversed. You can add anything you want to the beginning of the JAR file (or is it any ZIP file?) so long as it doesn't include the Zip file header "PK". So, I use this to prepend a bash script that ultimately calls

    java -jar $0
It makes it very easy to setup and use Java based command line programs on a server.


this sounds incredibly useful but I couldn't get it to work. I just get

    java.util.zip.ZipException: invalid CEN header (bad signature)
        at java.base/java.util.zip.ZipFile$Source.zerror(ZipFile.java:1623)
if I try to do anything with a JAR file that has leading text. I'm creating it just using

   echo 'java -jar $0' | cat - test.jar > test.run.jar
Is there more to it?


Technically, you should update the offset to the central directory in the Zip footer, along with the offsets to each file header in each central directory entry. If you don’t, the zip file reader has to apply some heuristics to locate the central directory; not all readers implement these heuristics, and those that do won’t always be robust.

The “unzip” utility can be useful as a sanity check; run “unzip -t” to test the integrity of the file.


Did you add a shebang line?

I have a small bash stub that I use. It’s roughly (I also add JAVA_OPTS, etc…):

    #!/bin/bash
    java -jar $0
Then…

    cat stub.sh myproject.jar > myexec
See: https://github.com/compgen-io/ngsutilsj/blob/master/src/scri...


https://github.com/puniverse/capsule/blob/master/capsule-uti...

This is what the (now defunkt) Capsule project prepends to get this same effect


This is my default approach to writing installers for the Unices. The program is compressed and added to the end of the script, and the script does the unpacking and any needed setup/configuration for the specific platform it's getting installed on.

I don't append it in binary form, though. I uuencode it. That way, there is no danger in using text editors.


Why uuencode? Base64 is the defacto standard these days.


Sorry, I did mean base64. I have a bad habit of calling all "binary as text" encodings "uuencode". I usually catch myself before I put it in writing, though.


I've used both, but only briefly. I think I used uuencode when using uucp. And Base64 in one of my Python programs.

What are their pros and cons, in your opinion?


Base 64 is slightly more space efficient. Other than that it's just more popular and better supported.


Got it, thanks.

Yes, uuencode / uudecode are probably older too.

They are from the uucp dialup comms era of networking.


See https://man.freebsd.org/cgi/man.cgi?query=shar&sektion=1&for... for a tool to generate these types of archives.



I can vaguely remember that many programs used to install themselves this way under Linux.


It was used on Unix systems even before that.


definitely used something similar on VAX/VMS called VMS_SHARE (https://www.glaver.org/ftp/multinet-contributed-software/vms...) circa '90-91

in fact I found an old archive of mine floating around on usenet and wrote a python script to unpack it. Looking at the original, it was using a scripting. language bootstrap to make a COM script unpack embedded the original code.


Lots of commercial Linux software use this still for installing their stuff. It’s a neat trick


I've seen it recently with the Conda and Mamba package managers.


"$0" otherwise it won't work for paths with spaces


This reminds me of ZX Spectrum Basic where all the graphics, sound, and level layouts were defined using DATA lines at the end of the program.


You could also put the binary data in the first line of the Basic program after the ‘rem’ command, change the line number to 0 using the poke command, so that it’s not possible to edit this line. The second line would run the code using ‘randomize usr’. There were also fun tricks with control sequences, that would hide the ‘rem’ command and the line number, and put something like “Cracked by Bill Gilbert (c) 1982” instead. Gosh, why I still remember all this nonsense after all these years…


Or any machine code routines you wanted to POKE into memory.

A suppressed obscure part of my lizard brain secretly wishes I could just code for 8bit computers from the 80s, just with all the modern niceties like text editors, assemblers and emulators etc.


Me too!…


Makeself archives are a classic self-extracting tarball who do exactly that...


"All you have to do is make sure you append an explicit 'exit' to the end of your program before your new 'data section', so that bash won't parse any of the 'data section'."

Or just use exec.

     exec tail -c [number of bytes for the binary] $0


....that's horrid. Why would you do that to your fellow humans ?

just use

    cat >outfile <<EOF
    some
    data
    EOF
add base64 if binary

edit: after looking thru the thread I am deeply disappointed so little people know of that feature.


>deeply disappointed

I am deeply disappointed you didn't know this at some point in the past. An alternative oulook: https://xkcd.com/1053/


I expected in a forum with far more than 10k users for this to be common enough that it would be first suggestion or the post's content. Not apparently some rare unknown knowledge that I only thought it was normal because corpo I work for used it often in CM.

And to clarify I'm not really disappointed in people, just the fact it is unknown


One “naughty” thing you can do is write invisible data into the last block of a file…

- truncate the file to extend it to the end of the last block

- write data to that area

- truncate the file back to its original size

An edit of that file will likely lose you data though.


I think this is how GOG ships the Linux version of Battletech.


I believe this is how GOG ships all of its Linux titles, all of the installs I've used from them are downloaded as a single *.sh file. I just checked an example game, and it looks to be using this method.


BASIC and Perl had or have something like that too.

IIRC, Perl copied it from BASIC, because BASIC came much before Perl.

And, again, IIRC, I've read about the shar (shell archive) method that someone else commented about in this thread (and which even has a Wikipedia entry), in either the classic Kernighan and Pike book, The Unix Programming Environment (which I've recommended here multiple times before), or in some Unix man pages, long ago.

So it's quite an old method.


I remember writing DATA lines on commodore basic. But you could put them on any line.


Yes, the DATA statement was the way of doing it in BASIC.


I did a similar thing for a lowish volume embedded product. The update files are just bash scripts with a tar file cat'd on them. The unit just looks for a particular file on an external flash drive to run and the bash script runs, copies off a tar and checks that it has the right hash. Super simple and flexible when customers need me to do something special. Like extract some specific log onto a flash drive.


This reminds me of a job I had 15+ years ago where we did code reviews by emailing files to one another with our changes. It worked like this with the first part of the file being a script and the end of the file being a base64 encoded zip of the changed files. We had tooling that would pack them, but unpacking was done by execution.

What could possibly go wrong with emailing executable scripts?


> What could possibly go wrong with emailing executable scripts?

Server side malware filters will strip the attachment.


I use this at work for batch scripts which call R code for some of their functionality it’s very handy providing somebody who’s not very technology literate a solution which is a single .bat file which windows is happy to run by double clicking than a directory of files which must be stored together in order to work


It's also good for signed bash scripts.


Von Neumann architecture to the extreme :)


A very large Electronic Medical Records company shipped an extremely large shell script to us for an install.

Upon examination it contained binary data and a command to extract it to a file and then installed the application.

This was the “efficient” way to ship and install the binary.


This for any sh type script, not just bash :) Will work with sh, ksh and even [t]csh


I use a fun little hack, a la awk:

``` #!/usr/local/bin/bash

echo "HELLO"

TAIL_REMOTE_MARKER=`awk '/^__THE_REMOTE_PART__/{flag=1;next}/^__END_THE_REMOTE_PART__/{flag=0;exit}flag' ${0}`

eval "$TAIL_REMOTE_MARKER"

exit 0

__THE_REMOTE_PART__

echo "WORLD"

__END_THE_REMOTE_PART__ ```


I seem to recall that you can do the opposite as well: stash some extra data at the end of a binary file. The 'tclkit' system used this to package up an executable with the scripts you wanted to ship.


The Löve game development framework packages games like this. The games are nominally Alia scripts but if you cat a zip file to the end of the Lua script the framework lets you access assets in that zip as if they were in the game's cwd.

It's a cool abstraction, during development you can have assets just living on the file system and then "deploy" a flat file that accesses those assets the same way.


That's what uuencode / uudecode were once used for.


portswigger does that for the burpsuite installers.

https://portswigger-cdn.net/burp/releases/download?product=c...


>portswigger does that for the burpsuite installers.

Wow, that triggered my wordplay radar, which I'm working on as a fun side line these days, thanks :)

port, suite (sweet)

swig, burp

Heh.


I used to do something similar for Windows executable files. Append a large file to the end as necessary.


This is a malware technique.

I am not saying don't do it. But that is mostly where I see this type of trick.


Malware is about intent and consent, not executable format.


Security software does not flag based on intent or consent.


Security software also isn't the one-stop-for-everything solution to security, but merely a helper.


You missed my point: maybe don't write scripts that superficially look like malware. Unless of course you like unpleasant surprises.


I vaguely remember this is what Ocaml does for one format of its executable.


Sadly, it won't work with my favourite curl | sh.


I dont understand this website it is too hard and i dont understand anything. Anyone help me with this?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: