$ unrar e shakespeare.part001.rar
unknown archive type, only plain RAR 2.0 supported(normal and solid archives), SFX and Volumes are NOT supported!
All OK
$ sudo apt install rar
$ rar x shakespeare.part001.rar
... shakespeare.html OK
All OK
I discovered that there are two versions of unrar:
$ dpkg -l "*unrar*"
Architecture Description
un unrar <none> <none> (no description available)
ii unrar-free 1:0.0.1+cvs2014070 amd64 Unarchiver for .rar files
un unrar-nonfree <none> <none> (no description available)
You probably have the nonfree package. I have no idea about what's the difference between them.
Looks like the original "unrar" has some licensing restrictions (for compression purposes, not decompression). The "unrar-free" version was created as pure FOSS in response, but the performance isn't quite up to par. If there are any compatibility issues, it's likely the latter version that's having them. No idea what "unrar-nonfree" is, I'm guessing it's just another name for the first "unrar".
7-zip reads the zip file by scanning from the beginning of the file for the first entry signature (which need not be at the start of the file, which is why this trick works at all.)
WinRAR reads the central directory located at the end of the file, which is technically how you're supposed to do it, but entry offsets given in the central directory are messed up by the existence of the JPG data at the beginning of the file. There also appears to be ~61KB of garbage data at the end of the file, which will trip up some zip readers that use the central directory.
tl;dr: The zip file technically malformed, whether or not a given archive reader will choke on it depends on which technique they use and how forgiving their implementation is.
This file, pocorgtfo16.pdf , is a polyglot that is valid as a PDF document,
a ZIP archive, and a Bash script that runs a Python webserver which hosts
Kaitai Struct’s WebIDE which, allowing you to view the file’s own annotated bytes.
> 0x17 is a PDF that's also a valid ZIP and valid firmware for Apollo Guidance Computer.
Wait, what?
Edit: 0x15 has the additional feature that viewing it in Chrome makes my work machine throw a "Threat detected: A threat has been blocked and quarantined" message, so I think I'm going to avoid this site now. Not to say that I don't trust this random PDF hacker (...that sounded less sarcastic in my head), but I don't need IT asking me pointed questions about what sites I've been visiting.
PoC||GTFO is a well known project, and has a tradition of turning their PDFs into absurdly polyglot files. Your Chrome is giving you a false positive here.
binwalk DqteCf6WsAAhqwV.jpg
DECIMAL HEXADECIMAL DESCRIPTION
--------------------------------------------------------------------------------
0 0x0 JPEG image data, JFIF standard 1.01
182 0xB6 Zip archive data, at least v1.0 to extract, ..., name: shakespeare.part001.rar
...
1971177 0x1E13E9 End of Zip archive
Guessing Twitter doesn't scale/thumbnail/compress already tiny images... which is probably an oversight in this case because the file size is disproportionately large.
They don't address the metadata in compression, which is probably what the bug report [1] was about. This thumbnail is over 2MB. With the metadata removed, it's only about 2KB.
Yes, if you open the file as text you will find the magic number of icc profile (acsp) follwoed by a bunch of random bytes and that of zip (PK/x03/x04). It is quite easy to append arbitrary data to JPEG files [0] and however there is no guarantee that it will survive image processing, so it was necessary to hide it within an ICC profile. As for zip, unzip will happily decode any data stream as long as it makes sense and ignore the parts that don't.
I can't quite figure out what additional tricks the author has used to disperse the image data inside the body to evade entropy analysis as shown by the binwalker analysis posted in this thread; however image processing software seems to have no trouble parsing the file and extract only the image data without the embedded zip.[1]
This is hardly new. I remember collecting ebooks on 4chan circa 2005 via this method. The fact that it was a jpg allowed you to post it on the imageboard with the zip/RAR file therein containing a txt/doc/rtf of the ebook.
And yet, nobody has done the same for Twitter until now.
The difference is that twitter applies a series of operations to all uploaded image, stripping EXIF data, recompressing, etc., which would normally be difficult to work around.
Did people do this back in the day? 4chan used to be totally fine with just uploading a jpeg concatenated with a zip, but I haven't seen this ICC profile trick before today.
I tried doing something similar with an mp3 first, and that worked. You can't upload mp3 files to twitter, but I have one here: https://instaud.io/2RLM Play it on the page, and it's an mp3, download it via the buttons, and it's also a valid zip file with a jpg in it.
So then I tried the same thing with MP4 video on twitter. Twitter didn't like that at all :) https://imgur.com/a/O0QFZC9
Oh I did this too a few years ago. Except I used gifs instead of jpegs, and then wrote a Ruby wrapper for backing up files to Flickr. Blog post about it here: https://namwen.svbtle.com/hoardr
Anybody have any insight into what kind of trickery goes into this? I know a decent amount about both of the compression algorithms, but I've forgotten the structure of the file formats so I can't tell if this required any really funny business.
Basically the "top level" zip file section lives at the end of the zip file and contains pointers back into the file for the actual data. This is so that you can keep adding to a zip file without having to rewrite anything you have done previously - just append a few more files and then your updated directory at the end.
JPEG segments contain two special opening bytes, maybe a length, and then the actual data. So it would be trivial to have the last jpeg section in a jpeg file end with zip directory bytes. This directory could point back to data in other sections of the jpeg.
I can't exactly remember what JPG and ZIP look like, but yeah pretty much, I've done it before.
What's surprising is that it works on Twitter. I tried the same thing with image hosts and the data got scrubbed. I guess they don't re-encode small images but I dunno.
I did not. This is just a response to the parent comment not to the post itself.
(Although if they do something with this file and you still are able to do this trick that would seem even more .. interesting on their side than leaving images below certain size untouched)
The zip offset fields would be wrong, so your unzip must have some error correction. Use "zip -A zipfile" on Linux to fix them. As mentioned though, image upload sites would likely scrub out the extra data.
The only trickery involved is making sure it survives twitter's compression and image cleaning algorithms. Other than that, programs that read images will ignore garbage data after what should be there for a jpg or whatever, and programs that open archives will ignore anything that comes before the archive header, so it's literally as straightforward as appending a .rar file at the end of a .jpg file.
It's similar in the sense that there's more than one possible operation on any file (decompress it, or compress it), and different in that the file type doesn't change.
Very thought provoking. The file size is 1.93 MB. I wonder what dimensions would a disk just big enough to store that much have? Not long ago I read of a suggestion to archive information as DNA (for its durability iirc). But in terms of physical space required to store stuff, is DNA more or less efficient than today's disks?
So Borges' Library of Babel -- every possible 3200 character book from a 25-character alphabet; essentially the 'space' equivalent of the infinite monkey theorem's 'time' domain -- is sufficiently large that if we took this universe and shrunk it down to the Planck length (the smallest length there is) and filled another universe with these femtouniverses, and then scaled that one down likewise, and repeated this process EIGHTY times, we'd be able to fit all the books in.
And Hamlet would be split amongst dozens of these books.
> Can the technique be used to bypass content censorship (e.g. in China)?
Steganography can be used to hide data in clever ways, but it isn't a substitute for encryption -- anybody who knows whatever trick you used can extract the payload. You could always encrypt the payload, but all you're doing is giving your adversary another (trivial) hoop to jump through when you could've just encrypted the message to begin with.
> Can it be made resistant to detection?
Yes and no. There are some clever steganographic techniques that take advantage of PNG and JPEG implementation details to foil basic entropy checks, but anybody who knows the algorithm can trivially extract the payload. In other words, it's security by obscurity, not by any sort of strong cryptographic property.
> You could always encrypt the payload, but all you're doing is giving your adversary another (trivial) hoop to jump through when you could've just encrypted the message to begin with.
Wouldn't "an encrypted message that no one is sure you sent in the first place" sometimes be more useful than "an encrypted message that any eavesdropper knows you sent" in oppressive-surveillance-state scenarios? (It seems that, if you find some subset of bits in a JPEG/PNG that normally have random distribution and don't affect the image that much, putting an encrypted message into those bits might be indistinguishable from a "completely normal" image even to a well-informed attacker.)
That's true, the least significant bits of a png encoding of a photograph could be effectively random (if the photo had high noise) and could be replaced with random looking data.
Given that there are a lot of different methods and many different file formats I do not think this can be automated to the extent to catch every single file with steganographic content without massive amount of false positives.
Entropy and metadata analysis gets you 95% there, and the rest is cat-and-mouse with whatever the latest paper is. Importantly, your adversary doesn't need to extract the payload, only detect that there is one, to perform filtering.
> without massive amount of false positives
China, at the very least, has no problems with this!
not really, the data is still there, there are just 2 different file contents saved into one and playing with the zip and jpg format so both of them are still readable.
For those who don't know the trick: ZIP files have their index at the end of the file. So you can add a zip file to anything else and have it unzippable.
This was done because it allows adding files to zip archive files without rewriting the whole file, instead just start writing where the index starts and add an updated index at the end. Please note that over the years there's been multiple methods of doing this, including partial indexes which don't even rewrite the index.
If the file you're adding things to is also tolerant of wrong file sizes and extra data at the end (like JPG and many others), you can just:
That's probably working only because of an unzip that deals with problems. You're probably seeing an error like "warning [foo.zip]: 12345 extra bytes at beginning or within zipfile". The offsets in the zip header would be wrong. "zip -A file" would fix them. Most image upload sites would probably strip the data anyway though.
It's astonishing that it's as simple as CATting a zip file to the end of a jpg. I feel there are consequences here for any website that accepts image uploads.
Jep you're right. It's doing something with 64k JFIF application segments ... wtf.
Well, it's still using the trick I pointed out, placing the ZIP file index at the end.
So from one viewpoint it's a JFIF("jpg") file with large application segments containing the zipfile data for the shakespeare.part0xx.rar files.
From another viewpoint, from the back of the file. It's a incremental zip file (not compressed in one go), with the garbage bytes (the "overwritten") bytes in the zipfile updates forming a valid JFIF file.
There was a similar technique used in the tv show Mr. Robot, season 3 episode 9, to hide some secret information (trying not to spoil anything). Ryan Kazanciyan, the Technical Consultant, wrote a blog piece on this technique.
I thought it was assumed that the files, once downloaded, would be distributed locally on floppies. The sub-archives were usually 1.44MB to fit on a 3.5".
When did resumable BBS download protocols become a thing? ZMODEM at least was widely used by the early '90s.
I dealt with this problem on a photo site 15 years ago where people were embedding RAR pieces in their JPGs to use us as a distribution repository. The bandwidth increase was a giveaway so I just wrote an incoming processor that re-compressed everything that had more than X bytes per pixel. I can't believe Twitter doesn't have similar logic given its scale.
Then it's not recompression. Recompression is pulling (let's say) a JPG into memory so it becomes a bitmap and writing it back out to a JPG. If you want to keep metadata you only keep a specific set and you cap the value lengths.
Probably easier. The MP3 standard didn't include any standard for tags, so the tags standards evolved separately. Basically, you can append any random crap you want at the end of an mp3. Since a zip file has it's headers at the bottom, it wouldn't be hard to do. I don't know if popular mp3 sites scrub/transform mp3 files in the way image sites do to images though.
Edit: Tried it, and got it to work. The zip command on Linux has a -A option that will fix the offset headers after you cat an mp3 and a zip file together. Just "cat file.mp3 file.zip > newfile.mp3;zip -A newfile.mp3" and you're done.
I did notice that some mp3 file upload sites reprocess the mp3 file and strip out the extra bits. Some don't though.
While technically it would probably fit the definition, usually what's meant by steganography regarding images is hiding the data within the image itself (using differences in pixel values that are not easy to spot). As far as I understand it, this just makes use of the file structure. It's beyond me though how it survives twitter thumbnailer. I would assume that one of its purposes is to strip any unnecessary metadata from the image, making it as small as possible.
The easiest implementation would be to use the 2 least significant bits in the blue channel, and the least significant bit of the red channel, to encode your data payload. Just sticking the data in the metadata portions of an image file seems like hiding behind the only bush in an otherwise featureless field. ("Reginald Maudlin, will you stand up, please?")
My auntie takes more pleasure from watching the news to see what colour tie the newsreader is wearing than from watching the news to find out what is happening in the world. I personally believe that she is missing out on something. I also think that she could watch something on the other side if it is fashion that really interests her, particularly when her comments on ties prevent others in the room from actually hearing the news.
I had a school friend that used 7" records as frisbees and made 12" records into plant pot holders. This was fine until he started doing this with my prized records.
It is the same here, there is a disconnect going on, a lack of appreciation for Shakespeare which is more than mere 'classic literature', it is an education.
I don't think you have any basis for claiming intellectual superiority when you started this by judging the education and interests of a person from a single thing they've done.
Shakespeare wrote for a wide audience. In his plays there was plenty to entertain people from all walks of life regardless of education level, in fact literacy was not a prerequisite at all. I make no claims for intellectual superiority, I do have thespian friends who I do read Shakespeare with so that when we do go to the Globe we have a little bit of an idea of what we are seeing. From this I appreciate how my thespian friends value this cornerstone of English culture. Consequently I believe something is lost when the great works are reduced to a tweet. Had the post been made by someone who knew who identified with the literature - a fan - then I might have seen this as admirable fan love.
There is nothing new in hiding text in images, in fact I do this for myself to see if images can do better in Google SEO, and if indexed images can be searched for by unique phrases hidden in the binary as well as the EXIF data. I do this with the permission of the owner of the images and I don't blindly take images or text for this that others might consider sacred.
I think that Shakespeare 'himself' would have been okay with having his complete works stuffed into a tweet as 'he' did steal most of his stories from elsewhere.
I know that the real reason is a requirement for length 3 combined with a slight preference for vowel elision, but I just realized that if one were so inclined it could be interpreted as a snub.
Joint Photographic Experts Group -> Joint Photographic Group
I never really understood why people still adhere to the 3 character file extension limit when that is a throwback to FAT32 on pre-Windows 95 (and some very old mainframes which shouldn't be internet connected anyway). Yet some people still favor the 3 character extension despite it looking objectively uglier (as a dyslexic I do find them harder to read) and dropping a solitary vowel in a file extension saves you nothing in terms of development time / file size / etc.
So I honestly don't get why people still use them.
Extensions that particularly annoy me are:
yml instead of yaml
Why do people do this when even the earliest YAML spec is much newer than the last of the systems that had a 3 character extension limit.
jpg instead of jpeg
Nearly all of the .jpg's in the last 10 years would be too large to load on any systems that couldn't handle a 4 character extension anyway. So backwards compatibility isn't even an argument here.
htm instead of html
Similar argument as JPG except this time it's more to do with the HTML specification and browsers capable of rendering them.
Not sure why'd you'd say "objectively" when it's clearly not, the hint being the word "ugly."
I can certainly appreciate the limitation of "extensions will always be three characters". In fact, I wasn't exposed to extensions habitually longer than that until I did iOS dev. "Main.storyboard" "Model.xcdatamodeld" So if we're going to talk about ugly, I'd start there. ;)
Sorry, as another poster commented, I did mean "subjectively" (doh!)
xcdatamodeld is definitely ugly (though main.storyboard is ok in my personal opinion) and I'm sure there will be other examples where file extensions have been taken too far. I think if one wants to use long extensions then the extension needs to at least be readable at a glance (eg xcdatamodeld is ugly to read - even if you switched it the other way around, xcdatamodeld.model would still be unpleasant). So I'd argue that xcdatamodeld is more of an example of a poor naming convention than a problem with longer extensions (I say this with no experience of iOS development, however I see the same problem quite as acutely in the other languages and platforms I've developed on).
It should also be noted that my argument was really more focused on people who shorten already short 4 character acronyms - just for the purpose of making it 3 letters (eg the yaml and html examples I gave).
> 3 character file extension limit when that is a throwback to FAT32 on pre-Windows 95
The 3 character file extension is much older than FAT32. It was part of the original FAT that was created nearly 20 years earlier, and later part of FAT12 and FAT16. But it predated all that in various other microcomputer DOS, as well as minicomputer systems going back into the 60s.
Yes, definitely! What I meant by my statement was no system has been dependant on a separate extension meta-property since FAT32 (pre-vfat). But yes, you're absolutely right that the 3 character extension standard is much older than FAT32.
I didn't think it was quite as old as the 60s though? CP/M and FAT are mid-to-late 70s. VMS was around the same time too, or maybe slightly earlier. What else stored the extension as a separate meta-property to the file name?