A <noscript> script would be even more suitable, but I agree with the principle. I added a link to view the demo without downloading the file, see https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/demo.... (it was not working previously because GitHub serves pages in UTF-8).
Indeed, for example the HTML of the files used for the presentation slides [1] use <noframe> tags to keep the HTML well-formed. This point is addressed in the conclusion of the presentation.
Note that if you're on iOS, it's possible that the HTML page doesn't work at all because when it's opened from the filesystem, it's displayed by a viewer which doesn't support JS instead of Safari.
You're right, SingleFile (which is capable of saving pages in this format) does a little better than the demo, but it can also be optimized. In fact, I chose the JSON format to keep things as simple and didactic as possible for the presentation. I think I need to use your suggestions to optimize this structure in SingleFile ;)
Note that you can also take advantage of the fact that a ZIP can be password-protected and make your web page secret! For example https://gildas-lormeau.github.io/private/ (password: "thisisapage").
If you are loading external libraries like in this example your encrypted data is at risk. It would be better to include the decryption code directly in the Js or embed Js zlib.
It's possible to define the Content Security Policy with a <META> tag in the "bootstrap page" and prevent this kind of security issue, e.g. <META http-equiv="content-security-policy" content="connect-src 'self' data: blob:;">
I don't think that will prevent data exfiltration. Malicious javascript could create e.g. an img element with the data to exfiltrate stored in a query parameter of the image URL.
If we make it strict enough to block exfiltration, it'll block the external libraries from loading. So that means we have to load our scripts from the same origin instead of external origins (as jclarkcom suggested).
But the whole reason for CSP was to allow us to use external libraries without exfiltration risk. If we stop using external libraries, then our motivation for using CSP is gone. So CSP is useless for the purpose of this conversation.
I think there's been a misunderstanding, there was an error in the article suggesting that zip.min.js is not inlined in the page. This error has been corrected meanwhile. I'm sorry for that. The goal is obviously to create pages that work offline, as shown in the demo.
The saved page is encoded in windows-1252. It includes "consolidation data" to read the ZIP data as text from the DOM and recover the replacements of \r and \r\n occurrences (this is the only data loss and it represents approx. 1% of ZIP data), see the links below for more info.
If "CR" is the only bad byte, that means that 255/256 of the symbols are okay to use. That beats UTF-16 embedded in a string, where only 63481/65536 of the symbols are okay to use.
My approach was to use very large integers. You can split the input file into blocks of X bits, then represent that block as X+1 bits. The output is bigger because it can't have any forbidden bytes in there.
For the case of 255 of 256 symbols, packing 1415 bits of data into 1416 bits of space is the most efficient block size (before reaching a ridiculously large size) at 0.0706215% expansion. (For an infinite block size, you'd have an expansion of 1 - (log base 256 of 255), or 0.070582%)
Encoding: Turn 1415 bits of data into a very large number. Repeatedly divide and modulo by 255, giving a range of 0-254. Then add 1 to all bytes "CR" or larger. Now you have 1416 bits of encoded data, which cannot be "CR".
Decoding: Read a byte, decode back to 0-254 by subtracting 1 if it's greater than "CR". Multiply by 255 and add to your big number. At the end, you'll have a really big number that holds 1415 bits of data. This would be 177 big multiplies, and 177 big adds.
Decoding (the faster way):
Javascript uses floats, but you can treat them as 48-bit integers. Just watch out for the bitwise operators, they will truncate results down to 32 bits. That means use actual multiplication and division instead of bit shifting.
6 bytes at a time: 48 bits can hold 6 bytes. With normal floating point math, you can multiply each byte by 255^0, 255^1, 255^2, 255^3, 255^4, 255^5, and sum them together. Then you multiply-and-add these 6-byte chunks to a big int. Then the operations afterwards use big ints. First 6 bytes get multiplied by 255^0, next 6 bytes get multiplied by 255^6, then 255^12, 255^18, etc. Whole thing is summed together. This cuts it down to 30 bigint multiply-and-adds, (30 multiplies and 30 adds)
Homemade bigint: It's an array of doubles, but used as 48-bit integers. Compared to the actual BigInt, it removes all allocations, and you can access the bits inside directly, speeding up the part where you extract bits from the number. Only mathematical operation required for decoding is the "multiply and accumulate" operation. Using the homemade bigint sped things up dramatically.
---
So then, that's a lot of math just to avoid escaping (or fixing up) your bytes, but I think that would get close to the minimum possible expansion.
[1] https://harddrop.com/wiki/T-Spin_Guide
reply