
My adventures in random data compression - Uptrenda
http://roberts.pm/random_data_compression
======
jsnell
This is pure nonsense. Random data is provably not compressible. So all the
complex schemes the author came up with are useless, and it's not just due to
a lack of CPU power that it failed to show results.

~~~
magicalhippo
Reminds me of when I was about 13 and I had this brilliant idea for an
insanely good compression scheme that could handle random data.

I was teaching myself programming using QBasic and I had just discovered the
random number function. My idea was essentially to find a random seed which
would make the random number generator generate the same byte sequence as the
input block. Then you just had to store the random seed, huge win!

Of course, it didn't take me too long until I realized just how long it would
take to compress, and I hadn't even thought about the limited number of byte
sequences a 16-bit random seed could generate or the fact that the random
numbers it would generate aren't all that random...

But for a few hours there I felt really, really clever :)

------
jonstewart
As I started reading this, I thought, well, he means “high entropy” when he
says random, and the scheme will be about finding bits of structure here and
there , and he brought up Golomb Coding, and that seems like a reasonable
place to start... and then I realized he really did mean random and he really
did think the blockchain was the way to go and he really did use a cobbled-
together Spark installation rather than learn SIMD/GPU programming... and,
yeah, it’s all nonsense, and the proof unsurprisingly does not fit in the
margin.

------
rambojazz
Well he's not the first nor he will be the last one trying this nonsense.
There was even a patent for compressing random data! Maybe he will learn
something useful in the process, like, why he's wrong. Why am I not surprised
to see blockchains linked with this :)

------
jere
I'm fairly certain compressing random data is nonsensical. In fact there's a
cash prize challenge for doing it that has been unmet for 20 years.
[https://marknelson.us/posts/2012/10/09/the-random-
compressio...](https://marknelson.us/posts/2012/10/09/the-random-compression-
challenge-turns-ten.html)

Not sure I can explain it properly myself but it think it's impossible by
definition. Intuitively, if you could compress any random data then you should
be able to compress 2 bits into 1, which is clearly impossible.

~~~
alkonaut
It's possible to compress (some) random data, based on just luck. It's not
generally possible to compress _any_ random data.

If I generate a thousand random 1024 byte sequences, I'm sure I could pick a
few of them that I can compress, so long as I don't have to compress all of
them.

It's also possible that the total of my compressed subset and the uncompressed
rest, is smaller than the sum of the sequences! This still doesn't mean I
could compress _any_ random sequence, just specific ones. Ones that had lower
entropy, by pure luck.

To translate to your analogy: I can make a compressor that compresses 2 bits
into 1 BUT it works only works for 50% of all two bit sequences...

------
Uptrenda
Author here.

I think most of the criticism on random data compression is too far sweeping
and absurd. For example saying that it's impossible to compress random data
because if it were possible you could compress data forever and make any file
a single bit... is basically reductio ad absurdum.

I don't see anyone making that claim (and I certainly didn't in my article
where I specified a limit of 1k.) People like to bring up the pigeonhole
principle as if it's a mathematical proof that there can never be a practical
algorithm for compressing random data. But the truth is if there's so many
different ways to represent data then we might never run into a situation
where it's actually relevant.

The other fault I see in these comments is the knee-jerk reaction over using a
PoW-inspired algorithm for set reduction. Why is that? Could it be because
blockchain technology is no longer in your favourite hype cycle and you need
to signal to all your tech buddies that you're in the know by bandwagoning
like everyone else? PoW is far superior to other ways to reduce the sets and
has the added bonus that nonces are more compressible.

The point that I got to with this research is having 14 bits left over per set
to save offsets after filtering results. I was running some code yesterday and
noticed that using 3 x 3 bit fields to save the number of zero bit information
in the hashes produced low offsets. I think that this could be compressed with
the same simple quotient idea I already wrote about to fit in the 14 bits so
that you could quickly filter and recover the sets and dereference the
original data.

This would reduce 1024 bytes to around 1015~ which would be more than enough
to solve the 20+ year compression challenge and refute claims that random data
can't be compressed. Of course, the cluster that I built to run my code is too
slow and I'm getting tired of waiting. So I will need a break from this. In
the mean time you can stick to the insults and other predictions in the
comments. As we all know from experience: hacker news is a positive and
constructive community that is excellent at predicting the future.

~~~
filmfact
Lets restructure the problem slightly, so the objection is clearer.

Instead of binary data, think of decimal numbers. We’ll take a five digit
number and compress using function f to a four digit one.

We could come up with many different ways of doing this, but lets just throw
away the most significant digit.

F(00123) = 0123 F(01234) = 1234

Great - but now what happens with a five digit number?

F(12345) = 2345

We lose information. We can’t encode the fifth digit in any of the other
digits because that would change their meaning.

It should be evident that there is no scheme where you can encode a five digit
number as a four digit number without either being lossy:

For some X: F’(F(X)) != X

Or having undefined ranges i.e. F’(F(X)) = X, only where digit 5 = 0

Increasing the number of digits just obfuscates the problem by moving it from
the realm of numbers, where it’s clear there can be no 1-1 bi-directional
mapping between two sets of unique numbers where the cardinality of the sets
is not equal, to the realm of ‘data’ where we can confuse the issue.

At the end of the day, a function that only takes 2^(1015 * 8) Unique input
values can have at most 2^(1015 * 8) unique outputs. And while thats a lot of
outputs, it’s far fewer than 2^(1024 *8) that you need to represent to be able
to compress any random input.

Ill leave you with a quote from Oscar Wilde: “A true friend stabs you in the
front, not the back”

~~~
twhitmore
Exactly. Each compressed input can only decompress to one output.

But the number of input values possible is far smaller than the possible
output space. Therefore many outputs cannot be reproduced.

Real compression algorithms compressing structured data preserve the above
properties. All outputs are representable, but they devote shorter codes (and
thus most of the coding symbol space) to the most common patterns. Since
coding symbol space is finite, random or less compressible data are left with
longer symbols/ bitstrings than if conveyed uncompressed. For a reasonably
structured file, the overall outcome is compression. However random files
experience a small increase in size due to coding overhead.

