Now, we all know that it can take a while to find a long sequence of digits in π,
so for practical reasons, we should break the files up into smaller chunks that
can be more readily found.
In this implementation, to maximise performance, we consider each individual byte
of the file separately, and look it up in π.
Or - what a genius idea! - we could use Pi itself as a lookup table and use the sequence of n bytes at that position of pi as an implicit lookup table.
To store a number in range of 0 to 4,294,967,296 - you will need exactly 4 bytes.
Here is the pigeonhole principle on wikipedia:
This is the same tradeoff that all compression schemes make. gzip can compress some files, but others will grow. The hope is that the domain of files you actually want to apply it to will shrink instead of grow.
That said, given that the digits of pi are more or less uniformly distributed, there is no particular reason to believe that common bit patterns will be likely to appear sooner in pi than uncommon bit patterns. So it is unlikely that it will compress very many of the files you'd have sitting around on your hard drive.
But the pigeonhole principle alone does not prove this.
If you have a 100-gigabyte file you want to "compress" with Pi, all you have to do is find the beginning of that exact sequence in Pi and write down its location and the size (100gigabytes). As long as the binary representation of the location within Pi was less than 100 gigabytes, it is now "compressed". Why wouldn't this work?
It would work, exactly like you say. However, you seem to intuitively be vastly underestimating how far into pi you have to go to find a particular bit pattern.
To sharpen your intuition of this, I recommend this website: http://pi.nersc.gov/
I tried searching pi for the string "zzzz" (this is 20 bits of information according to their somewhat weird encoding scheme). The decimal position in pi for this 20-bit string was 3725869808, which takes 32 bits to represent.
The same will be true in most cases -- the offset into pi will take more space to represent than the data itself! However, in some cases data absolutely can be compressed with this pi scheme. Just not often enough to be actually useful.
However, the fact that the data will usually be larger is not what the pigeonhole principle is about.
All the pigeonhole principle says is: it's not possible for every 100 GB file to have a offset in pi that takes less than 100GB to represent. The pigeonhole principle still allows some files to be compressable with pi, just not all.
To prove this is simple: take every possible 100GB file (there are 2100G of them). Now let's suppose that for every single one you can actually find a location in pi whose binary representation is less than 100GB to represent. If you can do this, then it means that at least 2 of the input files mapped to the same location in pi (because there were 2100G distinct input files but less than 2100G pi offsets that we are allowed to use). Therefore, once "compressed", you can't tell the difference between the two input files that both mapped to the same location in pi!
Compression can only work when for every byte you take out of a file you want you need to concede to add another byte to the representation of a file you don't care about.
> write down its location and the size
The location of any arbitrary sequence in pi is on average a much larger number than the sequence itself.
The idea that a single reference point in a set is larger than the entire set itself is also ridiculous. If I have a 100-gigabyte file, and I want to point to the position in the middle, that is position 53687091200. That is an 11 byte string pointing to a single point in a 107 billion byte long file.
You just do the same thing with Pi, except in this case Pi is the 107-billion byte file, and position 53687091200 points to the file's beginning point. The only thing you store is 11 bytes + the size of the file. (Of course wouldn't store the location in ASCII as binary would be a much smaller representation, but for these purposes i'm just using ASCII) Obviously the size of Pi is infinite and the size of the position would be much larger than 50-billion in, but the idea is the same.
Please tell me in simpler terms how this idea does not work?
The crucial part is that the size of the position will be larger than the file you are seeking. This is unintuitive, but necessarily true for most possible files because of the pigeonhole principle. If you don't believe me, just try it out. Decide on random 4-digit sequences in pi and find their indexes. The mean of their indexes will be >4 digits. Same is true of any sequences of any length.
> I don't get this at all. You're talking about multiple files, which has nothing at all to do with my proposal.
No, I'm talking about multiple possible files, of which you must be able to compress and decompress any single one for your proposal to work in general.
General compression is mathematically proven to be impossible. If you have a compression algorithm f() which can take any input, in order for it to be able to compress some input of length x into a form that is shorter than x, it must also compress some other input of length x into a form that is longer than x.
Or, to put in another way, the sum of len(f(y)) for all possible y where len(y) = x is always greater than or equal to sum of len(y).
03 14159 26535 897
01 23456 78901 234
0 - 0
1 - 2
2 - 7
3 - 1
4 - 3
5 - 5
6 - 8
7 - 14
8 - 12
9 - 6
Now, extrapolate this out to every possible 100GB file, and the binary representation of PI.
One drawback (on top of astronomically slow extraction time...), is that this still requires us to go ahead and actually write the data in uncompressed form onto a hard drive somewhere. So storage space is still taken up at some point in the world, but at least we can erase and reuse that space as soon as we like, and transmit the data elsewhere with minuscule bandwidth.
Of course, all this really amounts to "Any data you are actually interested in presumably has a short description (e.g., something like 'A complete audio and video recording of every occurrence in Eurasia over the 10,000 years beginning with 6000 BC') and thus, in perfectly compressed form, you will never have any need for large numbers of bits". Or, put another way, "There can't be more than [some reasonably finite number] pieces of data you will ever actually be interested in in your very finite life, so as far as you're concerned, log_2([said number]) bits suffices for everything".
That being said, I would be more concerned by the philosophical and practical consequences of such capability, namely the ability to read hard drives in normally inaccessible locations and hard drives in the future.
Now I have calculated a little:
The age of the universe is 8 * 10^60 in planck time units:
So if the simulation is for example something like a cellural automaton with a planck-time (5.4 * 10^(-44) sec) step-size, than assuming we use fix-point numbers, 26 bytes suffice. (25 bytes are just not enough, 256^25 = 1.6 * 10^60)
Suprisingly the size of the universe is 5.4 * 10^61 in planck-length units:
So again, we need 26 bytes, for a super fine representation. I was one byte off for all components, so we need altogether 104 bytes.
Very hypothetical of course :)
0 1 10 11 100 101 110 111 1000...
"You can't copyright a fact (like a number), but you can copyright a creative work, like a song or a piece of software. But since one can be transformed into another, copyright law is logically INCOHERENT."
Does a number that represents a range within Pi, get also coloured as copyrighted if your intent while computing it was the search for a copyright coloured sequence of bits within Pi ?
I think that for a lawyer it doesn't really matter; it doesn't matter which technique you're using to encode your data, as long as somebody can access that content and you did that on purpose. They're good at dealing with loosely defined things like intent, better than with formally defined things.
I wonder what amount (and progress) of AI research has been done in this area; not all illogical things are bullshit, however you might want to feel about them. Anti-digital-rights sentiment (disclaimer: I personally deplore much of the consequences about enforcement of digital rights, so I share that sentiment at many levels) sometimes can cloud judgement, and I've seen many people invoke rational thinking so well they successfully miss the point.
e.g., the first instance of "256" in pi starts at the 1750th digit. So in that case you're getting a 'compression' rate of -33% if we go by the count of decimal digits used.
That said, I wonder if "but remember all that storage space we saved by moving our data into π? Why don't we store our file locations there!?!" wasn't meant as some kind of hint...
The good probability that a 5 digit combination is found in Pi will be in the range of locations above 10000, for example I once located by 6 digit phone number in position 685214 which was not actually helpful at all.
Further we are not sure if Pi is normal hence the better idea would be use a simple computable normal series.
It was just yesterday I uploaded a paper that presents a idea for Compressing Random Data to -> https://www.academia.edu/7620004/Advanced_Compression_Techni...
which proposes an Idea to push multiple bytes represented by a positions in a computable number series into small representation and generate them on the go when required. (need lot of improvement to actually apply)
For example, the byte 0xFF, which is the number 255, first occurs at the 1168th value of Pi.This means instead of storing 255, you're now storing 1168, or 0x490, requiring an extra half-byte. However, 0x328, or number 808, first occurs at the 105th value of pi, or 0x69, requiring one less half-byte.
How does this system provide better compression? The way I see it, the best case scenario would be if no sequence from 000 to 255 was ever repeated in Pi (or rather, not until every pattern in that sequence has been covered), in this case the compression ratio should be exactly 0%, no net gain or loss.
I think the lyrics are contained in the celestial music itself, and such music is contained in the silence. Silence or 0 is the holy grail of 100% compression, creatio ex nihilo. Perhaps you could find such a song in Pi in a lifelong quest, but it's a much deeper mystery that Pi and 0 are mysteriously related. Euler is rumoured to have remarked it to be proof of God's existence. But even if such proof exists it's of no value compared to it's beauty.
"After proving Euler's identity during a lecture, Benjamin Peirce, a noted American 19th-century philosopher, mathematician, and professor at Harvard University, stated that "it is absolutely paradoxical; we cannot understand it, and we don't know what it means, but we have proved it, and therefore we know it must be the truth." Stanford University mathematics professor Keith Devlin has said, "Like a Shakespearean sonnet that captures the very essence of love, or a painting that brings out the beauty of the human form that is far more than just skin deep, Euler's equation reaches down into the very depths of existence."
Because your index size would be at least equal or higher than your original data.
The only way you get a smaller compression index, you have to look for recurrences, and try to only include most recurring sequences up to a number(there would be a tradeof and an optimal number for compression ratio) and left other sequences uncompressed. Only this way you can achieve compression ratio's smaller than 100%.
Here's a link in case you have trouble locating it within Pi: http://hyperdiscordia.crywalt.com/library_of_babel.html
I hope the author's pending patent is limited to pi though, so I can work on ef^H^H. Hmm, I see why maybe the author preferred pi over e.
Pi can be infinite and non-repeating (as it's irrational) and only sparsely contain 5s after the 100 trillionth digit (or whatever we've calculated it to so far), unlikely.
There are some normal numbers that are known but they seem hard to construct (to me, not a number theorist), like Champernowne's number which is the concatenation of all natural numbers (clearly any number will be in it's expansion by definition but it won't be very good for compression purposes due to the indexing issues highlighted elsewhere).
Some further reading: http://math.stackexchange.com/questions/216343/does-pi-conta..., http://mathoverflow.net/questions/51853/what-is-the-state-of..., http://en.wikipedia.org/wiki/Complexity_function.
It's basically scanning a random byte-stream for a 200-byte long exact match. 200 bytes, 1600 bits, or 2^1600 different possible sequences, making the odds 1/2^1600 that any particular 200 bytes pulled out will match the bytes you are looking for.
π: Not quite the definition of normal, but equivalent.
Am I right in assuming that the decompression step is several orders of magnitude faster than the compression phase?
If the offset within pi is so large that any representation of it is larger than my data?
A similar argument works for all compression algorithms and all sizes. It is flatly impossible to compress all data all of the time.
Pedantically speaking, any data passing through a cryptographic hash algorithm is being compressed.
strlen(alix) < strlen(1188606148)