
File system that stores location of file in Pi - morisy
https://github.com/philipl/pifs
======
zaroth

      Now, we all know that it can take a while to find a long sequence of digits in π,
      so for practical reasons, we should break the files up into smaller chunks that
      can be more readily found.
    
      In this implementation, to maximise performance, we consider each individual byte
      of the file separately, and look it up in π.
    

Definitely worth a chuckle. Very cute idea and implementation.

~~~
microcolonel
You laugh now, but when we develop a trivial method to calculate pi and other
irrational constants to quadrillions of digits, this will be wonderful.

~~~
Tuna-Fish
By the pigeonhole principle, no matter how fast you can calculate pi, you
cannot actually use this to compress data. The index to relevant sequence is
on average >= the size of the data to be stored.

~~~
peterwwillis
I don't get how this applies.

If you have a 100-gigabyte file you want to "compress" with Pi, all you have
to do is find the beginning of that exact sequence in Pi and write down its
location and the size (100gigabytes). As long as the binary representation of
the location within Pi was less than 100 gigabytes, it is now "compressed".
Why wouldn't this work?

~~~
haberman
> Why wouldn't this work?

It _would_ work, exactly like you say. However, you seem to intuitively be
vastly underestimating how far into pi you have to go to find a particular bit
pattern.

To sharpen your intuition of this, I recommend this website:
[http://pi.nersc.gov/](http://pi.nersc.gov/)

I tried searching pi for the string "zzzz" (this is 20 bits of information
according to their somewhat weird encoding scheme). The decimal position in pi
for this 20-bit string was 3725869808, which takes 32 bits to represent.

The same will be true in most cases -- the offset into pi will take more space
to represent than the data itself! However, in some cases data absolutely can
be compressed with this pi scheme. Just not often enough to be actually
useful.

However, the fact that the data will _usually_ be larger is not what the
pigeonhole principle is about.

All the pigeonhole principle says is: it's not possible for _every_ 100 GB
file to have a offset in pi that takes less than 100GB to represent. The
pigeonhole principle still allows _some_ files to be compressable with pi,
just not all.

To prove this is simple: take every possible 100GB file (there are 2 __100G of
them). Now let 's suppose that for every single one you can actually find a
location in pi whose binary representation is less than 100GB to represent. If
you can do this, then it means that at least 2 of the input files mapped to
the same location in pi (because there were 2 __100G distinct input files but
less than 2 __100G pi offsets that we are allowed to use). Therefore, once
"compressed", you can't tell the difference between the two input files that
both mapped to the same location in pi!

~~~
haberman
Aw shoot, where my comment above says "2100G" that was supposed to be
"2^100G". HN ate my double-asterisk.

------
nadam
When I was young I had this idea that any hard drive can be compressed into
100 bytes. The compressed data is a 4 dimensional vector, a component of the
vector is a 25 byte floating point number, and represent the space-time
coordinates of the hard drive. (For example my hard drive in 1994 marc 3
23:00:45.456 at a specific place in Budapest) The extractor algorithm just
have to simulate the universe from the big bang up until the given time, read
the state of the atoms at the specified location, recognize the hard drive,
and read the data from it. (Provided that the universe is deterministic, and
what seems to be random in quantum mechanics can be simulated with a
pseudorandom number generator.)

~~~
erikb
Well, let's assume that the past is constantly changing. The path leading from
the start to the universe to the writing of your 4D vector would also change
and therefore your vector itself might change automatically with every change
to space-time. Or would that still be called deterministic?

~~~
nsajko
Could yóu explain how/why do you think the past is constantly changing?

------
baddox
If you're going to use a normal number for this purpose, why not choose a much
nicer one? Let's use a number such that its binary representation is the
concatenation of consecutive ascending binary numbers.

    
    
        0 1 10 11 100 101 110 111 1000...
    

becomes

    
    
        0.0110111001011101111000...
    

It's much easier to demonstrate that this number is normal than to do so for
pi. It's also much easier to calculate the nth digit, and to find an
occurrence of a given string of bits.

------
zachrose
Obligatory Dinosaur Comics:
[http://www.qwantz.com/index.php?comic=353](http://www.qwantz.com/index.php?comic=353)

"You can't copyright a fact (like a number), but you can copyright a creative
work, like a song or a piece of software. But since one can be transformed
into another, copyright law is logically INCOHERENT."

~~~
pasiaj
What Colour are your bits?

[http://ansuz.sooke.bc.ca/entry/23](http://ansuz.sooke.bc.ca/entry/23)

~~~
ithkuil
Wow, great article!

Does a number that represents a range within Pi, get also coloured as
copyrighted if your intent while computing it was the search for a copyright
coloured sequence of bits within Pi ?

I think that for a lawyer it doesn't really matter; it doesn't matter which
technique you're using to encode your data, as long as somebody can access
that content and you did that on purpose. They're good at dealing with loosely
defined things like intent, better than with formally defined things.

I wonder what amount (and progress) of AI research has been done in this area;
not all illogical things are bullshit, however you might want to feel about
them. Anti-digital-rights sentiment (disclaimer: I personally deplore much of
the consequences about enforcement of digital rights, so I share that
sentiment at many levels) sometimes can cloud judgement, and I've seen many
people invoke rational thinking so well they successfully miss the point.

------
bonchibuji
Isn't this one of the April 1st jokes from 2012? Most of the commits were made
on March 31, 2012[1]. And there's even a reference to the pi joke[2].

[1]
[https://github.com/philipl/pifs/commits/master](https://github.com/philipl/pifs/commits/master)

[2]
[http://www.netfunny.com/rhf/jokes/01/Jun/pi.html](http://www.netfunny.com/rhf/jokes/01/Jun/pi.html)

------
peterkelly
Great, now we're going to see a DMCA takedown for π as it contains copyrighted
content.

~~~
cetu86
Yeah, and bomb building instructions and child pornography.

------
ttflee
Like other lossless compression algorithms, there always exist some blobs of
data, where the length of the location plus metadata exceeds that of the the
original blob, due to the pigeon hole principle. The trouble in the case of
pi-fad is that probably we will not know whether the location is longer or not
before it is ever actually computed.

~~~
EarthLaunch
Quantum computing will abstract that away so we don't need to know whether the
location is longer or not before using it.

------
TOMDM
I love pieces of code like this, it appeals to me in the same way sleepsort
does. A superficial understanding of it might make you think it would be worth
it, but really, while it may work, it's better left as a joke.

------
bunderbunder
I'm skeptical that this could really save any space. Just speculating here,
really, but it seems like on average the amount of space needed to store the
starting index of an arbitrary string of digits in pi should be greater than
(or at least comparable to) the size of the string itself.

e.g., the first instance of "256" in pi starts at the 1750th digit. So in that
case you're getting a 'compression' rate of -33% if we go by the count of
decimal digits used.

~~~
hashmymustache
To be fair, it compressed my 93 Gb file into 6 bytes. Incidentally, the file
stored the first 100 billion decimals of pi.

~~~
vanderZwan
Ah, the LenPEG approach.

[http://www.dangermouse.net/esoteric/lenpeg.html](http://www.dangermouse.net/esoteric/lenpeg.html)

------
skhavari
Hooli is gonna be pissed when they learn that Pi-ed Piper nailed a compression
algorithm.

------
TheAuditor
I had played with this idea some time back and gave up after some very
specific flaws came became clear.

The good probability that a 5 digit combination is found in Pi will be in the
range of locations above 10000, for example I once located by 6 digit phone
number in position 685214 which was not actually helpful at all.

Further we are not sure if Pi is normal hence the better idea would be use a
simple computable normal series.

It was just yesterday I uploaded a paper that presents a idea for Compressing
Random Data to ->
[https://www.academia.edu/7620004/Advanced_Compression_Techni...](https://www.academia.edu/7620004/Advanced_Compression_Technique_For_Random_Data)
which proposes an Idea to push multiple bytes represented by a positions in a
computable number series into small representation and generate them on the go
when required. (need lot of improvement to actually apply)

------
braydenjw
I'm not sure I understand how this would compress files. I mean, the only way
it could is if the decimal place in Pi at which the byte occurs is
significantly less than the value of the byte itself. Statistically, this
would happen less than 50% of the time, the other 50+% of the time occurring
at a higher decimal place. I don't see this providing any real compression
benefit.

For example, the byte 0xFF, which is the number 255, first occurs at the
1168th value of Pi.This means instead of storing 255, you're now storing 1168,
or 0x490, requiring an extra half-byte. However, 0x328, or number 808, first
occurs at the 105th value of pi, or 0x69, requiring one less half-byte.

How does this system provide better compression? The way I see it, the best
case scenario would be if no sequence from 000 to 255 was ever repeated in Pi
(or rather, not until every pattern in that sequence has been covered), in
this case the compression ratio should be exactly 0%, no net gain or loss.

------
fluff3141592653
I've been looking for the lyrics to the song that, when sung, will bring about
peace on this planet. Now to hear that the file containing these lyrics is
already contained in pi is revelatory. Could someone please give me the index
and length of the file? I've got some singing to do.

~~~
fsiefken
As Dr. Ellie whispers in in Contact; "No words to describe it" and in the Dark
Crystal Jen's song is without words.

I think the lyrics are contained in the celestial music itself, and such music
is contained in the silence. Silence or 0 is the holy grail of 100%
compression, creatio ex nihilo. Perhaps you could find such a song in Pi in a
lifelong quest, but it's a much deeper mystery that Pi and 0 are mysteriously
related. Euler is rumoured to have remarked it to be proof of God's existence.
But even if such proof exists it's of no value compared to it's beauty.

"After proving Euler's identity during a lecture, Benjamin Peirce, a noted
American 19th-century philosopher, mathematician, and professor at Harvard
University, stated that "it is absolutely paradoxical; we cannot understand
it, and we don't know what it means, but we have proved it, and therefore we
know it must be the truth." Stanford University mathematics professor Keith
Devlin has said, "Like a Shakespearean sonnet that captures the very essence
of love, or a painting that brings out the beauty of the human form that is
far more than just skin deep, Euler's equation reaches down into the very
depths of existence."

[http://en.wikipedia.org/wiki/Euler's_identity](http://en.wikipedia.org/wiki/Euler's_identity)

------
EdwardCoffin
This reminds me of Frederik Pohl's [1] book The Gold at Starbow's End, in
which Gödelization [2] is used to compress a huge message into a very short
one. There's a brief description of that part of the book at MathFiction [3]

[1]
[http://en.wikipedia.org/wiki/Frederik_Pohl](http://en.wikipedia.org/wiki/Frederik_Pohl)

[2]
[http://www.encyclopediaofmath.org/index.php/Gödelization](http://www.encyclopediaofmath.org/index.php/Gödelization)

[3]
[http://kasmana.people.cofc.edu/MATHFICT/mfview.php?callnumbe...](http://kasmana.people.cofc.edu/MATHFICT/mfview.php?callnumber=mf1033)

------
cettox
As many pointed that out using Pidgeon Hole principle, it is not practical to
create a compression index(A lookup index where you map actual data with some
kind of adresses preferably smaller than sequences), using every possible n
byte sequence of your data!

Because your index size would be at least equal or higher than your original
data.

The only way you get a smaller compression index, you have to look for
recurrences, and try to only include most recurring sequences up to a
number(there would be a tradeof and an optimal number for compression ratio)
and left other sequences uncompressed. Only this way you can achieve
compression ratio's smaller than 100%.

------
pbhjpbhj
I scanned the responses and saw only one that mentioned that pi is not proved
(or possibly also provably) normal. That comment was downvoted.

------
andybak
If anyone here hasn't read The Library of Babel yet, then now is a good time.

Here's a link in case you have trouble locating it within Pi:
[http://hyperdiscordia.crywalt.com/library_of_babel.html](http://hyperdiscordia.crywalt.com/library_of_babel.html)

------
Jack5500
This project was posted before and hasn't been updated since. I doubt that it
is still in development

~~~
XorNot
Its also pretty obviously an elaborate mathematics joke.

~~~
E_Carefree
Yes, but it conceivable. It's like a code. You could securely save files with
two numbers. Just a position and length.

~~~
vidarh
It would only be secure if nobody knows what the position refers to. And the
position is likely to be longer than the data, so you might as well use proper
encryption.

------
bmh100
This is extremely clever and something I have wanted to do for a while. If you
are interested in contributing to a fun, small Clojure project, stop by:
[https://github.com/bmhimes/clojure-pifs](https://github.com/bmhimes/clojure-
pifs)

------
tluyben2
Reminds me of Jan Sloot;
[http://en.wikipedia.org/wiki/Jan_Sloot](http://en.wikipedia.org/wiki/Jan_Sloot).
It was like an april fools but a lot of _big_ people fell for it at the time.

------
tragomaskhalos
If you use base 11 you get the added bonus of proving the existence of god
!([http://en.wikipedia.org/wiki/Contact_(novel)](http://en.wikipedia.org/wiki/Contact_\(novel\)))

------
andrewfong
Reminds me of this SMBC: [https://medium.com/the-nib/jesus-is-destroying-
civilization-...](https://medium.com/the-nib/jesus-is-destroying-
civilization-a2ac3c553d47)

------
andrey-p
I prefer the infinite monkey database [1] myself.

[1]: [https://github.com/brycebaril/infinite-monkey-
db](https://github.com/brycebaril/infinite-monkey-db)

------
aaron695
I'm not sure pi is proven to contain all sequence of digits. Anyone care to
link a proof. The joke be on them and they might not really understand pi at
all.

~~~
pbhjpbhj
It isn't proven to be "normal"
([http://en.wikipedia.org/wiki/Normal_number](http://en.wikipedia.org/wiki/Normal_number)).
There is no guarantee that any particular sequence is in pi until you've
searched and found it. It's a very common fallacy that because the expansion
is infinite and non-repeating it should contain every possible sequence, very
simple counter examples exist.

Pi can be infinite and non-repeating (as it's irrational) and only sparsely
contain 5s after the 100 trillionth digit (or whatever we've calculated it to
so far), unlikely.

There are some normal numbers that are known but they seem hard to construct
(to me, not a number theorist), like Champernowne's number which is the
concatenation of all natural numbers (clearly any number will be in it's
expansion by definition but it won't be very good for compression purposes due
to the indexing issues highlighted elsewhere).

Some further reading: [http://math.stackexchange.com/questions/216343/does-pi-
conta...](http://math.stackexchange.com/questions/216343/does-pi-contain-all-
possible-number-combinations), [http://mathoverflow.net/questions/51853/what-
is-the-state-of...](http://mathoverflow.net/questions/51853/what-is-the-state-
of-our-ignorance-about-the-normality-of-pi),
[http://en.wikipedia.org/wiki/Complexity_function](http://en.wikipedia.org/wiki/Complexity_function).

~~~
colanderman
If the idea that an infinite, non-repeating pattern might _not_ contain every
possible digit seems strange, consider Penrose Tilings [1]. Penrose Tilings
are infinite geometric patterns which never repeat, yet clearly don't contain
every image known to man.

[1]
[http://en.wikipedia.org/wiki/Penrose_tiling](http://en.wikipedia.org/wiki/Penrose_tiling)

------
somid3
Fascinating implementation I must admit. How does i/o performance change as
the byte-length or chunks vary in size from 3 to 200 bytes?

~~~
ch4ch4
I think it would probably take forever for the initial lookup, because the
probability of matching any 3 byte sequence is higher than matching a 200
bytes sequence?

~~~
zaroth
_Literally_ forever, right?

It's basically scanning a random byte-stream for a 200-byte long exact match.
200 bytes, 1600 bits, or 2^1600 different possible sequences, making the odds
1/2^1600 that any particular 200 bytes pulled out will match the bytes you are
looking for.

~~~
pndmnm
In fact, it's still not known if pi is normal (contains all finite patterns of
numbers[π]), so you can't guarantee that any search will terminate.

π: Not quite the definition of normal, but equivalent.

~~~
calvins
Even if pi isn't normal, there are plenty of normal numbers to choose from
(almost all of the reals are normal, in fact), including some really simple
and predictable ones like Champernowne's constant (in base 10:
0.1234567891011121314...) that would support simpler index calculations than
pi.

------
alixaxel
This is genius!

Am I right in assuming that the decompression step is several orders of
magnitude faster than the compression phase?

------
mavdi
Smartest thing I've seen in months, not so much the speed, but the compression
value of it is great

------
haddr
really cool idea, but not sure if actually true:
[http://math.stackexchange.com/questions/216343/does-pi-
conta...](http://math.stackexchange.com/questions/216343/does-pi-contain-all-
possible-number-combinations)

------
tiku
if you have this large base file with pi-numbers, you could use it to compress
data right? and with the current internet speeds, pi-storing in the cloud
could be an option. or hell, even distributed pi files :)

------
soheil
The location of where the data is is no less complex than the actual data.

------
Igglyboo
His weissman score must be off the charts.

------
dvanduzer
has anybody run bonnie++ benchmarks with this fs?

------
serge2k
> They said 100% compression was impossible? You're looking at it!

If the offset within pi is so large that any representation of it is larger
than my data?

~~~
patio11
Yep. Consider the minimum case: assume we've described a process for finding
any bitstream we want in pi while necessarily saving at least one bit. Attempt
to do so for the bitstreams 00, 01, 10, and 11. If we compress 2 bits to 1
bit, by the pigeonhole principle, at least one of 0 and 1 has to represent at
least 2 distinct bitstreams, which means we have lost data.

A similar argument works for all compression algorithms and all sizes. It is
flatly impossible to compress all data all of the time.

~~~
delinka
"It is flatly impossible to compress all data all of the time." Might I add
"...such that the uncompressed data is recoverable."

Pedantically speaking, any data passing through a cryptographic hash algorithm
is being compressed.

~~~
rmc
Pedantically, that's not compression.

~~~
delinka
Indeed it is: [https://en.wikipedia.org/wiki/One-
way_compression_function](https://en.wikipedia.org/wiki/One-
way_compression_function)

