Hacker News new | past | comments | ask | show | jobs | submit login
File system that stores location of file in Pi (github.com/philipl)
310 points by morisy on July 11, 2014 | hide | past | favorite | 98 comments



  Now, we all know that it can take a while to find a long sequence of digits in π,
  so for practical reasons, we should break the files up into smaller chunks that
  can be more readily found.

  In this implementation, to maximise performance, we consider each individual byte
  of the file separately, and look it up in π.
Definitely worth a chuckle. Very cute idea and implementation.


If you used small chunks (say, up to four bytes) you could just have a lookup table storing all the indices.

Or - what a genius idea! - we could use Pi itself as a lookup table and use the sequence of n bytes at that position of pi as an implicit lookup table.


There are 4,294,967,296 possibilities in 4 bytes.

To store a number in range of 0 to 4,294,967,296 - you will need exactly 4 bytes.


Not all 32bit sequences can be found in the first 4,294,967,296 binary digits of pi. Hence you will either need more then 4 bytes to encode 4 bytes in a pi position or a lookup table to fit the 4,294,967,296 numbers, some of which will be substantially larger than 4,294,967,296, into 4,294,967,296 slots.


To save us from searching in the number space after 4.3-ish billionth slot, we could just store the sequences that we didn't find in that part of pi. I propose a bit used to indicate whether the current sequence is a pi-sequence or just raw data.


You laugh now, but when we develop a trivial method to calculate pi and other irrational constants to quadrillions of digits, this will be wonderful.


By the pigeonhole principle, no matter how fast you can calculate pi, you cannot actually use this to compress data. The index to relevant sequence is on average >= the size of the data to be stored.


I was thinking so hard trying to give myself a clear explanation of why this type of compression was impossible! Thank you.

Here is the pigeonhole principle on wikipedia:

http://en.wikipedia.org/wiki/Pigeonhole_principle


The pigeonhole principle says that you cannot use pi to compress all data. You could certainly use this scheme to compress some data. For example, if you were storing a file containing the contents of pi (from the beginning), it would compress very well with this scheme.

This is the same tradeoff that all compression schemes make. gzip can compress some files, but others will grow. The hope is that the domain of files you actually want to apply it to will shrink instead of grow.

That said, given that the digits of pi are more or less uniformly distributed, there is no particular reason to believe that common bit patterns will be likely to appear sooner in pi than uncommon bit patterns. So it is unlikely that it will compress very many of the files you'd have sitting around on your hard drive.

But the pigeonhole principle alone does not prove this.


I know :D I tried it for past 6 months and then gave up on it 2 months back, for the same reason. how ever that search gave rise to a better idea to compress files. Not yet usable but don't have all these issues with it. If you are interested you can read it at https://www.academia.edu/7620004/Advanced_Compression_Techni... It is based on data being represented as a position and not as the values in a computable number series .


I don't get how this applies.

If you have a 100-gigabyte file you want to "compress" with Pi, all you have to do is find the beginning of that exact sequence in Pi and write down its location and the size (100gigabytes). As long as the binary representation of the location within Pi was less than 100 gigabytes, it is now "compressed". Why wouldn't this work?


> Why wouldn't this work?

It would work, exactly like you say. However, you seem to intuitively be vastly underestimating how far into pi you have to go to find a particular bit pattern.

To sharpen your intuition of this, I recommend this website: http://pi.nersc.gov/

I tried searching pi for the string "zzzz" (this is 20 bits of information according to their somewhat weird encoding scheme). The decimal position in pi for this 20-bit string was 3725869808, which takes 32 bits to represent.

The same will be true in most cases -- the offset into pi will take more space to represent than the data itself! However, in some cases data absolutely can be compressed with this pi scheme. Just not often enough to be actually useful.

However, the fact that the data will usually be larger is not what the pigeonhole principle is about.

All the pigeonhole principle says is: it's not possible for every 100 GB file to have a offset in pi that takes less than 100GB to represent. The pigeonhole principle still allows some files to be compressable with pi, just not all.

To prove this is simple: take every possible 100GB file (there are 2100G of them). Now let's suppose that for every single one you can actually find a location in pi whose binary representation is less than 100GB to represent. If you can do this, then it means that at least 2 of the input files mapped to the same location in pi (because there were 2100G distinct input files but less than 2100G pi offsets that we are allowed to use). Therefore, once "compressed", you can't tell the difference between the two input files that both mapped to the same location in pi!


Aw shoot, where my comment above says "2100G" that was supposed to be "2^100G". HN ate my double-asterisk.


Let's say you want to store a 100kB file. There are 2^(8 x 102400) possible different 100kB files. The lowest amount of data you need to be able to distinguish the one you want from all the possible candidates is 100kB. If you could compress any file with the same algorithm, it would mean that you would necessarily have to be able to decompress the exactly same compressed file into multiple different files.

Compression can only work when for every byte you take out of a file you want you need to concede to add another byte to the representation of a file you don't care about.

> write down its location and the size

The location of any arbitrary sequence in pi is on average a much larger number than the sequence itself.


I don't get this at all. You're talking about multiple files, which has nothing at all to do with my proposal. And I have no idea what your second paragraph means at all.

The idea that a single reference point in a set is larger than the entire set itself is also ridiculous. If I have a 100-gigabyte file, and I want to point to the position in the middle, that is position 53687091200. That is an 11 byte string pointing to a single point in a 107 billion byte long file.

You just do the same thing with Pi, except in this case Pi is the 107-billion byte file, and position 53687091200 points to the file's beginning point. The only thing you store is 11 bytes + the size of the file. (Of course wouldn't store the location in ASCII as binary would be a much smaller representation, but for these purposes i'm just using ASCII) Obviously the size of Pi is infinite and the size of the position would be much larger than 50-billion in, but the idea is the same.

Please tell me in simpler terms how this idea does not work?


> and the size of the position would be much larger than 50-billion in, but the idea is the same.

The crucial part is that the size of the position will be larger than the file you are seeking. This is unintuitive, but necessarily true for most possible files because of the pigeonhole principle. If you don't believe me, just try it out. Decide on random 4-digit sequences in pi and find their indexes. The mean of their indexes will be >4 digits. Same is true of any sequences of any length.

> I don't get this at all. You're talking about multiple files, which has nothing at all to do with my proposal.

No, I'm talking about multiple possible files, of which you must be able to compress and decompress any single one for your proposal to work in general.

General compression is mathematically proven to be impossible. If you have a compression algorithm f() which can take any input, in order for it to be able to compress some input of length x into a form that is shorter than x, it must also compress some other input of length x into a form that is longer than x.

Or, to put in another way, the sum of len(f(y)) for all possible y where len(y) = x is always greater than or equal to sum of len(y).


Let's say you wanted to encode a single decimal digit using this PI method. So, to start, you list out PI until you get all the possible decimal digits:

    03 14159 26535 897
    01 23456 78901 234
There, that's all of them. I cheated a bit and put a zero in the front, as the first zero in the expansion is another 10 digits or so away. So now, we can represent any decimal digit by indexing into PI!

    0 - 0
    1 - 2
    2 - 7
    3 - 1
    4 - 3
    5 - 5
    6 - 8
    7 - 14
    8 - 12
    9 - 6
So, what did we accomplish? Well, zero stays the same. We successfully renamed 1-6 and 9 to be new numbers. And now 7 and 8 take two decimal digits.

Now, extrapolate this out to every possible 100GB file, and the binary representation of PI.


For most candidates, the binary representation of the location will be 100GB or more.


I bet I could take a few moments and come up with a proof that such a method would require P=NP.


I can't find an authoritative source on this, but the algorithm used by SuperPi: https://en.wikipedia.org/wiki/Gauss%E2%80%93Legendre_algorit... seem polynomial in the number of digits you want to calculate at first glance.


Even if that's the case, wouldn't the number of digits of pi that must be computed to find a n occurrence of a random bit string of length n not be polynomial in n?


And what will you do when you know that 0xdeadbeef is located at index 54'896'523'871 ? store the index as a 64bit number ? ;) Edit: typo


Could have made it even better by considering individual bits separately, and using their positions.


When I was young I had this idea that any hard drive can be compressed into 100 bytes. The compressed data is a 4 dimensional vector, a component of the vector is a 25 byte floating point number, and represent the space-time coordinates of the hard drive. (For example my hard drive in 1994 marc 3 23:00:45.456 at a specific place in Budapest) The extractor algorithm just have to simulate the universe from the big bang up until the given time, read the state of the atoms at the specified location, recognize the hard drive, and read the data from it. (Provided that the universe is deterministic, and what seems to be random in quantum mechanics can be simulated with a pseudorandom number generator.)


That's quite clever. Of course, only 2^800 hard drives can actually be represented in this way, but it would, by definition, be in principle capable of covering everything we would ever desire to place on a hard drive anywhere (during the relevant time period, in the relevant space, to the relevant resolution...).

One drawback (on top of astronomically slow extraction time...), is that this still requires us to go ahead and actually write the data in uncompressed form onto a hard drive somewhere. So storage space is still taken up at some point in the world, but at least we can erase and reuse that space as soon as we like, and transmit the data elsewhere with minuscule bandwidth.

Of course, all this really amounts to "Any data you are actually interested in presumably has a short description (e.g., something like 'A complete audio and video recording of every occurrence in Eurasia over the 10,000 years beginning with 6000 BC') and thus, in perfectly compressed form, you will never have any need for large numbers of bits". Or, put another way, "There can't be more than [some reasonably finite number] pieces of data you will ever actually be interested in in your very finite life, so as far as you're concerned, log_2([said number]) bits suffices for everything".


Even if all your assumptions hold, of course, you can still only represent 2^800 universes, and thus only 2^800 different file systems. That's probably sufficient for cases with infrequent read-write cycles, assuming that your floats tend to cover useful time-space coordinates (i.e. where and when a hard drive could plausibly reside). I'm not sure how "space-time coordinates" would work (where is the origin and what is the granularity?), so I don't know if it's likely.

That being said, I would be more concerned by the philosophical and practical consequences of such capability, namely the ability to read hard drives in normally inaccessible locations and hard drives in the future.


" That's probably sufficient for cases with infrequent read-write cycles, assuming that your floats tend to cover useful time-space coordinates (i.e. where and when a hard drive could plausibly reside). I'm not sure how "space-time coordinates" would work (where is the origin and what is the granularity?), so I don't know if it's likely."

Now I have calculated a little:

The age of the universe is 8 * 10^60 in planck time units:

http://www.wolframalpha.com/input/?i=universe+age+%2F+planck...

So if the simulation is for example something like a cellural automaton with a planck-time (5.4 * 10^(-44) sec) step-size, than assuming we use fix-point numbers, 26 bytes suffice. (25 bytes are just not enough, 256^25 = 1.6 * 10^60)

Suprisingly the size of the universe is 5.4 * 10^61 in planck-length units:

http://www.wolframalpha.com/input/?i=planck+length+universe+...

So again, we need 26 bytes, for a super fine representation. I was one byte off for all components, so we need altogether 104 bytes.


For what it's worth, despite the common perception that the Planck length is the the grid-scale on which the world is quantized, or the smallest measurable distance, or various such things, the Planck length is not actually known to have any particular physical significance; it's just the length that falls out of combining various other constants of significance. It's not unreasonable to guess that it will eventually turn out to have some significance in itself, but if there is any, it is as of yet unknown.


It sounds like you probably don't need or want floats then. Just a uniform number format for the coordinates should suffice.


Well, let's assume that the past is constantly changing. The path leading from the start to the universe to the writing of your 4D vector would also change and therefore your vector itself might change automatically with every change to space-time. Or would that still be called deterministic?


Could yóu explain how/why do you think the past is constantly changing?


Good point. It might be best to use a 5D vector to account for this.


"the universe is deterministic" - have you developed your thoughts on this since?


Not too much, just one thing: If we could simulate the universe then of course there would be a simulation in the simulation, etc... And maybe there is a fundamental yet undiscovered 'law of computation' that a sufficiently complex simulation can embed itself only on a slower timescale, so even if everything is deterministic, we even theoretically cannot calculate fast enough to calculate the future.


Embedding itself is one thing, but what about embedding a small portion of itself? Suppose the area to be simulated was merely the Solar System (with the rest faked), and the size of the computer far larger...

Very hypothetical of course :)


If you're going to use a normal number for this purpose, why not choose a much nicer one? Let's use a number such that its binary representation is the concatenation of consecutive ascending binary numbers.

    0 1 10 11 100 101 110 111 1000...
becomes

    0.0110111001011101111000...
It's much easier to demonstrate that this number is normal than to do so for pi. It's also much easier to calculate the nth digit, and to find an occurrence of a given string of bits.


Obligatory Dinosaur Comics: http://www.qwantz.com/index.php?comic=353

"You can't copyright a fact (like a number), but you can copyright a creative work, like a song or a piece of software. But since one can be transformed into another, copyright law is logically INCOHERENT."


What Colour are your bits?

http://ansuz.sooke.bc.ca/entry/23


Wow, great article!

Does a number that represents a range within Pi, get also coloured as copyrighted if your intent while computing it was the search for a copyright coloured sequence of bits within Pi ?

I think that for a lawyer it doesn't really matter; it doesn't matter which technique you're using to encode your data, as long as somebody can access that content and you did that on purpose. They're good at dealing with loosely defined things like intent, better than with formally defined things.

I wonder what amount (and progress) of AI research has been done in this area; not all illogical things are bullshit, however you might want to feel about them. Anti-digital-rights sentiment (disclaimer: I personally deplore much of the consequences about enforcement of digital rights, so I share that sentiment at many levels) sometimes can cloud judgement, and I've seen many people invoke rational thinking so well they successfully miss the point.


Oh! I love this article! Very enlightening. Thanks for reminding me about it!


Godel's lesser-known Nonliability Theorem has as of yet failed to gain much traction in the MPAA.


You cant copyright a number, but you can make it illegal[0].

[0]:https://en.wikipedia.org/wiki/Illegal_number


Isn't this one of the April 1st jokes from 2012? Most of the commits were made on March 31, 2012[1]. And there's even a reference to the pi joke[2].

[1] https://github.com/philipl/pifs/commits/master

[2] http://www.netfunny.com/rhf/jokes/01/Jun/pi.html


Great, now we're going to see a DMCA takedown for π as it contains copyrighted content.


Yeah, and bomb building instructions and child pornography.


Like other lossless compression algorithms, there always exist some blobs of data, where the length of the location plus metadata exceeds that of the the original blob, due to the pigeon hole principle. The trouble in the case of pi-fad is that probably we will not know whether the location is longer or not before it is ever actually computed.


Quantum computing will abstract that away so we don't need to know whether the location is longer or not before using it.


I love pieces of code like this, it appeals to me in the same way sleepsort does. A superficial understanding of it might make you think it would be worth it, but really, while it may work, it's better left as a joke.


I'm skeptical that this could really save any space. Just speculating here, really, but it seems like on average the amount of space needed to store the starting index of an arbitrary string of digits in pi should be greater than (or at least comparable to) the size of the string itself.

e.g., the first instance of "256" in pi starts at the 1750th digit. So in that case you're getting a 'compression' rate of -33% if we go by the count of decimal digits used.


To be fair, it compressed my 93 Gb file into 6 bytes. Incidentally, the file stored the first 100 billion decimals of pi.



Nope, sry: "In this implementation, to maximise performance, we consider each individual byte of the file separately, and look it up in π."


It's obvious that, like many compression algorithms, this one is best suited for certain specific applications such as yours.


You should really consider commenting on the project pointing this out to the author; I think he'll be disappointed to know that he hasn't discovered a way to magically compress random data.

That said, I wonder if "but remember all that storage space we saved by moving our data into π? Why don't we store our file locations there!?!" wasn't meant as some kind of hint...


... and that's the joke.


Don't worry, it scales well!


Hooli is gonna be pissed when they learn that Pi-ed Piper nailed a compression algorithm.


I had played with this idea some time back and gave up after some very specific flaws came became clear.

The good probability that a 5 digit combination is found in Pi will be in the range of locations above 10000, for example I once located by 6 digit phone number in position 685214 which was not actually helpful at all.

Further we are not sure if Pi is normal hence the better idea would be use a simple computable normal series.

It was just yesterday I uploaded a paper that presents a idea for Compressing Random Data to -> https://www.academia.edu/7620004/Advanced_Compression_Techni... which proposes an Idea to push multiple bytes represented by a positions in a computable number series into small representation and generate them on the go when required. (need lot of improvement to actually apply)


I'm not sure I understand how this would compress files. I mean, the only way it could is if the decimal place in Pi at which the byte occurs is significantly less than the value of the byte itself. Statistically, this would happen less than 50% of the time, the other 50+% of the time occurring at a higher decimal place. I don't see this providing any real compression benefit.

For example, the byte 0xFF, which is the number 255, first occurs at the 1168th value of Pi.This means instead of storing 255, you're now storing 1168, or 0x490, requiring an extra half-byte. However, 0x328, or number 808, first occurs at the 105th value of pi, or 0x69, requiring one less half-byte.

How does this system provide better compression? The way I see it, the best case scenario would be if no sequence from 000 to 255 was ever repeated in Pi (or rather, not until every pattern in that sequence has been covered), in this case the compression ratio should be exactly 0%, no net gain or loss.


I've been looking for the lyrics to the song that, when sung, will bring about peace on this planet. Now to hear that the file containing these lyrics is already contained in pi is revelatory. Could someone please give me the index and length of the file? I've got some singing to do.


As Dr. Ellie whispers in in Contact; "No words to describe it" and in the Dark Crystal Jen's song is without words.

I think the lyrics are contained in the celestial music itself, and such music is contained in the silence. Silence or 0 is the holy grail of 100% compression, creatio ex nihilo. Perhaps you could find such a song in Pi in a lifelong quest, but it's a much deeper mystery that Pi and 0 are mysteriously related. Euler is rumoured to have remarked it to be proof of God's existence. But even if such proof exists it's of no value compared to it's beauty.

"After proving Euler's identity during a lecture, Benjamin Peirce, a noted American 19th-century philosopher, mathematician, and professor at Harvard University, stated that "it is absolutely paradoxical; we cannot understand it, and we don't know what it means, but we have proved it, and therefore we know it must be the truth." Stanford University mathematics professor Keith Devlin has said, "Like a Shakespearean sonnet that captures the very essence of love, or a painting that brings out the beauty of the human form that is far more than just skin deep, Euler's equation reaches down into the very depths of existence."

http://en.wikipedia.org/wiki/Euler's_identity


This reminds me of Frederik Pohl's [1] book The Gold at Starbow's End, in which Gödelization [2] is used to compress a huge message into a very short one. There's a brief description of that part of the book at MathFiction [3]

[1] http://en.wikipedia.org/wiki/Frederik_Pohl

[2] http://www.encyclopediaofmath.org/index.php/Gödelization

[3] http://kasmana.people.cofc.edu/MATHFICT/mfview.php?callnumbe...


As many pointed that out using Pidgeon Hole principle, it is not practical to create a compression index(A lookup index where you map actual data with some kind of adresses preferably smaller than sequences), using every possible n byte sequence of your data!

Because your index size would be at least equal or higher than your original data.

The only way you get a smaller compression index, you have to look for recurrences, and try to only include most recurring sequences up to a number(there would be a tradeof and an optimal number for compression ratio) and left other sequences uncompressed. Only this way you can achieve compression ratio's smaller than 100%.


I scanned the responses and saw only one that mentioned that pi is not proved (or possibly also provably) normal. That comment was downvoted.


If anyone here hasn't read The Library of Babel yet, then now is a good time.

Here's a link in case you have trouble locating it within Pi: http://hyperdiscordia.crywalt.com/library_of_babel.html


This project was posted before and hasn't been updated since. I doubt that it is still in development


Its also pretty obviously an elaborate mathematics joke.


Yes, but it conceivable. It's like a code. You could securely save files with two numbers. Just a position and length.


It would only be secure if nobody knows what the position refers to. And the position is likely to be longer than the data, so you might as well use proper encryption.


Because nobody likes Pi anymore, we need the Tau filesystem.


Nah, e is obviously the best transcendental number.

I hope the author's pending patent is limited to pi though, so I can work on ef^H^H. Hmm, I see why maybe the author preferred pi over e.


This is extremely clever and something I have wanted to do for a while. If you are interested in contributing to a fun, small Clojure project, stop by: https://github.com/bmhimes/clojure-pifs


Reminds me of Jan Sloot; http://en.wikipedia.org/wiki/Jan_Sloot. It was like an april fools but a lot of big people fell for it at the time.


If you use base 11 you get the added bonus of proving the existence of god !(http://en.wikipedia.org/wiki/Contact_(novel))



I prefer the infinite monkey database [1] myself.

[1]: https://github.com/brycebaril/infinite-monkey-db


I'm not sure pi is proven to contain all sequence of digits. Anyone care to link a proof. The joke be on them and they might not really understand pi at all.


It isn't proven to be "normal" (http://en.wikipedia.org/wiki/Normal_number). There is no guarantee that any particular sequence is in pi until you've searched and found it. It's a very common fallacy that because the expansion is infinite and non-repeating it should contain every possible sequence, very simple counter examples exist.

Pi can be infinite and non-repeating (as it's irrational) and only sparsely contain 5s after the 100 trillionth digit (or whatever we've calculated it to so far), unlikely.

There are some normal numbers that are known but they seem hard to construct (to me, not a number theorist), like Champernowne's number which is the concatenation of all natural numbers (clearly any number will be in it's expansion by definition but it won't be very good for compression purposes due to the indexing issues highlighted elsewhere).

Some further reading: http://math.stackexchange.com/questions/216343/does-pi-conta..., http://mathoverflow.net/questions/51853/what-is-the-state-of..., http://en.wikipedia.org/wiki/Complexity_function.


If the idea that an infinite, non-repeating pattern might not contain every possible digit seems strange, consider Penrose Tilings [1]. Penrose Tilings are infinite geometric patterns which never repeat, yet clearly don't contain every image known to man.

[1] http://en.wikipedia.org/wiki/Penrose_tiling


Fascinating implementation I must admit. How does i/o performance change as the byte-length or chunks vary in size from 3 to 200 bytes?


I think it would probably take forever for the initial lookup, because the probability of matching any 3 byte sequence is higher than matching a 200 bytes sequence?


Literally forever, right?

It's basically scanning a random byte-stream for a 200-byte long exact match. 200 bytes, 1600 bits, or 2^1600 different possible sequences, making the odds 1/2^1600 that any particular 200 bytes pulled out will match the bytes you are looking for.


In fact, it's still not known if pi is normal (contains all finite patterns of numbers[π]), so you can't guarantee that any search will terminate.

π: Not quite the definition of normal, but equivalent.


Even if pi isn't normal, there are plenty of normal numbers to choose from (almost all of the reals are normal, in fact), including some really simple and predictable ones like Champernowne's constant (in base 10: 0.1234567891011121314...) that would support simpler index calculations than pi.


This is genius!

Am I right in assuming that the decompression step is several orders of magnitude faster than the compression phase?


Smartest thing I've seen in months, not so much the speed, but the compression value of it is great


really cool idea, but not sure if actually true: http://math.stackexchange.com/questions/216343/does-pi-conta...


if you have this large base file with pi-numbers, you could use it to compress data right? and with the current internet speeds, pi-storing in the cloud could be an option. or hell, even distributed pi files :)


The location of where the data is is no less complex than the actual data.


His weissman score must be off the charts.


has anybody run bonnie++ benchmarks with this fs?


> They said 100% compression was impossible? You're looking at it!

If the offset within pi is so large that any representation of it is larger than my data?


Yep. Consider the minimum case: assume we've described a process for finding any bitstream we want in pi while necessarily saving at least one bit. Attempt to do so for the bitstreams 00, 01, 10, and 11. If we compress 2 bits to 1 bit, by the pigeonhole principle, at least one of 0 and 1 has to represent at least 2 distinct bitstreams, which means we have lost data.

A similar argument works for all compression algorithms and all sizes. It is flatly impossible to compress all data all of the time.


"It is flatly impossible to compress all data all of the time." Might I add "...such that the uncompressed data is recoverable."

Pedantically speaking, any data passing through a cryptographic hash algorithm is being compressed.


Pedantically, that's not compression.



Case and point: http://pi.nersc.gov/cgi-bin/pi.cgi?word=alix&format=char

strlen(alix) < strlen(1188606148)


Shush. You'll ruin it :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: