
πfs – A data-free filesystem - rnhmjoj
https://github.com/philipl/pifs
======
jboggan
I've been working on an alternative implementation of this but using tau
instead, I'm not sure if my calculations are correct but I think it offers a
2x speedup over this method.

It's still in stealth mode but I'm hoping to unveil it sometime in late June.

~~~
abulman
Since the code already exists, within pi - or tau - surely all you have to do
is to find the working code with it.

I'm sure that you can do that before noon of the first of next month.

~~~
jboggan
Actually I was going to unveil another project that day. I've been excitedly
working on this idea since I had the epiphany last night. So, I was checking
our logs over the weekend and noticed that no jobs or processes ran between 2
and 3 AM early Sunday morning. I thought it was a fluke of our system, but I
started poking around in unrelated logs and I found the same thing. I even
looked at a couple of public systems and saw the same thing. Now I don't know
why this is but apparently NO Unix systems (and I think maybe Windows too) ran
any processes during that window. Is it some bug in the OS? Maybe, but I can
exploit it.

So I've designed a once-a-year data processing system. It's going to queue up
all your requests throughout the year and save them for this "quiet window"
and then distribute all the jobs across all network connected computers
running this new code. I think I'll make a pull request to Linux for maximum
coverage. The best thing is, there is absolutely no performance hit or impact
on everyone's servers because they're ALREADY not being used.

~~~
hinkley
I heard that all the servers in Arizona still processed data during that time.
Sounds like a conspiracy to me.

~~~
hueving
Arizona is able to compute during that time period due to the residual heat in
the desert providing energy.

------
Retr0spectrum
This concept reminds me of
[https://libraryofbabel.info/](https://libraryofbabel.info/)

"At present it contains all possible pages of 3200 characters, about 104677
books."
([https://libraryofbabel.info/About.html](https://libraryofbabel.info/About.html))

~~~
jlebar
* "At present it contains all possible pages of 3200 characters, about 104677 books." *

Copy-paste error caused a dropped exponent. It's 10^4677 books.

~~~
Retr0spectrum
Oops, it's too late to edit now.

------
cyanbane
I think it is fair to say my gene sequence is somewhere in there. It is also
fair to say that my gene sequence is somewhere with alterations where my
stomach processes Chipotle burritos with less.. fanfare. So it sounds like I
need to come up with some kinda of a CAS9-ish process of finding this better
burrito processing version of my gene sequence inside pi. The better version
of my 'life of pi' for burritos.

------
torvald
Of the more obscure file systems, pingfs is still my favorite.
[https://github.com/yarrick/pingfs](https://github.com/yarrick/pingfs)

~~~
pimlottc
> pingfs is a filesystem where the data is stored only in the Internet itself,
> > as ICMP Echo packets (pings) travelling from you to remote servers and
> >back again.

So... if your Internet connection goes down, you lose all your data?

~~~
jonknee
> So... if your Internet connection goes down, you lose all your data?

So it is like every cloud service...

------
d4nt
Supposing I found the index location of the next Star Wars movie in pi, would
it be copyright infringement for me to mention that on a forum?

~~~
BinaryIdiot
I'm not sure it could be but I'm not expecting this to happen during any of
our lifetimes :)

~~~
Terribledactyl
Let's redefine copyright timelines in terms of computational effort to find
its location in pi.

------
univacky
Do we know that every possible finite sequence exists in pi?

~~~
falcolas
IIRC, since there are an infinite number of digits in Pi, and there are no
infinitely repeating finite sequences in Pi, there are therefore an infinite
number of finite sequences in Pi.

Granted, the index might be larger to represent than the actual sequence
you're indexing, but...

~~~
skykooler
"An infinite number of finite sequences" does not mean "all finite sequences".
For example, the number 1.10100100010000100001... contains an infinite number
of finite sequences, but does not contain, for example, any sequences with a
2, or even the sequence '10101'.

~~~
aesthetics1
You cannot bend the rules in this way and still equate it to the question at
hand - you're introducing characters outside of base2 in your example. If you
stuck within the realm of base2, '2' in its binary form certainly appears in
the pattern.

Not only that, but Pi is normal (all digits distributed evenly), and your
sequence clearly is not.

I understand what you are saying - indeed there are infinite finite sequences
- but it just does not apply here.

~~~
evanb
The example isn't base two. It's base 10. It just happens to be a number that
only has 0s and 1s. Like eleven. Or 1000.

------
ape4
I wonder if it might be easier to start with a immutable tree of typical
office documents. Then filesystem just stores the diff. Eg you save a memo
document and its pretty much the same as one of the standard memos. Think of
the savings.

------
drivingmenuts
Someone needs to alert the White House about this. Terrorists implementing
piFS will have access to all the secrets of the US.

The movie industry should just shut down now. Any two people implementing piFS
will have already shared all possible movie files, including future,
unreleased films.

Your medical records are now available to anyone with time and patience.

The horror!

(Seriously, we should see if we can wind up Spicer about that first one and
see if he explodes).

------
jlas
> Now, we all know that it can take a while to find a long sequence of digits
> in π, so for practical reasons, we should break the files up into smaller
> chunks that can be more readily found. In this implementation, to maximise
> performance, we consider each individual byte of the file separately, and
> look it up in π.

So basically like a regular filesystem except now lookup each byte every time
you intend to use it?

~~~
jerf
This is a reference to a common situation that arises anywhere in the internet
that compression is discussed. There is a segment of people who seem to find
it personally, mathematically, or even perhaps morally offensive that there
can exist no algorithm that can take all possible inputs and be guaranteed to
emit something compressed by at least one bit. This seems to be true despite
the fact the proof of this statement is very, very simple; it can be sketched
on a literal post-it note and you could easily convince an open-minded
10-year-old of its truth.

In those situations, you can bet money that some compression scheme will be
proposed that fits into at least one of the two categories "did not check to
see if it actually compresses data" or "works by hiding bits where you forgot
to count, but they still count".

"I'll just give an offset into pi" is in the first category; it does,
technically, work, in the sense that it can validly represent the original
data, but it turns out that in order to represent the offsets into pi it takes
more bits than the content had in the first place. (You can get lucky every
once in a while, and store 14159265 as "0/8", but the vast, vast majority of
sequences get expanded.) _Everyone_ proposes this one. It's almost funny how
often it gets proposed, given that it is actually a very slow method for
making your data bigger. Don't hold your breath for your Fields Medal for that
one.[1]

An example of the second case is "I'll just store the first few bytes in the
filename of the file and then 'forget' that I have to count the filenames in
the size of the compressed content". Vigorous, month-long debates about
whether or not something "counts" as part of the content will follow, which
can be resolved by pointing out that the challenge is basically to give a
stream of bytes over the network to a computer that only has the algorithm on
it that fully reconstructs all the content, but this rarely satisfies Our Hero
who has discovered perfect compression.

I add the word "moral" not just as a jab, but as the only way I have to
explain how these people react to things like pointing out "if that worked, I
could compress all files to one bit/one byte" or "I can't actually reconstruct
the original content with that information". They get _offended_.

[1]: Just for fun, let's implement the ideal "index into a sequence"
compression algorithm. This is not written to proof standards, but "give the
reader the flavor of what is going on" standards. (Technically it assumes the
proof, in an attempt to illuminate it from another direction.) The problem
with pi is that it tends to repeat a lot. Let's chunk up 8 bits into a byte,
since that's pretty common, and let's build the idea sequence of bytes that we
can index into easily. For completeness, let's just index the 8-bit bytes in a
one sequence:

    
    
         00000000000000010000001000000011
    

and so on, 0, 1, 2, etc, up to 255. The most efficient way to index this list
is to break it up along the 8-bit boundaries, because you'll find (if you try
it) that any more clever attempts to exploit overlapping numbers will end up
at best matching that performance and at worse taking more bits. Therefore, we
can index into that list to get a 0 as... 0. And the index for 1 is... uhhh...
1. And 143 is... 143. And so on.

You'll find this approach, extended into as many consecutive bits in a row as
you please, will _wildly outperform_ pi as a sequence to index into to get
values. Because this is, of course, just the identity transform, and pi in
practice does _way worse_ than that. This transform is also wildly more
efficient than the pi indexing operation, having computational complexity
O(zero), if you'll pardon me the abuse of notation.

~~~
TeMPOraL
> _the proof of this statement is very, very simple; it can be sketched on a
> literal post-it note and you could easily convince an open-minded 10-year-
> old of its truth_

Could you write that proof here, for the benefit of tired HNers who are just
finishing their day of work?

~~~
rmidthun
Pigeonhole Principle. The number of messages of bit length N is 2^N. You are
trying to map these into 2^M compressed strings, M < N. By the Pigeonhole (or
Dirichlet's Box) Principle, at least one compressed string will have more than
one message mapped to it. So it cannot be reversed.

~~~
jerf
Yup, that's the one.

The "literally sketchable on a post-it note" is that you can show this for 1
bit, 2 bits, and 3 bits pretty easily, and 4 bits if you sketch small enough.
At that point it's not hard to see this goes to infinity, maybe not rigorously
enough to pass for a proof, but pretty strong. (It's a fairly small step to an
inductive proof from there, though this verges on a trivial proof by the time
you've laid all the bits out.)

------
pimlottc
This brings to mind a practical question - what's the largest valid file
that's been located in the digits of pi? Has anyone yet found a valid jpg,
say? Or gif? PDF? I suppose a text file would be easiest.

~~~
dannypgh
"Text files" are perhaps underspecified (what character set?) but a good
fraction of bytes would, by themselves, count as valid text files.

~~~
pimlottc
We can be generous and say any common character set - ASCII, Latin-1, UTF-8,
UTF-16, etc. It'd be interesting to see the longest intelligible phrase has
found so far.

------
stdgy
Haha, I love it.

To anyone confused: Look today's date!

~~~
Twirrim
They were talking about Pi day on the radio. "Today is the day the whole world
celebrates Pi day". No, it really doesn't. It's just the US, the only country
in the world that uses the MM/DD/YYYY order.

~~~
random_rr
Well, since you're being pedantic... this _is_ the only day the whole world
celebrates Pi Day, because it only occurs in the US date format. So... their
statement is correct.

~~~
iandioch
We can celebrate pi day too, just to less accuracy. The third of January?

~~~
oneeyedpigeon
Or, you could celebrate pi time every single day, to as great a level of
accuracy as you choose :)

------
omginternets
The indexing kinda sucks though. I can never seem to find what I'm looking
for.

------
narrator
If you could convert all algorithms to constant time operation you could
simulate the universe at any point in the future or the past and just read the
bits out of memory from simulated future RAM as long as any correct algorithm
was used to read and write bits to disk.

------
kukx
I wonder, assuming the computation was instantaneous, how effective it would
be for compression.

~~~
greggyb
If computation is instantaneous, then you could recurse down to a single
pointer, which points to a pair of pointers, which each point to another pair,
and so on until you have an arbitrary amount of your data. Since computation
is instantaneous, it is instantaneous to compute this first pointer for the
contents of any hard disk. And since computation is instantaneous, rebuilding
the files on the disk from that pointer is instantaneous.

Your compression ratio would approach infinity.

~~~
greggyb
Nevermind, I neglected to take into account the length of that pointer:

[https://news.ycombinator.com/item?id=13870098](https://news.ycombinator.com/item?id=13870098)

------
gesman
The problem with that is that the length of offset pointing to your data is
going to be geometrically(or worse?) bigger than the size of your data.

Thus - storing and maintaining the offset in PI becomes more expensive than
storing data elsewhere.

------
remx

        As long as you know the index into p of your file and its length,
        its a simple task to extract the file using the Bailey–Borwein–Plouffe
        formula Similarly, you can use the formula to initially find the index of your file
    
        Now, we all know that it can take a while to find a long sequence of digits in p,
        so for practical reasons, we should break the files up into smaller chunks that
        can be more readily found.
    
        In this implementation, to maximise performance, we consider each individual
        byte of the file separately, and look it up in p.
    

I'm not entirely convinced my files would be present in those smaller chunks.
You would need a copy of π (p) going way past the 5 trillion digit world
record. Even then it's not guaranteed the bytes needed would be there.

~~~
daxelrod
Bailey–Borwein–Plouffe lets you calculate the nth binary digit of pi without
calculating 0..n-1.

You would not need to store the entire number.

[https://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%9...](https://en.wikipedia.org/wiki/Bailey%E2%80%93Borwein%E2%80%93Plouffe_formula)

You

------
daemonk
I work in bioinformatics. I remember writing a rough implementation of this
for the human genome during my phd as a way to compress sequence information.
Turns out it doesn't perform better than gzip.

~~~
taejo
Did it perform better than _cat_?

------
725686
I bet that the index will be larger than the data for most cases... but then
you could store the index of the index... of the index, and the number of
times this goes on.

------
funkaster
lol, this reminds me of the time when I took the numerical calculus class at
the uni and I came up with the best compression algorithm ever: compress a
file by using the bytes as points, so the only thing you need to recreate the
file is to calculate the newton poly for that set of numbers. Of course, that
also requires the metadata for the newton poly, and since you don't want a
lossy algorithm, you need to make sure all points are represented.

------
psadri
Is this inspired by "A Hard Boiled Wonderland and the End of the World" by
Murakami?

A really great literary read btw.

------
ronreiter
Seriously for a second - can you mathematically prove that a certain sequence
does not exist in pi?

~~~
Someone
I don't think we know whether it is decidable whether you can, but we
certainly cannot (?yet?)

As other posts already said, if, as many think, π is _normal_
([https://en.m.wikipedia.org/wiki/Normal_number](https://en.m.wikipedia.org/wiki/Normal_number)),
every sequence appears in it. If it isn't, that still is open.

------
meehow
This README file is more hilarious than this source code.

------
calebm
Hey, if CPU is abundant and memory is limited, why not?

~~~
robotresearcher
Because it's likely that the size of the index into pi is larger than the data
itself.

~~~
calebm
Good point. Actually, that brings up an interesting question: what is the
mathematical relationship between the size of the input and expected offset?

~~~
ronreiter
Probably not a good one the more you increase your data size

------
mdekkers
πfs for Hadoop!

