
Three Types of Data - brundolf
https://www.brandonsmith.ninja/blog/three-types-of-data
======
typon
Out of the tar pit paper: [https://github.com/papers-we-love/papers-we-
love/blob/master...](https://github.com/papers-we-love/papers-we-
love/blob/master/design/out-of-the-tar-pit.pdf)

It describes these categories in a lot more detail and the equivalents are:

Constants => User specifications

State => Essential state

Cached values => Accidental state

------
andrewflnr
This seems like a decent model for a lot of situations, but the difference
between "development time" and "run time" gets increasingly blurry in the real
world when you reload config live, do canary deployments of new code, etc. I
think there's really just state with a spectrum of different lifetimes, from
years to nano-seconds, and also cached/pure-functional data.

~~~
Twisol
Right — any batch professing program can be thought of as a pure function of
its inputs, but internally it is likely to be constructed as a series of
smaller programs. The constants of a lower-level program are the mutable and
cached state of the higher level program, frozen for the duration of
execution. And the subprograms may themselves instantiate and operate over
locally-scoped mutable state.

It’s a good model, and I think gets a lot of things right. But there are
definitely nuances.

~~~
brundolf
Yes- there's a certain amount of relativism to it. A function that's
externally pure may have its own internal State, just like an entire program
that's externally pure may have its own internal state. A piece of data
categorized this way must come with a "Relative to what?"

~~~
andrewflnr
> A piece of data categorized this way must come with a "Relative to what?"

This would be a good caveat to add to the original post. :)

------
brundolf
Author here: this is one of those ideas that's been stewing in my head for
quite some time and I'm only about 80% happy with how it came out as words.
Let me know if I can provide further examples or elaboration.

\----------------------

Edit: I wrote this example in response to a comment which has since been
deleted, so I'll post it here instead

Let's say your program stores the positions of two entities that can change
arbitrarily over time:

    
    
      let pos1 = { x: 0, y: 1, z: 2 };
      let pos2 = { x: 1, y: 3, z: 0 };
    

And you also want to work with the distance between them:

    
    
      let distance = Math.sqrt(
        (pos2.x - pos1.x) * (pos2.x - pos1.x) +
        (pos2.y - pos1.y) * (pos2.y - pos1.y) +
        (pos2.z - pos1.z) * (pos2.z - pos1.z));
    

When do you do this computation?

If "distance" is thought of like any other state, it's unclear, and it's easy
for it to get out of sync with the values it's derived from. Maybe you have
some sort of core update or rendering phase and you re-compute it there. Maybe
you try and re-compute it every time one of the two values gets modified,
either by constraining their modification within methods or by somehow
observing their changes. Maybe you have a data structure that allows you to
easily compare them to the values the previous computation came from. Deciding
which of these strategies to take is non-trivial, but you can simplify the
question a little bit by seeing "distance" as not being a normal part of
state.

If you pull it out into a pure function:

    
    
      function getDistance(a, b) {
        return Math.sqrt(
          (pos2.x - pos1.x) * (pos2.x - pos1.x) +
          (pos2.y - pos1.y) * (pos2.y - pos1.y) +
          (pos2.z - pos1.z) * (pos2.z - pos1.z));
      }
    

then "updating" it becomes a singular, clear action:

    
    
      let distance;
      function updateDistance() {
        distance = getDistance(pos1, pos2);
      }
    

And then _when it gets computed_ becomes an independent question from _how it
gets computed_. You can update it eagerly, or lazily, or implicitly. You can
use comparisons, or observables, or whatever.

The benefit becomes more clear when the value isn't a simple number, but a
whole object or object graph. By making it immutable, you have much more
leeway when it comes to "refreshing" it, because you can guarantee you won't
be losing any meaningful information.

None of these ideas are especially novel or profound, but as a mental
framework they've shed a whole lot of clarity for me in my work over the last
year or two.

~~~
choward
> And then when it gets computed becomes an independent question from how it
> gets computed. You can update it eagerly, or lazily, or implicitly. You can
> use comparisons, or observables, or whatever.

But what about where distance is being used (read)? How do you know if
updateDistance() got called already. It seems with this approach it would be
important to still have a function for accessing it. But then that functions
would need access to some sort of state that says whether or not the data is
stale. Something like this:

    
    
        let isDistanceStale = false;
        let distance = 0;
        let pos1 = { x: 0, y: 0, z: 0 };
        let pos2 = { x: 0, y: 0, z: 0 };
    
        function updatePos1(newPos1) {
          pos1 = newPos1;
          isDistanceStale = true;
        }
    
        function updatePos2(newPos2) {
          pos2 = newPos2;
          isDistanceStale = true;
        }
    
        function getDistance() {
          if (isDistanceStale) {
            distance = calculateDistance(pos1, pos2);
            isDistanceStale = false;
            return distance;
          } else {
            return distance;
          }
        }
    
        function calculateDistance(a, b) {
          return Math.sqrt(
            (pos2.x - pos1.x) * (pos2.x - pos1.x) +
            (pos2.y - pos1.y) * (pos2.y - pos1.y) +
            (pos2.z - pos1.z) * (pos2.z - pos1.z));
        }
    

Gross. This is why it's better not to made these kind of optimizations unless
there really is a performance issue. Otherwise, just let it get computed every
time or compute it every time pos1 or pos2 changes. No need to do lazy
evaluation. If you really want that, there are languages that have it built
in. Otherwise you'll just be fighting the language.

~~~
brundolf
The idea would be that you abstract it as a pure function and then you can
layer on caching later, when you find out it's needed. The idea is not to
_always_ cache values, but to _recognize_ when that's what you're doing and
keep it separate from the rest of your state.

------
slx26
Nice, I've also thought a lot about cached values in the same way. In the past
I implemented a meta-programmed system in Ruby to deal with cached values so
they would be easy to use and automatically dropped when other related state
changed... and I really started to think about the concept of "derived state".
I feel it should be something implemented in common programming languages. I
believe it could be extremely helpful. Does someone know if this exists in
some language?

I don't know if there's any other tricky implementation part, but from what
I've seen you could simply define something like this:

    
    
      struct Human:
         name String
         birth Date
         derived age Integer
    
      function derive age:
         return (time.Now() - self.birth).Years().Floor()
    

And the compiler should have everything it needs. Sure, this example is pretty
annoying because time changes all the time, so it doesn't look like you can
cache much with such a naive approach, but I suck at examples (surely people
working on languages could come up with more interesting approaches, like
adding ways to schedule the cached value to be preferably kept until X time
later or whatever).

------
sorokod
About cached values you say: "Synchronizing" them is always as simple as a
single, controlled operation.

Others have stated that: "There are only two hard things in Computer Science:
cache invalidation and naming things"

~~~
pkage
I don't think that's what he's arguing though—it's not a cache like memcached
or anything, it's much more abstract than that. He's saying that calculating
values derived from the program state should be as simple as possible, and not
have other side effects.

~~~
sorokod
Caching comes with a bunch of strings attached of which recalculating the
values is the least problematic.

~~~
Twisol
I think the name “cached value” is a red herring, as far as the mindset being
described is considered. “Derived value” might be closer to the mark; the
values are pure functions of other things, and to change the derived value you
must change the things it depends on. There need not be any mention of storing
that derived value in a cache and somehow invalidating it.

------
twhitmore
To perhaps offer a slightly wider perspective on kinds of data & lifetimes:

\- Constant data

\- Long-term configuration

\- Medium-term configuration

\- Client/account data

\- Business transactions

\- Transitory processing work

------
fpoling
The cache here means read-cache. Write-cache, that is a delaying state update
for performance reason probably needs own category.

