Three Types of Data

typon · on Feb 16, 2020

Out of the tar pit paper: https://github.com/papers-we-love/papers-we-love/blob/master...

It describes these categories in a lot more detail and the equivalents are:

Constants => User specifications

State => Essential state

Cached values => Accidental state

andrewflnr · on Feb 15, 2020

This seems like a decent model for a lot of situations, but the difference between "development time" and "run time" gets increasingly blurry in the real world when you reload config live, do canary deployments of new code, etc. I think there's really just state with a spectrum of different lifetimes, from years to nano-seconds, and also cached/pure-functional data.

Twisol · on Feb 15, 2020

Right — any batch professing program can be thought of as a pure function of its inputs, but internally it is likely to be constructed as a series of smaller programs. The constants of a lower-level program are the mutable and cached state of the higher level program, frozen for the duration of execution. And the subprograms may themselves instantiate and operate over locally-scoped mutable state.

It’s a good model, and I think gets a lot of things right. But there are definitely nuances.

brundolf · on Feb 15, 2020

Yes- there's a certain amount of relativism to it. A function that's externally pure may have its own internal State, just like an entire program that's externally pure may have its own internal state. A piece of data categorized this way must come with a "Relative to what?"

andrewflnr · on Feb 15, 2020

> A piece of data categorized this way must come with a "Relative to what?"

This would be a good caveat to add to the original post. :)

brundolf · on Feb 15, 2020

Author here: this is one of those ideas that's been stewing in my head for quite some time and I'm only about 80% happy with how it came out as words. Let me know if I can provide further examples or elaboration.

----------------------

Edit: I wrote this example in response to a comment which has since been deleted, so I'll post it here instead

Let's say your program stores the positions of two entities that can change arbitrarily over time:

  let pos1 = { x: 0, y: 1, z: 2 };
  let pos2 = { x: 1, y: 3, z: 0 };

And you also want to work with the distance between them:

  let distance = Math.sqrt(
    (pos2.x - pos1.x) * (pos2.x - pos1.x) +
    (pos2.y - pos1.y) * (pos2.y - pos1.y) +
    (pos2.z - pos1.z) * (pos2.z - pos1.z));

When do you do this computation?

If "distance" is thought of like any other state, it's unclear, and it's easy for it to get out of sync with the values it's derived from. Maybe you have some sort of core update or rendering phase and you re-compute it there. Maybe you try and re-compute it every time one of the two values gets modified, either by constraining their modification within methods or by somehow observing their changes. Maybe you have a data structure that allows you to easily compare them to the values the previous computation came from. Deciding which of these strategies to take is non-trivial, but you can simplify the question a little bit by seeing "distance" as not being a normal part of state.

If you pull it out into a pure function:

  function getDistance(a, b) {
    return Math.sqrt(
      (pos2.x - pos1.x) * (pos2.x - pos1.x) +
      (pos2.y - pos1.y) * (pos2.y - pos1.y) +
      (pos2.z - pos1.z) * (pos2.z - pos1.z));
  }

then "updating" it becomes a singular, clear action:

  let distance;
  function updateDistance() {
    distance = getDistance(pos1, pos2);
  }

And then when it gets computed becomes an independent question from how it gets computed. You can update it eagerly, or lazily, or implicitly. You can use comparisons, or observables, or whatever.

The benefit becomes more clear when the value isn't a simple number, but a whole object or object graph. By making it immutable, you have much more leeway when it comes to "refreshing" it, because you can guarantee you won't be losing any meaningful information.

None of these ideas are especially novel or profound, but as a mental framework they've shed a whole lot of clarity for me in my work over the last year or two.

choward · on Feb 15, 2020

> And then when it gets computed becomes an independent question from how it gets computed. You can update it eagerly, or lazily, or implicitly. You can use comparisons, or observables, or whatever.

But what about where distance is being used (read)? How do you know if updateDistance() got called already. It seems with this approach it would be important to still have a function for accessing it. But then that functions would need access to some sort of state that says whether or not the data is stale. Something like this:

    let isDistanceStale = false;
    let distance = 0;
    let pos1 = { x: 0, y: 0, z: 0 };
    let pos2 = { x: 0, y: 0, z: 0 };

    function updatePos1(newPos1) {
      pos1 = newPos1;
      isDistanceStale = true;
    }

    function updatePos2(newPos2) {
      pos2 = newPos2;
      isDistanceStale = true;
    }

    function getDistance() {
      if (isDistanceStale) {
        distance = calculateDistance(pos1, pos2);
        isDistanceStale = false;
        return distance;
      } else {
        return distance;
      }
    }

    function calculateDistance(a, b) {
      return Math.sqrt(
        (pos2.x - pos1.x) * (pos2.x - pos1.x) +
        (pos2.y - pos1.y) * (pos2.y - pos1.y) +
        (pos2.z - pos1.z) * (pos2.z - pos1.z));
    }

Gross. This is why it's better not to made these kind of optimizations unless there really is a performance issue. Otherwise, just let it get computed every time or compute it every time pos1 or pos2 changes. No need to do lazy evaluation. If you really want that, there are languages that have it built in. Otherwise you'll just be fighting the language.

brundolf · on Feb 15, 2020

The idea would be that you abstract it as a pure function and then you can layer on caching later, when you find out it's needed. The idea is not to always cache values, but to recognize when that's what you're doing and keep it separate from the rest of your state.

heavenlyblue · on Feb 15, 2020

What does this solve? If you compute this lazily you’ll end up in a bunch of situations which require synchronisation primitives or you’d end up recomputing these value too many times.

Also the distance function can often be replaced by a distance to the power of two (since then the cost of running sqrt is not paid). This is often the case in my work - rearchitecting the application in such a way that you don’t need caching in the first place and being explicit.

loopz · on Feb 15, 2020

Data exists for higher purposes. You can't say 4 6.9 0034 has much intrinsic meaning. In fact, modern approaches seek to encapsulate the data itself, and instead build abstractions to better model circumstances and behaviour, in effect hiding or masking data.

The troubles begin when working directly in the data models, leading to coupling, dependencies, side-effects and narrow perspectives on how to accomplish better designs. In beginning it seems more powerful, until enough complexity creep attained to warrant headache examination.

carapace · on Feb 15, 2020

Have you read "Programming Pearls" by Jon Bentley?

jmchuster · on Feb 16, 2020

I know he's very popular here on HN, but just wanted to ask if you had watched any of Rich Hickey's talks. It sounds like the end insight you'll end up on is what he describes as the "epochal time model".

slx26 · on Feb 16, 2020

Nice, I've also thought a lot about cached values in the same way. In the past I implemented a meta-programmed system in Ruby to deal with cached values so they would be easy to use and automatically dropped when other related state changed... and I really started to think about the concept of "derived state". I feel it should be something implemented in common programming languages. I believe it could be extremely helpful. Does someone know if this exists in some language?

I don't know if there's any other tricky implementation part, but from what I've seen you could simply define something like this:

  struct Human:
     name String
     birth Date
     derived age Integer

  function derive age:
     return (time.Now() - self.birth).Years().Floor()

And the compiler should have everything it needs. Sure, this example is pretty annoying because time changes all the time, so it doesn't look like you can cache much with such a naive approach, but I suck at examples (surely people working on languages could come up with more interesting approaches, like adding ways to schedule the cached value to be preferably kept until X time later or whatever).

sorokod · on Feb 15, 2020

About cached values you say: "Synchronizing" them is always as simple as a single, controlled operation.

Others have stated that: "There are only two hard things in Computer Science: cache invalidation and naming things"

pkage · on Feb 15, 2020

I don't think that's what he's arguing though—it's not a cache like memcached or anything, it's much more abstract than that. He's saying that calculating values derived from the program state should be as simple as possible, and not have other side effects.

sorokod · on Feb 15, 2020

Caching comes with a bunch of strings attached of which recalculating the values is the least problematic.

Twisol · on Feb 15, 2020

I think the name “cached value” is a red herring, as far as the mindset being described is considered. “Derived value” might be closer to the mark; the values are pure functions of other things, and to change the derived value you must change the things it depends on. There need not be any mention of storing that derived value in a cache and somehow invalidating it.

Yen · on Feb 15, 2020

I've always heard it as

"There are two hard problems in Computer Science: cache invalidation, naming things, and off-by-one errors"

brundolf · on Feb 15, 2020

Yes :) This does not solve the problem of cache invalidation, it only separates it from other concerns so that it can be focused on explicitly.

white-flame · on Feb 15, 2020

"Cached values" isn't a great name here. He should probably call them "Derived values".

brundolf · on Feb 15, 2020

The word "cache" was used to encompass remote values too; i.e. ones that an external system may derive from its own state, but where that component is behind a black box.

twhitmore · on Feb 16, 2020

To perhaps offer a slightly wider perspective on kinds of data & lifetimes:

- Constant data

- Long-term configuration

- Medium-term configuration

- Client/account data

- Business transactions

- Transitory processing work

_0w8t · on Feb 16, 2020

The cache here means read-cache. Write-cache, that is a delaying state update for performance reason probably needs own category.