Hacker News new | comments | ask | show | jobs | submit login

What if "data" is a really bad idea?

Data like that sentence? Or all of the other sentences in this chat? I find 'data' hard to consider a bad idea in and of itself, i.e. if data == information, records of things known/uttered at a point in time. Could you talk more about data being a bad idea?

What is "data" without an interpreter (and when we send "data" somewhere, how can we send it so its meaning is preserved?)

Data without an interpreter is certainly subject to (multiple) interpretation :) For instance, the implications of your sentence weren't clear to me, in spite of it being in English (evidently, not indicated otherwise). Some metadata indicated to me that you said it (should I trust that?), and when. But these seem to be questions of quality of representation/conveyance/provenance (agreed, important) rather than critiques of data as an idea. Yes, there is a notion of sufficiency ('42' isn't data).

Data is an old and fundamental idea. Machine interpretation of un- or under-structured data is fueling a ton of utility for society. None of the inputs to our sensory systems are accompanied by explanations of their meaning. Data - something given, seems the raw material of pretty much everything else interesting, and interpreters are secondary, and perhaps essentially, varied.

There are lots of "old and fundamental" ideas that are not good anymore, if they ever were.

The point here is that you were able to find the interpreter of the sentence and ask a question, but the two were still separated. For important negotiations we don't send telegrams, we send ambassadors.

This is what objects are all about, and it continues to be amazing to me that the real necessities and practical necessities are still not at all understood. Bundling an interpreter for messages doesn't prevent the message from being submitted for other possible interpretations, but there simply has to be a process that can extract signal from noise.

This is particularly germane to your last paragraph. Please think especially hard about what you are taking for granted in your last sentence.

Without the 'idea' of data we couldn't even have a conversation about what interpreters interpret. How could it be a "really bad" idea? Data needn't be accompanied by an interpreter. I'm not saying that interpreters are unimportant/uninteresting, but they are separate. Nor have I said or implied that data is inherently meaningful.

Take a stream of data from a seismometer. The seismometer might just record a stream of numbers. It might put them on a disk. Completely separate from that, some person or process, given the numbers and the provenance alone (these numbers are from a seismometer), might declare "there is an earthquake coming". But no object sent an "earthquake coming" "message". The seismometer doesn't "know" an earthquake is coming (nor does the earth, the source of the 'messages' it records), so it can't send a "message" incorporating that "meaning". There is no negotiation or direct connection between the source and the interpretation.

We will soon be drowning in a world of IoT sensors sending context-or-provenance-tagged but otherwise semantic-free data (necessarily, due to constraints, without accompanying interpreters) whose implications will only be determined by downstream statistical processing, aggregation etc, not semantic-rich messaging.

If you meant to convey "data alone makes for weak messages/ambassadors", well ok. But richer messages will just bottom out at more data (context metadata, semantic tagging, all more data) Ditto, as someone else said, any accompanying interpreter (e.g. bytecode? - more data needing interpretation/execution). Data remains a perfectly useful and more fundamental idea than "message". In any case, I thought we were talking about data, not objects. I don't think there is a conflict between these ideas.

2nd Paragraph: How do they know they are even bits? How do they know the bits are supposed to be numbers? What kind of numbers? Relating to what?


It contravenes the common and historical use of the word 'data' to imply undifferentiated bits/scribbles. It means facts/observations/measurements/information and you must at least grant it sufficient formatting and metadata to satisfy that definition. The fact that most data requires some human involvement for interpretation (e.g. pointing the right program at the right data) in no way negates its utility (we've learned a lot about the universe by recording data and analyzing it over the centuries), even though it may be insufficient for some bootstrapping system you envision.

I think what Alan was getting at is that what you see as "data" is in fact, at its basis, just signal, and only signal; a wave pattern, for example, but even calling it a "wave pattern" suggests interpretation. What I think he's trying to get across is there is a phenomenon being generated by something, but it requires something else--an interpreter--to even consider it "data" in the first place. As you said, there are multiple ways to interpret that phenomenon, but considering "data" as irreducible misses that point, because the concept of data requires an interpreter to even call it that. Its very existence as a concept from a signal presupposes an interpretation. And I think what he might have been getting at is, "Let's make that relationship explicit." Don't impose a single interpretation on signal by making "data" irreducible. Expose the interpretation by making it explicit, along with the signal, in how one might design a system that persists, processes, and transmits data.

If we can't agree on what words mean we can't communicate. This discussion is undermined by differing meanings for "data", to no purpose. You can of course instead send me a program that (better?) explains yourself, but I don't trust you enough to run it :)

The defining aspect of data is that it reflects a recording of some facts/observations of the universe at some point in time (this is what 'data' means, and meant long before programmers existed and started applying it to any random updatable bits they put on disk). A second critical aspect of data is that it doesn't and can't do anything, i.e. have effects. A third aspect is that it does not change. That static nature is essential, and what makes data a "good idea", where a "good idea" is an abstraction that correlates with reality - people record observations and those recordings (of the past) are data. Other than in this conversation apparently, if you say you have some data, I know what you mean (some recorded observations). Interpretation of those observations is completely orthogonal.

Nothing about the idea of 'data' implies a lack of formatting/labeling/use of common language to convey the facts/observations, in fact it requires it. Data is not merely a signal and that is why we have two different ideas/words. '42' is not, itself, a fact (datum). What constitutes minimal sufficiency of 'data' is a useful and interesting question. E.g. should data always incorporate time, what are the tradeoffs of labeling being in- or out-of-band, per datom or dataset, how to handle provenance etc. That has nothing to do with data as an idea and everything to do with representing data well.

But equating any such labeling with more general interpretation is a mistake. For instance, putting facts behind a dynamic interpreter (one that could answer the same question differently at different times, mix facts with opinions/derivations or have effects) certainly exceeds (and breaks) the idea of data. Which is precisely why we need the idea of data, so we can differentiate and talk about when that is and is not happening - am I dealing with facts, an immutable observation of the past ("the king is dead") or just a temporary (derived) opinions ("there may be a revolt"). Consider the difference between a calculation involving (several times) a fact (date-of-birth) vs a live-updated derivation (age). The latter can produce results that don't add up. 'date-of-birth' is data and 'age' (unless temporally-qualified, 'as-of') is not.

When interacting with an ambassador one may or may not get the facts, and may get different answers at different times. And one must always fear that some question you ask will start a war. Science couldn't have happened if consuming and reasoning about data had that irreproducibility and risk.

'Data' is not a universal idea, i.e. a single primordial idea that encompasses all things. But the idea that dynamic objects/ambassadors (whatever their other utility) can substitute for facts (data) is a bad idea (does not correspond to reality). Facts are things that have happened, and things that have happened have happened (are not opinions), cannot change and cannot introduce new effects. Data/facts are not in any way dynamic (they are accreting, that's all). Sometimes we want the facts, and other times we want someone to discuss them with. That's why there is more than one good idea.

Data is as bad an idea as numbers, facts and record keeping. These are all great ideas that can be realized more or less well. I would certainly agree that data (the maintenance of facts) has been bungled badly in programming thus far, and lay no small part of the blame on object- and place-oriented programming.

Why do you limit the meaning of 'data' to facts and/or observations?

"datum" means "a thing given" - a fact or presumed fact.


I think in the Science of Process that is being related as a desirable goal, everything would necessarily be a dynamic object (or perhaps something similar to this but fuzzier or more relational or different in some other way, but definitely dynamic) because data by itself is static while the world itself is not.

Your selection of data is arbitrary.

Not only is your perception based on an interpreter, but how can you be sure that you were even given all of the relevant bits? Or, even what the bits really meant/are?

Of course the selection of data is arbitrary -- but Rich gives us a definition, which he makes abundantly clear and uses consistently. All definitions can be considered arbitrary. He's not making any claim that we have all the relevant bits of data or that we can be sure what the data really means or represents.

But we can expound on this problem in general. In any experiment where we gather data, how can we be sure we have collected a sufficient quantity to justify conclusions (and even if we are using statistical methods that our underlying assumptions are indeed consistent with reality) and that we have accrued all the necessary components? What you're really getting at is an __epistemological__ problem.

My school of thought is that the only way to proceed is to do our best with the data we have. We'll make mistakes, but that's better than the alternative (not extrapolating on data at all.)

I hope we can do our best, I'm just not sure there is really a satisfactory way to define/measure/judge that we have actually done so....

Isn't the interpreter code itself data in the sense that it has no meaning without something (a machine) to run it? How do you avoid having to send an interpreter for the interpreter and so on?

Yes, so think about how to make this work "nicely" in an Intergalactic Network ...

It can't be turtles all the way down, so maybe set theory?

A good question isn't it?

For parallel ideas and situation, take a look at Lincos https://en.wikipedia.org/wiki/Lincos_(artificial_language)

Thank you! I started to think on those lines too thanks to the Carl Sagan's Contact novel. That was the first thing that came to mind.

Now the question is, what if there are "objects" more advanced than others and what if advanced-object sends a message concealing an trojan horse? I think this question was also brought up in the novel/movie too...

I think this is a real life and practical show stopper to develop this concept...

Thanks for the reference. I've been trying to think along these lines.

Wow. Thanks.

I think object is a very powerful idea to wrap "local" context. But in a network (communication) environment, it is still challenging to handle "remote" context with object. That is why we have APIs and serialization/deserialization overhead.

In the ideal homogeneous world of smalltalk, it is a less issue. But if you want a Windows machine to talk to a Unix, the remote context becomes an issue.

In principle we can send a Windows VM along with the message from Windows and a Unix VM (docker?) with a message from Unix, if that is a solution.

This is why "the objects of the future" have to be ambassadors that can negotiate with other objects they've never seen.

Think about this as one of the consequences of massive scaling ...

Along this line of logic, perhaps the future of AI is not "machine learning from big data" (a lot of buzz words) but computers that generate runtime interpreters for new contexts.

It's not "Big Data" but "Big Meaning"

When high bandwidth communication is omnipresent, is "portability" of the interpreter really something to optimize for?

How can you find it?

The association between "patterns" and interpretation becomes an "object" when this is part of the larger scheme. When you've just got bits and you send them somewhere, you don't even have "data" anymore.

Even with something like EDI or XML, think about what kinds of knowledge and process are actually needed to even do the simplest things.

Sounds pretty much like the problem of establishing contact with an alien civilization. Definitely set theory, prime numbers, arithmetic and so on... I guess at some point, objects will be equipped with general intelligence for such negotiations if they are to be true digital ambassadors!

It's hard for me to grasp what this negotiation would look like. Particularly with objects that haven't encountered each other. It just seems like such a huge problem.

I don't really know anything at all about microbiology, but maybe climbing the ladder of abstraction to small insects like ants. There is clearly negotiation and communication happening there, but I have to think it's pretty well bounded. Even if one ant encountered another ant, and needed to communicate where food was, it's with a fixed set of semantics that are already understood by both parties.

Or with honeybees, doing the communication dance. I have no idea if the communication goes beyond "food here" or if it's "we need to decide who to send out."

It seems like you have to have learning in the object to really negotiate with something it hasn't encountered before. Maybe I'm making things too hard.

Maybe "can we communicate" is the first negotiation, and if not, give up.

It is worth thinking of an analogy to TCP/IP -- what is the smallest thing that could be universal that will allow everything else to happen?

I remember at one point after listening to one of your talks about TCP/IP as a very good OO system, and pondering the question of how to make software like that, an idea that came to mind was, "Translation as computation." I was combining the concept that as implemented, TCP/IP is about translation between packet-switching systems, so a semantic TCP/IP would be a system that translates between different machine models, though, in terms of my skill, the best that I could imagine was "compilers as translators," which I don't think cuts it, because compilers don't embody a machine model. They assume it. However, perhaps it's not necessary to communicate machine models explicitly, since such a system could translate between them re. what state means. This would involve simulating state to satisfy local operation requirements while actual state is occurring, and will eventually be communicated. I've heard you reference McCarthy's situation calculus re. this.

Well, there's the old Component Object Model and cousins ... under this model an object a encountering a new object b will, essentially, ask 'I need this service performed, can you perform it for me?' If b can perform the service, a makes use of it; if not, not.

Another technique that occurs to me is from type theory ... here, instead of objects we'll talk in terms of values and functions, which have types. So e.g. a function a encountering a new function b will examine b's type and thereby figure out if it can/should call it or not. E.g., b might be called toJson and have type (in Haskell notation) ToJson a => a -> Text, so the function a knows that if it can give toJson any value which has a ToJson typeclass instance, it'll get back a Text value, or in other words toJson is a JSON encoder function, and thus it may want to call it.

Alan, what is your view on Olive Executable Archive ?https://olivearchive.org/

The Internet Archive (http://archive.org) is doing the same thing. They have old software stored that you can run in online emulators. I only wish they had instructions for how to use the emulators. The old keyboards and controllers are not like today's.

Their larger goals are important.

Do you think they are on the right path to their larger goals?

I think for so many important cases, this is almost the only way to do it. The problems were caused by short-sighted vendors and programmers getting locked into particular computers and OS software.

For contrast, one could look at a much more compact way to do this that -- with more foresight -- was used at Parc, not just for the future, but to deal gracefully with the many kinds of computers we designed and built there.

Elsewhere in this AMA I mentioned an example of this: a resurrected Smalltalk image from 1978 (off a disk pack that Xerox had thrown away) that was quite easy to bring back to life because it was already virtualized "for eternity").

This is another example of "trying to think about scaling" -- in this case temporally -- when building systems ....

The idea was that you could make a universal computer in software that would be smaller than almost any media made in it, so ...

I agree that the "image" idea is more powerful than the "data" idea.

However since PC revolution, the mainstream seemed to take on the "data" path for whatever technical or non-technical reasons.

How do you envision the "coming back" of image path via either bypassing the data path or merging with it in a not so faraway future?

Over all of history, there is no accounting for what "the mainstream" decides to believe and do. Many people (wrongly) think that "Darwinian processes" optimize, but any biologist will point out that they only "tend to fit to the environment". So if your environment is weak or uninteresting ...

This also obtains for "thinking" and it took a long time for humans to even imagine thinking processes that could be stronger than cultural ones.

We've only had them for a few hundred years (with a few interesting blips in the past), and they are most definitely not "mainstream".

Good ideas usually take a while to have and to develop -- so the when the mainstream has a big enough disaster to make it think about change rather than more epicycles, it will still not allocate enough time for a really good change.

At Parc, the inventions that made it out pretty unscathed were the ones for which there was really no alternative and/or no one was already doing: Ethernet, GUI, parts of the Internet, Laser Printer, etc.

The programming ideas on the other hand were -- I'll claim -- quite a bit better, but (a) most people thought they already knew how to program and (b) Intel, Motorola thought they already knew how to design CPUs, and were not interested in making the 16 bit microcoded processors that would allow the much higher level languages at Parc to run well in the 80s.

It seems that barriers to entry in hardware innovation are getting higher and higher due to high risk industrial efforts. In the meantime barriers to entry in software are getting lower and lower due to improvement of toolings in both software and hardware.

On the other hand due to the exponential growth of software dependency, "bad ideas" in software development are getting harder and harder to remove and the social cost of "green field" software innovation is also getting higher and higher.

How do we solve these issues in the coming future?

I don't know.

But e.g. the possibilities for "parametric" parallel computing solutions (via FPGAs and other configurable HW) have not even been scratched (too many people trying to do either nothing of just conventional stuff).

Some of the FPGA modules (like the BEE3) will slip into a Blades slot, etc.

Similarly, there is nothing to prevent new SW from being done in non-dependent ways (meaning the initial dependencies to hook up into the current world can be organized to be gradually removeable, and the new stuff need not have the same kind of crippling dependencies).

For example, a lot can be done -- especially in a learning curve -- if e.g. a subset of Javascript in a browser (etc) can really be treated as a "fast enough piece of hardware" (of not great design) -- and just "not touch it with human hands". (This is awful in a way, but it's really a question of "really not writing 'machine code' ").

Part of this is to admit to the box, but not accept that the box is inescapable.

Thank you Alan for your deep wisdom and crystal vision.

It is the best online conversation I have ever experienced.

It also reminded me inspiring conversations with Jerome Bruner at his New York City apartment 15 years ago. (I was working on some project with his wife's NYU social psychology group at the time.) As a Physics Ph.D. student, I never imaged I could become so interested in Internet and education in the spirit of Licklider and Doug Engelbart.


You probably know that our mutual friend and mentor Jerry Bruner died peacefully in his sleep a few weeks ago at the age of 100, and with much of his joie de vivre beautifully still with him. There will never be another Jerry.

>Please think especially hard about what you are taking for granted in your last sentence.

Any Meaning can only be the Interpretation of a Model/Signal?

Information in "entropy" sense is objective and meaningless. Meaning only exists within a context. If we think "data" represent information, "interpreters" bring us context and therefore meaning.

Thank you - I was beginning to wonder if anyone in this conversation understood this. It is really the key to meaningfully (!!) move forward in this stuff.

The more meaning you pack into a message, the harder the message is to unpack.

So there's this inherent tradeoff between "easy to process" and "expressive" -- and I imagine deciding which side you want to lean toward depends on the context.

Check this out for a practical example: https://www.practicingruby.com/articles/information-anatomy

(not a Ruby article, but instead about essential structure of messages, loosely inspired by ideas in Gödel, Escher, Bach)

So the idea is to always send the interpreter, along with the data? They should always travel together?

Interesting. But, practically, the interpreter would need to be written in such a way that it works on all target systems. The world isn't set up for that, although it should be.

Hm, I now realize your point about HTML being idiotic. It should be a description, along with instructions for parsing and displaying it (?)

TCP/IP is "written in such a way that it works on all target systems". This partially worked because it was early, partly because it is small and simple, partly because it doesn't try to define structures on the actual messages, but only minimal ones on the "envelopes". And partly because of the "/" which does not force a single theory.

This -- and the Parc PUP "internet" which preceded it and influenced it -- are examples of trying to organize things so that modules can interact universally with minimal assumptions on both sides.

The next step -- of organizing a minimal basis for inter-meanings -- not just internetworking -- was being thought about heavily in the 70s while the communications systems ideas were being worked on, but was quite to the side, and not mature enough to be made part of the apparatus when "Flag Day" happened in 1983.

What is the minimal "stuff" that could be part of the "TCP/IP" apparatus that could allow "meanings" to be sent, not just bits -- and what assumptions need to be made on the receiving end to guarantee the safety of a transmitted meaning?

Would some kind of IDL not be enough to allow meanings to be sent?

Now it's to late to fix.

I don't think it's too late, but it would require fairly large changes in perspective in the general computing community about computing, about scaling, about visions and goals.

Data, and the entirety of human understanding and knowledge derived from recording, measurement and analysis of data, predates computing, so I don't see the relevance of these recent, programming-centric notions in a discussion of its value.

Wouldn't Mr. Kay say that it is education that builds the continuity of the entirety of human understanding? Greek philosophy and astronomy survived in the Muslim world and not in the European, though both possessed plenty of texts, because only the former had an education system that could bootstrap a mind to think in a way capable of understanding and adding to the data. Ultimately, every piece of data is reliant on each generation of humans equipping enough of their children with the mindset capable to use it intelligently.

The value of data is determined by the intelligence of those interpreting it, not those who recorded it.

Of course, this dynamic is sometimes positive. The Babylonians kept excellent astronomical records though apparently making little theoretical advance in understanding them. Greeks with an excellent grasp of geometry put that data to much better use very quickly. But if they had had to wait to gather the data themselves, one can imagine them waiting a long time.

This kind of gets into philosophy, but a metaphor I came up with for thinking about this (another phrase for it is "thought experiment") is:

If I speak something to a rock, what is it to the rock? Is it "signal," or "data"?

Making the concept a little more interesting, what if I resonate the rock with a sound frequency? What is that to the rock? Is that "signal," or "data"?

Up until the Rosetta Stone was found, Egyptian hieroglyphs were indecipherable. Could data be gathered from them, nevertheless? Sure. Researchers could determine what pigments were used, and/or what tools were used to create them, but they couldn't understand the messages. It wasn't "data" up to that point. It was "noise."

I hope I am not giving the impression that I am a postmodernist who is out here saying, "Data is meaningless." That's not what I'm saying. I am saying meaning is not self-evident from signal. The concept of data requires the ability to interpret signal for meaning to be acquired.

Computing has existed for thousands of years. We just have machines do some of it now.

What if "data" is a really great idea?

Your blog looks very interesting. You should share some links of it here on hackernews!

Thanks. I share it wherever I think it will add to the discussion.

Yes, I think if we could get rid of this notion we can probably move in interesting directions. Another way to look at it: if we take any object with sufficient complexity in the universe, how could it interact with other object of sufficient complexity? If we look at humans, as first order augmentation devices for other humans, it's notable that the difference between levels of complexity of their internal state is much higher than the level of complexity of input at any sufficiently small time frame (whatever measurement you decide to take). Basically, the whole state is encoded internally, by means of successive undifirentiated input. In that sense, for example - neural networks don't work with data as such, the data presupposes an internal structure that is absent in an input from the standpoint of the network itself. It is it's job to covert that to something we can reasonably call "data". Moreover, this knowledge is encoded in it's internal state, essentially being the "interpreter" bundled in. Another angle that I like to think from is this: TRIZ has a concept of an ideal device, something performing it's function with minumum overhead required, best that the function be performed by itself, in absence of any device. If we imagine the computer (in a very generic sence) to be such a device, it stands to reason that ideally it will require minimum, or even no input. Obviously it means that we don't need to encode meaning or interpretation into it through directed formal input. The only way for it to happen is for a computer to have a sufficiently complex internal state, capable of converting directed, or even self acquired input to whatever we can eventually call "data". This logic could possibly be applied to some minimimal object - we could look for a unit capable of performing a specific function on a defined range of inputs, building the meaning from it's internal state. The second task then, would be to find a way to compose those object, provided they have no common internal state, and to build systems in which combination of those states would render a larger possible field of operation. Third interesting question would be: how can we build up the internal state of another object, provided we would want to feed it the input requiring interpretation further down the line, building up from whatever minimum we already have.

Welcome to Claude Shannon! It's not about the message but about the receiver ...

Actually it is as much about the sender and the message as the receiver.

Sure, the message matters insomuch as it contains any information the receiver might be able to receive, but that doesn't guarantee it will be received, so how much does a message really matter? I don't see how the sender matters that much (unless perhaps the sender and receiver are linked, for example, they exchange some kind of abstract interpreter for the message). But does the message matter on its own if is is encrypted so well that it is indistinguishable from noise to any but one particular receiver? It's just noise without the receiver. I'm not sure what was meant, but this is the best I can do in understanding it.

data isn't the carrier, it isn't the signal (information), and it certainly isn't the meaning (interpretation). A reasonable first approximation is that data is _message_.

Data is semantically defined by the processes using/interpreting it. Not by the data itself. So Rich Hickey is right and Alan Kay is wrong.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact