

Metadata vs Data: a wholly artificial distinction - edw519
http://blogs.fluidinfo.com/fluidDB/2009/09/05/metadata-vs-data-a-wholly-artificial-distinction/

======
andreyf
_At an architectural level, I think that’s entirely wrong._

In my applications, I've encountered the opposite. On an architecture level
_data_ is usually stored in a relational database, where the nodes of the
indexes' B-trees are optimized in size for the disk and processor cache,
foreign keys are defined and enforced, and transactions ensure consistency.
Meta-data, on the other hand, is significantly more dynamic (the structure
changes as an application evolves) and several orders of magnitude smaller in
size (hence, no need for indexes), and hence better stored in XML, Yaml, or
JSON. I've certainly seen people confusing the two - usually by creating a ton
of structured DB tables which should be replaced with a config file.

The fundamental difference I see is this: _Data_ should be optimized for
scale. It needs to be well-structured, so your architecture can use that
structure to scale well as the amount of data increases. _Meta-data_ should be
optimized for change, so the structure can change as your application evolves.
Now, there's a lot of ways to structure data, so there's certainly a lot of
choices in how (and how much) you want to structure your data, but the trade-
off is real - you either spend time defining/maintaining structure to optimize
performance, or you ignore performance and structure for the sake of
change/development speed.

It's really the same question as with type systems in programming languages -
type definitions are meta-data about code which lets compilers/interpreters
optimize performance, at the expense of a developer having to write and
maintain type definitions.

That said, I agree that for applications which don't require exceptional
efficiency, there's a lot to be gained by unifying all data at one point in
the structured/unstructured spectrum. But this doesn't mean the spectrum
ceases to exist.

~~~
gruseom
I don't think you're contradicting the post. He's talking about the meaning of
the concepts and you're talking about how to get efficient implementations.
Both are important, but if you didn't have to worry about efficiency, it's
pretty obvious that the data/metadata distinction is a logical false turn. (To
use your terminology, why wouldn't we want everything to be fast, dynamic, and
unified?)

The post does hint at this by suggesting that the concept of metadata itself
arose from tradeoffs imposed by hardware. Where I think you differ from Terry
is that while he implies that it's time we moved beyond all that, you're
saying that the tradeoffs (e.g. between scale and dynamism) remain
fundamental. That's a legimitate concern. It's not obvious that we can have
our cake and eat it too. It will be interesting to see how FluidDB performs at
scale over time.

We've been thinking a lot about these issues on our startup too, because we
need both speed and dynamicness. The solution we're arriving at is to exploit
the static (structured) cases to get as much speed out of them as we can, be
fast-ish on the most common dynamic cases (which thankfully are relatively
simple), and be correct-but-slow on the complex dynamic cases. What we won't
do, however, unless absolutely forced to, is introduce architectural
distinctions between these. We want a consistent, regular system (no two types
of data, no two types of anything) that happens to run very fast when handed
common usage patterns but still allows for arbitrary flexibility. What we're
betting is that the usage patterns will be something like 90% static, 9%
dynamic-but-simple, and 1% dynamic-and-complex; we want that to map to 90%
super-fast, 9% fast-enough, and 1% slow.

~~~
andreyf
_To use your terminology, why wouldn't we want everything to be fast, dynamic,
and unified?_

My assumption was that this is provably impossible - that well structured data
will always be more efficient to interact with, but on second thought, I can't
think of a good reason for that. To the contrary, it seems computational speed
is unbounded, while the database size/speed we need are (for most current
problems) bounded. Hence, "relatively slow" is going to be "fast enough" more
and more often.

~~~
gruseom
_on second thought, I can't think of a good reason for that_

Isn't it simply that if you know the structure in advance, you don't have to
figure out what it is at runtime?

I agree that the tradeoff becomes moot if you can get both sides fast enough.
The question is whether present-day hardware has gotten us to that point.
That's why FluidDB is an interesting test case of whether a system supporting
arbitrary flexibility can perform at scale. I assume Terry isn't being naive
about this, which means he must be making some tradeoffs somewhere. The
question I'd like the answer to is, what battles is he consciously _not_
trying to fight (at a design level)? Terry Jones, if you're reading this,
please speak up :)

------
jzdziarski
I'm not entirely sure the author himself understands the difference between
data and metadata. It has nothing to do with importance; rather metadata is
"data about data". The author seems to think it's got something to do with a
distinction between system and user data, which isn't the case. It's an
important logical abstraction to make, as "data" is a noun, whereas metadata
can be thought of as a collection of adjectives.

Now if you want to start discussing why "data about data" should be considered
part of the data itself, that's an argument we can have... but that doesn't
appear to be the goal of this article (at least if it is, he doesn't make much
of an argument on that level).

~~~
gruseom
Huh? He knows perfectly well what the definition of "metadata" is. It's not
that hard a concept. Nor is it hard to see why it's muddled, or why systems
that make an architectural distinction between the two (data and metadata) are
clumsy and rigid. In fact, the whole point is rather obvious (or ought to be),
but he makes up for it by offering specific insights. I thought the historical
speculation was particularly good.

------
holygoat
I prefer to think of the distinction as _fluid_.

What one application considers metadata is another application's data.

From the apparently trivial (file sizes, line numbers) to the domain specific
(date of joining social network; file server cluster that hosts this image),
each tool doing the work has a very different viewpoint.

The motivation for abandoning the distinction is this fluidity, not that there
is no distinction.

(Sidenote: I like the Clojure definition of metadata:

"It is used to convey information to the compiler about types, but can also be
used by application developers for many purposes, annotating data sources,
policy etc.

"An important thing to understand about metadata is that it is not considered
to be part of the value of an object. As such, metadata does not impact
equality (or hash codes). Two objects that differ only in metadata are
equal.")

------
QE2
_so hard to find, for example, a file with "accounts" in its name and
"automobiles" in the contents ... Consider the UNIX find command which
searches based on file metadata and the grep command which searches file
contents. Combining the two is not easy._

    
    
    		grep -i automobiles `find / -type f -name *accounts*`
    

_Interestingly, Amazon are currently being sued because they threw away
someone’s metadata in the process of removing a copy of Orwell’s 1984 from a
Kindle. You can bet the metadata was removed automatically when the content
was removed._

Actually, they are restoring the file to anyone who wants it, including all
notes.

[http://www.engadget.com/2009/09/04/amazon-offers-to-give-
bac...](http://www.engadget.com/2009/09/04/amazon-offers-to-give-back-your-
kindles-copy-of-1984/)

~~~
icodestuff
This is at the application level. The point is that there's no way to do this
as one search without building an index a la Spotlight. You can't write a
program called "nameandcontents" that does the above without doing exactly
what find and grep do - there is always the data equivalent of NUMA... call it
Non-Uniform Data Access. The author wants, at an architectural level, but not
application level, UDA.

------
robertk
A meta-distinction.

------
humble
It's just data - all the way up and all the way down (??)

