
Column Names as Contracts - MaysonL
https://emilyriederer.netlify.app/post/column-name-contracts/
======
closed
> Using controlled vocabularies for column names is a low-tech, low-friction
> approach to building a shared understanding of how each field in a data set
> is intended to work.

Can't stress how useful this has been working on data science teams. There is
really an art for getting a group of people to standardize around a vocabulary
of column name pre/suffixes without being overly prescriptive.

Being able to unlock nifty tools / behaviors by naming your columns certain
ways seems crazy powerful. (Like a form of function dispatching)

~~~
lowwave
Yeah convention over configuration is great! It attributed to the great
success of RoR kind of framework as well.

------
Cieplak
This sounds a lot like Hungarian notation [1].

[1]
[https://en.wikipedia.org/wiki/Hungarian_notation](https://en.wikipedia.org/wiki/Hungarian_notation)

~~~
goto11
Yeah, and like Hungarian notation it is mostly useful in an untyped system
(like BCPL where Hungarian was invented).

If you have a type system, let it do the work for you.

One of the most bizarre conventions I have seen is prefixing table names with
"TBL_" and column names "COL_". I suspect some people just really like
abbreviations.

~~~
mgkimsal
Haven't seen a COL_ prefix in a long time, but I still run across 'TBL_'
prefixes sometimes. I've seen it in practice for more than 20 years, and never
saw any value. Then... I realized you might be pulling from a 'view' vs an
actual table, and perhaps there's a bit of value in that knowledge? I did run
across someone joining a view against a table that was joining another view
that was comprised of joining against that original table, and the query then
did a few of these, and people wondered why it got slow. If the names had been
prefixed with view_ perhaps it might have been more obvious this might cause
problems? I think not though - the original person who put it together was
apparently a DBA by trade and insisted this was normal and good, and 'my' code
(which... he'd also written?) was the culprit. "Worked fine when I was there".
Well... there were 1200 rows at that time; hitting 60000 rows, we ran in to
large perf problems.

------
ZeroClickOk
Now I remember my pain to work in a system with hundreds of tables and near
4-5 different column name schemas :/

------
fmjrey
Reminds me of what clojure spec is trying to do: create data specifications
that can be named and reused via a registry. The metadata is just data on the
shape of the data so it can serve as doc, validation, data generation,
property testing, etc. Spec names are namespaced but no specific guidelines
are given it's just organizational.

Separately clojure also allows for namespaced names for map entries, but again
provides no specific guidelines on what namespaces should be: in code this
could be a library name, but it could also be an entity name as suggested by
datomic best practices.

The combination of both features, namespaced var/entry names associated with
namespaced specs, seems to make more sense than using name parts conventions,
but if you're not clojure all the way then the latter might be the only way.

------
paperwork
I really like the idea of being more thoughtful about naming columns and being
more explicit about the “type” of data contained in them.

Is this idea already known among data modelers or data engineers?

I’d love to read any other references, if available.

~~~
sixdimensional
I am a data architect in my day job. Within the realm of data management, I'd
say "metadata management" [1] is the general category this fits within.

I would say, yes this idea is known/very common, as data architecture is as
much about the descriptive language we use as anything. I mean, "business
glossaries", taxonomy, even just naming conventions [2] in coding, these are
all related.

If you build enough databases/tables or even code yourself, you inevitably
come across the "how to name things" problem [3]. If all you have to sort on
for the known meaning of a thing (column, table, file, etc.) is a single
string value, then encoding meaning into it is quite common. This way, a sort
creates a kind of "grouping". Many database vendors follow standard naming
conventions - such as Oracle, for example [4]. It is considered a best
practice when designing/building the metadata for a large system, to establish
a naming convention. Among other things, it makes finding things easier, as
well as all the potential for automation.

You get all kinds of variations on this, such as, should the "ID_" come as a
prefix or a suffix (i.e. "_ID"). One's initial thought is to use it as a
prefix so all the related types group together, but then that becomes much
more difficult if you want to sort items by their functional area (e.g.
DRIVER_ID, DRIVER_IND, etc.).

One other place you see something similar is in "smart numbers" which is an
eternal argument - should I use a "dumb identifier" (GUID, integer) or a
"smart one" (one encoding additional meaning) [5].

I mean, basically, any time you can encode information in the meta-data of
data, I think you can then operate on it by following "convention over
configuration" (as mentioned elsewhere in the discussion comments).

The only problem I see is that such conventions can, at times be limiting -
depending on the length of your metadata columns, and the variability you are
trying to capture - which is why I believe, generally, metadata is often
better separated and linked to the data it describes - this decoupling allows
for much more descriptive metadata than one could encode in simple a single
string value. Certainly, you can get a long way with an approach like this,
but I suspect you would run into 80/20 rule limitations.

Using naming in this way is a form of tight coupling, which could be seen as
an anti-pattern in terms of meta-data flexibility, in some cases.

[1]
[https://en.wikipedia.org/wiki/Metadata_management](https://en.wikipedia.org/wiki/Metadata_management)

[2]
[https://en.wikipedia.org/wiki/Naming_convention_(programming...](https://en.wikipedia.org/wiki/Naming_convention_\(programming\))

[3]
[https://martinfowler.com/bliki/TwoHardThings.html](https://martinfowler.com/bliki/TwoHardThings.html)

[4] [https://oracle-base.com/articles/misc/naming-conventions](https://oracle-
base.com/articles/misc/naming-conventions)

[5]
[https://en.wikipedia.org/wiki/Smart_number](https://en.wikipedia.org/wiki/Smart_number)

~~~
westurner
In terms of database normalization, delimiting multiple fields within a column
name field violates the "atomic columns" requirement of the first though sixth
normal forms (1NF - 6NF)

[https://en.wikipedia.org/wiki/Database_normalization](https://en.wikipedia.org/wiki/Database_normalization)

Are there standards for storing columnar metadata (that is, metadata about the
columns; or column-level metadata)?

In terms of columns, SQL has (implicit ordinal, name, type) and then primary
key, index, and [foreign key] constraints.

RDFS (RDF Schema) is an open W3C linked data standard. An rdf:Property may
have a rdfs:domain and a rdfs:range; where the possible datatypes are listed
as instances of rdfs:range. Primitive datatypes are often drawn from XSD (XML
Schema Definition), or [https://schema.org/](https://schema.org/) . An
rdfs:Class instance may be within the rdfs:domain and/or the rdfs:range of an
rdf:Property.

RDFS is generally not sufficient for data validation; there are a number of
standards which build upon RDFS: W3C SHACL (Shapes and Constraint Language),
W3C CSVW (CSV on the Web).

There is some existing work on merging JSON Schema and SHACL.

CSVW builds upon the W3C "Model for Tabular Data and Metadata on the Web";
which supports arbitrary "annotations" on columns. CSVW can be represented as
any RDF representation: Turtle/Trig/M3, RDF/XML, JSON-LD.

[https://www.w3.org/TR/tabular-data-primer/](https://www.w3.org/TR/tabular-
data-primer/)

[https://www.w3.org/TR/tabular-data-model/](https://www.w3.org/TR/tabular-
data-model/) :

> _an annotated tabular data model: a model for tables that are annotated with
> metadata. Annotations provide information about the cells, rows, columns,
> tables, and groups of tables […]_

...

From
[https://twitter.com/westurner/status/901992073846456321](https://twitter.com/westurner/status/901992073846456321)
:

> _" 7 metadata header rows (column label, property URI path, DataType, unit,
> accuracy, precision, significant figures)"
> [https://wrdrd.github.io/docs/consulting/linkedreproducibilit...](https://wrdrd.github.io/docs/consulting/linkedreproducibility#csv-
> csvw-and-metadata-rows) _

...

From
[https://twitter.com/westurner/status/1295774405923147778](https://twitter.com/westurner/status/1295774405923147778)
:

> _Relevant:[https://discuss.ossdata.org/](https://discuss.ossdata.org/)
> topics: "Linked Data formats, tools, challenges, opportunities; CSVW,
> [https://schema.org/Dataset](https://schema.org/Dataset) ,
> [https://schema.org/ScholarlyArticle](https://schema.org/ScholarlyArticle) "
> [https://discuss.ossdata.org/t/linked-data-formats-tools-
> chal...](https://discuss.ossdata.org/t/linked-data-formats-tools-challenges-
> opportunities-csvw-schema-org-dataset-schema-org-scholarlyarticle/160) _

> _" A dataframe protocol for the PyData ecosystem"
> [https://discuss.ossdata.org/t/a-dataframe-protocol-for-
> the-p...](https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-
> ecosystem/267) _

> _A .meta protocol should implement the W3C Tabular Data Model: [...]_

...

The various methods of doing CSV2RDF and R2RML (SQL / RDB to RDF Mapping) each
have a way to specify additional metadata annotations. None stuff data into a
column name (which I'm also guilty of doing with e.g. "columnspecs" in a small
line-parsing utility called pyline that can cast columns to Python types and
output JSON lines).

...

Even JSON5 is insufficient when it comes to representing e.g. complex
fractions: there must be a tbox (schema) in order to read the data out of the
abox (assertions; e.g. JSON). JSON-LD is sufficient for representation; and
there are also specs like RDFS, SHACL, and CSVW.

Abox: [https://en.wikipedia.org/wiki/Abox](https://en.wikipedia.org/wiki/Abox)

~~~
sixdimensional
I see the line of thinking you're going down. There are ISO standards for data
types, in a sense I could see why one would seek a standard language for
defining the metadata/specification of a type as data. Have to really think
about that some more.. in a way a regex could be seen as a compact form of
expressing the capability of a column in terms of value ranges or domains, but
to define the meaning of the data, not so much.

Your interpretation of the atomic columns requirement is a little different
than my understanding. That requirement of normalization only applies to the
"cells" of columnar data, it says nothing about encoding meaning into column
names, which are themselves simply descriptive metadata.

I mean, for sure you wouldn't want to encode many values/meanings into a
column name (some systems have length restrictions that would make that
impossible, I'm not sure it makes sense anyway), but just pointing out that
technically the spec does not make that illegal. Certainly, adding minor
annotations within the name of a column separated by a supported delimiter
does not, in my opinion, violate normalization rules at all. I mean things
like "ID_" or similar.

Have you looked at INFORMATION_SCHEMA in SQL databases? [1] You mentioned SQL
metadata and constraints, that is as close to a standard feature for querying
that information there is, some databases do it using similar but non-standard
ways (Oracle for example).

Also, not standard but, many relational databases support extended properties
or Metadata for objects (tables, views, columns, etc.) - you can often come up
with your own scheme although rarely do I see people utilize these features.
[2] [3]

At some point it feels like we are more talking about type definitions and
annotations, applied to data columns.

Maybe like, BNF [4] for purely data table columns (which are essentially
types)?

[1]
[https://en.wikipedia.org/wiki/Information_schema](https://en.wikipedia.org/wiki/Information_schema)

[2] [http://www.postgresql.org/docs/current/static/sql-
comment.ht...](http://www.postgresql.org/docs/current/static/sql-comment.html)

[3] [https://docs.microsoft.com/en-us/sql/relational-
databases/sy...](https://docs.microsoft.com/en-us/sql/relational-
databases/system-stored-procedures/sp-addextendedproperty-transact-sql)

[4]
[https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form](https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form)

------
anamax
There are two hard problems in computer science. Naming things, cache
invalidation, and off-by-one errors.

The first is about people, not algorithms.

