
Naming things (2015) [pdf] - petethomas
http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf
======
pjc50
The real lesson of this discussion seems to be: metadata has failed our
expectations.

All this ancillary stuff that we'd like attached to files, like dates, client
names and projects, versions and so on, are metadata. Some systems keep
metadata _in_ files: EXIF, Word, PDF. Some systems have conventions for this
instead: header blocks in source code. But if neither of those applies? Only
place you can put it reliably is the filename :(

~~~
jasode
_> lesson of this discussion seems to be: metadata has failed our
expectations._

I've written several "disk and file catalog" utilities over the years so I
inevitably spent a lot of time thinking about the "metadata" problem.

I think the issue is that it's _impossible_ to solve metadata in a universal
way that satisfies all scenarios. This is why metadata often ends up being
inscribed into the filename. It's the "least worse" solution.

Let's take one example of the scientific data of csv files. Typical Comma-
Separated-Value files do not have metadata fields such as author, measuring
device, timestamp of readings, GPS coordinates. (Yes, csv files sometimes have
a first line for "column names" which is arguably metadata but that's not the
higher-level metadata I'm talking about.)

Exactly _where_ does one put that high-level metadata?

1) If one makes a new pseudo-standard that signifies any lines at the top the
csv beginning with "//" as metadata, that means that modifying any metadata of
a 100GB csv file (e.g. change author from "John Doe" to "Jacob Doe" is
rewriting the whole 100GB file to add 1 byte.) As a related issue, let's say
you have hash of "e1bb76e7391b93eb12" for the csv file. You really want a
stable hash that represents the actual "raw data" of the csv file. You don't
necessarily want the hash to change just because the metadata changed. In this
case, embedding metadata into the file itself makes certain operations worse
since typical hash utilities don't have "intelligence" about which parts of
the file is "important" for hashing. (A similar problem is scanning mp3 files
for duplicates. If 2 mp3 files have bit-identical audio output but the
metadata tags are different, are they the same or different?!? It depends.)

2) if you put metadata at the end, typical utilities won't know about about
it. (UNIX has "tail" command but standard MS Windows does not. The tail
command is also unstructured and read-only which makes it a non-solution for
managing end-of-file metadata fields. Also, the "quick" view of GUI file
managers show the top of the file and not the bottom of it.)

3) If you put metadata in a separate file, it easily gets lost. File managers
like MacOS Finder and MS Windows Explorer don't know when 2 files are supposed
to be "treated as one unit" vs separately.

4) If you try to put metadata in a separate special area using os file system
features suchs MS "NTFS alternate data streams" or Mac OSX "resource forks",
they will get lost when transferring across incompatible filesystems or
uploading to Amazon S3.

If one is feeling uncharitabe, one could say the MS WinFS[1] was a spectacular
failed attempt at unifying metadata. (A relational database that makes
metadata more of a 1st class concept.) Nobody has tried it on that level
since. Even Apple's new file APFS system didn't have the same metadata
ambitions as WinFS.

The combination of tradeoffs leads everybody to re-invent the idea of
embedding metadata (including namespacing hierarchies) into filenames. The
article's suggestions for scientific data filenames looks very similar to
filenames that companies end up using for ETL pipelines.[2]

[1] [https://en.wikipedia.org/wiki/WinFS](https://en.wikipedia.org/wiki/WinFS)

[2]
[https://en.wikipedia.org/wiki/Extract,_transform,_load](https://en.wikipedia.org/wiki/Extract,_transform,_load)

~~~
demonshalo
Couldn't we just come up with some convention for "expanded filenames" where
the meta-data is included in the file name itself? In the UI portion, you see
what you see now, no difference, but say anything after the // delimiter in
the file name is considered meta data and not shown in the windows/terminal
UI.

Not sure if it's a good solution but if I were to put the meta data somewhere
I would somehow try to put it in the identifier of the file (the name) as it
is data that would help me identify the file AND it's content!

~~~
dsp1234
This is partially what filesystem forks/streams were supposed to be for. The
big problem, as mentioned, is that this works for the OS, but isn't somehow
transferred to third parties.

[https://blogs.technet.microsoft.com/askcore/2013/03/24/alter...](https://blogs.technet.microsoft.com/askcore/2013/03/24/alternate-
data-streams-in-ntfs/)

[https://en.wikipedia.org/wiki/Fork_(file_system)](https://en.wikipedia.org/wiki/Fork_\(file_system\))

~~~
demonshalo
thanks for the links!

------
dghf
The problem I've found with file names like those described as "awesome" in
the fifth slide is that if you have a bunch of them open at once, your
taskbar/switcher/windows menu truncates them all to something like
"2013-06-26_BRAFWTNEG...", making finding the one you want a bit more
burdensome.

Jakob Nielsen had a post (a link for which I cannot find) recommending that
web-page titles put the most specific information at the beginning. Doing
something similar with file names (e.g., calling them
"H01_MutantFraction...2013-06-26.csv", etc) would trade some of the advantages
of the proposed scheme for speed of finding and switching between files when
you're actually using them.

------
k_sze
If you look at the filename examples, there seems to be an _implicit_
suggestion of naming a group of related files using a common prefix.

If one needs to distinguish _groups_ of files, why not just put them in
_directories_? That's the reason directories exist, no?

I can somewhat understand if some (bad) software is written to look for files
only in a single directory and you have to put everything there. But
otherwise, it seems pretty pointless to use a common prefix and make filenames
longer.

~~~
jimktrains2
You loose the information in the directory title if the file is downloaded or
from viewing the title in an application title bar.

~~~
nerdinja
yeah, thinking along these lines filenames themselves are the easiest way to
display the contents of a file, and that data travels with each and every file
no matter where they're mv'd to, uploaded, deployed, shared, etc.

It's like brand name packaging, all the information including nutrition packed
neatly on the outside. You don't go to the store and buy 'bread' you buy
'2017-08-18-00-natures-own-dbl-fiber-wheat'.

This whole pdf resonated with me because it made me realize I'd developed
these almost identical practices without knowing it. Mostly over time, trial
and error, and a kind of natural selection, when it comes to sake of ease.

Cool stuff.

------
stevewillows
I'm constantly harping on everybody to pay attention to their file naming.

I'm a graphic designer, so for me everything is Client/YYMM-
Project/_FINAL/YYMM-COLLATERAL-NAME

Within each project there is a _PROCESS folder with a _ELEMENTS subfolder for
pieces the client has given me to work with.

For invoices I do YYMMDD-ClientName-Project-Sum.pdf. When the invoice is paid,
I rename the file to add -PAID- before the client name. Its simple, but its
allowed me to easily track and maintain projects and billing over the years.

If I end up working for another 83 years, I guess I'll pad the year with a
0...

Proper file management is an undervalued skill and should be taught both in
school and in corporate environments. In an old tech job we had a public
folder on the server that was total chaos. So many people insisted on naming
their files MAY-%day%-%contents%-%personsname% -- and, as you'd expect, people
spent countless hours per year trying to hunt down that one file so-and-so
worked on before they left for another job.

~~~
dghf
> I'm a graphic designer, so for me everything is Client/YYMM-
> Project/_FINAL/YYMM-COLLATERAL-NAME

> Within each project there is a _PROCESS folder with a _ELEMENTS subfolder
> for pieces the client has given me to work with.

Out of curiosity, why do you have the year/month in the names of both the
project folder and the collateral file?

And why do you start the names of your final/process/elements folders with
underscores?

~~~
dkarl
I can't speak to stevewillows' answer, but my girlfriend uses a similar naming
scheme in her architecture work. The project folder contains a year-month
prefix indicating when the project started, and any year-month prefixes below
that indicate when that part of the project was initiated. It maps the project
in time for her: this folder tells her when she first talked to the client,
this folder tells her when design started, this folder tells her when she got
the first construction bids. The date on the top-level folder is extremely
useful when looking through old projects. Just from consulting those files so
often, she knows when most of her major projects were started and finished and
how long each phase took, which is something I don't remember (and can't
reconstruct) for most of the projects I've worked on. She has folders like
this going back ten years. I can't even imagine what that would look like for
my projects. I'm a bit jealous.

------
efitz
The author forgot to write down the most important reason for the whole
exercise: file transfer.

Modern operating systems index the contents of your files, so finding all your
files on project "foo" is only a search away. If you are a GUI user, then file
naming isn't really that important for locating data on your machine.

Where file names matter is because there is 40 years of cruft out there which
absolutely refuses to move metadata along with files. So you can touch your
files to set dates, organize them in directories or tag them to your heart's
content in Mac OS or Windows, but you will lose all that information when you
attach the file to an email or put it in DropBox.

So you only have a choice of two places to put metadata in such a way that the
metadata will be carried along with the file- the file name or the file
contents.

Putting the data in the file contents lacks discoverability and in many cases
the applications you use to manipulate the files don't allow for additional
metadata anyway. Also, some file types (Word .docx files, jpegs, MP3s) get
their metadata updated and/or scrambled when you open them with specific
applications. So really your only valid choice is to put it in the file name.

The author's specific recommendations (use underscores and hyphens for
delimiting) assume that you really want to access the files with the command
line and use globbing. Other than that implication, the recommendations are
sound.

------
avian
> avoid [...] accented characters

It's certainly good advice and I definitely avoid using non-ASCII characters
in filenames in practice. But I can't help thinking that advice like that is
why support for Unicode is still buggy in many places.

I see nothing fundamentally wrong with using non-ASCII for filenames (and the
slides don't give any reasoning), if only random software wouldn't mangle
encodings, sorting order, or plain refuse to accept such filenames.

~~~
tellarin
In my view the main problem is in sharing those files with others. Other
systems may not support those chars or the other users may not be aware of how
they are sorted and even how to type them in a search.

------
alexilliamson
For those who don't know Jenny Bryan, she is a wonderful force for good in the
R community. It seems like these filenames are a little contentious here on
HN, but IMO this will always be a big improvement over the file-naming
practices of someone who has given little or no thought to the topic. Which I
am guessing was her intended audience.

If you're getting into R or data analysis, check out
[http://stat545.com/topics.html](http://stat545.com/topics.html). She has put
a lot of thought into the project management aspects of carrying out a data
science project that don't get discussed as often as other, sexier topics.

------
andrewflnr
My only real concern is with left-padding numbers with 0s when you don't know
in advance how big the numbers are going to get. Do you pad to 2 or 3 digits
or...

~~~
gdickie
When you have a long list of numbered filenames with no padding, pipe the list
through "sort -V" for "version number" ordering.

Edit: Glitchmr has a better solution below, "ls -v".

------
GlitchMr
Why left pad numbers if you could simply use better tools that properly sort
numbers (instead of using ASCII order which doesn't even make sense for
ordering purposes). For instance, use `ls -v` instead of `ls` (possibly as an
alias).

------
Kagerjay
The pdf didnt mention the most important point - keep it short and to the
point

Its so hard to use CI interface and type that filename every time or
navigating to in file explorer when its so long to read and the important bits
of info should always be at the start.

Iso date conventions arent necessary since most files have metadata associated
with it (create and modified date) so adding ISO date format is redundant for
human made files . As you can always use a bulk renamer at any point.

But my point still stands just sort regularly and add some sorting identifier
at the front of the file. Depending on who is working on file and what context
it is a simple number at front suffices 01, 02, 03, etc or it can be a word
and version number at end

Lastly the author didnt mention foldernames. Those need to be one or two words
at most to help segregate information if theres lots of files in that one
folder

If your creating machine made / autogenerated saved reports following a
standard ISO state convention makes sense though, with regexable slugs, etc

------
hyporthogon
There's some empirical work on how developers encounter and respond to naming
anti-patterns e.g. [http://www.veneraarnaoudova.com/wp-
content/uploads/2014/10/2...](http://www.veneraarnaoudova.com/wp-
content/uploads/2014/10/2014-EMSE-Arnaodova-et-al-Perception-LAs.pdf) and
associated googlescholar search
[https://scholar.google.com/scholar?hl=en&as_sdt=0,34&q=lingu...](https://scholar.google.com/scholar?hl=en&as_sdt=0,34&q=linguistic+anti+patterns+and+how+developers)

------
adams_at
I always name from general to specific — left to right.

Example: client_project_element-name_20170818.txt

When sorted by name, similar items are grouped together.

------
snarfy
MMDDYYYY

I like to call this middle endian.

------
euske
For some reason I want to avoid names beginning with numbers. All my filenames
can be used as an identifier (except the extension part).

i.e. [a-zA-Z][0-9a-zA-Z_]+\\.[a-z0-9]+

------
heinrichhartman
I name my files like this:

    
    
      2017-04-04 #S4907 #Choir List of names.pdf
      2017-04-05 #EXCITE #S=5005 Notes on Data Repositories.pdf
      2017-04-06 #ARAG #S=5031 #Amount=14.2 #CUR=EUR Invoice.pdf
    

with the following semantic:

* date. Every file name is prefixed with the ISO date to facilitate sorting

* tags. The syntax #<tagname> to categorize documents with tags.

* key-value pairs. The syntax $<key>=value let's us attach structured information to the document name.

and keep them in a single large folder.

On top of this, I have written some shell tooling to normalize, view, and
shuffle around those documents:
[https://github.com/HeinrichHartmann/pile](https://github.com/HeinrichHartmann/pile)

E.g. `pile extract EXCITE` will extract all files with tag #EXCITE to a
separate folder named #EXCITE. There is also a HTML form that helps with
proper naming of new files.

File management is still a pain for me, but this at least gives me some
confidence that I can retrieve stored documents reasonably well. I hope, that
one day I'll be able to auto-generate expense reports and tax filings, from
properly tagged up filenames.

~~~
grif-fin
You are suggesting a file name format with spaces in it?

~~~
heinrichhartman
yeah. I just can't stand all those dashes ;)

Just make sure to quote your variables in bash.

------
mulmen
I'm not a fan of dates at the beginning of the file name. If a file name needs
a date I always put it at the end so future versions group when sorting the
files.

------
stevenschmatz
Nice rules :) I especially like the use of underline-delimited metadata in
filenames, saved me on a huge research deadline once.

------
deGravity
It's a nice set of guidelines, but why the snail on page 15?

~~~
jesusth1
I think it's to embrace the slug:
[https://en.wikipedia.org/wiki/Semantic_URL#Slug](https://en.wikipedia.org/wiki/Semantic_URL#Slug)

------
sguav
Considered that most of the world uses "little-endian" format for date writing
(
[https://en.wikipedia.org/wiki/File:Date_format_by_country_(n...](https://en.wikipedia.org/wiki/File:Date_format_by_country_\(new\).png)
) how comes ISO 8601 was set on "big-endian"?

Not so practical for anything else but file naming IMHO...

~~~
mulmen
First off and most importantly ISO 8601 is a standard for _data interchange_
so being easy to parse visually and with a computer is a feature. ISO 8601
groups and sorts well without needing special rules and remains consistent all
the way from the year to the millisecond. It is easier to parse visually when
you are looking at a list of values, especially if they are similar.

Second "little endian" dates are inconsistent because the year is still big
endian. If you want to remain consistent you would have to write the current
year as 1720 (or even 7120!) because the years (or decades) are smaller than
the century. To achieve the consistency of ISO 8601 with little endian time
you would also have to write seconds before minutes and minutes before hours.

~~~
sguav
Right, very good point about the endianness of the year. Makes sense now.

------
kutkloon7
Those filenames look _terrible_ to me.

BRAFWTNEGASSAY doesnt make any sense. To distinguish between the filenames you
actually have to read the full filename, and with such long filenames chances
are they won't fully be displayed, so that you have
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFrac... For the first
four files.

Keeping cruft out of your filenames seems like a much, much better way to name
files. Also, most systems keep track of the creation data, no need to keep it
in the filename. I think it's better to give files an id.

~~~
jpcosta
BRAFWTNEGASSAY would make sense to the owner of the file or someone working in
that particular project. Consider it a project name, or a keyword that is
relevant in that particular context. If you're working with files from
different sources with multiple contributors this sort of approach works
brilliantly.

You could have named it differently: 2013-06-26_KUTKLOON7_Plasmid-
Cellline-100-1MutantFrac

Creation date can sometimes be lost if you copy/move the file between
different mediums

