Hacker News new | past | comments | ask | show | jobs | submit login
Taxonomy Is Hard (autodidacts.io)
132 points by serhack_ on Nov 1, 2022 | hide | past | favorite | 82 comments



Tagging and categorizing are two subtly different things to do. Having dealt with a lot of real world data, all I can say is that getting your hands on consistently tagged or categorized data is hard and gets harder the more data sources you have.

A real world example of how tagging can be both super useful and get out of hand is open street maps. The only meta data allowed in there are tags. The OSM community depends on people using tags correctly. But of course the tagging is incomplete, inconsistent, and subject to regional and data source specific variations. Which makes interpreting the tags a bit of a dark art.

But it still adds up to a very complete and rapidly evolving world map. So, there's that.

I've had the pleasure of seeing the documentation for Navteq's internal meta datas schema that they used for their maps while I still worked for Nokia Maps, which at the time owned Navteq. These days the whole thing is known as Here Maps. This was a PDF of around 4K pages. Thousands of attributes. Lots of weird little details related to traffic lights, subway entrances and exits, and other features you have on maps. This stuff gets complicated quickly.

Two very different approaches to the same problem. I think I like the OSM way a bit better. Neither is easy.

My job at the time was trying to align some of that data with data we got from external data sources like TripAdvisor, Qype, HRS, and a few others. Way too much time got invested in dumbing down categories, mapping one to the other, and trying to make sense of stuff. We had lots of issues with duplicate POIs because of all sorts of subtle differences in how different data sources were annotated with categories, tags, etc. Some of the data just isn't that good, complete, or consistent and you have to deal with that.


My thinking is that categorization is a dead horse. Here is a problem: take a car and a truck. Now slowly, bit by bit, morph the car into the truck. At what point is the car a truck? Or is it ever a truck? I, personally, personally cannot see this as a problem that can be solved. Maybe you are better.

My conclusion is that composition is a better method. Has a. And composition by capability is even better. Not great, but better. I was part of several efforts by standards groups to come up with SNMP standards for storage devices. What is a disk? And it has a controller right. Wait. Now the controller is in the disk. Wait now there are RAID systems. Wait...

Trying to build a system of composable tags for functionality seems like it might have a chance. Can it store something? Can it retrieve it. Does it have a retrieval time? Etc.

Another annoying problem is the categories for Craiglist. Where do cooking pots and pans go. For sale. A locality. Household items. Really? Nothing under cooking or even kitchen? I'm not saying this is wrong, I'm just thinking about how do I go about finding stuff other than just "search title".


Discrete categorization of continuous phenomena is always going to be inherently arbitrary. It's important to keep in mind what your goals are. This sort of categorization shouldn't be done with the goal of finding some sort of fundamental platonic truth; that's not going to work. Rather, discrete categorization is performed simply because humans find these categories useful when communicating.

The fuzzy line between truck and car isn't a problem when you approach categorization with this mindset. If you see a bank robber fleeing in an El Camino, you can tell the cops they fled in a "truckish car" or a "carish truck"; you don't have to neatly categorize an El Camino as one or the other to get your point across, but the arbitrary categories still help you communicate the idea.

If you're tagging pictures and come across an El Camino, and you don't have a "coupe utility vehicle" tag, you can simply tag the El Camino as both a truck and a car.


This runs the very real risk of thinking you can overload all "tags" onto something that you want.

That is, embrace the restrictions that come from a taxonomy. It is a bit of a lie, but a lot of it is defined on contrasts and these contrasts help build things.

What distinguishes a car from a truck? One is more faceted to holding people than to holding cargo. But if you don't know either a truck or a car, you are unlikely to know from just that description. Indeed, you could wind up with a cargo van. Which is different from a passenger van, in much the same way.

To that end, we present the taxonomy as a heirarchy of descriptive properties. When, in many cases, it is a heirarchy of representative samples. In programming, we distinguish between class based and prototype based object oriented representation. In reality, most taxonomies are presenting class style based relationships, but using prototype based representation from a population.


> I, personally, personally cannot see this as a problem that can be solved. Maybe you are better.

Both library science and cognitive linguistics have solutions for/discuss this problem.


I would say what they call 'composition by capability' is a theory of categorization, in fact similar to Wittgenstein's family resemblance theory.

That said, it's not obvious that the underlying data model needs or should map to a theoretical model for how human cognition works. Maybe it should and it's worth considering the theoretical landscape before setting out, but there are other features that come into play.

I know very little about library science, but I assume they take a more practical (for this task) approach to this that's worth looking into.


You should definitely look into it, especially if you like pedantic arguments.


Not disagreeing but unfortunately a lot of data sources have different opinions on this. And sometimes categorizations are actually useful indicators to tell what is what.

Composable and namespaced tags is exactly what we do in my company. It gives us a lot of flexibility and it mirrors what OSM does. You can do a lot of duck typing against such a system and it will work as good as your data is without breaking completely when you get bad data. You just loose some features.


> At what point is the car a truck?

when you can use it as a truck (ie. when it can do what the car isn't adequate for).


All true. Taxonomy is indeed hard. But, does it actually matter?

It seems what matters is not how files are stored/organized, but how one can find the files one is looking for. Taxonomy is mostly a search problem.

Yet, although one is usually capable of remembering specific or unique details about a file, it's still incredibly hard to search for a file or its contents effectively.

Dropbox, to pick just one example, ditched all its advanced search functionalities a couple of years ago and only allows to search for any keyword (not all of them), with forced stemming, and no filtering of the results.

I seem to remember there were startups a decade ago trying to address the search problem on the desktop but they all got acquired or folded; I don't understand why. The problem is real and seems quite solvable; yet it seems there's no actual market for it. It's a bit of a mystery.


> it's still incredibly hard to search for a file or its contents effectively.

Underappreciated aspect of this is trust. As much as "taxonomy is mostly a search problem", search is in large part a trust problem - trust that the search was done exhaustively, and if it returned no results, it means there aren't any.

This is a problem that IMO most offerings are blind to. One big offender for me is Windows Explorer. It has a search tool that can search both in names, metadata and file content. Yet in the past, it frequently failed to find files I knew were there. Truth is, it may not have been even looking in the right place - it depends on what's been indexed and how, which is information you can find somewhere in the system, but notably not in the search interface itself. I've stopped using it long ago, as I don't trust it at emotional level. I prefer to literally walk the filesystem structure by hand.

And, on Windows, I at least get that option. This is the fallback, the baseline: even if I don't trust a search engine to be exhaustive, I can cope with it as long as I can perform an exhaustive search manually. I.e. as long as I have a way to list everything. But this, then, is almost universally missing from cloud offerings.

(There's a whole rant to be written about the incredibly dumb idea of hiding the filesystem / database from the end-user, but I'll skip it today.)

> there were startups a decade ago trying to address the search problem on the desktop but they all got acquired or folded; I don't understand why. The problem is real and seems quite solvable; yet it seems there's no actual market for it. It's a bit of a mystery.

Nobody really wants to solve this problem anymore, because it conflicts with the major thing vendors want: for you to move your data into their cloud. The desktop, and the ability to own your data, survives only because most cloud systems are still shit. This might eventually change, and in the meantime, the market definitely isn't interested in helping you own your data.


I think what you're describing is database query, not search. Heuristic search is going to be non-exhaustive by definition.

Desktop-focused databases are of course widely available (Paradox, Access, LibreOffice Base etc.) but the volume of data you could manage with them is of course limited. I.e. non-toy datasets will naturally be hosted on some sort of "cloud".


> I think what you're describing is database query, not search.

What's the difference?

> Heuristic search is going to be non-exhaustive by definition.

To me, a non-exhaustive search is broken by design, because it will miss results. Making data accessible only through such search means there will be a hole in the system into which some of the data falls, never to be seen again.


Elasticsearch in particular can do the kind of queries most people expect from a database, see

https://www.elastic.co/kibana/

At work we have a search interface powered by Postgres that uses the full text index but also uses GIN indexes on arrays to index things based on categories/tags.


That argument is much more convincing in the "open world" of internet searches, where enumerating all results is simply not going to happen.

In the scenario where you're searching for a file in a group of low-hundreds of files, it's nothing but a bad cop-out excuse.


> because it conflicts with the major thing vendors want

Maybe, but that doesn't explain why there are no startups on the case, that wouldn't have this conflict of interest.


You'd think it wouldn't be that hard to build a personal search engine on top of Lucene or Elasticsearch and on one level it isn't. But there are two very hard problems.

(1) Performance. Back in the day people frequently turned full text indexing on Windows off because it would slow down their computer too much. People won't be happy with the overheard of a search engine that is always scanning 10's of GBs of document.

(2) Search quality. People are familiar with Google being an effective search engine but they've certainly tried a number with terrible relevance scoring and probably learnt that it is not worth trying the search on a web site or the search on the help of an application. In the case of Elasticsearch the default similarity is BM25 but that has two tunable parameters. There are other similarities you could use but most of them have tunable parameters. It makes a real difference what you choice and there is a methodology for tuning them that is now built into ElasticSearch.

https://www.elastic.co/guide/en/elasticsearch/reference/curr...

I talked to about 20 vendors of full text search products and found that only 2 out of the list regularly evaluated the quality of the results, 1 of them just did it so they could get some advertising by being on the TREC leaderboard. They told over and over again that customers didn't care about search quality, they just wanted to see a list of 350+ data sources that the product could index.


> They told over and over again that customers didn't care about search quality, they just wanted to see a list of 350+ data sources that the product could index.

This seems to describe corporate buyers ticking boxes on a form, more than actual users.


Sometimes those box tickers are your customers, as the ones paying you. That's how you get Enterprise Software


Who would fund those, and where's the potential for outsized growth?


I think the problem is there's no demand, not that incumbents are fighting it. But I also think it's weird and a mystery that nobody wants good search.

The trust issue you mention may be a clue.


> I think the problem is there's no demand

Computing is a supply-driven market. The demand is there, it's just partially latent, partially ignored. Vast majority of technology users have no choice but to choose from what's being offered, and the minority of tech-savvy users with opinions are increasingly too small a niche to support "power user" tools.


There isn't even an open source product with traction.


ripgrep solves the "precise full-text search" problem quite nicely.


There is that, and locate.

Most people have Word Documents, PDF files, and other things that need a more complex indexing strategy. Also a lot of people have lots of image and audio files which pose their own challenges, namely indexing textual metadata and possibly some indexing of the content.


> startups a decade ago trying to address the search problem on the desktop but they all got acquired or folded; I don't understand why.

Work migrated to the cloud.


True if you know what you're looking for. But apparently discovery is highly underrated by many.


Yes! I think this is a key concept that is often overlooked. Taxonomies -- and ontologies for that matter -- can often serve as a guide to the corpus, particularly when a user has less familiarity with the knowledge space.

And yes, any taxonomy is by definition always a subjective, biased construct. But this is often useful in helping to reveal the preferences and motivations behind the creation and curation of the corpus itself.


I'd guess only a small portion of people actually care for that feature. People are maybe better at remembering what's in their files by name alone than we give them credit, or willing to just look through things to find them, or are better organized via folders than we might assume.

Important for technical users, not so much regular people.


For Windows there's Everything a pretty good search tool.

https://www.voidtools.com/support/everything/


MacOS and Windows got built in search around 15 years ago. Anything more advanced would be very niche, so there's probably no market left.


More like 25 years ago. https://en.wikipedia.org/wiki/Mac_OS_8 had it in 1998 (possibly earlier. It isn’t clear to me whether Sherlock became PowerPC native in 8.5 or was new and PowerPC native in 8.5), Windows NT in 1996/2000 (https://devblogs.microsoft.com/windows-search-platform/the-e...)


Windows search only finds things sporadically though. I turned off search indexing on my Mac because otherwise I'd have to go out for coffee when I open any application.


Taxonomy is a demon which separates people into perfectionists and non-perfectionists just before dragging both the kinds to hell of exceptions, weird relations and impractical location. Perfectionists get stuck spending infinite amounts of time engineering the taxonomy, non-perfections face the quirks later.

Tags are better but can turn out to be even harder (for similar reasons, amplified combinatorially).

Labels are the most practical. The GMail inventor was genius. Folders must begone (except for system files).


How are "labels" different from tags?


"Label" — if I understand GP's point — is just a random string you attach to an item. "Amsterdam" could be a label, and it can be attached to a PDF of an old map of the city of Amsterdam, to a photo of your neighbor's dog named Amsterdam, or to an expense report for the project you're building on Amsterdam street. You decide what to attach it to, and it's only important what "Amsterdam" means to you. The search for "Amsterdam" label would bring you all of the above.

"Tag" — if I understand GP's point — implies some structure. In the above examples, it would rather be "ByLocation::Planet Earth::Europe::Netherlands::Amsterdam", "BySubject::Animals::Dogs::Amsterdam(MyNeighborsDog)", and "MyProjects::ByYear::2021::ProjectOnAmsterdamStreet" — or similar. In this case, if you're interested in the neighbor's dog, you're searching by its specific tag, or — if you don't remember its name — a search for "BySubject::Animals::Dogs" might help you. Any such search will also keep the other "Amsterdam" results away.

The problem with tags is how exactly do you implement them. Your project on Amsterdam street may have begun in 2021, but still ongoing in 2022 with no end in sight; or you may have forgotten its dates at all. Additionally, "MyProjects::ByLocation::MyCity::AmsterdamStreet" is not an invalid way to tag that project.


You don't care about how they (individual labels) relate to each-other.

E.g. the "javascript" tag implies the "programming" tag. I can even speculate there probably is a trait (linguists may suggest) which applies to some programming languages as well as to some spoken languages. This way tagging arguably can become even harder than taxonomy.

While a label is just a label.

See also: https://news.ycombinator.com/item?id=33248391


Nice article. As a data hoarder this is something I've run into several times over the years. Still haven't found a great solution. If xattrs were more universally supported, then that would probably be the best solution. Instead, I've come to specialized solutions for different data types.

For research papers (in PDF), I have a half-baked python solution I wrote myself that cobbles together the cermine pdf parser/content extractor, the whoosh full text search engine, and an ncurses-based interface.

For personal images, I use the elodie CLI tool, but I'd like to move away from it as I don't like how it modifies files by embedding metadata in them. For research/computer vision data, I use custom tooling based on sidecar files and a pg database kept in sync with the sidecars . For audio samples, I just use a commercial solution, sononym, that uses an sqlite database.

For other miscellaneous use cases, I've also used TMSU. Pretty nice as a more general purpose solution, except for the inherent issues mentioned by the article.

So yeah, I agree it's a hard problem.


Taxonomies are not only hard, but impossible. Tagging is hardly better. The problem you run into with tagging is that one day you call them `photos` and the next day you call them `pictures` or `pics` or `photography` or `disneyland trip 2019` or nothing at all, and then the day after that you can't find anything. The only solution is constant maintenance. Organization of non-trivial amounts of information is an ongoing problem, not solvable per se. You can hack out a path through the jungle, but the jungle just keeps growing.


I think the tagging issue can be solved by a system that gives the user suggestions from what they've entered before, and if they need to create a new rag, it has to be an explicit "create new tag" action so they don't end up with similar tags due to typos "pictures" "Pictures" "Picures"..

But yeah, I can't argue that tagging solves the problem at hand.


In practice that doesn't solve the problem as much as you'd expect. Consider two or more people maintaining the same system. You are tagging something, and the tags that first come to your mind don't appear among the suggested ones: is that because you're the first person to ever tag something like this, or is it that you're describing it in a different way than I did last year?

Your options for finding out are:

1. Step back and do some research, look at all the tags and find ones that someone might have used to describe something similar, and then check the original item to make sure.

2. Ask your team if anybody has tagged something like this before.

3. Institute a sort of merge request process for new item tags.

4. Make up a new tag and move on.

Guess which option people usually go with?


I wonder whether language models could be used to solve this problem. Convert every tag to a latent space vector and show the user all of the closest existing tags in a dropdown with the option to create a new one at the bottom.


That would be a cool tool to have. In the end, it still comes down to humans being conscientious every time they add something to the library. The equivalent would be having really nice software development lifecycle processes: test suites, merge request gating, etc, and once in a while (ha ha) people still manage to push out half-baked changes when they're in a hurry. In the end it's a human issue, tools only do so much. And in domains outside of coding, it's harder to find regressions automatically, so they tend to go uncaught and accumulate.


Taking this a bit further, do we actually need “tags” that are identical? Rather than filter by a tag, perhaps you could enter a label or set of labels and sort by which results minimize a distance function in latent space. No matching or conscientiousness necessary.


One thing I've done ever since the days of del.icio.us is look for the tags that have the fewest matches -- most often they were typos, synonyms, or singular/plural confusion.


Tagging, seems like a solution. But isn't.

Specific problems with tagging: - Need to tag every file (whereas in folders, you just navigate to the folder and everything you store there is in that folder) - Takes too long - Too much thinking overhead (at the time of storing) - To be effective have to enter the name for all tag entries (e.g. project, type, etc.). If anything is missed for a file, that file will never be found. - You have to remember what tag categories (e.g. project, type, etc.) you have used. If you don't use in 3 months, now you have forgotten. - You have to remember the enumeration you are using for some tag categories (e.g. for type you might decide to use only photo, video and music. Now you have to remember that. You also have to remember its "photo" not "image") - If tags were the solution, they would have already been used everywhere. The tagging system SEEMS like a good solution, but once you go deeper, it just doesn't work.


If you tag using RDF[1] and use OWL[2] to define both the tag name and values then it really works pretty well. And with some entailment rules[3] you can even define rules that tags on folders becomes tags on items inside them, or you can just solve this with a bit more complex SPARQL query [4][5].

For an example of where such a system is used, see:

- WikiData[6]: They use RDF but they don't use OWL, they have a similar though less formal way of defining types, and then entailment has to be encoded in SPARQL queries.

- schema.org[7]: Uses RDF and also provide OWL specifications.

[1]: https://www.w3.org/TR/rdf11-primer/

[2]: https://www.w3.org/TR/owl2-primer/

[3]: https://www.w3.org/TR/rdf11-mt/#entailment-rules-informative

[4]: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...

[5]: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/...

[6]: https://www.wikidata.org/wiki/Wikidata:Main_Page

[7]: https://schema.org/


What I realized the other day is that the property restrictions in OWL are a lot like the options you see on a typical search interface, say,

https://ropercenter.cornell.edu/ipoll/

It is hidden in plain site, for instance if you read this

https://www.w3.org/TR/owl-ref/#Restriction

you might not realized that there are some other axioms documented elsewhere in the standard that let you query on "less than", "greater than", "matches a regex", etc. Instead of fighting with Protege or writing turtle you should be able to compose queries out of restrictions plus complements, unions and intersections the way you snap together a program in Scratch, but somehow the idea that OWL works as a query language (logically define a class in terms of its properties) had eluded people completely.


Maybe automatic tagging with the help of some kind of AI would work?


Doesn't even need to be AI. I've done something very similar to this with text mining, regex, and some xquery/shell. The biggest problem, though, is convincing people that taxonomy doesn't belong in filenames, but that's built in to specifications. Kind of a brick wall, that.


For one thing you can train a model that will tag the same way you do.


Fun fact: Taxonomist is actually a role at many of the top tech companies. Much of the faceted search experiences are manually determined by taxonomists. Example: search cars allows you to filter by brand, color, engine type, etc vs searching furniture allows filtering by dimensions. Facebook, Walmart, etc. employ a few of these folks.


Well, if symlinks seem inelegant, there are hard links as well, you know. Ultimate tag system: for each tag, make a directory with corresponding name, fill with hard links to appropriate files stored in one big directory of mud.

Of course, it all have to fit into a single disk drive, and actually deleting a file is difficult but hey, that must be easily solvable, details are left as an exercise for the reader.


That's what I was thinking, too. Maybe also add some content-addressability or timestamp IDs into the mix to make renaming a safe operation. And just FYI: there is a reason why Git puts objects in subdirectories. Filesystem-based operating systems are surprisingly bad at handling extreme filesystems, so it might be wise to do what Git does in the main directory.


In my "personal digital garden evolution" I've reached the timeline-level taxonomy or, with Emacs/org-mode/org-roam/org-attach etc:

- new textual entries (headings) goes into monthly notes ($org-roam-directory/timeline/year/$month-name.org), they might attach files or not of course. Attached files are generally directly linking in the heading/inside the textual content of the heading for a single-click quick access and glance view. Doing so allow to have not too many too small files, not too big ones who operate slowly;

- another subdir of org-roam-directory is for "topics", a note per topic, linking or org-transcluding (slow and a bit limited but still useful) the collected entries in timeline style;

- another is workdir where I craft my catalogue (using org-mode drawers created with templates to allow easy org-ql queries) and queries to explore my notes in different view. It's not as easy as TiddlyWiki transparent transclusion but allow a certain degree of practical usability, fine grain selection and ease composition.

MOST of my files and config live or as org-attachments or tangled from org-mode. So yes, taxonomy is hard, but we have tools to master them IF we decide to discover them and invest time in improving our digital garden for real instead of leaving classic mess of files hoping for some miracle "application" that solve automagically all issues. Unfortunately due to the lack of interests by most leave such systems too little developed to be as effective as they can...

My personal experience is:

- we need taxonomy anyway, just mere full-text searching with extras à-la-google do suffice for a certain percentage but fails more than that;

- we need taxonomy that are a bit flexible in storage terms and can change at a slow peace;

- we need integration, witch is NOT possible in ALL modern software, we need for that classic desktops where the OS was a framework/live image and anything is just a module, a bit of code, of it. With end-user programming concepts because no UI can be effective enough in "no code" style and no "modern programming" styles are usable for user programming.

A bottomline: people should learn a bit about information management at school, from how a library or a pharmacy organize books/meds on their shelves to book's indices and personal information archives. Nothing exaggerated but the bare minimum to understand how to manage data, digital and physical in various forms for a lifetime...


Many people here are saying this is a search problem, but actually there are two distinct ways of finding information: search and browse. Unfortunately it is hard to support both without a lot of work.

Sometimes you want to locate a specific item - in which case you need a good way of searching - and sometimes you want to browse through related information so you want to see a hierarchical structure.

Google drive was originally designed on the principle that search was all you needed so it was all tag-based. And it was terrible as soon as you had a lot of data. So google was forced to introduce the ability to create a folder structure.


I agree. Set theory is more powerful and flexible than taxonomic trees (although not perfect). It's why I believe the future is Table Oriented Programming (TOP), where code blocks are either in or managed by RDBMS. Code-centric tools rely too much on file trees and other trees. If you instead try to design your stack and/or language around sets, you'll probably end up with something similar to TOP.

https://news.ycombinator.com/item?id=33413124&p=2#33415249


Been there and tried them all. One day you realise there isn’t a perfect approach and you must settle and compromise. I settled on project based [0]. When you notice too much repetition - and it happens more rarely than you may think - it’s time to simply consider a new project and a symlink. Not pure but is simple and practical.

[0] https://github.com/slowernews/hamster-system#hamster-folder-...



Plain old unix find with grep, locate or xapian help me navigate. Remembering to run updatedb is the pain. I won't put it in cron because grinding disks every night annoys the hell out of me.


I store my files chronologically. Every time I want to store something I want to keep, I make a folder with the name format "2022-11-01 something something". The text here is just a short description in natural language. If I feel like it I add a tag here like invoice or photos. The point is to make it searchable. I easily find most things I am looking for with Directory Opus and Recoll.


I use this simple system too, except for media. Everything else goes in a new folder inside my "2022" folder.

I tried for a few years to store things in a more hierarchical way or use tags, but it was too much mental effort to think about it every time. I also tried something similar to symlinks when google drive used to support putting files in multiple locations, and it was surprising how confusing it was to manage. Unless you apply symlinks or tags in a very consistent way, it just ends up being frustrating finding something that you think tagged or linked in one way but actually, you tagged it in another way or forgot to tag it at all.


Taxonomies are one of the funnest things I have worked with (very limited capacities) in the Wordpress implementation (about 1.5 million indexed pages).

I think spatial datasets/spatial aptitude either makes them relevant or just an untapped avenue for exploration.

Interesting article and most likely part of 21st century technology on many.... levels.


Separating functions helps us use the right tool for the job. Taxonomies are for semantics, and the file system is for retrievability. The comfort of hierarchies makes it easy to try and do both simultaneously.

- From computer science, we know graphs give us expressive modeling capabilities. I sometimes use mermaid ER diagrams as a concept map to capture complex relationships between files and concepts.

- From library science, faceted classification works well for extensive collections because inserting a new entry does not require thinking about existing entries. I maintain entries in a spreadsheet for extensive collections that matter to me. Note: Facets are meant for unchanging or infrequently changing properties. Creating a concept map and maintaining a faceted classification system take work, so I only use them for things that are very important to me.

90% of files I only care about for a short amount of time. I use the file system to co-locate the files I'm currently working on (so a project) but then archive all of it when I move on to something else.

The trade-off is that I give up on sharing files between projects. I don't want to deal with references. I copy from the archive when I need to. On the rare occasion when I need to reconcile the same file between projects, I do it manually. What helps is working on only a few projects at the same time.

TL;DR: Archive more. Use high-investment techniques only for the small percentage of files that really matter.


Categorization is order, tagging is chaos. No order is perfect, but it enables people to talk about the same entities and to agree on certain ideas. Tagging is when everyone invents their own labels and conventions, requiring tons of "smart" algorithms to make sense of the mess.


> Linux is still in the stone age when it comes to tagging. Common Linux filesystems (including ext4 & ZFS) support extended attributes, but I'm not aware of any Linux distro or file manager that includes tagging features based on them (or embedded metadata, for that matter).

KDE has decent support of tags in extended attributes. Tags are shown and can be edited in Dolphin (file manager), gets indexed by baloo. This is far from stone age as author claims!


It's hard for everything in every way.

Biology. Everything is a fish or nothing is a fish. Trees don't exist. Tomatoes are fruits as are cucumbers, pumpkins, bell peppers, and most things we don't consider fruits. But all fruits are also vegetables. Strawberries are neither straw nor berries. Etc.

When they said the two hardest things in computer science was naming things and cache invalidation, it's partly because naming things is a hard problem in every discipline.


The big central problem here is that non-trivial taxonomies aren't trees but graphs. Trying to get a tree-based filesystem to represent a taxonomy means you're forcing a graph into a tree. Symlinks help because they turn your tree into a graph (albeit one that breaks too easily; I think that could be fixed, though). But in the end, a traditional filesystem is a poor way to represent a taxonomy.


If you really want to organize files basic on semantics you might generate file names based on, say, a SHA-3 hash but keep looking the attributes of the files in a database.

The trouble with that though is that people have different perspectives on documents.

The librarian in me wants to ingest a document and never modify it, such that content addressable storage is what I want. I want to attach metadata in an external database.

There's another culture though where people edit documents, most notable in Adobe's tools which will try to save a JPEG even if all you did was print it! Adobe developed

https://en.wikipedia.org/wiki/Extensible_Metadata_Platform

which embeds metadata in the files which fits that point of view.


I guess the lesson to be learned is that you will need to support multiple classification systems. For biological taxonomy both morphological as well as genetics classifcation makes sense.

In the health data exchange format FHIR you have identifiers and codings have a system and a value/code. Usually you can specify multiple of them.


Can anyone recommend a paper or book that approaches this from first principles, e.g. are some limitations/abilities due to mathematical structures such as functions (folders), relations (tags), etc?



The trouble with that list is that those are supposed to be pigeonhole categories.

"Those that belong to the emperor" is an alright category as any animal could belong or not belong to the emperor. You get in trouble working in a system where that has to be disjoint from "bird".

"Those that look like flies from a long way away" is a category that contains all of them.

A while back I was interested in databases like DBpedia and Freebase where you don't really state that anything is disjoint from anything else as most of the categories overlap with other categories. For instance, a person can be an Emperor and an Actor and a Wrestler... See Nero Claudius. Wikipedia doesn't necessarily distinguish between a video game, a manga and an anime that have the same title, so unless you are going to split the topic you're going to have something that is all three even though somebody might think those classes are disjoint.

Lately I have been involved in OWL modelling of financial messages and there, as in some other domains, most classes are disjoint -- in some cases that's a deliberate decision of the modellers, in other cases it is fundamental to the platform I am sucking data out of.


Taxonomies are a crutch that simplify a complex problem - usually too much to be useful.

Multidimensional latent spaces based on content and other characteristics of the object is the real solution here.


So... tags with magnitude?


Totally agree that it's difficult.

> Symlinks are brittle.

On the Mac OS, aliases are a lot less brittle. You can move aliased files around, and they usually won't lose their connection.


Preferably whatever web server you use will allow you to assign sections to multiple categories. I agree project based is usually the way to go though.


This is not taxonomy, is classification.


This dicussion crops up every so often, eg. at https://news.ycombinator.com/item?id=29141800 . Here I repost a composite of my old comments https://news.ycombinator.com/item?id=14542595 https://news.ycombinator.com/item?id=14546682 from a previous occasion when this was discussed https://news.ycombinator.com/item?id=14537650 . Anyone who is serious about this stuff should probably start here.

> Well, since you ask, here's Hans Reiser's old stuff:

https://reiser4.wiki.kernel.org/index.php/Future_Vision

https://reiser4.wiki.kernel.org/index.php/V4

(and http://lwn.net/2001/1108/a/reiser4-transaction.php3 )

. And here's some emails etc. I wrote in response:

https://web.archive.org/web/20040728044342/http://www.st-and...

https://marc.info/?l=linux-kernel&m=111624697710426

https://www.mail-archive.com/reiserfs-list@namesys.com/msg09...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

https://www.mail-archive.com/reiserfs-list@namesys.com/msg20...

, plus some of the discussion threaded from those posts. (Sorry, my stuff needs rewriting and updating but I'm not in the position to do it at present. If there's anything you would like to ask about please do. https://news.ycombinator.com/item?id=9809041 and https://news.ycombinator.com/item?id=10548477 touch on things that are a bit further down the line, but related—in particular, to the handling of "internal metadata" and files with a compound internal structure.)



tl,dr:

* Partitioning files into folders is most likely wrong: Most things need to be in multiple folders.

* (Sidenote: Symlinks are not a good solution)

* Tagging would be best but no good support from the OS for metadata

* We're working on something


Categorising information into taxonomies is like trying to hammer a square peg into a round hole; sometimes a necessary but undecidable problem. As someone once said: a book (article, webpage, whatever) is rarely about one thing.

This is a topic that is at the top of my mind, as I grapple to organise my growing gemini/gopher site. Is it better to index, list a table of contents, search, or try to classify it with the (DDC) Dewey Decimal Classification.

The DDC. It has come under criticism, and librarians have justified a lot of their efforts in moving away from it. I doubt that the effort was justified. It boils down to this: you have to put a book in a library somewhere. And that somewhere has to boil down to a taxonomy.

To illustrate the problem, is a book about programming microcontrollers a book about programming, or is it primarily about microcontrollers? The Arduino Cookbook is in DDC 621.3810285536 (yes, really. That's obviously extreme, though). That's part of the electronics section, which seems fair enough to me. So far so good, But "Beginning MicroPython with the Raspberry Pi Pico: Build Electronics ..." is in section 005.13, which is programming. A completely different place. "Programming with STM32: Getting started with the nucleo board" is in 005.262, which is also programming. But why 005.262 rather than 005.13? It almost seems that whoever is classifying these books has no idea what they're doing ;)

I could go on at length about the confusions I have in trying to place my content. In the end, you have to make a somewhat arbitrary decision and just go with it.

Tables of contents work reasonably well within a book. Subjects are often non-intersecting, so they can be treated separately. For the most part, anyway.

A solution which is fairly reasonable is to index your site. Indices are useful because they allow you to take multiple views on something, thereby eliminating the taxonomy problem.

I'm not a great fan of tagging. It is too much of a scattergun approach to my liking. Perhaps some merit, though.

Then there's textual searching. In fact, that's how I relocated some of my notes. So, text search it is, then? Well, not quite. It seemed like a good system for my site which is focussed. It has problems scaling. I don't want millions of results, a la Google, I want a few relevant ones.

This is even a problem with search engines for the gemini and gopher protocols, where nobody is even trying to game the system. You often end up with a lot of similar stuff at the top which I am not interested in.

Oddly, for gemini, I prefer the "Collaborative Directory of Geminispace" over at gemini://cdg.thegonz.net/ , which is a taxonomy of categories, the very thing that I has doubts about.

So, in summary, it's not easy.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: