Hacker News new | past | comments | ask | show | jobs | submit login
I am endlessly fascinated with content tagging systems (twitter.com/hillelogram)
590 points by redbar0n on Oct 18, 2022 | hide | past | favorite | 266 comments



Instagram's tagging system was actually really effective at categorizing content and discovery because each hashtag was treated as a node in a (giant) graph, where each node has multiple properties, including post count (number of posts using a tag), 'velocity' (number of posts using a particular tag per unit time), etc. I could write up a big post about it as I made a study of it in when I created a web app for finding the most relevant tags a few years ago.

All that to say there was a lot to their system and it worked because users became aware that they were rewarded for using the most relevant tags. Using irrelevant tags was punished. This guided users towards using a mix of relevant popular and niche tags to maximize their reach, which, in turn, further improved the tagging system.

Instagram's tagging system isn't as important anymore as their algorithm has deemphasized it, in favor of other methods for classification and discovery, but there were a couple of golden years where it worked very well. Most users still look back on those years as the 'good times' even if they don't know exactly why. I'd go so far as to say they ruined the app after they deemphasized tags (and added way too many ads)


I went and got my laptop to type up a reply to this:

Instagram's tagging system was and is atrocious in combination with their discovery mechanisms and the incentives they create.

A real example, this has been true for years: I want to look at pictures of Jennifer Lawrence's makeup because she, like me, has hooded eyes and that makes useful reference. I go to instagram imagining that I will find fan accounts posting pictures. I search for #jenniferlawrence. 2.8 million posts. Nice.

https://www.instagram.com/explore/tags/jenniferlawrence/

Only three of the top nine pictures present there today (same on mobile for me) are of Jennifer Lawrence.

This is what happens when all the engagement is counted in one big engagement bucket, all eyeballs are equal eyeballs, and all likes are equal likes. There are two gorgeous pictures of Anne Hathaway here, and I'm sure they've gotten great engagement, but they are absolute shit at being pictures of Jennifer Lawrence. So now I have to scroll through a ton of absolutely irrelevant nonsense – with attendant ads, let's not forget the ads – if I actually want to use this tag for its sole raison d'etre.

For contrast, consider what happens when a picture is cross-posted on Reddit. If I upvote it in one subreddit, I am giving an engagement signal specific to that subreddit – and I as one user may choose to downvote that same picture in another if I think it doesn't fit there!

This isn't limited to low-effort reposting of professional photographers' work. It happens in less egregious ways for many, many tags in many areas I've seen on IG for quite a few years now (though I can't speak for the whole history of the app). Artists tagging media they're not using, etc. etc.


Isolated hashtags on IG were often not useful for finding specific content. To find what you want you'd need an interface to a page that showed posts across multiple hashtags which they didn't provide.

For example, posts that contain both #jenniferlawrence and #hoodedeyes would be what you are looking for, and if you interacted with enough posts containing both of those tags you would end up seeing the content you want, you just can't do it directly by navigating to those hashtag pages individually.

I used to provide this in my free hashtag suggestion app (which I kept online because it was, IMO, the best at what it did and still gets 500k hits a month). The app allows users to iteratively navigate the hashtag graph to a final set of tightly related tags. Anyway, during this refinement process users would see a grid of post thumbnails where the posts contain the desired hashtags. Facebook eventually tightened access to IGs API and blocked my ability to provide this thumbnail grid, but the rest of the app is still standing


> For example, posts that contain both #jenniferlawrence and #hoodedeyes would be what you are looking for

I mean, not to overwork it, but in this example, this isn't accurate: I just want pictures of Jennifer Lawrence, not the much smaller slice of content where the poster had her particular eye shape in mind. Also, I don't want to only "end up seeing" pictures of Jennifer Lawrence in my main feed, I want to be able to go find them when I want them. Instagram's ontology is designed to not make this possible, because everything about it is meant to facilitate tube-feeding, and that is why I feel so strongly that it is trash.

(We could also talk about how similar phenomena manifested elsewhere: the #goth tag on Tumblr in my Tumblr days was unusable because of the quantity of non-goths looking at a lowkey photo and deciding that was the best descriptor – so the actual #goth content was found in .... #gothgoth. And if you're wondering if this was discovered and subsequently chased out to #gothgothgoth, You Have Understood The Problem)


Right, but it was never meant to specialize in user-directed filtering and discovery to the level you are describing.

To your second point, every social context that gets 'cool' eventually gets LCDed into mediocrity (even HN). You have to outrun the noise, as you described in your #gothgothgoth example


>Right, but it was never meant to specialize in user-directed filtering and discovery to the level you are describing.

this, and what you said earlier

>Isolated hashtags on IG were often not useful for finding specific content. To find what you want you'd need an interface to a page that showed posts across multiple hashtags which they didn't provide.

forgive my naivete but I wonder exactly what could possibly have been good about their tag system if it didn't help users narrow down to just the content they wanted?


You ever just put a show on or pick up a book spontaneously, without really having a particular show or book in mind before you started?

This is kind of like that, call it 'directed spontaneity'. You open the app to see interesting stuff, but you don't know what exactly that might be in advance. The app learns some basic stuff about you using your interactions on content then tries to guess. You either like the stuff the app shows you or you don't.

The hashtag system was very good at filtering content to what you generally like in your feed and the Explore view, but it wasn't set up for you to fully guide your own experience in search of very particular content using individual hashtags.


> Right, but it was never meant to specialize in user-directed filtering and discovery

We agree:

> Instagram's ontology is designed to not make this possible, because everything about it is meant to facilitate tube-feeding

So as the user-directed ontology component of the IG ecosystem, it's trash, and I will die mad about it.

> every social context that gets 'cool' eventually gets LCDed into mediocrity (even HN)

I want to distinguish my complaint here from a gatekeeper's lament. This isn't about, say, nu-goth showing up and coming to eat the fashion scene, as Tumblr did indeed instigate. I'm talking about pictures being tagged with "goth" that no one looking for "goth" content would consider relevant (a closeup of coral red lipstick, say), but which would show up as "top #goth posts" because they were highly engaged with in the other contexts people saw them. (To a lesser extent, also Etsy spam in #gothgoth, but this is more what you're talking about)


I just want to say that your comments made me laugh super hard, but truly get to the heart of why a blunt engagement metric is a trash metric. (I also study gatekeeping and intermediaries, so it’s interesting from that angle also.)


Some day some master's student is going to write a thesis on the history of the #witchcraft tag on Tumblr and how it functioned as a social space and I only hope my username does not appear. :D


We agree on the intended use of the app being more about passive consumption of a feed of interesting content, not active, thorough exploration of particular topics.

I'm not sure why you think the whole hashtag system was trash because of this one aspect. Most of the evidence I see is contrary to your statements.

To reiterate, the strength of the tagging system is the relationship between the tags you interact with, not any given individual tag. Also, individual tags are not exclusively literal descriptors for their content. They may be, or they may be used like: 'people who like #jenniferlawrence will also like #...'


> active, thorough exploration of particular topics

> discovery to the level you are describing

It is not a thorough or sophisticated search-engine-type task to "use a celebrity's name as a tag to look at pictures of a celebrity".

I'm whinging about this because I enjoy Instagram for (yes, largely passive) consumption of the kinds of things people post on Instagram, and I think we should be honest about the brave new world we're living in. Part of that means owning up to how the sophisticated engagement-driving media have made certain aspects of interacting with each other's stuff worse, not better.

> Also, individual tags are not exclusively literal descriptors for their content. They may be, or they may be used like: 'people who like #jenniferlawrence will also like #...'

This contradicts the original

> All that to say there was a lot to their system and it worked because users became aware that they were rewarded for using the most relevant tags. Using irrelevant tags was punished.

unless we are using a definition of "relevant" entirely gleaned from a tube-feeding targeted-advertising mindset, where my intent and desire at any given time is immaterial, the actual content of any particular post is immaterial, and queries can only be made via the general profile of my eyeballs' interaction patterns.

(This works really well for paid ads! It really really does, for everyone involved! The Instagram Etsy and Amazon ads I see are extremely good at showing me niche-ass things I want to buy, far better than the content recommended to me on Etsy and Amazon.)

Tags are used as vehicles for content promotion, and the race to the bottom has rendered them so useless that they're not even good for that (at one time they were valuable to catch people browsing tags, but who's going to browse tags when they're this incoherent) so it's not surprising the app has deemphasized them. I think that product choice is pretty strong evidence for my take that they weren't working well, tbh.

On TikTok, the even-more-cursed Instagram Of (Ten Minutes In) The Future, accurate tags come off as cringey and desperate, and are used either as blatantly ironic jokes (the sponsored ones, especially) or Skinner-pigeon dancing-to-attract-the-algorithm (lookin' at you, #fyp). You can see why: if people come to associate tagging with thirsty self-promoting behavior, that's not a good look on your #aesthetic videos. In many contexts, evidence of the social media hustle you're involved in is disqualifying; it's easier to feign naivete, "oh damn this blew up", if your effortful video editing was sent down to people's tubes without your particular direction.

None of this was inevitable, because deciding to rank by engagement bucketed by "users who have a general profile of liking X" instead of "users who are looking at the X tag" is not an inevitable choice, even if the knock-on effects of the incentives it creates are inevitable after that choice.


> unless we are using a definition of "relevant" ..

Yea, this is the crux. Using a particular tag in your post is a prediction that people who appreciate the content clustered around that tag will like your post. A good portion of the time that is based on accuracy, but sometimes it isn't, as I already explained (especially in the most used tags). Instagram was never set up to have the specificity of a search engine.

I agree tags are less useful now, because the feedback loop is broken after Facebook's changes, but it wasn't always like this. Every point I've made is about a particular period in time, not the present.


>To your second point, every social context that gets 'cool' eventually gets LCDed into mediocrity (even HN).

The Law of Shitty Clickthroughs by Andrew Chen illustrates and details this very well: https://andrewchen.com/the-law-of-shitty-clickthroughs/


> And if you're wondering if this was discovered and subsequently chased out to #gothgothgoth, You Have Understood The Problem

Excellently put.

To me, it seems that the problem is that tagging is adversarial in any system where spamming can be rewarded.


Yeah – I'm not even sure we can call it adversarial where there seems to be no signal for relevancy. I'm sure there are "good" reasons for not having one (where by "good" I mean "profitable") but oof it sucks (esp. relative to the much simpler Web2.0 organization of e.g. Reddit)


It's funny because one of the most common arguments I see inside Reddit communities is irrelevant posts getting upvoted in a subreddit, and wanting the mods to step in - or not. People just see a post on their frontpage that they like and they upvote it, rarely stopping to look at what subreddit it's from and whether the post is a good fit for that subreddit.

I suppose it probably still works better than Instagram.


Fully agreed – it at least seems like you could tune for this in your less deterministic algorithmic sorting, too. There's no Reddit Constitution that says an upvote is an upvote is an upvote – so you could weight upvotes issued from people viewing the subreddits' pages more strongly than those from people viewing their home feeds, upvotes from subscribers' home feeds more strongly than nonsubscribers' /all feeds, etc. etc., downvotes mutatis mutandis


What the actual fuck, NON of the nine picture are of JL. Although all nine posts tagged every living actress, celebrity and they mom.


Please write that big post! Sounds interesting


I second this! Sounds like an interesting read! :)


I third this!


I'll add that if you do write it, email us a heads-up at hn@ycombinator.com so we can consider putting it in the second-chance pool (https://news.ycombinator.com/item?id=26998309).


I'd also be interested in reading about that - in particular are things like 'post count' computed and stored (i.e. not normalised)? How do you cope with keeping such analytics current as the source changes?

(When oh when will postgres get automatic and incremental materialised views?)


I noticed this as a regular user. I’m curious to see what it is like now.


> in favor of other methods for classification and discovery

Aka the toxic engagement trap. I miss the days when unfathomable AI didn't dictate what's popular.


> could write up a big post about it

Please DO.


I adore tagging systems and have worked on them in several different applications and implementations, but there are always pitfalls and trade offs, and it’s possible to bury yourself

Nowadays I nearly always store the assigned tags as an integer array column in Postgres, then use the intarray extension to handle the arbitrary boolean expression searches like “((1|2)&(3)&(!5))”. I still have a tags table that stores all the metadata / hierarchy / rules, but for performance I don’t use a join table. This has solved most of my problems. Supertags just expand to OR statements when I generate the expression. Performance has been excellent even with large tables thanks to pg indexing.


Do you index arrays? What index type is that? Any tips?

I’ve used array column in PG before, haven’t indexed arrays though.


AFAIK, postgres first got its reputation of high performance because of array indexes.

People usually go with GIN indexes, that can be used on the contains, overlaps or equals comparisons.


Yes, it’s explained in the intarray doc here. GiST is the one I use, but as it states GIN should be faster on reads. I haven’t really thought about that in many years, I should run some perf tests.

https://www.postgresql.org/docs/9.1/intarray.html


Read perf would be good in isolation, sure; but what about the lock contention over those rows due to the frequent writes to the tags column? Usually "taggings" have a much higher rate-of-change than the objects they tag. If you're not at least keeping a thing_taggings has-one table (thing bigserial, tags intarray) separate from your things table, I could see this degrading performance for any query that wants to touch the table.


I just made tags nosql, it's stored as json data in pg with a user set to each tag.

I think the relation was that each user could have mutlitple posts that each contain multiple tags.

Where each tag is global with an user ids stored with the user Id as the key. Or something approximatating that.

Mainly to not need insane queries for many to many relationships. O(1) baby.


Can you search by tag expressions like “((1|2)&(!3))” with your method? I’m not understanding it completely. I often use json_data but I didn’t consider tagging methods because of the overhead it has.


The tradeoff here is that you lose the foreign key constraint, correct? So if you delete a tag, there is no way for the database to automatically remove all references to it. Or is there some way to do this now?


> So if you delete a tag, there is no way for the database to automatically remove all references to it.

I'm not sure about implementation/support in Postgres specifically, but in the general case of a column of tag bitfields, the database could easily maintain a global popcount (ie, "number of rows with this tag") and soft-delete flag for each tag, and clear any soft-deleted tags on (possibly-only-write-)access. When a soft-deleted tag reaches popcount == zero, it counts as garbage collected and can be reused for a new tag.


But then again, the reasons for _deleting_ a tag are very low. What really happens is that you want your searches on that tag to just return nothing, and that's more of an application level responsibility. Your #absolutely_vile_tag_that_would_get_you_in_jail just enters a blacklist, and you're done.


Yes, but that’s easily handled with a trigger. My first implementation actually had a regular join table of items_tags which used a trigger to update the items.tags intarray. Wasn’t super performant but let us use our existing templates for 1-many to implement the UI.

Nowadays you can just use a tagging component with integrated search for the UI.


Right . More like nosql FKs.

How high is the business risk if you have a random tag with no name? Skip it’s display jn the UI


Would you mind sharing a simple example that demonstrates this? Sounds great!


I worked with the Wikipedia category system a few years ago, and you could see the problems with hierarchical tagging systems right in action back then. (Though it may have gotten better in the meantime)

The system appeared simple: There were just two relations, "Article A is a member of category B" and "Category X is a subcategory of category Y".

However, in practice, the community was using this system to represent a whole host of wildly different relationships between items, often with different implications what a category actually applied to.

E.g., if A has a subcategory B, this could mean one of several things: B might be an additional constraint on the items in A ("American writers" -> "19th century American writers"), the things in B might be more specific than the things in A: ("Writers" -> "Novelists"), A might apply to the concept B, not the things in B ("Occupations" -> "Writers") or A might refer to the category B ("Categories with more than 100 entries" -> "Writers") and on and on...

Of course those different aspects could even be combined. E.g. "Categories with more than 100 entries" might have a child "Categories with more than 100 entries in need of review", which represents a constraint but might itself contain less than 100 entries...

The basic question "Is item X in category Y" becomes impossible to answer generally, because there is no clear indication if a category only applies to its direct children or to all of its descendants or only to the subcategories itself.

I'm sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation...


There are only two kinds of relation here, “subset of” and “instance of” (aka “element of”, type-token).

The category-category relations are intended to always be a subset relation. The article-category relations are intended to always be an instance-of relation.

- "19th century American writers" is a subset of "American writers“.

=> Both are a category, so no problem.

- “Novelists” is a subset of “Writers”.

=> Both are a category, so no problem.

- “Writer(s)” is an instance of “Occupation”.

=> Here the problem is that “Writers” is a category. It would be okay if it was an article “Writer (occupation)”.

- “Writers” is an instance of “Categories with more than 100 entries".

=> Here, again, the problem is that “Writers” is a category, and having an instance-of relation between categories is not an intended/supported use-case.

This could conceivably be solved by supporting an instance-of relation between categories, in addition to the existing subset (subcategory) relation. It could be called a meta-category relation. Then you could have the category of occupation categories.

Another way to put this is that categories have to be typed: a category contains either (just) articles, or it contains (just) categories. Subcategories then must match the type of their supercategories and correspondingly must contain either articles or categories.

Basically, Wikipedia’s type system is not expressive enough to allow everything people would want to express in it.


> Another way to put this is that categories have to be typed: a category contains either (just) articles, or it contains (just) categories.

The problem with using such a strict type system as a tagging-system is exacerbated by the cases where:

1. Someone adds an article to a category (they tag the article), but then want to add a subcategory to that category. Now the category contains both articles and subcategories (violating the constraint). So the user would have to move all its articles into its subcategories for the constraint to be satisfied. This can be an enormous amount of work (needing to invent new subcategories for all the articles in the category not fitting into the specific subcategory they had in mind).

2. Someone wants to add an article, but only has a vague idea of a super-category in which it would fit. Now they have to exhaustively crawl/navigate the tree of sub-categories, until they find only the leaf sub-categories which only contains articles, which is a place they could put it. The input barrier thus becomes high (which is antithetical to how people expect to use tags).


Re 1: I think you misunderstood. The restriction of either articles or categories is for the instance-of relation. (The subset-of relation is naturally restricted to categories to begin with.) If a category contains articles, it can’t also contain categories as elements (instance-of relation). But it can contain subcategories as subsets.

Analogy: The set of real numbers has the set of natural numbers as a subset, but it doesn’t have the set of natural numbers as an element, because the set of natural numbers is not a real number — the individual natural numbers are.

Likewise, the category “Occupations” may contain articles describing occupations, and it may have a subcategory “Clerical occupations” (subset-of relation), but it cannot contain the category “Writers” as an element (in an instance-of relation as with the articles), because writers are not a subset of occupations.

Furthermore, as an example of a meta-category, the category “Categories with more than 100 entries” may contain the category “Occupations” as an element (but not as a subcategory!), and hence cannot contain any articles as elements.

The element type of the category “Categories with more than 100 entries” is categories, and the element type of “Occupations” is articles. The point is that you can’t mix both types of elements within the same category. This is independent from subcategories. Any category can have subcategories, the only condition being that the subcategories must have the same element type as the suoercategory.

The idea is that a category can have both subset-of and instance-of relations at the same time (and each relation needs to be marked as such in the system), but the instance-of relation is restricted to be either articles or categories, but not both.

Re 2: I believe that problem goes away, given the above.


The issue is that system has nodes and edges, but no concept of distinct graphs. That leaves you trying to fit all notable human knowledge onto a single graph, which is non-optimal. Whether it’s also a DAG, tree, or something else doesn’t even matter.

Ontologies are like languages. There is no correct one. What matters is how good a fit it is for the problem at hand and that you’re all using the same one! If half the people are using Italian and half Spanish, it’s going to be a disaster. I wouldn’t use APL to write a UI and I wouldn’t architect a computer system in Shipibo.

Similarly, if I’m bird watching, “Birds of Northern California” is very useful. Organizing them by genus is less useful to me in that moment, but it’s not wrong.


I don't think you necessarily need multiple graphs; just labeled edges.


You just need some way to interact with it as multiple graphs. Some variation of labeled edges is probably the best.


In your examples, would the edges be like:

tagged_with_italian_tag vs. tagged_with_spanish_tag ?

tagged_with_genus_tag vs. tagged_with_geo_tag ?

Would that afford such multiple graphs?


There are a bunch of ways to do it. You could use the Entity-Attribute-Value[0]. Then it's (California Quail, Region, California), (California Quail, Genus, Callipepla). You could do relational tables, with a through table for each taxonomy. Or, one through table with tags. That's like your comment.

https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80...


Isn't this literally just saying we need another layer of categorization on top of the categorization layer?


It’s saying you need support for multiple types of categories. You could use the same system to organize itself. No need for a meta layer.


Perhaps "adjacent to" rather than "on top of"? I've started looking at this kind of problem in terms of DB queries or set relations. Even "organization" can be a set relation if there are the right bits of metadata in place.


The problem might not be with hierarchical tagging systems, but with the specific hierarchical tagging system they use at Wikipedia.

Imagine another system with the following categories:

* People:ByOccupation:Creative:Writers

* Time:CommonEra:ByCentury:19

* Location:Earth:Americas:NorthAmerica:USA

In this scheme of things, e.g. Mark Twain would be tagged with all three. "19th century American writers" (which includes Mark Twain) would not be a category but a saved search. (Other saved searches — which would also include Mark Twain — would be "19th century people from Americas" or "Stuff from Planet Earth").


Suppose you have someone who did a bunch of writing in America, then moved to Europe and became famous as an inventor there. Under your proposal, this person has both Location:Europe and Location:USA, and both Occupation:Writer and Occupation:Inventor. They therefore show up for queries for European writers and American inventors, neither of which was intended; I bet we can come up with situations where the false positives are even worse. The presence of those tags have to be interpreted in light of each other.

If you do this naively I think it's pretty clear you've either sacrificed expressivity or made the system a LOT more complicated/harder to understand. At best you end up with some kind of product structure in (what is no longer just) the set of tags on an article. You can think of explicating an implicit product structure in joining "American" and "Writer" in the same object. But I think if you've started talking about compound tags, you're really talking about something other than a tagging system.


I think the only feature you need to express this is to be able limit a tag to the context of another tag. It's slightly different than a compound tag because each tag can still be used independently.

I experimented with a system like this recently[0] that used two different tag notations that seemed to make the mixing more intuitive. I didn't have enough time to iterate on it further or build it more seriously, but I think there is potential in this area.

[0] https://youtu.be/bi3YkY7UKmM

----

To be more specific, your entity could be tagged as Location:Europe in the context of Occupation:Inventor, and Location:USA in the context of Occupation:Writer. I still think the entity should match queries for any of the 4 tags, it just shouldn't match a query {Occupation:Inventor/Location:USA}


That introduces a dependency, or at least ordering, between the concepts of Location and Occupation that I'm not sure should exist, much less which direction it should point. It works for baking:skilled because skill level is inherently part of the property of being a baker, and skill is undefined without a thing to be skilled at, whereas someone can easily reside in a location with no occupation ({Location:USA/Occupation:Layabout}?) or have an occupation with unknown/unfixed location ({Occupation:Inventor/Location:Nomad}?).

And if you try to create a synthetic context to place both tags under, you get... compound tags, or close enough as makes no difference to me. :) I'd 1000x rather start from there, and special case it to return results for each tag individually, than start introducing spurious orderings or dependencies. (ed: maybe it would be clearer to say "composite tags", as in tags composed out of other tags?)


Perhaps I misunderstood the example but I thought the point was precisely that there is a dependency between Occupation and Location which individual tags cannot express...?


> I'm sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation

I think the problem is allowing users to freely tag, then. There should be easily accessed guidelines about how each tag should be used, and people who are constantly moving them, correcting them, and updating usage guidelines.

We need the ability to implement governance systems on top of web 2.0+ style content systems. People should be able to vote for representatives (with any number of voting systems), create committees, submit changes to be voted on, etc. Instead we usually work based on hierarchical dictatorships or imagined consensus. People need organizational management tools baked into software, because organization of information depends on it. Instead of proposing a new committee to come up with the schema of everything, better tools that enable users to build committees.


The fundamental tension in tagging systems, to me, is whether tagging is a feature the software offers to the user or a task the user performs to assist the software.

In the first case, you want freewheeling and tolerate ontological inconsistencies because you want to offer flexibility to users and will capture hard to quantify emergent benefits (some made up examples: "try the tag user233-favorite, I keep discovering awesome articles!", "the physicist-needed tag has highlighted a lot of misinformation surrounding quantum physics and relativity"). People use it to the extent it is useful.

The other way, with formal semantics, governance (which you made some very wise points about), etc allows the software to reply to queries like "19th-century + Missouri + humorists" in a performant and authoritative way. It's not really a feature so much as it is a way to enable other features.


I recently run into the same kind of problem in Wikidata.

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Onto...

typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/...

https://www.wikidata.org/wiki/Wikidata:Project_chat#Ontology...


Many comments below are hinting at - but not naming - triplestores. "A has relationship X with B". This is how wikidata works.

Learning about those and learning how to query wikidata just blew my mind.


If it isn't too much trouble, I would love to see an example of a particularly complex query that can be done on top of this...Pseudocode or just a text description is fine, it doesn't have to be precise syntax.


I was toying with the first world war at the time. You could query famous soldiers born in the same home town as the current president, for example.


> m sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation...

You might be interested in Snowmed CT, a way to describe medical concepts. It does something rather similar.


I encountered the same problem a few years ago and indeed realized that using categories to understand what type of article a thing was (person? subject? event?) was utterly useless, for the reasons you describe.

On the other hand, I discovered that infoboxes (the data in the top-right box on most pages) was generally extremely reliable, if frustrating to parse.


The infoboxes are created from a query to Wikidata, which you can query yourself! No scraping necessary! https://query.wikidata.org/

You'll want to learn SPARQL, but if you know SQL it's not so bad to pick up.


As far as I can tell, that is not the case, sadly.

Right now it appears that only 3,975 articles have infoboxes auto-generated from Wikidata. [1] The wikitext contains something like "{{Wikidata Infobox ...}}" instead of just "{{Infobox ...}}".

If you look up a popular article like Barack Obama [2], it's just a traditional hand-edited infobox. In fact, one of the first lines of data says "Vice President = Joe Biden", while the Wikidata entry for Barack Obama [3] doesn't reference Biden anywhere -- so not only is the Wikipedia infobox not generated from Wikidata, but Wikidata isn't pulling all the relevant info from Wikipedia either.

Back when I had been working on my project, I'd hoped Wikidata could be a solution but it was far too incomplete and information was regularly out of date. Perhaps (hopefully) it's better now, but it's clearly not being used to power infoboxes yet except in a tiny number of cases. (Which actually complicates things more now, since anybody parsing Wikipedia infoboxes now has to deal separately with the 3,975 ones that grab from Wikidata, since none of the actual data is copied over into the wikitext...)

[1] https://en.wikipedia.org/wiki/Category:Articles_with_infobox...

[2] https://en.wikipedia.org/wiki/Barack_Obama

[3] https://www.wikidata.org/wiki/Q76


For sure, thank you for the correction, I was under the impression it's role was broader.


Wikidata is not solution at all.

I recently run into the same kind of problem in Wikidata.

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Onto...

typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/...

https://www.wikidata.org/wiki/Wikidata:Project_chat#Ontology...


> Categories with more than 100 entries" might have a child "Categories with more than 100 entries in need of review"

This should have been specified by 2 separate tags: "Categories with more than 100 entries" and "In need of review".


("Occupations" -> "Writers") seem wrong why would you do this? same for ("Categories with more than 100 entries" -> "Writers").

This seems like trying to put tag on category entity instead of creating a tag hierarchy.

Those 2 should be stored using different relationship type mechanisms.

("categoryTag", <SourceTag>, <DestinationTag>)

ex: ("categoryTag", "Occupations", "Writers")

and

("parentTag",<tagName1>, <tagName2>)

ex: (("parentTag", "American writers" , "19th century American writers")


Indeed. It's a bit like if a programming language was trying to represent base classes and meta classes using the same mechanism.

My guess is that no one realized the need for "meta" categories when the system was implemented, so later the existing hierarchy was simply co-opted instead of implementing a new functionality for that use case.

As long as the categories are only used by human editors and use is only within some small subcommunity, it can work quite well. The problem starts if you want to combine categories used by different communities or if you (or your program) lack the domain knowledge to understand which nodes represent "meta" categories.

As another poster said, the better approach to use Wikipedia data for automated processing is using infoboxea or the explicitly machine-readable Wikidata repository. The category system looks machine-readable on first glance but really isn't.


An exactly analogous problem exists in the Collections hierarchy at the Internet Archive, of uploaded/digitized material (not the Wayback Machine web captures).

A single graph is applied locally with very different semantics; and absent a distinct tagging systems, collection membership is sometime used to mark material for treatment in some way.


Clearly the solution to all of this would be the category of all those categories that do not contain themself.


This seems like one of those Eternal Problems that people, whether librarians, programmers, or hobbyists, stumble across, think they'll make headway in, then discover that they've really managed to progress just a few feet across a vast and hostile surface of landmines, pitfalls, and lures. Each "obvious" step (I'll have parent relations to define a context!) is only yet another bargain with the Devil, who laughs at your precautions.


Tagging’s pain is that it’s a problem that is easy enough where you can come up with plenty of ideas without prior knowledge. Its bane is that it is, in this sense, similar to bikeshedding. Everyone can have an opinion about it; Fortunately, it’s only appealing to people who enjoy exploring problems.


I guess if you're really focused on it. I built a content tagging system for an old employer that would attempt to guess context based on keywords and associations but give the writer of the content the final say in what's actually being tagged.

Sure, I could have spent a thousand hours refining it, but the improvement would have been marginal and it still would need human interaction.


Was it used for content related to that particular business? I think as long as you have relatively limited variety, you can make something that works well enough.


The content was a publisher. Writers would submit articles, the system would automatically tag them, and the system was good enough that usually the writer or editor would weed out a false positive or two, which is about the same result as all these machine learning use cases with thousands of hours of dev time.

IIRC, terms were weighted, so that some of them needed to have more instances in the articles than others in order to be included in the final tag results.

Locations were one-offs, but specific topical items required more mentions because of false positives. And then there were things we called branch-offs. Branch-off tags occurred when a topic was mentioned enough to be a tag but there's another name that some segment of the population would know it by.

For example, the fish known as the white crappie are known in Louisiana as sac-au-lait, but people would also spell the word sacalait or sac-a-lait. So when we would get an article from an author in the Carolinas, they have no familiarity with a term that is the dominant one in Louisiana, but the software would add the tag anyway, which also exposed it to our site search.


Similarly, I think if you have a limited number of people doing the classification, you can also make a good shot at it.


> I can't find anything on how to design and implement anymore more than the barebones basics of a system.

All of this stuff (horse/horses etc) is extensively discussed, maybe look under "taxonomy" or "ontology".

Now, whether you want to use any of those solutions or not or find the discussion useful or not... if you aren't finding anything about it at all, you aren't looking in the right places.

(I learned about it in librarian school)


To be fair to OP, the biggest hurdle in learning anything is knowing what questions to ask. When you don't have ontology as part of your vocabulary it's hard to find literature regarding, say, "comparison of ontologies for user-generated text content".

I suppose this flows back into library science, which is all about systematizing where to look for answers to questions, but I'm always astonished to find that there's oceans of literature and research in questions I haven't even thought to ask.


I think OP is referring to finding software-engineering related design discussions surrounding tagging systems, but yes, I’m sure there is a great depth of ontology material and librarian knowledge that could add to software system designs.


(I learned about it in librarian school)

As the rest of us learned during the first tagging boom, the librarian is the natural apex predator of tagging.


I've been a librarian for more than 15 years and I can only speak from personal experience when I say that I am the apex predator of nothing. Every once and a while I will get it in my head to systematize my personal knowledge base with a controlled vocabulary and ontology and I just fall on my face. I really want it for some twisted reason, though.

Turns out LC subject headings -- for all their failures -- are pretty good.


Library of Congress classifications and subject headings (those are two separate things, for those unfamiliar) are not perfect, but they're pretty good, apply to a huge copus, and to my mind most importantly, have evolved over a bit over a century under numerous circumstances, including an absolute explosion of published materials, substantial changes to understanding organisation and classification of knowledge, and an awareness of the social and cultural aspects of these (as well as the institutional bias that's often embodied within them). That is, they have evolved a change management process.

The Classifications are substantively hierarchical, though that's really an outgrowth of the fact that they're used to locate books within physical shelf space, in which a record must occupy an address (physical space), and given that the Library's settled on subject classification as its storage and retrieval basis, this maps what's effectively a folded linear structure (shelf space) onto the multidimensional subject classification. It's not ideal, but it's workable. And many of the quirks of the LoCCS come out of the fact that it addresses both the composition (comprehensive, but still US-centred) and process (shelving, search, and retrieval) of the Library.

The Subject Headings are not hierarchical, though they're structured. In particular, they're relational, with numerous subject headings referring to others. There's some parent-child relations (though the top level hierarchy is broad), numerous retired classifications, and many "use that instead of this" notes.

(I've made ... some progress ... at a structured parsing of the subject headings, though that work's been stranded Because Reasons.)


I've learned to accept that my personal life and knowledge management is going to be a mess. (I'm also a librarian). I just don't want to do more organizing when I get home. I do also feel the temptation to do it 'right' once in a while, but it never sticks. I'd wager a lot of it has to do with the fact that managing an ontology completely on your own just sucks.


> controlled vocabulary

Are you using English? English words can almost mean whatever you want them to. Perhaps design your own language that removes ambiguity. Probably requires a knowledge of philosophy to distinguish between say concrete and abstract, good luck.

Maybe start with correcting the ontology of: https://cuberule.com/ (which takes a geometric approach to defining food types).

Also perhaps decide whether you want to work top-down like a directory tree (or Dewey Decimal?): resulting in standard book classification issues. Or bottom up: resulting in conflicts and discrepancies - https://news.ycombinator.com/item?id=33254025


> Perhaps design your own language that removes ambiguity.

That's what a controlled vocabulary is. It's essentially a set of tags which are clearly defined. So instead of #horses being defined purely by the word "horses," it has an attached definition along the lines of, "The category 'horses' includes equine biology, sports relating to horses, the cultural history of horses, and all other topics involving real horses. Metaphorical horses such as saw horses are not included." Tags like #horse would be redirected to #horses, since there is only one canonical horse tag in the vocabulary.


Librarians are the people that we (technologists) should learn from. But all I see is programmers trying to invent things from first principles.


Eh, as the librarian who wrote the post you're replying to... I am actually ambivalent.

I wish librarianship as a field and industry were more what I'd fantasize it should/could be, but it's not so much.


How so?

What's missing / what would you remove and/or change?


The problem isn't knowing what the problem is (taxonomy and ontology), but how to implement it effectively.

I've seen enough of Hillel's posts over the years that I am fairly sure he is aware of taxonomy/ontology too.


Yeah, the content for learning has been around for over a decade or mor

Plus we have plenty of content for AI now

https://towardsdatascience.com/machine-learning-classifiers-...


Can you link some resources about it then?


This is a good basic overview, goes beyond tagging/indexing, was the textbook in LIS501 Information Organization and Access at UIUC-GSLIS (now the iSchool at Illinois) in 2006:

https://mitpress.mit.edu/9780262512619/the-intellectual-foun...

Controlled vocab standards:

https://www.niso.org/publications/ansiniso-z3919-2005-r2010

(this one is deprecated in favor the one that follows)

https://www.niso.org/schemas/iso25964

https://www.w3.org/2004/02/skos/

The book we used in my thesaurus construction class at UIUC:

https://www.alastore.ala.org/content/essential-thesaurus-con...

My favorite intro to semantic modeling with RDF/OWL/SPARQL:

http://workingontologist.org/

Topic Maps are dead but i still have a soft spot for them:

https://www.isotopicmaps.org/

I also recommend Heather Hedden, linked in jrockhind's post.


I could, but honestly I'd just be googling "taxonomy". But ok that's not entirely true, I know how to refine my search and recognize when something is what I'm thinking of, from some familiarity with the field.

(But if you want to look around, in addition to "taxonomy" and "ontology", other good terms are "information architecture" and "controlled vocabulary").

These are not things I have vetted, this is literally just me googling and taking a quick skim...

https://blog.optimalworkshop.com/how-to-develop-a-taxonomy-f...

https://www.uxbooth.com/articles/introduction-to-taxonomies/

https://www.nngroup.com/articles/taxonomy-101/

http://accidental-taxonomist.blogspot.com/2020/11/what-it-th...

Or how about some textbooks:

https://narrowgaugebooks.indielite.org/book/9781627055802

https://www.hedden-information.com/accidental-taxonomist/


This is German, but I found it very good:

https://www.isi.hhu.de/fileadmin/redaktion/Fakultaeten/Philo...

Books:

* Cataloging the World

* Organising Knowledge. Taxonomies, Knowledge and Organisational Effectiveness

* The Intellectual Foundation of Information Organization

* The Oxford Guide to Library Research



I'm surprised I haven't seen more discussion of how tags are an entry point into plain-old data architecture. It should be obvious that by the time you're using tags for queries like "start-date: BEFORE 2022-03-01", you've created an inner-platform where you're building a plain-old relational database on top of your tags. Stop what you're doing and elevate "start date" out of tag-land and into a more structured representation with more application support.

Many enterprise databases add a memo field called "Comments" to almost every table. Clients very often end up coming up with their own guidelines about how to embed various information in the comments fields that the primary structure is missing. Looking over how clients are using the "comments" fields is a great way to discover new things that should be formally incorporated into the structure of your data architecture. Similarly with tags.

Look at tags as a starting point for adding a bit of loose structure to the frontiers of your data architecture. Mix them in with more structured data architecture. Be ready to "graduate" tags up to the next level of structure when it becomes appropriate. Stop worrying about how to make tagging perfect and embrace it for what it is: an easy way to get started on modeling the parts of the domain that you haven't spent a long time thinking about yet. A good way to understand how users want to use your system. Something you're always revisiting, cleaning up, and using as a source of inspiration. If you see some tags getting out of hand, don't try to improve your tagging system; instead take what those tags are trying to represent and add more structured fields and queries for them. This pipeline of less to more structure should be constantly playing out in a healthy, evolving system.


Relatedly, comments fields are the bane of data compliance exercises. You think you’ve caught everywhere a customer’s information might be stored, and then at the last minute you find out support have been putting phone numbers in the comments field because they had nowhere else to put it.


sounds like a savior!


What a weird decision to store dates in tags, it looks like bad design. Fully agree with the primer to data architecture, I've seen people almost getting to the point of writing DSL over tags. Madness!


As someone who dabbled in adding basic tagging to a database recently, this take feels absolutely spot-on. I’ll draw on it when evolving our system forward.


I've been wanting to make a datalog tagging system for a few things for a while now but don't have the energy to actually do it. Essentially the idea is to encode relationships allowing for very specific queries like: "show me pictures of a person wearing a green hat looking at another person" which is not something most tagging systems could reasonably do.

Breaking that down, that'd be something like:

  wearing(person1, hat), is_hat(hat), is_green(hat), is_person(person1), is_person(person2), looking_at(person1, person2).
I wanted to apply this to Brazilian Jiu Jitsu videos to be able to find very specific queries like, "matches where player 1 gets a takedown, gets swept by player 2, and player 2 wins by submission". A sufficiently well tagged data set would let you find specific stories and sequences of events in a way that I don't think a non-computational query system could do.

Most of the effort and value around a system like this would be building a community of people to tag the data and tools to make that tagging easy... and perhaps a more user friendly query UI.


So you might be interested to know that medical information is described in the way that you propose. Snowmed CT [0] uses a standardized set of "relationships" between "concepts"

[0] https://en.wikipedia.org/wiki/SNOMED_CT?wprov=sfla1


I think tag aliases are fine, but in my opinion, tags should not have hierarchies. That is just opening the can of ontology worms, and most systems are ill-equipped to deal with ontologies...including ontological systems.

Tags are just dumb strings which label data. They are basically KeyValues, where the value is just always equal to True. We don't think of KVs as hierarchical unless they are explicitly a path string, and in that case, they are forced to be a plain tree with no cycles or diamonds.


Not having tag hierarchies doesn't fix the difficulty of classification, it just handwaves it away. There will always need to be (super)tags that are collections of other tags, where it is a bug for an item that has a particular tag to not also have another, related tag. The question should be how you're going to handle that, not if you're going to handle it, or you'll end up with a lot of broken tags of dubious usefulness.

Tags are just dumb strings that label data, but tags are also data. If I can't label tag:"red" a tag:"colored" in your system, it's not great. It's not much better if I'm labeling things tag:"colored-red" because if I'm doing that and there's no central validation to add semantics to that relationship, I'm going to end up with tag:"red" things, tag:"colored" things, tag:"colored-red" things, and probably even tag:"color-red" and tag:"red-color" things.

edit: what's so bad about cycles when it comes to a tag being assigned another tag that has been assigned the original tag? It's just a mutual implication. There's nothing wrong to me with adding a single tag and seeing five more added automatically. It means that you're building a knowledge base.


> It means that you're building a knowledge base.

That's precisely the problem. You started with tags and now you are building a knowledge base. You wanted a banana and now you have a gorilla holding a banana and the whole jungle. If you want to build a knowledge base, use links/URIs/ontologies.

You'll find out the moment you add cycles the algorithms get way more intense. And then once you have stronger algos, you want more search power. Next thing you know you are bikeshedding about things like "apple" is both "fruit" and "tech company" so you need tags-of-tags etc. Just build a knowledge base if you need a knowledge base. Otherwise tags are just a way to do faceted search.

They aren't mutually exclusive, either.

OpenTelemetry, for example, has both tags and references.


This is just search with synonym analyzer / partial match. If you make the tag search dynamic, you’ll find the tags you’re looking for quickly.


Nothing you say is necessarily the case, and is dependent on implementation. Take "value is just always equal to true", well, no, not if your key is a predicate. "Color:red" is more powerful than "#red" or "red:true", and "color:[lookup-ID-for-red-concept]" is substantially more powerful than both.


>"I think tag aliases are fine, but in my opinion, tags should not have hierarchies."

Many years ago I've developed a proprietary database for a media related product. It was a NoSQL Entity-Attribute-Value database where Attribute was basically a tag. Tags had no hierarchy but query language allowed to specify sequence of attributes like Genre, Artist, Album, Title. When said sequence was not empty the result set would be a tree where each level would correspond to an attribute position as defined in query.


Org mode approaches this by making hierarchies and inheritance optional. I personally like both, but I acknowledge (as was mentioned in the tweets) that hierarchies can get to be very convoluted if you don't work to maintain them sensibly.


What I like most about org mode tags is that regular expressions can be subtags (or "members of a group tag" in org mode lingo). So you can specify a hierarchy where the parents have children you don't know in advance.


Optional forests of hierarchy trees are where it's at. Essentially don't encode everything into one gigantic one.

Sometimes you know that users are going to tag `laptop` a bunch and want that to also drag in `personal computer` (but not all `PC`s are `laptop`s) or that `blue dress` is also a `dress` and don't want to hard code special cases.

That said, if you are going to do this, then you must have it controlled by an admin/moderator. Maybe allow for hierarchy request submissions but have it moderated. There is at least one public system where this just works to my knowledge and a bunch of self-hosted ones as well.


> They are basically KeyValues, where the value is just always equal to True

That would be a set of values :-)


Sure, but my point was there that it's really easy to use KVs as a backing store for tags, which you can implement anywhere and easily serialize. Sets, if you have to do something like transform it to JSON, you have a choice to make: dict:true, or list.


I understand your pain, but want to make you aware that LINQ has become so powerful especially with lazy evaluation and expression trees that hierarchical views of tags is really basically simple and actually just one more method of visualizing data...


One example of an unexpectedly rich and deep tagging ontology is the Danbooru "Anime" image board [NSFW] https://danbooru.donmai.us/


Yeah, danbooru or similar image boards basically have all the things talked in this tweet thread.

They have tag aliases, meta-tags and so-called "tag implications".

The last one is basically sub-tags but with more flexibility and dead simple to implement: if A implicates B, then tagging an image with A will automatically tag it with B. So you can tag "American Male Novelist", and then the system will automatically add "American", "Male", "Novelist", "Writer", etc. (after such implications were added).

It much easier than Wikipedia's categories, but Wikipedia's way is of course intentional because categories is meant to have a stronger hierarchy than mere tags.


How much content they've actually put in their tagging system is just as interesting as how the tagging system works.


And the content itself, I'd say.

But yeah, I've always been impressed by how detailed and specific their tags were. That's A LOT of work! The power of porn, I guess. And as an added bonus, those tags now power AIs able to generate custom hentai on demand!


Unfortunately, since their tags are so abundant, most of them (mainly the "too specific" ones) are far from exhaustive/complete in term of being actually used. And things like hair colors are extremely subjective so you're not sure if an image is going to be tagged as brown_hair or red_hair.

If you want to find some images by tags, you better stick with more generic ones.


There is a safe-for-work, or at least safer-for-work version of the site: https://safebooru.donmai.us/

(It is of course based on the tagging system: every post is tagged by its "safeness" level.)


I know this is not reddit. But why do you know even know this link and its tagging system...


I'm not scared away by things that might offend the puritanically inclined and I'm interested in ontologies and this is a fascinating one.

There was some drama about someone training a Stable-Diffusion-alike by ripping their dataset that brought it to my attention.


Danbooru is one of the most popular anime image board.

Anyone who's into Anime (not just for hentai) probably knows.


I created a new kind of object store where tagging is one of its key features. Each data object (called a Didget - short for Data Widget) can have a set of contextual tags attached. This is true whether the Didget holds file data like a photo, a document, or a piece of software; or if it holds other kinds of structured or semi-structured data (relational tables, folders, configuration, etc.).

Each defined tag has a data type (STRING, INTEGER, DATETIME, etc.) and a 2 level context. Like a column in a relational table within a columnar store; all the values for the same defined tag are stored together. This makes querying extremely fast.

So you can define tags like Person.FirstName, Event.Wedding, FileSystem.Extension and then attach values to files and other kinds of content. You can then query the system (e.g. Find all photos where Person.FirstName = 'Billy') based on their tags.

I have created containers with 200M of these objects and put a dozen or so tags on each one. It can run queries that return in just a couple of seconds.

Demo Video: https://www.youtube.com/watch?v=dWIo6sia_hw


Is your design open-source? Do you have an API? Would like to learn/help


As many commenters have mentioned (as does the article) hierarchical tags are a pain, if not an impossibility to get right. Related tags, though, can be done on the cheap and are surprisingly powerful, fun and cool under the right conditions.

Say you have a massive database of photos, each photo having tags. As example we'll use the tag "United States", which is used as a tag on 50,000 photos. Next, you go over each of those 50,000 photos and check which other tags were used, and sort them by occurrence.

This reveals useful and often surprising implicit relations between tags. The relation can be of any type, hierarchical or otherwise. It reveals relations never explicitly mapped or maintained. It's organic, which kind of fits the philosophy of tagging.


A few months ago I worked on some proof-of-concept code for searching tagged data: https://github.com/aaviator42/Cha

I now work full-time in a role where part of my duties is designing a content tagging system and its search functionalities. It's very interesting and fun! Lots of puzzles.

How do you weigh different tags? How do you do fuzzy searching ('city' should match with plural ('cities'), misspellings ('citys'), etc)?

How do you program the system so that 'hotdog' is not matched with 'hot' and 'dog'? What about synonyms? What about regional terminology and synonym tables?

Then there's one-to-one and one-to-many and many-to-one mapping.

As a side project I'm also working on a beta public search engine that I'll launch on HN sometime in the next year or so, where I'm having similar puzzles.


Seems to me like a lot of these are solved by a dedicated search engine?

I see it in this business all the time how people try to reinvent the wheel and end up writing their own search engine, thinking it's just another small in-house project, but they quickly run into the difficult problems like, well, these.


Nice. Your Cha project has the beginnings of a search engine with cosine similarity, even though I don't think you intend to take it there. Tagged items are like a precomputed inverted index, and the matching is a search on that index.


Yep! I'm not planning on working further on Cha at the moment, but I've learnt a lot since I initially wrote the PoC code and that is probably how I'd go about it.


> How do you program the system so that 'hotdog' is not matched with 'hot' and 'dog'?

That sounds like a very good use case for word embeddings.


How do you deal with "hotdog" possibly being a noun (several meanings), or proper noun (several meanings), or verb, or interjection?


e621 frequently has to deal with characters with the same name, or an artist with the same name as a character. they just make ambiguous tags have a special syntax. so if bob was an artist, but also had a character named bob, it would just be bob_(bob) for the character and bob_(artist) for the artist. and if someone tried to tag something as just “bob” they would be told to be more specific. searching for all bobs can be done with bob_(*).

so hotdog could have hotdog_(food), hotdog_(interjection), and hot dogs (the animal) would be two tags: hot and dog.

it’s not the cleanest solution, but it works well enough.


I'm so happy to see people talk about this! I too am endlessly fascinated with content tagging systems.

Hillel's thoughts are completely unsurprising to me so I guess I've come to similar conclusions.

I do notice that we seem to care about different things though - where Hillel appears to focus on tag types (and the implementation challenges that go with that) I focus more on human factors like what problem are we solving? for who? How do we maintain relevance (and power) in tagging systems (and for who?)

I'm of the opinion that tagging systems should not be made by the few for the many but by each person for themselves. Which, of course, sucks because that puts the onus on everyone who wants tagged content to do their own work. But I believe the output of that investment would be quite valuable and useful!

An easy example I could use might be recommendation engines. Assume I have a database of tags (a tag cloud?), and I know you have similar interests to me. If you also have a tag cloud, I could input links to both of our tag clouds into a purpose-built recommendation engine to discover new content I might not have consumed yet.


> I could use might be recommendation engines. Assume I have a database of tags (a tag cloud?), and I know you have similar interests to me. If you also have a tag cloud

This was the first "naive" implementation on finclout. Every post get automatically scanned for ranked keywords and then matched with other known entities about the post. We also user collect tags from the user and have users verify keyword matches.


What made you move away from that "naive" implemeentation? What kind of implementation do you now employ?


This reminds me of a talk from Clay Shirky about categorization and general ontology. It's interesting to read in hindsight, because it's from when recommendation algorithms were in their infancy.

Warning PDF: https://ia800203.us.archive.org/10/items/Ontology_is_Overrat...

> This is what we're starting to see with del.icio.us, with Flickr, with systems that are allowing for and aggregating tags. The signal benefit of these systems is that they don't recreate the structured, hierarchical categorization so often forced onto us by our physical systems. Instead, we're dealing with a significant break -- by letting users tag URLs and then aggregating those tags, we're going to be able to build alternate organizational systems, systems that, like the Web itself, do a better job of letting individuals create value for one another, often without realizing it.


Thank you for this link. I’ve been looking for a good discussion of the browse vs search argument and this is very, very good


I've done this professionally in a couple different settings, from building topic classifiers for news events (it is sometimes hard to know when one news event should stop and another start) to creating tagging systems for audio recordings of group conversations (where topics often merge in and out of each other, often within a single sentence).

I'm currently working on classifying non-speech, non-musical sound and it can be useful to piggyback on an existing knowledge system, though they tend to be industry-specific. As an example, Google's ontology for sound identification [1] is a nice starting point for general classification, whereas the taxonomy [2] used by the audio post-production industry (sound effects, foley, etc) is structurally quite different (which isn't surprising, but it sure is fun!). From a totally different field (electro-acoustic composition), the work of Michel Chion and Pierre Schaeffer [3] add psychoacoustic elements to more traditional measurable characteristics, i.e. how the sound is perceived and comprehended is just as important as its medium of travel and its source. It is helpful to see what others have done before you so you can pick and choose elements of their work to incorporate into your own.

1: https://github.com/audioset/ontology

2: https://docs.google.com/spreadsheets/d/1b2UhKpcOAE-jd1edOsxC...

3: [big pdf!] https://monoskop.org/images/0/01/Chion_Michel_Guide_To_Sound...


The Google list is interesting. It only has main category and subcategory. Causes problems like “SCIFI WEAPON SCIWeap = Lightsabers, exotic sci-fi weapons. Not a 'blaster' which would go in LASERS-GUN.” because they created another main category LASERS for the desired four extra sub-subcategories (hmmmm, how should that be spelled? Categorically another conundrum.)


Anyone have a suggestion for a tagging filesystem that is maintained? Or if not a filesystem, something that at least works? I still feel like this is the best way to organize personal photos and media, and while https://www.tagsistant.net/ is pretty good it hasn't been updated in 6 years and is fairly buggy.


Dr. Karl Voit did his dissertation designing a tag-based file system. I don't know what the status is today, but the dissertation itself may be a decent place to start your search.

https://karl-voit.at/tagstore/en/papers.shtml


I haven't tried it yet but https://tmsu.org/ is actively maintained and looks nice.


MacOS has tags. Right click any file in finder, select "Tags..."

No idea if they are implemented at a filesystem level but there are various tools for finding things by tag


Edit: ok, wow I didn't even check the link before writing my suggestion but it's kinda freaky how similar they are. Even used a scifi movie as the example.

I have a horrible idea that I haven't actually tried out, and I don't know how many filesystems it would work on anyway, but hard links...

Create a directory that holds your files. Organise them however you like - some arbitrary subdirectory structure using dates, names etc.

Create another top-level directory called "Tags", and build a directory structure that supports your needs.

Write a stupid "tag" shell script and a shell tab-completion script for it that lets you tag any item in the "real" directory structure using tab completion on the original file name and the tag. When you hit the return key the script creates a hard link for the original file in the chosen tag directory.

Example: tag "files/movies/the matrix.mkv" "tags/movies/sci-fi"

Now you can browse 'tags/movies/sci-fi'.

No "real" coding skills needed. You reorganize your tag directories and files by moving them around, and if you've done your shell scripts properly it shouldn't care as long as the top-level directories don't change. Limits? On Linux an inode can have up to 65000 hard links so I don't think it'll be an issue.

The many problems I see are name clashes, directories not supported (by hard links), cross file system links not supported (by hard links) and file deletions (don't work as expected). The tag script could handle the former. For deletions, you could create a "delete" that gets the file inode and deletes tags first, using 'find $tagsdir -inum $inode -delete'


> The many problems […]

…, hard-links are incompatible with programs that attempt "atomic" file saves instead of rewriting a file in-place, …


I just gave up and mimicked tags with symlinks and subfolders. ie "foo" is tagged "todo" if there's a symlink to it in "Tags/todo/".

It works surprisingly well, since I can manage it with standard shell scripting.


Interestingly, this is effectively a tag hierarchy (orthogonal to the content), which is also a DAG.

https://twitter.com/mikeybtags/status/1582509479806980096?s=...

I wonder what disadvantages such a tag hierarchy has?

I imagine enforcing the acyclicity is hard/effortful.

Using (inherently hierarchical) folders (with no symlinks for folders) enforces the acyclicity constraint, and avoids diamond issues, so that’s neat.


Directory Opus embeds labels into the file system:

https://www.gpsoft.com.au/help/opus12/index.html#!Documents/...


unix has tags, they are known as hardlinks.


You can misuse them as such (I've done it myself, albeit on Windows), but they're not really made-to-measure for that use case. E.g. deleting files that have been "tagged" that way becomes more cumbersome (because to the OS all hardlinks are created equal, so deleting the "primary" file doesn't automatically delete the tags, too), it's incompatible with any program that uses "atomic" files saves instead of modifying files in-place, the UI for viewing and editing "tags" is not really there, …


there's a massive difference between tagging-for-self-recall and tagging-for-other-recall. when i invented tagging the first was paramount, but the latter has become dominant and has very different design considerations

one interesting note: you can infer a bunch of hierarchical information since people frequently tag from broader to more specific, topicwise.

some things can be tagged by multiple people and you can thus infer synonyms as well. this can thus be fixed in search.


"When I invited tagging" is such a flex. But creating delicious gives you some credible claims there.


Content tagging in online systems has existed since at least the 1970s, with the earliest example I can think of being MEDLINE https://en.wikipedia.org/wiki/MEDLINE


metadata is not is tagging. keywords are not tagging.


Right, MEDLINE's "tagging" system is MeSH, which is a large controlled vocabulary. MEDLINE does contain bibliographic data + journal keywords, but its real value add is MeSH, which is used for search, related publication identification, etc. in PubMed.

https://en.wikipedia.org/wiki/Medical_Subject_Headings


I don't get to use it much these days


One weird content tagging system I recall was Amazon's "Amapedia" (https://en.wikipedia.org/wiki/Amapedia). It was a product wiki, a way for people to curate information of all sorts about Amazon products. It allowed each product to be arbitrarily tagged. It was short-lived, failed, and abandoned, for all of the reasons you'd immediately expect.

What was neat about it was that it must have involved someone a little too interested in set theory. A product was an article, and a product could have tags, but tags were themselves articles, and so tags could also have tags, and those tags could also belong to tags, etc.

The whole system was focused on these tags. If you wanted to compare two products, you'd compare the pages, and the comparison would focus on the differences in the tags of the two pages. Tags could have values, too, so products could have a "RAM" tag, and each RAM tag would have an associated value for that page, but the RAM page itself would have general information about RAM as a concept (which would probably have tags itself...). Searching worked the same way. You could search for pages with certain tags or tags whose values were greater/less/equal to whatever values.

Anyway, it was a fun and interesting way to do content tagging that did not work out.


My similar issue is with names in source code.

Fuzzy matching names and interrogating the contributor about the changes being checked in. Questions to ask the contributor, are the names similar to any of these other names? Is there an opportunity to use the same name or are they different concepts?

Code grows and grows and becomes harder to grep if inconsistently naming things.


This is the reason the Semantic Web never took off—people on the internet can't even agree on what a "sandwich" is, let alone the exact hierarchy of ontology.

This is an area where large language models have a role to play—whatever you're hoping to achieve with user-generated tags can probably be achieved with ML-powered associations or navigation. And the potential benefit is that it could be tailored to each user—so you're only surfacing "Hot Dogs" when certain users click "Sandwich."


I thought that the cube rule of food generally settled the sandwich debate. A hot dog is not a type of sandwich, being surrounded on three sides. Instead, it is a type of taco.


Haha yes, and I would hate to meet the Turing-complete tagging system that could capture this nuance!


This is what we do for the most part. Two tiers of 'tags'. One is curated and required, the other is an embedding.


We spent a lot of time building tagging systems to organize technology skills on https://www.moonlightwork.com.

The coolest part was training a collaborative filter on the tags. So, when you add "Django" as a skill, it could recommend "Python" as a related skill. This made for some refined user experiences.

Getting typeahead search right took a lot of refinement. Here is some of the logic we ended up implementing over time:

1. Exact matches get prioritized first (e.g. "Go")

2. Abbreviations support (e.g., "AWS" for "Amazon web services" or "ROR" for "Ruby on Rails")

3. Name that start with query should go before non-leading matches (e.g., "Ru" should return "ruby" before "task runner")

4. We tracked an "Aliases" column for each tag to enhance search. So, "golang" was an alias for "go".


This is crazily sad non-invasive (without embedding into the file body) tagging is not standardized across OSes and file systems. The only system to support tags I know is KDE/Dolphin/Baloo, outside KDE tagging seemingly is supported only by a handful of incompatible 3-rd party apps.

Sadly I don't expect much progress to happen in this area. Almost nobody cares about storing and organizing of files locally nowadays.

I hope it is going to be done some day or later (there isn't much to do: just standardize some xattrs and something like RDF schema to be used in an alternative FS stream + add support for these to the standard file management and search tools, this is orders of magnitude easier than implementing a new FS) but probably not soon - it would be a huge luck to get any resources allocated to this.


The hierarchical nature of the information he's talking about really reminds me of the ontologies and terminologies that are used in healthcare to organize medical information. E.g. Ibuprofen 10mg Tab < Ibuprofen < NSAID < ... < Therapeutic Chemical.

This is a field that I'm only tertiary familiar with but it's a fascinating discipline trying to group, and manage all of the different categories of healthcare data. You can use the RxNav tool to look at the RxNorm terminology which is only 1 of many terminology systems.

https://mor.nlm.nih.gov/RxNav/search?searchBy=String&searchT...


Openstreetmap is map data that is basically coordinates with tags on them and relations between those tags. I guess this is true for most GIS software but there is very little 2D map data that can not be described in the OSM tagging model.

You can never express everything with tags, you need stats and metadata on metadata, documentation and a strong heterogeneity which also need to be able to adapt to new ideas.

https://wiki.openstreetmap.org/wiki/Tags https://wiki.openstreetmap.org/wiki/Map_features


https://wiki.openstreetmap.org/wiki/Tagging_mailing_list ( https://lists.openstreetmap.org/pipermail/tagging/ ) is a fascinating, hilarious and interesting place.

Basically it is about an endless attempt to classify at least part of reality, in organically growing worldwide project based on bunch of passionate obsessive hobbyists with overly strong opinions.

With bonus of bunch of politics, confusion and passion.

https://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_AP... is likely of interest.



Look into AI systems from the 1960's and you will find Semantic Networks. If you just need categories you can go with taxonomies and folksonomies. If you want to (over?) formalize and describe mainly non-agentive structure you look at ontologies.


I don't know how people deal with tags. It adds so much friction to me. Naming tags, deciding what rules this tag is supposed to have, deciding what stuff is tagged. I tried the firm approach of being extremely discrete with tags and it took a lot of effort, and I've tried the loose approach of tagging things if they are even slightly related which imo defeated the whole purpose of organizing things to make it easy to find them later if a lot of tangentially related things share the same tags.

Folders seem a lot more straightforward for me at least, and if I need something in two places at once, there's always ln -s


I am too but I've given up. I've collected a lot data over the years and spent a lot of time trying to organize it so I can find relevant connections. It's just too time consuming. I've decided discerning relationships in unstructured data is where I want to focus.


I'm largely the same with the exception of music. I don't use online music services, and have had the same (growing) collection of mp3/m4a/flac files since the late 90s. I have a few custom tag fields that I maintain in an ad-hoc fashion. First, I'll add new music to an inbox and give it a few listens. If anything jumps out at me I'll give it a rating, and maybe add some tags under 'mood' and 'instruments'. And that's about it. It doesn't have to be perfect and it doesn't require a lot of time. Maybe I'll spend 30 minutes a week going through my recent play history tagging specific tracks that I like.

It's nice to be able to put on "uplifting saxophone" or "dark dramatic piano"


Look at wikidata, RDF and semantic web. This is somewhat a well solved problem that should not be solved differently again.


Tags are arguably superior to folders for organising files, unfortunately the major OS don't seem to agree. I'm using the same (expanded, adapted) folder structure for all my files since I got my first computer, and it's survived multiple OS migrations, being synced between multiple devices with different form factors, multiple changes in life circumstances (school, undergrad, postgrad, work),... I love tags and I've used them in some parts (eg in my old mp3 collection, for academic papers, for Anki flash cards) and I'd love to use a (simple and dumb, not rich enough to enable set theory paradoxes) tagging system to organise my files instead.

However, my experience has left me convinced that the only truly long-term solution for your personal data are flat files sitting on your hard drive inside a simple hierarchical folder structure. Anything else is likely going to rot at some point, after a system change, after some BigTech decides they want to use something else, after a start-up disappears, or it's going to keep you locked into some walled garden. Unless there's something I've missed, if so please let me know.


Very true. That’s why I’ve got endnote as my digital card catalog for papers. (Yes, there may be better reference tools but I’m too invested at this point…)


> It gets even more complex if tags can have multiple parents, like Wikipedia categories. "American Male Novelists" is a subtag of "American Male Writers" and "American Novelists". Now we have diamond problems, redundancy, a whole host of other edge cases.

I don't understand this problem. I would think that you would have

tag:american

tag:male

tag:novelist

tag:writer,

and tag:novelist would itself be tagged as tag:writer, because all novelists are writers.


I've been dabbling in personal knowledge bases for a long time now. I remember the when I discovered tags -- thought it was the best thing ever. The first good implementation in the wild (for me) was del.icio.us. Eventually I ran into all the problems that the linked thread describes. "Movie" or "movies"? "Book" or "books"?

In any case, I still think flat tag lists are better than a directory tree structure ("Content/Movies" vs "movies, movie, entertainment, science fiction, space travel, aliens").

A recent innovation that I'm enjoying is backlinks. I believe roam research was the first major player that showed you related entries via the links that you included, even though a similar concept existed forever. Then you can generate clouds of relationships and find concepts visually [0].

0: https://noduslabs.com/cases/visualize-connections-notes-roam...


> backlinks

> recent innovation

Ted Nelson is rolling in his Xanadu


100% there was prior art to this, I was thinking zettelkasten. Didn't know about Xanadu though!


Tags are beautiful. They enable a non-hierarchical way of linking elements together so they form a graph. And graphs are beautiful. But they are also messy and bring a whole cohort of problems that you wouldn't have with trees.

The problem with tags is that they are the first and often only metadata available to represent the complex relationships between elements. So everything goes in it: tags for the semantic (ontology is rabbit hole in itself), tags for relations with other items, and not forgetting the tags project management (priorities, people, milestone,...).

Want to empower your tags, for instance adding hierarchy or dynamic tags? Then every tag will get these features and associated problems. A solution would be to have tags of different "types", each processed differently, and migrate the metadata from a "bag of tags" to "a bag of bags of tags". But then tagging wouldn't be as simple writing a name in a field.


I'd love to know what those prolific Spotify engineers think of this.

That was a joke because Spotify doesn't let you tag music.


(Objects in 100s of playlists)


A big miss on the list, is that words (so a tag) do not mean the same things for each people and do not even mean the same things in different contexts


Why twitter man.. these questions are clearly important but there is a space to discuss them https://matrix.to/#/#datalisp:matrix.org


I had to click like 5 links from that link in order to get to a site which requires me to sign in before allowing me to see the content. I still have no idea what I'm supposed to be seeing. And no idea what the connection between "datalisp" and content tagging systems is.

Maybe that's why twitter man?


Yeah I am not a fan of the login walls and all that either but there is a reason that we should try to use free and open source software and currently matrix is the option that is convincing enough to use.

Datalisp.is the web of trust / semantic web / whatever. It doesn't exist outside my head currently but it also exists in lots of other heads (at least bits and pieces) so I believe we should manifest it.


Seriously. I'm not a twitter fan, but even so, it's a short-form medium. Why do people abuse it like this, especially with great content? What's so bad about tweeting a link to a blog?

Anyhow, I use threadereaderapp to get through the frustrating twitter UI and the ways that it is abused: https://threadreaderapp.com/thread/1534301374166474752.html


> Why do people abuse it like this, especially with great content?

Probably to get a wider audience to actually read and engage with the ideas, and to crowdsource relevant information from said audience.

> What's so bad about tweeting a link to a blog?

Probably an 80%+ reduction (total guess) in the number of people who engage directly with the content and author.


"Advice: don't let the tag predicates refer to other tags"

But then how would I search by the tag of all tags that do not tag themselves???


I can't wait for the author of this thread to discover the AO3 tagging system, which is, frankly, a masterpiece that demonstrates how effective community management can lead to extremely good tagging and categorization, with very little miscategorization.

https://www.wired.com/story/archive-of-our-own-fans-better-t...

https://archiveofourown.org/faq/tags


> The only system I know that does that is the fanfiction site AO3, where teams of volunteers manually create aliases from, say, "snarry" to "Harry/Snape"

They seem aware already.


The AO3 tagging system badly needs pruning. I hesitate to make examples, as the specificity will serve as a "call out," but quite a lot of authors throw in single-use, digressive tags as some kind of commentary on their own work. Huge meandering swaths of crap tags, and the people who make them ought to have their permissions to create tags revoked.


I kind of disagree with this. Tags are dual use in AO3, specifically they serve as a way to find specific stories with specific thematic or plot elements, but they additionally serve as a free expression of the author because its the author who chooses which if any tags they want to use to describe their piece. When an author gets to decide the categories of a work, the categorization also becomes an expression.

Consider the flavor of "Dead Dove: Do Not Eat" tag, which serves both as an author's expression of warning the reader and also a category of fanfic that is expected to have transgressive elements. Just tagging, idk, "child endangerment" completely misses the point of "Dead Dove: Do Not Eat" comparatively.


I will paraphrase this to avoid a callout, but "no regenerating limbs those arms are toast sorry QA despises them" is not a useful tag. (This is a mild example, I've seen far worse)

First, it is a single-use tag. Tags are for categories, not solo entries. Solo entries explode the tagspace to no good end.

Second, that expression belongs in the summation of the work, or just about anywhere else. Tags are for other people to use to find similar works or for readers to look for things based on their interests. Metadata is not for artistic expression, unless you're one of those people who believes that artists ought to be able to choose their own Library of Congress call numbers and such, people who want to include "elephant" in the metadata despite the work having nothing to do with elephants.


I think you're missing the point that, in AO3 specifically, tags are not solely metadata. Tags are also artistic expression in the context of AO3. That's the thing. AO3 doesn't function like the Library of Congress, and there are no librarians that are independently assigning categories to fanfic. An author can choose to opt out of tags entirely, and people cannot put tags on other people's fanfic even if it's relevant and would benefit that work's findability. The simple mechanism of the author having sole control of what tags they want to apply to the work causes the act of tagging to also serve the purpose of artistic expression-- this results in spontaneous tags going from single-use to culturally known, such as "no beta we die like men", and therefore I think arguably useful but only in the context of AO3.


> An author can choose to opt out of tags entirely, and people cannot put tags on other people's fanfic even if it's relevant and would benefit that work's findability

Curious about how this doesn't render the entire system near-useless? In my experience with other sites with user-generated content that allow tagging, this decision always makes the whole system way worse, because the OP alone is almost never going to be aware of all possible tags that are applicable to whatever it is they posted, and will instead just take the first 3-5 words that pop into their head and stick those in the tags field. The end result is a tagging system that barely works; you can search for a tag but you'll miss tons of stuff, and you can filter out a tag but you'll still see tons of stuff in that category. And if you ever find a hyper-specific tag you really enjoy it'll only have like 5 items in it even if there are hundreds or thousands it could be applicable to.

Don't get me wrong, the wiki-style approach of just letting anyone edit tags has its own issues, but it does at least result in tags on everything being at least mostly complete, and actually useful for finding what you want (or filtering out things you don't want).


> Curious about how this doesn't render the entire system near-useless? In my experience with other sites with user-generated content that allow tagging, this decision always makes the whole system way worse, because the OP alone is almost never going to be aware of all possible tags that are applicable to whatever it is they posted, and will instead just take the first 3-5 words that pop into their head and stick those in the tags field.

A few things makes this work brilliantly:

- authors are encouraged to tag as much as they want with whatever they want

- tags have an autocompletion to help authors select tags on keywords

- authors are prolific fanfic readers themselves and are therefore usually extremely familiar with the tag system

- manual tag linking means searching for one tag will also return results for all related or near-identical tags, a linking which has an extremely high success rate due to dedicated and extremely knowledgeable volunteers

This overall ends up being that authors use prolific tags, and reuse prolific tags from others, and ultimately search isn't strongly affected because the entire readerbase is hyper-knowledgeable. Check out the extremely specific fanfic-only "hanahaki disease" tag description in ao3 and you'll quickly see that any variety of related tags, with any level of hyerspecificity(some tags have neither "hanahaki" nor "disease"!), will appear searching for any of them, including hanahaki disease in other languages!: https://archiveofourown.org/tags/Hanahaki%20Disease


Then tags in AO3 are just more of the text and not much of a finding aid. You can't have both.


Tags end up being an excellent finding aid due to the strength of the community's tag linking, you see. So they serve both purposes.


"no regenerating limbs those arms are toast sorry QA despises them" just isn't useful if I want to locate a particular text, other than "I'm liable to get a Tumblr-stink off of this crap."

And your defense of this is really ... internal, as in, this all looks like a lot of in-jokes to an outsider who is new to AO3, or even new to a particular fandom. If someone doesn't know the slang, the in-joke reference, it's still unhelpful.


> "no regenerating limbs those arms are toast sorry QA despises them" just isn't useful if I want to locate a particular text, other than "I'm liable to get a Tumblr-stink off of this crap."

Yeah, but you're not looking for that tag, and that tag wouldn't affect your search in any way. That's the thing. You're approaching tags like they can only only ever be used one way, and yes they can be that, and also other things that don't affect your personal use. So when you search for your specific tag, all synonymous tags will also appear, and all superfluous tags don't affect your search. A one-off tag doesn't affect your ability to search for multi-use tags.

EDIT: Additionally, the fact the tag exists has also helpfully indicated to you that this is a fic you probably don't want to read because of the author's cultural hinting through their use of tags. You're proving my point here-- the one-off tag doesn't affect your ability to search for your specific fandom or tropes, but also it allows you to pick flavors of fanfic you want from that search because of your dislike of one-off tags.


You have it backward: I found the fic through other means entirely and eventually dropped it. When I encountered it again on AO3 (it was a cross-post), I said "Oh, look at those horrible tags." It was notable in the fact that I said "I need to keep this one handy the next time I end up having yet another conversation with someone about how much tagging sucks on AO3." Because this isn't the first time someone has brought it up to me.

They just crap up the results if I am searching for "regeneration" or "limbs." If something is used more than one way, yes, it does affect my personal use because it means "more stuff I have to filter through." When you search, what you do not want is extraneous results. That's the whole point of searching! And I guess my library experience is showing, but AO3 just reeks of amateur hour shenanigans. I predict that at some point there will be a movement to clean up that kind of junk.


Wait so, this tag you didn't like didn't even stop you from finding the fic? It didn't clog up your search at all because it wasn't even in your search when you found the fic you were looking for? What's the problem exactly? You're approaching this with a library lens but it's not a library! It was never even intended to be a library!

Additionally, it doesn't show up when you're searching for regeneration or limbs because it's a one-off tag and therefore isn't linked to the rest of the tag network. I suppose it would be a problem if you put it in the general search, but you'd also be catching anything with limb in the title, or limb in the author's name, too. I think this is coming from a place of multiple misunderstandings of how tags work from both a technical and a cultural standpoint.


And that's a problem, isn't it? I shouldn't have to be immersed in a culture to use the system. You've traded usability and user experience for ... a cultural in-joke. "Hi, this is AO3, and our tags aren't anything like anyone else's tags, but we're still gonna call them tags" is a problem. It's like if I made a search bar and it returned random results. It says search, but culturally, we give you random results. That's how we do it.

That's why we developed librarianship.


But it's not a problem because it doesn't affect your ability to search. One-off tags do not enter the tag search results. I'm super confused why this isn't obvious and intuitive to you..


> Tags are also artistic expression in the context of AO3.

Seems to be similar on Tumblr.


He is already familiar with it:

https://twitter.com/hillelogram/status/1579942625242931200?s...

Hillel actually mentioned it cursorily in the thread.


A taxonomy or hierarchical system sometimes also helps, eg. on E621: https://e621.net/wiki_pages/23556 (NSFW if you scroll at all or click anything).


its literally the third tweet in his thread


They mention it 4th post in the thread.

I never heard of it though, what's so good about it?


A lot of the items described are problems in ontologies


Yeah. A tag is a predicate. Sub-tags are implication (male author => author). Tag aliases are equivalence (implication in both directions).


isn't it super-tags that are implication? A male author (sub-tag) implies it's an author (super-tag). But an author does not necessarily imply it is a male author.


In my app, users apply a set of tags to a note, but then the app automatically creates hierarchical associations in a tree. There are an exponential number of associations between tags (At one point design was failing because it was trying to prebuild 100k+ GUI items for these cross-referenced tags) so I had to virtualize the intersection of tags at the exact moment a user expands a tree item.

You cannot plan what tag search will lead you back to the data you want, so every node in the graph must be bidirectional.


I hacked together a small extension to tag hacker news stories. A small presentation here,

https://datum.alwaysdata.net/static/extension/index.html

With the js files for the extension.

The motivation to finish it partly came from this hn thread. https://news.ycombinator.com/item?id=32970560


People look forward to a visit with the ontologist they way they do a visit with the orthodontist.


My (Chomskyish)hierarchy of tag systems goes something like.

tagged data

key=value tagged data

hierarchically tagged data (we just found the the unix filesystem!)

hierarchical key = value tagged data (oh damn, it's ldap, we dug too deep.)


When I started creating a simple blog system as a newbie developer, I needed to design its category/tagging system. Then I was surprised by the lack of good resources on how to design such basic features. I just wanted to know several design patterns and their pros and cons, but I couldn't find any, so I ended up designing my own crappy system.

I hope someone wrote articles on it with actual DB schemas.


One area that's illuminating is the effort to annotate the results of whole-genome sequencing projects. Tagging stretches of the genome which represent coherent units of some sort, and then relating them to some functional capability of the organism, is not at all a solved problem.

Here's an overview from 2011 where they're struggling to even get a good tagging system up for single-celled microorganisms (a much easier problem than multicellar genomes like humans):

https://pubmed.ncbi.nlm.nih.gov/22180819/

> "Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone."


we have a big tagging problem where i work and yesterday I tried using gpt3 to assist. worked well!

code and context: https://github.com/airbytehq/airbyte/issues/17893


I’ve written a tagging system from scratch for an existing system and it was one of the most interesting things I’ve worked out. I had total control over how it was implemented and I think I came up with a really nice, minimalist, scalable way to tag things, and to search them.


Care to elaborate? I'm also working on a categorization/tagging system - albeit a simple one - and I find myself in a struggle to keep it accessible enough to use on one hand and advanced enough to actually add value on the other hand.


If you share a brief overview of it, you might receive some good critique of it here (surfacing potential pitfalls others have run into with similar systems). Besides, it might be helpful to others.


Generally you either use Latent Dirichlet Allocation, exact tags, or a mixture of both. I structure the metric space to weigh exact tags greater than LDA—-whereas you can then create two more classes in that LDA space, of the heavier similar tags and then the description.


Those interested in the state of the art of professional tagging systems in culture heritage may have a look into the CIDOC Conceptual Reference Model (CRM): https://www.cidoc-crm.org/


I like how they worked out an advanced tagging system's requirements from a ~dozen tweets, starting with the most basic tagging system and working up through a tag hierarchy to a tree to a DAG, then even talks about K/V tags and etc.


Surprised no one has nailed a use case for semantic tags and their associations. Python and snake doesn’t require hierarchies to differentiate from Python and coding. Why aren’t co-occurrences within and between content samples enough?


I set out building my first full-stack webapp [0] to make a custom theme-based tagging/organizational system for musical ideas. I did not initially realize all the hairy design choices inherent in this domain, but have found it humbling and educational.

Remaining features to be implemented include in-app audio recording, editing, and custom labeling outside of the main tree structured organizational system.

I'd appreciate any thoughts or suggestions if anyone cares to take a look!

[0] https://www.soundseeker.app/


Eh, the diamond problem and transitive issues don't exist because what is being reduced to is simply a set and membership. if expansions / aliases / synonyms / multi-membership produce overlaps, who cares, it's a set of hashs. The overwrites only represent wasted computation.

Really this is a simpler version of multiple inheritance. You don't have the issue of conflicting method signatures and implementations, only names.

The only danger is names meaning different things. You need your tags to be relatively unique to the meaning.


Maybe it's the project I am working on but right now I see the ideal search interface to be something like an OWL class axiom, that is, I am searching for instances of a class that has the following restrictions

   * subclass of Actor
   * subclass of Singer
   * has been in at least 7 movies
   * was born after December 3, 1980
   * has been married to at most 3 other people
these can be intersected, unioned, complemented, etc.


It sounds like what you want is SQL.

There is no good solution for the cultural problem that a written language is somehow unsuitable for end users. but personally I have spent way too many hours trying to make a search interface only to realize at the end that not only is my interface complicated and hard to use it still has only a fraction of the descriptive power a sql query has. At times I am tempted to make full use of the built in database permissions and let the user just type queries directly. but this suggestion is always vetoed.


I am increasingly hating twitter being used for blogging.


If you don’t want to think too hard, just funnel the tags information into a search engine like Elastic Search.

It already handles stemming, stop words, aliases, etc.


Sounds like they are trying to embed the search semantics in the data storage. Why not treat search as a distinct problem?


Yes. Clicking a tag is like searching for a single word. The crux is that tags add useful metadata that may not be in the content it is supposed to tag. Maybe instead of modeling tags separately from content, metadata should simply be joined into the content itself at the end, and then searched using the same text search tools used for content?


I was fascinated by ontologies 10 years ago. Since then, I've been studying human brain, only to realize that this is an effort to basically build a software version of human brain. Maybe it's possible, but it's definitely not feasible in 99.9% of cases. The closest thing we have is some machine learning approaches.


If you're at the point where you're adding hierarchies to your tags, I think you're fighting a losing battle. At that point, why not do what Google does and just make a BERT embedding. No way you're going to manually achieve the full extent of complexity of how humans group and describe things.


My current solution to this problem is just putting a JSONB column in relevant tables. GIN indexes do the heavy lifting as needed.

This lets us implement arbitrary, queryable ontologies on top of the data without requiring further database instrumentation (aside from creating an index now and then).


Also great on the topic of tagging, with more information about the AO3 scheme: https://idlewords.com/talks/fan_is_a_tool_using_animal.htm


I was interested in that too. I stopped when as soon as I realized that any good search in tagging system would be just a full text search. E-commerce catalogs have detailed filters but I think people use maximum 2 properties in addition to simple name input search


Approximate date is the bugbear of photo tagging. EXIF and Dublin core and vendors can't agree what to do. Camera manufacturers don't care because at time of shot, date is fixed. It's archival, scanned and copied predigital work.


For what it's worth, ExifTool (and by extension, PhotoStructure) support 0 for month and day. The problem is that most all other applications won't see this as a valid date.

And I've struggled with how to covert this "fuzzy" date into something that sorts with other assets that _do_ have an exact date. Should they all live on the first of the month? In the middle of the fuzzy date range? Midnight? Noon?


Yes, I like your software and paid for it. But as you note it's not a solution guaranteed to be portable to another photo framework or even Microsoft file tag management.

DC discussed approximate dates at some point. What does 'circa' 1900 mean to searches by date? Is 1892 circa 1900.

It's hard.



A very insightful thread by Hillel Wayne on content tagging systems and their challenges.

Their ubiquitous use (in library and information sciences, and popular social networks like Instagram, Twitter, and Pinterest), their deceptive ease of implementation, and "obvious advantages" over hierarchies/folders, means that almost every developer has (or will) run into them at one point or another..

Feel free to comment with good theory and case studies on tagging systems. (It's especially interesting with good case studies for how to model an advanced tag system in a graph database).


> It's especially interesting with good case studies for how to model an advanced tag system in a graph database

I wouldn't accuse it of being a good tag system, nor a true graph database, but one thing to look at is Semantic MediaWiki. It's a MediaWiki extension which takes Categories as a starting point, and extends it quite far with e.g. relations and key-value pairs.

One interesting feature of Semantic MediaWiki is called "Concepts" which are essentially "computed tags." They can be used in place of Categories in most places, but while Categories are set by editors on a piece of content, Concepts are defined by a query against Categories or other properties. This can help bridge gaps between different types of tags that represent different ways of thinking about the content.


Empornium aka luminance has a great tagging system.


Is there an optimal tagging system, performance wise? Seems like there could be a database just for tagging.


Probably a graph database. Considering graph dbs are optimised for JOINs (doesn't need to do them, due to direct relationships between individual records aka. index-free adjacency). The question is how you would effectively model the tag system in a graph db, as there are several ways to do it.


A lot of the author's questions can be answered by "use an inverted index".


I don't know much about this topic.

The only thing I learned: if you think you have a taxonomy, then you don't.


Yet we still can't search for multiple hash tags on instagram.


Pro tip: use stemming!


I am endlessly fascinated by how twitter has now become a dumping ground for complex topics that are difficult to read and follow. But what happened to the old blogs?


I have a really high standard for my blog posts. They go through several rounds of rewrites, with feedback from friends, before I'm happy with them. That plus the length (median ~2000 words) means that most of my blog posts take weeks or months to write. I can hammer out a tweetstorm in 20 minutes.

(Also, tweets are a fun format! I want each tweet to be a complete idea, which is hard when you have only 280 characters.)


It's lower effort to make a stream of consciousness post one sentence at a time, and as a bonus, there's a built in audience / discovery network where they're posting.


Lower effort for whom? Back when I were a lad, we were told to write so that our readers did not have to work to understand us. The point of writing is to be understood. Old man yells at cloud.


Lower effort to the writer, obviously.

The point of posting on Twitter is not to be understood, it's to be retweeted.


I think it's helpful to keep in mind that with most of examples that get shared around, the choice for the author was not a string of tweets vs blog post, but rather a string of tweets vs not sharing at all.


How does one go back and edit a stream of consciousness like that into an actual coherent thought later though?

I was just having a conversation similar to this where it was explained "this is just how people my age do things". While attempting to avoid boomer/millennial tropes, this does make me wonder how much different schooling is now vs then (hoping to avoid those memes too).

I was always getting in trouble for just saying whatever came to mind vs slowing down to think if it really needed to be said or more specifically how it was said.


Nothing has happened to them. I have a few hundred distinct bookmarked blogs, if not over a thousand, and obviously my bookmark collection is a tiny fraction of what actually exists. They're still there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: