Hacker News new | past | comments | ask | show | jobs | submit login
I am endlessly fascinated with content tagging systems (twitter.com/hillelogram)
590 points by redbar0n on Oct 18, 2022 | hide | past | favorite | 266 comments

Instagram's tagging system was actually really effective at categorizing content and discovery because each hashtag was treated as a node in a (giant) graph, where each node has multiple properties, including post count (number of posts using a tag), 'velocity' (number of posts using a particular tag per unit time), etc. I could write up a big post about it as I made a study of it in when I created a web app for finding the most relevant tags a few years ago.

All that to say there was a lot to their system and it worked because users became aware that they were rewarded for using the most relevant tags. Using irrelevant tags was punished. This guided users towards using a mix of relevant popular and niche tags to maximize their reach, which, in turn, further improved the tagging system.

Instagram's tagging system isn't as important anymore as their algorithm has deemphasized it, in favor of other methods for classification and discovery, but there were a couple of golden years where it worked very well. Most users still look back on those years as the 'good times' even if they don't know exactly why. I'd go so far as to say they ruined the app after they deemphasized tags (and added way too many ads)

I went and got my laptop to type up a reply to this:

Instagram's tagging system was and is atrocious in combination with their discovery mechanisms and the incentives they create.

A real example, this has been true for years: I want to look at pictures of Jennifer Lawrence's makeup because she, like me, has hooded eyes and that makes useful reference. I go to instagram imagining that I will find fan accounts posting pictures. I search for #jenniferlawrence. 2.8 million posts. Nice.


Only three of the top nine pictures present there today (same on mobile for me) are of Jennifer Lawrence.

This is what happens when all the engagement is counted in one big engagement bucket, all eyeballs are equal eyeballs, and all likes are equal likes. There are two gorgeous pictures of Anne Hathaway here, and I'm sure they've gotten great engagement, but they are absolute shit at being pictures of Jennifer Lawrence. So now I have to scroll through a ton of absolutely irrelevant nonsense – with attendant ads, let's not forget the ads – if I actually want to use this tag for its sole raison d'etre.

For contrast, consider what happens when a picture is cross-posted on Reddit. If I upvote it in one subreddit, I am giving an engagement signal specific to that subreddit – and I as one user may choose to downvote that same picture in another if I think it doesn't fit there!

This isn't limited to low-effort reposting of professional photographers' work. It happens in less egregious ways for many, many tags in many areas I've seen on IG for quite a few years now (though I can't speak for the whole history of the app). Artists tagging media they're not using, etc. etc.

Isolated hashtags on IG were often not useful for finding specific content. To find what you want you'd need an interface to a page that showed posts across multiple hashtags which they didn't provide.

For example, posts that contain both #jenniferlawrence and #hoodedeyes would be what you are looking for, and if you interacted with enough posts containing both of those tags you would end up seeing the content you want, you just can't do it directly by navigating to those hashtag pages individually.

I used to provide this in my free hashtag suggestion app (which I kept online because it was, IMO, the best at what it did and still gets 500k hits a month). The app allows users to iteratively navigate the hashtag graph to a final set of tightly related tags. Anyway, during this refinement process users would see a grid of post thumbnails where the posts contain the desired hashtags. Facebook eventually tightened access to IGs API and blocked my ability to provide this thumbnail grid, but the rest of the app is still standing

> For example, posts that contain both #jenniferlawrence and #hoodedeyes would be what you are looking for

I mean, not to overwork it, but in this example, this isn't accurate: I just want pictures of Jennifer Lawrence, not the much smaller slice of content where the poster had her particular eye shape in mind. Also, I don't want to only "end up seeing" pictures of Jennifer Lawrence in my main feed, I want to be able to go find them when I want them. Instagram's ontology is designed to not make this possible, because everything about it is meant to facilitate tube-feeding, and that is why I feel so strongly that it is trash.

(We could also talk about how similar phenomena manifested elsewhere: the #goth tag on Tumblr in my Tumblr days was unusable because of the quantity of non-goths looking at a lowkey photo and deciding that was the best descriptor – so the actual #goth content was found in .... #gothgoth. And if you're wondering if this was discovered and subsequently chased out to #gothgothgoth, You Have Understood The Problem)

Right, but it was never meant to specialize in user-directed filtering and discovery to the level you are describing.

To your second point, every social context that gets 'cool' eventually gets LCDed into mediocrity (even HN). You have to outrun the noise, as you described in your #gothgothgoth example

>Right, but it was never meant to specialize in user-directed filtering and discovery to the level you are describing.

this, and what you said earlier

>Isolated hashtags on IG were often not useful for finding specific content. To find what you want you'd need an interface to a page that showed posts across multiple hashtags which they didn't provide.

forgive my naivete but I wonder exactly what could possibly have been good about their tag system if it didn't help users narrow down to just the content they wanted?

You ever just put a show on or pick up a book spontaneously, without really having a particular show or book in mind before you started?

This is kind of like that, call it 'directed spontaneity'. You open the app to see interesting stuff, but you don't know what exactly that might be in advance. The app learns some basic stuff about you using your interactions on content then tries to guess. You either like the stuff the app shows you or you don't.

The hashtag system was very good at filtering content to what you generally like in your feed and the Explore view, but it wasn't set up for you to fully guide your own experience in search of very particular content using individual hashtags.

> Right, but it was never meant to specialize in user-directed filtering and discovery

We agree:

> Instagram's ontology is designed to not make this possible, because everything about it is meant to facilitate tube-feeding

So as the user-directed ontology component of the IG ecosystem, it's trash, and I will die mad about it.

> every social context that gets 'cool' eventually gets LCDed into mediocrity (even HN)

I want to distinguish my complaint here from a gatekeeper's lament. This isn't about, say, nu-goth showing up and coming to eat the fashion scene, as Tumblr did indeed instigate. I'm talking about pictures being tagged with "goth" that no one looking for "goth" content would consider relevant (a closeup of coral red lipstick, say), but which would show up as "top #goth posts" because they were highly engaged with in the other contexts people saw them. (To a lesser extent, also Etsy spam in #gothgoth, but this is more what you're talking about)

I just want to say that your comments made me laugh super hard, but truly get to the heart of why a blunt engagement metric is a trash metric. (I also study gatekeeping and intermediaries, so it’s interesting from that angle also.)

Some day some master's student is going to write a thesis on the history of the #witchcraft tag on Tumblr and how it functioned as a social space and I only hope my username does not appear. :D

We agree on the intended use of the app being more about passive consumption of a feed of interesting content, not active, thorough exploration of particular topics.

I'm not sure why you think the whole hashtag system was trash because of this one aspect. Most of the evidence I see is contrary to your statements.

To reiterate, the strength of the tagging system is the relationship between the tags you interact with, not any given individual tag. Also, individual tags are not exclusively literal descriptors for their content. They may be, or they may be used like: 'people who like #jenniferlawrence will also like #...'

> active, thorough exploration of particular topics

> discovery to the level you are describing

It is not a thorough or sophisticated search-engine-type task to "use a celebrity's name as a tag to look at pictures of a celebrity".

I'm whinging about this because I enjoy Instagram for (yes, largely passive) consumption of the kinds of things people post on Instagram, and I think we should be honest about the brave new world we're living in. Part of that means owning up to how the sophisticated engagement-driving media have made certain aspects of interacting with each other's stuff worse, not better.

> Also, individual tags are not exclusively literal descriptors for their content. They may be, or they may be used like: 'people who like #jenniferlawrence will also like #...'

This contradicts the original

> All that to say there was a lot to their system and it worked because users became aware that they were rewarded for using the most relevant tags. Using irrelevant tags was punished.

unless we are using a definition of "relevant" entirely gleaned from a tube-feeding targeted-advertising mindset, where my intent and desire at any given time is immaterial, the actual content of any particular post is immaterial, and queries can only be made via the general profile of my eyeballs' interaction patterns.

(This works really well for paid ads! It really really does, for everyone involved! The Instagram Etsy and Amazon ads I see are extremely good at showing me niche-ass things I want to buy, far better than the content recommended to me on Etsy and Amazon.)

Tags are used as vehicles for content promotion, and the race to the bottom has rendered them so useless that they're not even good for that (at one time they were valuable to catch people browsing tags, but who's going to browse tags when they're this incoherent) so it's not surprising the app has deemphasized them. I think that product choice is pretty strong evidence for my take that they weren't working well, tbh.

On TikTok, the even-more-cursed Instagram Of (Ten Minutes In) The Future, accurate tags come off as cringey and desperate, and are used either as blatantly ironic jokes (the sponsored ones, especially) or Skinner-pigeon dancing-to-attract-the-algorithm (lookin' at you, #fyp). You can see why: if people come to associate tagging with thirsty self-promoting behavior, that's not a good look on your #aesthetic videos. In many contexts, evidence of the social media hustle you're involved in is disqualifying; it's easier to feign naivete, "oh damn this blew up", if your effortful video editing was sent down to people's tubes without your particular direction.

None of this was inevitable, because deciding to rank by engagement bucketed by "users who have a general profile of liking X" instead of "users who are looking at the X tag" is not an inevitable choice, even if the knock-on effects of the incentives it creates are inevitable after that choice.

> unless we are using a definition of "relevant" ..

Yea, this is the crux. Using a particular tag in your post is a prediction that people who appreciate the content clustered around that tag will like your post. A good portion of the time that is based on accuracy, but sometimes it isn't, as I already explained (especially in the most used tags). Instagram was never set up to have the specificity of a search engine.

I agree tags are less useful now, because the feedback loop is broken after Facebook's changes, but it wasn't always like this. Every point I've made is about a particular period in time, not the present.

>To your second point, every social context that gets 'cool' eventually gets LCDed into mediocrity (even HN).

The Law of Shitty Clickthroughs by Andrew Chen illustrates and details this very well: https://andrewchen.com/the-law-of-shitty-clickthroughs/

> And if you're wondering if this was discovered and subsequently chased out to #gothgothgoth, You Have Understood The Problem

Excellently put.

To me, it seems that the problem is that tagging is adversarial in any system where spamming can be rewarded.

Yeah – I'm not even sure we can call it adversarial where there seems to be no signal for relevancy. I'm sure there are "good" reasons for not having one (where by "good" I mean "profitable") but oof it sucks (esp. relative to the much simpler Web2.0 organization of e.g. Reddit)

It's funny because one of the most common arguments I see inside Reddit communities is irrelevant posts getting upvoted in a subreddit, and wanting the mods to step in - or not. People just see a post on their frontpage that they like and they upvote it, rarely stopping to look at what subreddit it's from and whether the post is a good fit for that subreddit.

I suppose it probably still works better than Instagram.

Fully agreed – it at least seems like you could tune for this in your less deterministic algorithmic sorting, too. There's no Reddit Constitution that says an upvote is an upvote is an upvote – so you could weight upvotes issued from people viewing the subreddits' pages more strongly than those from people viewing their home feeds, upvotes from subscribers' home feeds more strongly than nonsubscribers' /all feeds, etc. etc., downvotes mutatis mutandis

What the actual fuck, NON of the nine picture are of JL. Although all nine posts tagged every living actress, celebrity and they mom.

Please write that big post! Sounds interesting

I second this! Sounds like an interesting read! :)

I third this!

I'll add that if you do write it, email us a heads-up at hn@ycombinator.com so we can consider putting it in the second-chance pool (https://news.ycombinator.com/item?id=26998309).

I'd also be interested in reading about that - in particular are things like 'post count' computed and stored (i.e. not normalised)? How do you cope with keeping such analytics current as the source changes?

(When oh when will postgres get automatic and incremental materialised views?)

I noticed this as a regular user. I’m curious to see what it is like now.

> in favor of other methods for classification and discovery

Aka the toxic engagement trap. I miss the days when unfathomable AI didn't dictate what's popular.

> could write up a big post about it

Please DO.

I adore tagging systems and have worked on them in several different applications and implementations, but there are always pitfalls and trade offs, and it’s possible to bury yourself

Nowadays I nearly always store the assigned tags as an integer array column in Postgres, then use the intarray extension to handle the arbitrary boolean expression searches like “((1|2)&(3)&(!5))”. I still have a tags table that stores all the metadata / hierarchy / rules, but for performance I don’t use a join table. This has solved most of my problems. Supertags just expand to OR statements when I generate the expression. Performance has been excellent even with large tables thanks to pg indexing.

Do you index arrays? What index type is that? Any tips?

I’ve used array column in PG before, haven’t indexed arrays though.

AFAIK, postgres first got its reputation of high performance because of array indexes.

People usually go with GIN indexes, that can be used on the contains, overlaps or equals comparisons.

Yes, it’s explained in the intarray doc here. GiST is the one I use, but as it states GIN should be faster on reads. I haven’t really thought about that in many years, I should run some perf tests.


Read perf would be good in isolation, sure; but what about the lock contention over those rows due to the frequent writes to the tags column? Usually "taggings" have a much higher rate-of-change than the objects they tag. If you're not at least keeping a thing_taggings has-one table (thing bigserial, tags intarray) separate from your things table, I could see this degrading performance for any query that wants to touch the table.

I just made tags nosql, it's stored as json data in pg with a user set to each tag.

I think the relation was that each user could have mutlitple posts that each contain multiple tags.

Where each tag is global with an user ids stored with the user Id as the key. Or something approximatating that.

Mainly to not need insane queries for many to many relationships. O(1) baby.

Can you search by tag expressions like “((1|2)&(!3))” with your method? I’m not understanding it completely. I often use json_data but I didn’t consider tagging methods because of the overhead it has.

The tradeoff here is that you lose the foreign key constraint, correct? So if you delete a tag, there is no way for the database to automatically remove all references to it. Or is there some way to do this now?

> So if you delete a tag, there is no way for the database to automatically remove all references to it.

I'm not sure about implementation/support in Postgres specifically, but in the general case of a column of tag bitfields, the database could easily maintain a global popcount (ie, "number of rows with this tag") and soft-delete flag for each tag, and clear any soft-deleted tags on (possibly-only-write-)access. When a soft-deleted tag reaches popcount == zero, it counts as garbage collected and can be reused for a new tag.

But then again, the reasons for _deleting_ a tag are very low. What really happens is that you want your searches on that tag to just return nothing, and that's more of an application level responsibility. Your #absolutely_vile_tag_that_would_get_you_in_jail just enters a blacklist, and you're done.

Yes, but that’s easily handled with a trigger. My first implementation actually had a regular join table of items_tags which used a trigger to update the items.tags intarray. Wasn’t super performant but let us use our existing templates for 1-many to implement the UI.

Nowadays you can just use a tagging component with integrated search for the UI.

Right . More like nosql FKs.

How high is the business risk if you have a random tag with no name? Skip it’s display jn the UI

Would you mind sharing a simple example that demonstrates this? Sounds great!

I worked with the Wikipedia category system a few years ago, and you could see the problems with hierarchical tagging systems right in action back then. (Though it may have gotten better in the meantime)

The system appeared simple: There were just two relations, "Article A is a member of category B" and "Category X is a subcategory of category Y".

However, in practice, the community was using this system to represent a whole host of wildly different relationships between items, often with different implications what a category actually applied to.

E.g., if A has a subcategory B, this could mean one of several things: B might be an additional constraint on the items in A ("American writers" -> "19th century American writers"), the things in B might be more specific than the things in A: ("Writers" -> "Novelists"), A might apply to the concept B, not the things in B ("Occupations" -> "Writers") or A might refer to the category B ("Categories with more than 100 entries" -> "Writers") and on and on...

Of course those different aspects could even be combined. E.g. "Categories with more than 100 entries" might have a child "Categories with more than 100 entries in need of review", which represents a constraint but might itself contain less than 100 entries...

The basic question "Is item X in category Y" becomes impossible to answer generally, because there is no clear indication if a category only applies to its direct children or to all of its descendants or only to the subcategories itself.

I'm sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation...

There are only two kinds of relation here, “subset of” and “instance of” (aka “element of”, type-token).

The category-category relations are intended to always be a subset relation. The article-category relations are intended to always be an instance-of relation.

- "19th century American writers" is a subset of "American writers“.

=> Both are a category, so no problem.

- “Novelists” is a subset of “Writers”.

=> Both are a category, so no problem.

- “Writer(s)” is an instance of “Occupation”.

=> Here the problem is that “Writers” is a category. It would be okay if it was an article “Writer (occupation)”.

- “Writers” is an instance of “Categories with more than 100 entries".

=> Here, again, the problem is that “Writers” is a category, and having an instance-of relation between categories is not an intended/supported use-case.

This could conceivably be solved by supporting an instance-of relation between categories, in addition to the existing subset (subcategory) relation. It could be called a meta-category relation. Then you could have the category of occupation categories.

Another way to put this is that categories have to be typed: a category contains either (just) articles, or it contains (just) categories. Subcategories then must match the type of their supercategories and correspondingly must contain either articles or categories.

Basically, Wikipedia’s type system is not expressive enough to allow everything people would want to express in it.

> Another way to put this is that categories have to be typed: a category contains either (just) articles, or it contains (just) categories.

The problem with using such a strict type system as a tagging-system is exacerbated by the cases where:

1. Someone adds an article to a category (they tag the article), but then want to add a subcategory to that category. Now the category contains both articles and subcategories (violating the constraint). So the user would have to move all its articles into its subcategories for the constraint to be satisfied. This can be an enormous amount of work (needing to invent new subcategories for all the articles in the category not fitting into the specific subcategory they had in mind).

2. Someone wants to add an article, but only has a vague idea of a super-category in which it would fit. Now they have to exhaustively crawl/navigate the tree of sub-categories, until they find only the leaf sub-categories which only contains articles, which is a place they could put it. The input barrier thus becomes high (which is antithetical to how people expect to use tags).

Re 1: I think you misunderstood. The restriction of either articles or categories is for the instance-of relation. (The subset-of relation is naturally restricted to categories to begin with.) If a category contains articles, it can’t also contain categories as elements (instance-of relation). But it can contain subcategories as subsets.

Analogy: The set of real numbers has the set of natural numbers as a subset, but it doesn’t have the set of natural numbers as an element, because the set of natural numbers is not a real number — the individual natural numbers are.

Likewise, the category “Occupations” may contain articles describing occupations, and it may have a subcategory “Clerical occupations” (subset-of relation), but it cannot contain the category “Writers” as an element (in an instance-of relation as with the articles), because writers are not a subset of occupations.

Furthermore, as an example of a meta-category, the category “Categories with more than 100 entries” may contain the category “Occupations” as an element (but not as a subcategory!), and hence cannot contain any articles as elements.

The element type of the category “Categories with more than 100 entries” is categories, and the element type of “Occupations” is articles. The point is that you can’t mix both types of elements within the same category. This is independent from subcategories. Any category can have subcategories, the only condition being that the subcategories must have the same element type as the suoercategory.

The idea is that a category can have both subset-of and instance-of relations at the same time (and each relation needs to be marked as such in the system), but the instance-of relation is restricted to be either articles or categories, but not both.

Re 2: I believe that problem goes away, given the above.

The issue is that system has nodes and edges, but no concept of distinct graphs. That leaves you trying to fit all notable human knowledge onto a single graph, which is non-optimal. Whether it’s also a DAG, tree, or something else doesn’t even matter.

Ontologies are like languages. There is no correct one. What matters is how good a fit it is for the problem at hand and that you’re all using the same one! If half the people are using Italian and half Spanish, it’s going to be a disaster. I wouldn’t use APL to write a UI and I wouldn’t architect a computer system in Shipibo.

Similarly, if I’m bird watching, “Birds of Northern California” is very useful. Organizing them by genus is less useful to me in that moment, but it’s not wrong.

I don't think you necessarily need multiple graphs; just labeled edges.

You just need some way to interact with it as multiple graphs. Some variation of labeled edges is probably the best.

In your examples, would the edges be like:

tagged_with_italian_tag vs. tagged_with_spanish_tag ?

tagged_with_genus_tag vs. tagged_with_geo_tag ?

Would that afford such multiple graphs?

There are a bunch of ways to do it. You could use the Entity-Attribute-Value[0]. Then it's (California Quail, Region, California), (California Quail, Genus, Callipepla). You could do relational tables, with a through table for each taxonomy. Or, one through table with tags. That's like your comment.


Isn't this literally just saying we need another layer of categorization on top of the categorization layer?

It’s saying you need support for multiple types of categories. You could use the same system to organize itself. No need for a meta layer.

Perhaps "adjacent to" rather than "on top of"? I've started looking at this kind of problem in terms of DB queries or set relations. Even "organization" can be a set relation if there are the right bits of metadata in place.

The problem might not be with hierarchical tagging systems, but with the specific hierarchical tagging system they use at Wikipedia.

Imagine another system with the following categories:

* People:ByOccupation:Creative:Writers

* Time:CommonEra:ByCentury:19

* Location:Earth:Americas:NorthAmerica:USA

In this scheme of things, e.g. Mark Twain would be tagged with all three. "19th century American writers" (which includes Mark Twain) would not be a category but a saved search. (Other saved searches — which would also include Mark Twain — would be "19th century people from Americas" or "Stuff from Planet Earth").

Suppose you have someone who did a bunch of writing in America, then moved to Europe and became famous as an inventor there. Under your proposal, this person has both Location:Europe and Location:USA, and both Occupation:Writer and Occupation:Inventor. They therefore show up for queries for European writers and American inventors, neither of which was intended; I bet we can come up with situations where the false positives are even worse. The presence of those tags have to be interpreted in light of each other.

If you do this naively I think it's pretty clear you've either sacrificed expressivity or made the system a LOT more complicated/harder to understand. At best you end up with some kind of product structure in (what is no longer just) the set of tags on an article. You can think of explicating an implicit product structure in joining "American" and "Writer" in the same object. But I think if you've started talking about compound tags, you're really talking about something other than a tagging system.

I think the only feature you need to express this is to be able limit a tag to the context of another tag. It's slightly different than a compound tag because each tag can still be used independently.

I experimented with a system like this recently[0] that used two different tag notations that seemed to make the mixing more intuitive. I didn't have enough time to iterate on it further or build it more seriously, but I think there is potential in this area.

[0] https://youtu.be/bi3YkY7UKmM


To be more specific, your entity could be tagged as Location:Europe in the context of Occupation:Inventor, and Location:USA in the context of Occupation:Writer. I still think the entity should match queries for any of the 4 tags, it just shouldn't match a query {Occupation:Inventor/Location:USA}

That introduces a dependency, or at least ordering, between the concepts of Location and Occupation that I'm not sure should exist, much less which direction it should point. It works for baking:skilled because skill level is inherently part of the property of being a baker, and skill is undefined without a thing to be skilled at, whereas someone can easily reside in a location with no occupation ({Location:USA/Occupation:Layabout}?) or have an occupation with unknown/unfixed location ({Occupation:Inventor/Location:Nomad}?).

And if you try to create a synthetic context to place both tags under, you get... compound tags, or close enough as makes no difference to me. :) I'd 1000x rather start from there, and special case it to return results for each tag individually, than start introducing spurious orderings or dependencies. (ed: maybe it would be clearer to say "composite tags", as in tags composed out of other tags?)

Perhaps I misunderstood the example but I thought the point was precisely that there is a dependency between Occupation and Location which individual tags cannot express...?

> I'm sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation

I think the problem is allowing users to freely tag, then. There should be easily accessed guidelines about how each tag should be used, and people who are constantly moving them, correcting them, and updating usage guidelines.

We need the ability to implement governance systems on top of web 2.0+ style content systems. People should be able to vote for representatives (with any number of voting systems), create committees, submit changes to be voted on, etc. Instead we usually work based on hierarchical dictatorships or imagined consensus. People need organizational management tools baked into software, because organization of information depends on it. Instead of proposing a new committee to come up with the schema of everything, better tools that enable users to build committees.

The fundamental tension in tagging systems, to me, is whether tagging is a feature the software offers to the user or a task the user performs to assist the software.

In the first case, you want freewheeling and tolerate ontological inconsistencies because you want to offer flexibility to users and will capture hard to quantify emergent benefits (some made up examples: "try the tag user233-favorite, I keep discovering awesome articles!", "the physicist-needed tag has highlighted a lot of misinformation surrounding quantum physics and relativity"). People use it to the extent it is useful.

The other way, with formal semantics, governance (which you made some very wise points about), etc allows the software to reply to queries like "19th-century + Missouri + humorists" in a performant and authoritative way. It's not really a feature so much as it is a way to enable other features.

I recently run into the same kind of problem in Wikidata.


typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar



Many comments below are hinting at - but not naming - triplestores. "A has relationship X with B". This is how wikidata works.

Learning about those and learning how to query wikidata just blew my mind.

If it isn't too much trouble, I would love to see an example of a particularly complex query that can be done on top of this...Pseudocode or just a text description is fine, it doesn't have to be precise syntax.

I was toying with the first world war at the time. You could query famous soldiers born in the same home town as the current president, for example.

> m sure there are sophisticated ontological systems which would allow users to specify all those different relationships separately. I'm also pretty sure that users would become sloppy after a short time or would disagree which particular relationship to use in a particular situation...

You might be interested in Snowmed CT, a way to describe medical concepts. It does something rather similar.

I encountered the same problem a few years ago and indeed realized that using categories to understand what type of article a thing was (person? subject? event?) was utterly useless, for the reasons you describe.

On the other hand, I discovered that infoboxes (the data in the top-right box on most pages) was generally extremely reliable, if frustrating to parse.

The infoboxes are created from a query to Wikidata, which you can query yourself! No scraping necessary! https://query.wikidata.org/

You'll want to learn SPARQL, but if you know SQL it's not so bad to pick up.

As far as I can tell, that is not the case, sadly.

Right now it appears that only 3,975 articles have infoboxes auto-generated from Wikidata. [1] The wikitext contains something like "{{Wikidata Infobox ...}}" instead of just "{{Infobox ...}}".

If you look up a popular article like Barack Obama [2], it's just a traditional hand-edited infobox. In fact, one of the first lines of data says "Vice President = Joe Biden", while the Wikidata entry for Barack Obama [3] doesn't reference Biden anywhere -- so not only is the Wikipedia infobox not generated from Wikidata, but Wikidata isn't pulling all the relevant info from Wikipedia either.

Back when I had been working on my project, I'd hoped Wikidata could be a solution but it was far too incomplete and information was regularly out of date. Perhaps (hopefully) it's better now, but it's clearly not being used to power infoboxes yet except in a tiny number of cases. (Which actually complicates things more now, since anybody parsing Wikipedia infoboxes now has to deal separately with the 3,975 ones that grab from Wikidata, since none of the actual data is copied over into the wikitext...)

[1] https://en.wikipedia.org/wiki/Category:Articles_with_infobox...

[2] https://en.wikipedia.org/wiki/Barack_Obama

[3] https://www.wikidata.org/wiki/Q76

For sure, thank you for the correction, I was under the impression it's role was broader.

Wikidata is not solution at all.

I recently run into the same kind of problem in Wikidata.


typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar



> Categories with more than 100 entries" might have a child "Categories with more than 100 entries in need of review"

This should have been specified by 2 separate tags: "Categories with more than 100 entries" and "In need of review".

("Occupations" -> "Writers") seem wrong why would you do this? same for ("Categories with more than 100 entries" -> "Writers").

This seems like trying to put tag on category entity instead of creating a tag hierarchy.

Those 2 should be stored using different relationship type mechanisms.

("categoryTag", <SourceTag>, <DestinationTag>)

ex: ("categoryTag", "Occupations", "Writers")


("parentTag",<tagName1>, <tagName2>)

ex: (("parentTag", "American writers" , "19th century American writers")

Indeed. It's a bit like if a programming language was trying to represent base classes and meta classes using the same mechanism.

My guess is that no one realized the need for "meta" categories when the system was implemented, so later the existing hierarchy was simply co-opted instead of implementing a new functionality for that use case.

As long as the categories are only used by human editors and use is only within some small subcommunity, it can work quite well. The problem starts if you want to combine categories used by different communities or if you (or your program) lack the domain knowledge to understand which nodes represent "meta" categories.

As another poster said, the better approach to use Wikipedia data for automated processing is using infoboxea or the explicitly machine-readable Wikidata repository. The category system looks machine-readable on first glance but really isn't.

An exactly analogous problem exists in the Collections hierarchy at the Internet Archive, of uploaded/digitized material (not the Wayback Machine web captures).

A single graph is applied locally with very different semantics; and absent a distinct tagging systems, collection membership is sometime used to mark material for treatment in some way.

Clearly the solution to all of this would be the category of all those categories that do not contain themself.

This seems like one of those Eternal Problems that people, whether librarians, programmers, or hobbyists, stumble across, think they'll make headway in, then discover that they've really managed to progress just a few feet across a vast and hostile surface of landmines, pitfalls, and lures. Each "obvious" step (I'll have parent relations to define a context!) is only yet another bargain with the Devil, who laughs at your precautions.

Tagging’s pain is that it’s a problem that is easy enough where you can come up with plenty of ideas without prior knowledge. Its bane is that it is, in this sense, similar to bikeshedding. Everyone can have an opinion about it; Fortunately, it’s only appealing to people who enjoy exploring problems.

I guess if you're really focused on it. I built a content tagging system for an old employer that would attempt to guess context based on keywords and associations but give the writer of the content the final say in what's actually being tagged.

Sure, I could have spent a thousand hours refining it, but the improvement would have been marginal and it still would need human interaction.

Was it used for content related to that particular business? I think as long as you have relatively limited variety, you can make something that works well enough.

The content was a publisher. Writers would submit articles, the system would automatically tag them, and the system was good enough that usually the writer or editor would weed out a false positive or two, which is about the same result as all these machine learning use cases with thousands of hours of dev time.

IIRC, terms were weighted, so that some of them needed to have more instances in the articles than others in order to be included in the final tag results.

Locations were one-offs, but specific topical items required more mentions because of false positives. And then there were things we called branch-offs. Branch-off tags occurred when a topic was mentioned enough to be a tag but there's another name that some segment of the population would know it by.

For example, the fish known as the white crappie are known in Louisiana as sac-au-lait, but people would also spell the word sacalait or sac-a-lait. So when we would get an article from an author in the Carolinas, they have no familiarity with a term that is the dominant one in Louisiana, but the software would add the tag anyway, which also exposed it to our site search.

Similarly, I think if you have a limited number of people doing the classification, you can also make a good shot at it.

> I can't find anything on how to design and implement anymore more than the barebones basics of a system.

All of this stuff (horse/horses etc) is extensively discussed, maybe look under "taxonomy" or "ontology".

Now, whether you want to use any of those solutions or not or find the discussion useful or not... if you aren't finding anything about it at all, you aren't looking in the right places.

(I learned about it in librarian school)

To be fair to OP, the biggest hurdle in learning anything is knowing what questions to ask. When you don't have ontology as part of your vocabulary it's hard to find literature regarding, say, "comparison of ontologies for user-generated text content".

I suppose this flows back into library science, which is all about systematizing where to look for answers to questions, but I'm always astonished to find that there's oceans of literature and research in questions I haven't even thought to ask.

I think OP is referring to finding software-engineering related design discussions surrounding tagging systems, but yes, I’m sure there is a great depth of ontology material and librarian knowledge that could add to software system designs.

(I learned about it in librarian school)

As the rest of us learned during the first tagging boom, the librarian is the natural apex predator of tagging.

I've been a librarian for more than 15 years and I can only speak from personal experience when I say that I am the apex predator of nothing. Every once and a while I will get it in my head to systematize my personal knowledge base with a controlled vocabulary and ontology and I just fall on my face. I really want it for some twisted reason, though.

Turns out LC subject headings -- for all their failures -- are pretty good.

Library of Congress classifications and subject headings (those are two separate things, for those unfamiliar) are not perfect, but they're pretty good, apply to a huge copus, and to my mind most importantly, have evolved over a bit over a century under numerous circumstances, including an absolute explosion of published materials, substantial changes to understanding organisation and classification of knowledge, and an awareness of the social and cultural aspects of these (as well as the institutional bias that's often embodied within them). That is, they have evolved a change management process.

The Classifications are substantively hierarchical, though that's really an outgrowth of the fact that they're used to locate books within physical shelf space, in which a record must occupy an address (physical space), and given that the Library's settled on subject classification as its storage and retrieval basis, this maps what's effectively a folded linear structure (shelf space) onto the multidimensional subject classification. It's not ideal, but it's workable. And many of the quirks of the LoCCS come out of the fact that it addresses both the composition (comprehensive, but still US-centred) and process (shelving, search, and retrieval) of the Library.

The Subject Headings are not hierarchical, though they're structured. In particular, they're relational, with numerous subject headings referring to others. There's some parent-child relations (though the top level hierarchy is broad), numerous retired classifications, and many "use that instead of this" notes.

(I've made ... some progress ... at a structured parsing of the subject headings, though that work's been stranded Because Reasons.)

I've learned to accept that my personal life and knowledge management is going to be a mess. (I'm also a librarian). I just don't want to do more organizing when I get home. I do also feel the temptation to do it 'right' once in a while, but it never sticks. I'd wager a lot of it has to do with the fact that managing an ontology completely on your own just sucks.

> controlled vocabulary

Are you using English? English words can almost mean whatever you want them to. Perhaps design your own language that removes ambiguity. Probably requires a knowledge of philosophy to distinguish between say concrete and abstract, good luck.

Maybe start with correcting the ontology of: https://cuberule.com/ (which takes a geometric approach to defining food types).

Also perhaps decide whether you want to work top-down like a directory tree (or Dewey Decimal?): resulting in standard book classification issues. Or bottom up: resulting in conflicts and discrepancies - https://news.ycombinator.com/item?id=33254025

> Perhaps design your own language that removes ambiguity.

That's what a controlled vocabulary is. It's essentially a set of tags which are clearly defined. So instead of #horses being defined purely by the word "horses," it has an attached definition along the lines of, "The category 'horses' includes equine biology, sports relating to horses, the cultural history of horses, and all other topics involving real horses. Metaphorical horses such as saw horses are not included." Tags like #horse would be redirected to #horses, since there is only one canonical horse tag in the vocabulary.

Librarians are the people that we (technologists) should learn from. But all I see is programmers trying to invent things from first principles.

Eh, as the librarian who wrote the post you're replying to... I am actually ambivalent.

I wish librarianship as a field and industry were more what I'd fantasize it should/could be, but it's not so much.

How so?

What's missing / what would you remove and/or change?

The problem isn't knowing what the problem is (taxonomy and ontology), but how to implement it effectively.

I've seen enough of Hillel's posts over the years that I am fairly sure he is aware of taxonomy/ontology too.

Yeah, the content for learning has been around for over a decade or mor

Plus we have plenty of content for AI now


Can you link some resources about it then?

This is a good basic overview, goes beyond tagging/indexing, was the textbook in LIS501 Information Organization and Access at UIUC-GSLIS (now the iSchool at Illinois) in 2006:


Controlled vocab standards:


(this one is deprecated in favor the one that follows)



The book we used in my thesaurus construction class at UIUC:


My favorite intro to semantic modeling with RDF/OWL/SPARQL:


Topic Maps are dead but i still have a soft spot for them:


I also recommend Heather Hedden, linked in jrockhind's post.

I could, but honestly I'd just be googling "taxonomy". But ok that's not entirely true, I know how to refine my search and recognize when something is what I'm thinking of, from some familiarity with the field.

(But if you want to look around, in addition to "taxonomy" and "ontology", other good terms are "information architecture" and "controlled vocabulary").

These are not things I have vetted, this is literally just me googling and taking a quick skim...





Or how about some textbooks:



This is German, but I found it very good:



* Cataloging the World

* Organising Knowledge. Taxonomies, Knowledge and Organisational Effectiveness

* The Intellectual Foundation of Information Organization

* The Oxford Guide to Library Research

I'm surprised I haven't seen more discussion of how tags are an entry point into plain-old data architecture. It should be obvious that by the time you're using tags for queries like "start-date: BEFORE 2022-03-01", you've created an inner-platform where you're building a plain-old relational database on top of your tags. Stop what you're doing and elevate "start date" out of tag-land and into a more structured representation with more application support.

Many enterprise databases add a memo field called "Comments" to almost every table. Clients very often end up coming up with their own guidelines about how to embed various information in the comments fields that the primary structure is missing. Looking over how clients are using the "comments" fields is a great way to discover new things that should be formally incorporated into the structure of your data architecture. Similarly with tags.

Look at tags as a starting point for adding a bit of loose structure to the frontiers of your data architecture. Mix them in with more structured data architecture. Be ready to "graduate" tags up to the next level of structure when it becomes appropriate. Stop worrying about how to make tagging perfect and embrace it for what it is: an easy way to get started on modeling the parts of the domain that you haven't spent a long time thinking about yet. A good way to understand how users want to use your system. Something you're always revisiting, cleaning up, and using as a source of inspiration. If you see some tags getting out of hand, don't try to improve your tagging system; instead take what those tags are trying to represent and add more structured fields and queries for them. This pipeline of less to more structure should be constantly playing out in a healthy, evolving system.

Relatedly, comments fields are the bane of data compliance exercises. You think you’ve caught everywhere a customer’s information might be stored, and then at the last minute you find out support have been putting phone numbers in the comments field because they had nowhere else to put it.

sounds like a savior!

What a weird decision to store dates in tags, it looks like bad design. Fully agree with the primer to data architecture, I've seen people almost getting to the point of writing DSL over tags. Madness!

As someone who dabbled in adding basic tagging to a database recently, this take feels absolutely spot-on. I’ll draw on it when evolving our system forward.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact