Hacker News new | past | comments | ask | show | jobs | submit login
Identifiers are better off without meaning (varoa.net)
33 points by srvaroa 14 days ago | hide | past | favorite | 34 comments



This problem seems easily solvable. Just use the last part for resolution. E commerce shops use slug ids which always carry meaning in URLs for SEO and UX purposes.

So there are constantly URLs that if full url path was used they wouldn't resolve to anything.

But you juat the product id part only and then either redirect to new categories or you keep the same URL. Up to you.

I feel like all cases are teally solvable and slug ids or ids with meaning are actually great.

So I am talking about URLs like /electronics/smartphones/apple-iphone-blue-32gb etc.

This is very good for UX and usability as well.


Wikipedia does pretty well with redirects and disambiguation pages. Email can be forwarded. It's more difficult if you don't have a redirect mechanism.

I don't think any of us would prefer a meaningless number to a username? But you can create a new account if you wish.

And of course we use meaningful names in source code.

One scheme I like for URL's is a meaningless id followed by an SEO string in a URL, where only the number is used and it redirects if the SEO string doesn't match.


One situation where we made the decision to go against this advice was prefixing a type to our identifiers. This was only useful for humans to be able to distinguish the id types from each other so they wouldn't make mistakes when querying our service (passing a Foo id instead of a Bar id) since the id formats were otherwise identical.

We also made the decision that the geographical cluster doing the processing would be embedded in the id. This may have been a mistake, we never got to the point where it caused problems but if there was a major reorganization then yeah could have been problematic.


A product I worked on used a complex scheme of encoding some 4-5 ids into an integer by reserving certain integer ranges etc. It was super inflexible and predictably caused issues when their 16-bit ids weren't enough for new projects.

It was also very difficult to work with. I had to refactor the lovecraftian mess that generated them for an entire project of thousands of nodes. If they didn't come out exactly the same, a technician would have to spend days manually updating physical units to match the new ids. Thank God for unit tests.


I actually had the experience of using meaningful identifiers in a previous company. At that time it was really handy in a lot of day to day stuff, but I'm now thinking we might not have used them long enough to have run into the problems described here.


I feel often this is one of the major points in a trade-off: How likely is it that this choice will be thrown back in our face during the life of the application. Realistically many / most components get a rewrite soon enough. Or should have. And when they don't, it means they made back their cost many times over. After that, we don't have to actively look for trouble - we can at least modularize things somewhat so we don't have to rewrite the whole thing at once.


Personally, what I got from his story is that the management is out of touch with the technology. The only real concern I would have with meaningful identifiers, is that the public gets to decipher what the underlying data is, or aspects about the data. It’s just a privacy question. Your Trading privacy for the ability to avoid bottlenecks like reading a hard drive. As far as the management going, well, you can simply add additional indexes to that same id or data. It’s really not a big deal.


We use a Snowflake ID (the ID scheme not the company) variant that is both meaningful and gets us IDs for the next 200 or so years before having to do anything and I don't expect that will ever be an issue.


Rdio used a one-letter prefix for entity types. The company was bought & sold before it became an issue.

I agree with the author. But practically speaking, for every example where semantic IDs broke down, there are probably 10 examples where it worked out fine. Just a thought.


Discipline? Doesn't the fine article point it out themselves? Carefully chosen degree of meaning embedded in an identifier has many practical uses. After that what matters is the tradeoff: Choosing well what gets embedded or doesn't - and so what will require a lookup always or only occasionally? And additional headaches such as caching that lookup? Choosing well which libraries code-in the embedding - as opposed to all over the place? Not trusting that the embedding is "self-documenting"? Etc, etc.

The author provides several examples, and they were all chosen because there was a benefit. Although some were poorly chosen like a group name rather than a function (everyone remembers individuals' emails rather than functions). Whether it was worth it is easy to dispute after that choice caused a headache but is "sore loser bias": it doesn't account for all the worthwhile effective choices elsewhere. Nor does it account for the effectiveness it bought the project in the meantime.

All the way to the extreme: do we prefer blog URLs that mention at least some category, date and a few words of subject line. Or do we go with opaque machine generated ones? Several lines long for good measure? Does the fact that you will never have to rename the opaque ones justify inflicting them on the users? How likely are you to ever rename? Some people will still choose the opaque URL! Do they earn points with their readers?


> Nor does it account for the effectiveness it bought the project in the meantime.

This is a great point! I should have mentioned in the article, but in the examples mentioned part of the nuisance was that people involved could envision alternative solutions that would have also been effective without causing most of the long lasting trouble.

But, sometimes that's not possible (I do mention that they are not always avoidable)


I and many colleagues fought this war for decades in a major international electrical engineering company. I don't think we ever convinced more than a tiny fraction of people to stop embedding meaning in the names of things despite the obvious trouble it caused.

I wonder now that I am retired if we should perhaps just give up the fight and instead concentrate on mitigation.


Remember this lesson being taught in the early 1980s. I guess "in software we step on the feet of giants" is still true


Is this an argument against strongly typed identifiers, or just an argument against adding too much meaning to the identifier?

I think if the identifier just adds what type of data it is identifying, the extra meaning will only become obsolete at the same time that data does. And the extra type information can help avoid/debug problems where the wrong type of id was used.

Compare with strongly typed ids/type branding:

- https://hw.leftium.com/#/item/39174998

- https://www.peakscale.com/strongly-typed-ids/

- https://andrewlock.net/using-strongly-typed-entity-ids-to-av...


I think you could extend this argument to include the philosophy that databases should be glorified key-value stores, and semantics should only be handled by application code.

I think a key point in the article is that "models become obsolete faster than we’d like".

On the other hand, you could argue that it's simply necessary to put in the effort to keep the data model up to date with your current needs, through API versioning, database migrations, etc.

Honestly I'm not sure which approach is less messy; maybe it depends on the team.


No.

The difference is that IDs are part of the public API.

Your database schema (KV or otherwise) is not.


Generally the data model of the API resembles the data model of the DB schema. IDs that are generic numbers could be considered an aspect of a "flat" data model, where the application code assigns semantics to lists of data that are less strictly structured (and have less strict identifiers).


But what difference is there if you use KV stores or DBMS with strict schema? There is no consequence on your public interface.

That was the comment.


I am currently using identifiers that encode the source and type in them. Mainly did it to make sure various backoffice ID-s never overlap, but added type since it is sometimes useful to identify object type just by looking at ID. This article is a bit discouragong but I have seen this also work well in practice so I guess we shall see how it works out. In any case the problems mentioned are unicorn++ problems and I would be happy if we need to tackle them :)


My main advice would be to obfuscate those identifiers in a way that you can use the semantics in the very specific system do want/intend to, but prevent the rest of the organization / systems / external world from using them for their own purposes. At the end of the day it's about not turning them into a public API (which means you lose control over them).


Yes I am using 2 digits code for source and 2 digits for type (0-9)


Any semantic meaning in identifiers means all your data is completely denormalized. 1NF specifically requires storing all data as irreducible columns[0]. Most 1NF violations aren't actually all that consequential beyond not being able to do useful JOINs on the data in the column, but doing it in a candidate key column is a bottomless pit of update hazards.

[0] This means no comma-separated lists in strings, JSON columns, serialized PHP objects, and so on.


Right: When making the tradeoff, consider how painful it would be if you ever have to rename. All the way to your users refusing to even pay attention to a change in identifiers they have cached. How complex a translation layer might you have to add? It might be very manageable. It might be just plain dangerous.


The correct answer is to give identifiers to people, give them meaning, and then not use those identifiers in any way except for that.

In the backend use your own primary key, put a leading index on that other thing, it can even be shitty and long but if you have the first 50 chars indexed it will be fast as hell to lookup 99.999% of cases.


Reminds me of SQL 101.

Natural keys serve as a great primary key when contextual meaning is important. A surrogate key is a key which does not have any contextual or business meaning.


Most of these issues can be avoided if you properly encapsulate the identifiers so that you do not have arbitrary third-parties trying to locally interpret meaning from the identifiers beyond identity. There are additional issues with the collisions when using identifiers with semantic structure between systems.

If you are going to add semantic structure to an identifier, which is frequently useful and a good idea, best practice is usually to encrypt it before sending it to the external world. Encrypting a UUID-like structure is approximately free on modern computers.


External users is just one of several problems.

The essense of a thing still changes afer it's meaningful identifier was assigned, yet it's a problem to change a things identifier.

The identifier should be nothing other than an identifier. It's properties are both infinite and mutable.

As a baby admin I had the genius idea to name servers after the state they were in once we started renting racks scattered around the country. Completely stupid. oh3 and pa6 etc continued to exist as entities long after they had been migrated or failed-over to their hot backups in other locations.

I'm thick, so I still didn't get it when I realized the state names were wrong, and so the next plan was hostnames/cnames based on roles instead of physical location. Exactly the same problem.

Super simple baby example, and applies the same to everything else. It wasn't only stupid for that one case and reasonable in other cases. It's the same wrong in all cases.


> If you are going to add semantic structure to an identifier, which is frequently useful and a good idea, best practice is usually to encrypt it before sending it to the external world. Encrypting a UUID-like structure is approximately free on modern computers.

Been there. Problem is, now you can’t rotate your keys without breaking users and everyone and everything needs access to this key. This means the key is going to leak sooner or later. Also, someone will inevitably create an endpoint that does not encrypt, nay obfuscate, the identifier. Might as well not have bothered to obfuscate the ID to begin with.


If you cannot change your semantic identifiers because you have shared them with the outside world, and they are not willing to follow your change, than encryption won’t make a difference, will it?


> best practice is usually to encrypt it before sending it to the external world

Yeah, although encryption is basically a way of hiding semantics from the external world (or, everyone else but whoever generates them) no?


Is it just me, or is this just a special case of indirection? For example, virtual function pointers versus hardcoding method calls as jumps directly to some code. Pointers and indirection in general give you a place where you can update associations, at the cost of one extra lookup on every access! In database, you often have many to many join tables, just in case. Those tables give you total flexibility later to change associations or even introduce new ones. So instead of having your identifiers point directly to the thing, simply have such a table for one more look up in between.

At my company, we often ran into the same question over and over, namely, weather a convention should go one way or the other way. And in almost every case, we found it’s better to just make a general implementation with the options being available to be supplied at runtime, or in a configuration file. In other words, don’t choose, implement a more general solution. That has become the policy in our company.


I've heard the second paragraph described as:

Bad architects make decisions. Good architects make deciding harmless


The only constant is that things will change, good architecture accommodates by only making the necessary decisions to provide safety and usability with the hooks and studs to build what's needed now.


> Addresses make notable examples. The “complex and idiosyncratic” Japanese address system reflects the organic growth of its urban areas. In British postal codes the final part can designate anything from a street to a flat depending on the amount of mail received by the premises.

Those systems are actually useful, you know. I have a friend who used to live at a place with address like "Northern Living Block, 157". There were about 300 total buildings in that block, and the numbers were assgined to the buildings pretty randomly, so it was impossible to navigate unless you were either given explicit directions, or had a map with you.

The routing info has to live somewhere, you know. Pushing it into the IDs means that you don't need to "have an entry in some database saying '1|INFRA|HOST|12' RUNS '1|APM|APPLICATION|23'", you don't have to update/delete it as needed, you don't need to look it up and deal with caching issues, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: