The root of the issue here is that URLs are trying to be human-meaningful and machine-meaningful at the same time, but those requirements are fundamentally incompatible.
Humans work well with ambiguity and context. You know that when your coworker says "Bob's birthday is this weekend" you know she means her husband Bob, not Bob from accounting who nobody likes. And you even prefer that system to having an unambiguous human identifier, even a friendly one like "Bob-4592-daring-weasel-horseradish".
Machines, on the other hand, hate ambiguity and context. Every bit of context is an extra bit of state that has to be stored somewhere, and now all your results are actually statistical guesses - how inelegant!
In the early days of computing, there was no separation between the internals of the machine and its interface. If you worked on a computer, you were as much the mechanic as the driver. We got used to usernames, filenames, and hostnames because they were a decent compromise; they were meaningful enough to humans, and unambiguous enough for machines, so we could use them as a kind of human-computer pidgin.
But we don't need them anymore, and they were never really very good at either job anyway. Google's (probably accidental) discovery was that we were using the web wrong. Everyone was building web directories and portals because they thought that URLs weren't discoverable, but the real problem was that they weren't usable. Search was the first human interface to the web.
So Google's going to kill the URL, Facebook's going to kill the username, and someone (apparently not Microsoft) is going to kill the filename. There'll be much wailing and gnashing of teeth from the old guard while it happens, but someday our grandchildren will grow up never having to memorise an arbitrary sequence of characters for a computer, and I think that's a future to look forward to.
> Machines, on the other hand, hate ambiguity and context.
When I ask my car to call my wife using only her first name, it suggests a list of 3 people who I'm not even sure how they got in my contacts list. Siri, on the other hand, gets it right every time with the exact same request. I wouldn't say my car hates ambiguity, the programmers failed to bridge the gap to human/machine interaction and meet the person halfway. ("If you want to talk to a computer, you have to think like one.")
I'd say it's programmers or deadlines that mean that the extra work of accounting for ambiguous data gets skipped. It doesn't take a neural net to look at the recently called list for the most frequent or even most recently dialed [wife's first name].
One irony of your "Bob" example is that sometimes using someone's last name actually adds ambiguity: "It's Bob Lingendorfer's birthday this weekend!" ... "Who is Bob Lingendorfer? ... Ohhh, you mean your husband!".
Maybe it's not irony, it's just that people read a lot into data and might assume that all of it is relevant to the task at hand. My car kind of does the opposite and lazily stops at the first three "close enough" hits on my wife's name.
One thing that worries me about computers working with all that contextual information is that they then need to know all that information.
And since computing is so centralized these days, this means that whatever company made the software needs to know that context about you too.
There's something to be said for computers staying dumb. I'm okay with my co-workers knowing my social graph well enough to recognize my spouse's first name by context. I'm not okay with faceless corporations or governments having that same information.
Very good point. Can't disagree with you. I am ok, however, with a contacts system letting me specify a single name nickname that it prioritizes in matching / searches.
And I'm probably also ok with the computer knowing as much about me as my cellular provider does, since all that is probably hoovered up already. Why should Siri be dumber than the feds?
To take this further afield, it would be interesting to interact with a "smart" assistant that only learned from info likely to be accessible to third party law enforcement and/or aggregator, as a demonstration of the risk & power.
that’s funny. i have the exact inverse problem. when i ask siri to call my wife ( by her first name only ) it gives me a list of two to pick from, whereas my car does the opposite and calls my wife.
Why don't either of you just tell Siri who your wife is? You can say "My wife is" and her name, it was verify that it found the right one and after that you can just say "call my wife", "sms my wife", etc.. You can do the same thing with your boss ("call my boss") and various other tags.
Usernames and filenames are not just compromises, nor arbitrary sequences of characters.
Usernames reflect a fundamental human desire to create an alter ego free from the burden of their legal name and the socioeconomic context they're in. If Samual Clemens were a blogger, he would write under the username @marktwain. Alonso Quixano might call himself @donquixote69. Anakin Skywalker will want to be known (and feared) as @darth_vader, not because his real name is unusable, but because he prefers to be called Darth Vader.
People have had titles and pseudonyms for ages. Usernames are a continuation of this tradition, not merely an invention of the 20th century. The global uniqueness requirement is of course rather silly, but enforcing a real-name policy on everyone is just as silly. If our grandchildren have no concept of usernames/handles/whatever, it might be more a sign of great oppression and loss of privacy than of technological progress.
Ditto for filenames. We programmers have a habit of using weird filenames that really do look like arbitrary sequences of characters, but most of the rest of the world just uses human-readable filenames like "Financial report 3Q 2017". Change a few numbers inside, and it's still "Financial report 3Q 2017", content-addressing be damned. The document might not be stored as a physical file in the future, but then again, have files ever been physical? Filenames are just labels that we stick on a logical chunk of information. Implementation details can differ, but the concept itself is not going anywhere as long as humans like to put stable labels on mutable things. (This, unfortunately, tends to escape notice when your concept art for a filename-less system only contains a handful of photographs with pretty thumbnails.)
> Filenames are just labels that we stick on a logical chunk of information. Implementation details can differ, but the concept itself is not going anywhere as long as humans like to put stable labels on mutable things.
This is the point that I think is completely lost on the author of the article, probably because of a focus on API design. It's a good thing that we can replace that dog-eared copy of Moby Dick with a shiny new one when the time comes, and our users don't need to change their URLs.
APIs are intended to be used primarily by machines, so it's fine for the URL structure to preference the predictable uniqueness of ids. However, for most URLs intended for use by humans, the forces are different.
A human-readable URL is not a pointer, it's a symlink.
All good points, however one thing missing is that humans also want to be able to refer specifically to "that dog-eared copy of Moby Dick". Facts like "that dog-eared copy of Moby Dick is missing page 34" or "that dog-eared copy of Moby Dick is actually a super valuable early edition" should not change their referent when the library gets a shiny new copy of the book.
And that's exactly how I read the article: both mutable and immutable references are nice to have for different use cases.
Yes, that's true abstractly. However, A, those sorts of references are much less common in web pages than in physical descriptions (at least in my estimation) (though they're very common in APIs), and B, those repointable references are not the same as a search - I want to uniquely refer to the current value of this pointer, while allowing the publisher to relink as appropriate.
The article reads as universal URL design advice, but I'd argue the points only really apply to APIs.
Not so sure about the "well" part there. I've encountered people who love to make guesses about the context (and others who actually wish you'd do the same). That coupled with ambiguity creates disasters varying from ordering the wrong lunch to broken relationships.
I'd rather have humans take less pride in being ambiguous and make attempts to be as precise as possible.
There's a video of Dijkstra talking about Mozart and Beethoven as opposite poles -- the former wrote everything neat and right, the latter kept revising by gluing bits of paper in his scores. In order to further mark his position at some point Dijkstra stopped typesetting his papers at all and began to write them right the first time.
So there's this whole ambiguity aversion spectrum. Maybe it correlates to the autism spectrum, maybe it doesn't. It's arguably much more important. Even in mathematics you have Poincaré, a demigod among men that kept publishing papers with significant mistakes, while in the social sciences you have people like Niklas Luhmann and Bruno Latour who approach their subjects with utmost precision and dedication to detail.
I'm a more ambiguous, big-picture-even-in-small-problems thinker; and I thrive with more detail-oriented coworkers that walk me through the trees as I walk them through the forest. This has a lot to do with me being able to think in very ambiguous terms and narrow down as needed to interact or provide for the needs of others. Left to my own devices I come up with extremely abstract philosophical theories that are not useful at all! Conversely left to their own devices precision people become paperclip optimizers.
I want to speculate further into "edgy" territory: maybe the whole gender divide that seems to come up in psychometrics and the labor market and so on is really an ambiguity/precision divide. The evolution of technology has actually increased the value of ambiguity, as computers do much of the precision work for us -- maybe making tech "woman-friendly" is rather about identifying those big-picture/detail-oriented complementarities.
> The root of the issue here is that URLs are trying to be human-meaningful and machine-meaningful at the same time, but those requirements are fundamentally incompatible.
The TLDR of TFA is that an API can support both human-meaningful and machine-meaningful URLs.
Not really. TFA doesn't talk about what happens when the search fails. sgentle is talking about using human-meaningful urls as identifiers, which doesn't work when the search fails.
You have to remember that broad support for Japanese characters in URLs are a fairly recent, and haven't really caught on.
So advertisers want people to type something in their native script in order to get to the product website. So while an English advertisement campaign might tell people to go directly to johnnysmattresses.com, a Japanese campaign couldn't do this, and instead ask people to search for ジョニーの布団.
I don’t disagree with your point, but aren’t there languages that do take into account the context in which functions are called in addition to the parameters and namespaces?
Both R and Perl seem like ones where it wouldn’t be extremely strange for the function to also look back to the context of the calling function. Then it could find out if the two parties had an affinity for this person, and whether it was a conversation about something like figuring out an excuse to miss a party or one like finding a gift in order to which Bob.
You could easily have a bijective encoding at a frontend proxy that translates between the above and e.g.
> "4592-13f7-de41-203a"
(i.e. discards the descriptive part of the slug, and then reverses the unique words back into their index-positions in the same static 64k-word dictionary used for generation, resulting in a regular UUID.)
Great post - I quite like the stackoverflow.com style of `stackoverflow.com/questions/<question-id>/<question-title>`, where <question-title> can be changed to anything, and the link still works.
This allows for easy URL readability, while also having a unique ID.
In the context of this post (the library example) that would look like
1) there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services (mitigated for only some kinds of service if you redirect to correct),
2) if the title changes the URLs distributed are now permanently wrong as they stored part of the content (and if you redirect to correct, can lead to temporary loops due to caches),
3) the URL is now extremely long and since most users don't know if a given website does this weird "part of the URL is meaningless" thing there are tons of ways of manually sharing the URL that are now extremely laborious,
4) have now made content that users think should somehow be "readable" but which doesn't even try to be canonical... so users who share the links will think "the person can read the URL, so I won't include more context" and the person receiving the links thinks "and the URL has the title, which I can trust more than what some random user adds".
The only website I have ever seen which I feel truly understands that people misuse and abuse title slugs and actively forces people to not use them is Hacker News (which truncates all URLs in a way I find glorious), which is why I am going to link to this question on Stack Exchange that will hopefully give you some better context "manually".
Many web browsers don't even show the URL anymore: the pretense that the URL should somehow be readable is increasingly difficult to defend. A URL should sometimes still be short and easy to type, but these title slug URLs don't have that property in spades.
If anything, other critical properties of a URL are that they are permanent and canonical, and neither of these properties tend to be satisfied well by websites that go with title slugs, and while including the ID in there mitigates the problem it leaves it in some confusing middle-land where part of the URL has this property and part of it doesn't.
If you are going to insist upon doing this, how about doing it using a # on the page, so at least everyone had a chance to know that it is extra, random data that can be dropped from the URL without penalty and might not come from the website and so shouldn't be trusted?
(edit to add:) BTW, if you didn't know you could do this, Twitter is most epic source of "part of the URL has no meaning" that I have ever run across as almost no one realizes it due to where it is placed in the URL.
> 1) there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services (mitigated for only some kinds of service if you redirect to correct)
No need to redirect, that's what canonical links are for:
I don't disagree in that I mostly dislike URL slugs, too. Except for some hub pages ("photos", "blog", etc.), a numerical ID is more than enough. But the combination of ordering and display modes and filtering can still amount to a huge number of combinations, so canonical links are still needed - to have as many options for the user as possible and allow them all to be bookmarked, but also give search engines a hint on what minor permutations they can ignore safely.
I wish search engines would completely ignore words in the URL. If it's not in the page (or the "metadata" of actual content on pages linking to it, and so on), screw the URL. If it is in the page (and the URL), you don't need the URL. As long as they are incentivized, we'll have fugly URL schemes.
1) and 2) are not a problem if the server accepts any value for the title token (which is the case on stack exchange)
3) is not a problem for hyperlinks (url not visible) or for even direct links (not burdensome length), and if you care about a short url an even shorter form is available
4) seems like a feature? the person sending the link will only ever include as much information as they deem necessary anyway. If the recipient wants more info they'll either request it or click the link.
Trust is an interesting point, but if you can equally put literally anything in the client side anchor (eg. meta.stackexchange.com/questions/148454/#definitely-not-a-rick-roll) so I don't see what a viable alternative would be.
The usual way I've seen to deal with this kind of ambiguity is by doing a 301 redirect so that bookmarks get changed and the url in the address bar is also changed. It doesn't fix external parties linking to the site with the now deprecated url but there was never anything you could reasonably do about that.
> If you are going to insist upon doing this, how about doing it using a # on the page, so at least everyone had a chance to know that it is extra, random data that can be dropped from the URL without penalty and might not come from the website and so shouldn't be trusted?
The fragment doesn't get indexed by search engines so not many will see it. Along with that, in my understanding, having something human readable in the URL helps with SEO in at least google an bing so doing this could hurt your search rankings which isn't a good thing.
Minor correction, because dealing with this is a part of my job:
Almost no browsers have implemented changing bookmarks in response to 301 redirects. Link has further context and some testing.
301 may be dangerous, because browsers cache them.
Suppose the client follows a link to old-slug after the slug has been changed to new-slug. The server responds 301 → new-slug. The client caches that redirect, so that if you request old-slug it will immediately take you to new-slug without querying the server.
Then the object’s slug is changed back to old-slug (perhaps the change was made in error). Now a request to new-slug produces a 301 → old-slug. This likewise is cached, and a client may new be stuck in an infinite redirect loop.
I’m not sure if this is actually what browsers do; they might detect the loop and decide to throw away their cached redirects. I haven’t tested it; but I wouldn’t count on it.
This used to be a serious problem. It may be fixed now, but Firefox would eternally cache 301s unless explicitly told not to. This is why I configure all of my servers to disallow caching of 301s.
Minor nitpick, I'm not sure if exact match in URL slugs matters from Google's perspective very much. I do read that searchers' eyes can be drawn towards the exact match (which are frequently bolded in the SERPs), possibly leading to a higher clickthrough rate.
It's been a while since I was looking at how google's crawler worked. For items that had multiple ways of navigating there, I remember using the link rel="canonical" to let google know where the page would have been if not for the category information etc in the url.
1: so what? I use this for my blog (cryptologie.net) and this has never been a problem. Search engines handle that quite well.
2: no. The URL is not wrong. Rather it won’t describe the content perfectly anymore. If this is an issue you can attribute a new ID to your page.
3: that’s why you have url shorteners. But what’s wrong with a long url? And how does it complicates sharing it? To share you copy/paste the url. Nothing changed. And now the url describes the content! (That’s the reason we do it.)
4: that’s a good thing!
So yeah. I’ll keep doing this for my blog and I hope websites like SO keep doing that as well
>>> the pretense that the URL should somehow be readable is increasingly difficult to defend
I think I have a defense for this. I consistently long press links on mobile to see the url before deciding whether to load the page or not. Just to see if I can be bothered.
> 3) the URL is now extremely long and since most users don't know if a given website does this weird "part of the URL is meaningless" thing there are tons of ways of manually sharing the URL that are now extremely laborious,
I'm missing something -- what does length have to do with the difficulty of sharing a URL? I can't remember the last time I typed out any URL past the TLD.
Of course the difference is that Hacker News doesn't disseminate URLs of that form, but that doesn't mean someone couldn't pollute the internet with them.
> there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services
What services? Web crawlers? I'm sure the ones I would care about are smart enough to know how this works. There are many ways infinite valid URLs can be made. Query params, subdomains and hashroutes to name a few.
> if the title changes the URLs distributed are now permanently wrong as they stored part of the content (and if you redirect to correct, can lead to temporary loops due to caches),
You don't redirect. The server doesn't even look at the slug part of the URL for routing purposes. You can change the url with javascript post-load if it bothers you (as stackoverflow does). Cache loops are an entirely avoidable problem here.
> the URL is now extremely long and since most users don't know if a given website does this weird "part of the URL is meaningless" thing there are tons of ways of manually sharing the URL that are now extremely laborious
Extremely long and extremely laborious seems a bit of an exaggeration. Most users copy and paste, no? Adding a few characters of a human readable tag doesn't warrant this response I feel. Especially when the benefit means that if I copy and paste a url into someplace, I can quickly error-check it to make sure it's the title I mean. When using the share button, the de-slugged URL can be given.
> users who share the links will think "the person can read the URL, so I won't include more context" and the person receiving the links thinks "and the URL has the title, which I can trust more than what some random user adds".
I guess? I wont bother with a rebuttal because this issue seems so minor. The benefit far outweighs some users maybe providing less context because the link url made them do it. If someone says "My typescript wont compile because of my constructor overloading or something please help", I can send stuff like:
which I think is so much more useful than just IDs.
> Many web browsers don't even show the URL anymore: the pretense that the URL should somehow be readable is increasingly difficult to defend
Most do. Even still, the address bar is not the only place a URL is seen. Links in text all over the internet has URLs - particularly when shared in unformatted text (ie not anchor tags). And URLs should be readable to some extent. Would you suggest that all pages might as well be unique IDs? A URL like:
I meant in the context of sharing links, either on a board like this or in a text. But that does bring a up a good point of how many users know how to copy/paste?
Among all internet users, I would conservatively assume 30%+ do. Among people who have posted a link to social media or forums, I would assume %80+. But I'd be interested to see how off I am.
There's a reason there are those share buttons on every website that's chasing viral traffic.
I suspect most people who share on Facebook share via those, or via the Facebook app's own internal web viewer. I would assume Twitter is a bit more savvy, but I still would not bet strongly that a majority of people on Twitter know about copy-paste.
> there are now an infinite number of URLs for every one of your pages that may end up separately stored on various services (mitigated for only some kinds of service if you redirect to correct),
Not all services allow you to change the title (and therefore mutate the slug) but situations where changing the title changes the slug are so infrequent (and in this case, consequences nearly so inconsequential) that this is a problem mostly in theory. It's a miniscule price to pay for semantically useful URLs.
So does reddit. Go to any comment section. You can remove the latter part with the title and only leave the identifier, and the link will still work. The short link actually only contains the identifier.
I think the concern is in the way it obscures the target. Replace "Moby Dick" with a Chuck Tingle (warning, probably nsfw) book. Now that second link is a serious problem.
I see what you're saying, but it doesn't seem like much more than a funny gag you might pull on a friend.
If a website is concerned about that case, then instead of letting it inform their URL design, they should have a "Warning: Adult content. [Continue] [Back]" interstitial like Reddit or Steam.
I'm not even sure it's a serious problem - a possible annoyance, and perhaps, for a spammy site owner, maybe even a feature. But as a web user, I'm not really fond of that added uncertainty.
You don't necessarily have to redirect, but you should at least include `<link rel="canonical" href="..." />` (as given example StackOverflow does) so that search robots and other website (scrape and/or API) clients know which one is the canonical path, to avoid duplicate efforts.
Yes, the best approach is probably both, but it is crawlers that it matters more that they know the canonical paths more than users, and a crawler ignoring rel="canonical" is likely not much better than/as buggy as a crawler ignoring robots.txt; it's a specification they can ignore at their own peril.
The article talks about referring to resources by using URLs containing opaque ID numbers versus URLs containing human-readable hierarchical paths and names. They give examples like bank accounts and library books.
This problem about naming URLs is also present in file system design. File names can be short, meaningful, context-sensitive, and human-friendly; or they can be long, unique, and permanent. For example, a photo might be named IMG_1234.jpg or Mountain.jpg, or it can be named 63f8d706e07a308964e3399d9fbf8774d37493e787218ac055a572dfeed49bbe.jpg. The problem with the short names is that they can easily collide, and often change at the whim of the user. The article highlights the difference between the identity of an object (the permanent long name) versus searching for an object (the human-friendly path, which could return different results each time).
For decades, the core assumption in file system design is to provide hierarchical paths that refer to mutable files. A number of alternative systems have sprouted which upend this assumption - by having all files be immutable, addressed by hash, and searchable through other mechanisms. Examples include Git version control, BitTorrent, IPFS, Camlistore, and my own unnamed proposal: https://www.nayuki.io/page/designing-a-better-nonhierarchica... . (Previous discussion: https://news.ycombinator.com/item?id=14537650 )
Personally, I think immutable files present a fascinating opportunity for exploration, because they make it possible to create stable metadata. In a mutable hierarchical file system, metadata (such as photo tags or song titles) can be stored either within the file itself, or in a separate file that points to the main file. But "pointers" in the form of hard links or symlinks are brittle, hence storing metadata as a separate file is perilous. Moreover, the main file can be overwritten with completely different data, and the metadata can become out of date. By contrast, if the metadata points to the main data by hash, then the reference is unambiguous, and the metadata can never accidentally point to the "wrong" file in the future.
A long time ago, around when I was first taking systems programming courses, I had this vision for a filesystem and file explorer that would do exactly what you say. I imagined an entire OS without any filepaths for user data (in the traditional, hierarchical sense). My opinion (both now and back then) was that tree structures as a personal data filing system almost always made more of a mess than it actually solved. Especially for non-techies.
Rather, everything would automatically be ingested, collated, categorized, and (of course) searchable by a wide range of metadata. Much of it would be automatic, but it would also support hand-tagging files with custom metadata, like project or event names, and custom "categorizers" for more specialized file types.
Depending on the types of files, you could imagine rich views on top -- like photos getting their own part of the system with time-series exploration tools, geolocation, and person-tagging with face recognition, or audio files being automatically surfaced in a media library, with heuristics used to classify by artist, genre, etc. But these views would be fundamentally separate from the underlying data, and any mutations would be stored as new versions on top of underlying, immutable files, making it easy to move things between views or upgrade the higher level software that depended on views.
This was years ago, and I never got around to doing any of that (it would've been a massive project that likely would've fallen flat on its face). And now, in a roundabout kind of way, we've ended up with cloud-based systems that accomplish a lot of what I had imagined. I'd go so far as to say that local filesystems are quickly becoming obsolete for the average computer-user, especially those who are primarily on phones and tablets. It's a lot more distributed across 3rd party services than what I had in mind, but that at least makes it "safer" from being lost all at once (despite numerous privacy concerns).
Part of that is kind of what Apple has been going for the past couple of years with macOS, even though they haven't gone all onboard by removing the hierarchical part (since there is so much legacy software and users would revolt).
A new user profile will come with a prominent "All my files" live search shortcut that just shows all your files in a jumble sorted by when you last used them. Then they expect you to search and filter through them by metadata (which is automatically extracted/indexed by Spotlight). Then you can save these searches/filters as saved searches which are live-updating virtual folders.
If you were new to modern macOS(and iOS with the Files app) you might end up with something similar. Applications dump things in the main Documents folder(with user chosen names, but those are necessary metadata). You can then tag items with various labels(essentially adding more metadata), and everything is searchable through spotlight of the search function of our file manager using the user-given name, tags, and metadata(documents edited today, Pages files).
Photos and videos are managed entirely in the photos app, and organised almost exactly according to your suggested categories(literally called memories (for events), places, people). iTunes handles audio files automatically(you can sync your own files into apple music, where they're categorised in the same way as any other music).
As I understand it, APFS also handles copying and modifying in a similar way to your description, where a copy of a file is treated as a mutation of the previous version.
Everything is even synced through iCloud to all your devices, with all macOS devices keeping a rather complete copy, unless they run out of disk space.
This would require someone to have their first experience of computing in the modern Apple ecosystem(literally iOS 11 and up) to avoid preconceptions about filesystems, since traditional folders are still supported, but it's possible.
One thing I'd love to see, in conjunction with this, is some kind of MVCC with snapshot transactions on filesystem level. So you don't really mutate files - you create new versions of them, and then old versions get GC'd eventually if nothing references them (which may not be the case if you e.g. have a backup).
Problem is, our existing file I/O APIs are very much centered around the notion of mutable files, and globally shared state with no change isolation.
The article's main insight: "URLs based on hierarchical names are actually the URLs of search results rather than the URLs of the entities in those search results".
In the most technical sense both are searches encoded in to a URI form. The search for the (hopefully) GUID just happens to be for a specific mechanical object, while the other is describing the taxonomic categorization of what a matching item would look like.
Though their "/search?kind=book&title=moby-dick&shelf=american-literature" example is fundamentally different in that all filters (being URL query parameters) are optional and can be arbitrarily combined.
I didn't quite understand the point of the hierarchal "search URL" when you have the /search one implemented, and they go on to say you could implement both if you have the time and energy.
The Internet Archive WayBack machine kind of has an optional filter in a traditional URL scheme - you can replace the date in a WayBack machine URL with an asterisk as a wildcard and you'll get either the only entry it can find or a list of dates
"The case for identifiers" is really more of a case for surrogate keys. Surrogate keys need not be opaque, but rather are distinguished by the fact that they're assigned by an authority and may be completely unrelated to the properties of an entity.
Natural keys, meaning entity identification by some unique combination of properties, are hard to get right (oops, your email address isn't unique, or it's a mailing list) and a pain to translate into a name (`where x = x' and y = y' and z = z'`, or `/x/x'/y/y'/z/z'`, etc.).
Surrogate keys, on the other hand, make it easy to identify one and only one object forever, but only so long as everybody uses the same key for the same thing.
And as mentioned in the article, the most appropriate is usually both. Often you don't have the surrogate key, so you need to look up by the natural key, but when you do have the surrogate key, it's fastest and most likely to be correct if you use that in your naming scheme.
Something was bugging me about this, but I had to think hard to figure it out.
The article is largely based on a misguided premise: the idea that URLs should be conceptualized as either names or identifiers. URLs are neither: they are addresses of web pages. The things located at the URL may have names or identifiers, but by design of the web the stuff located at an address is mutable while the address is immutable.
This is an important point because it breaks the analogies to books or bank accounts. A physical copy of Moby Dick is a thing that may be located at a given address, or not. The work of fiction "Moby Dick" has an ISBN number, but the ISBN number is metadata, not an address. A bank account number is also metadata, not an address.
So I get the feeling that URLs should be conceptualized as addresses first and foremost. This isn't a magic bullet for the problem the blog post addresses (how to design URLs) but I think it gives some perspective:
* If the "thing" at the URL will always be conceptually the same "thing", but its name or other metadata may change, it makes sense to assign that thing a unique identifier and use this as part of the URL. (Because the thing with this ID will always be found at this address.)
* If the name of the stuff located at the URL is never going to change, it makes sense to use the name as part of the URL. (Because the stuff with this name will always be found there.)
* "Search results" as discussed in the blog post are a special case of the previous point: if a URL will always contain search results for a certain query, it makes sense to use the name of the query as part of the URL.
* There are also URLs that fall outside the name or identifier paradigms. http://www.ycombinator.com/about/ is the address of a bunch of stuff, which is not necessarily a single coherent thing with either an ID number or a name, but is a very reasonable address at which some content may be located.
Maybe this is all obvious, but to me it really helps think about the issue whereas the blog post confused some things for me, so I thought I'd share.
> The things located at the URL may have names or identifiers, but by design of the web the stuff located at an address is mutable while the address is immutable.
> I'm not sure about that. An address makes no promises (technically speaking) about what you will find at that address.
I don't see how that's relevant. An address, in principle, merely designates a particular location, perhaps physical like a street address, or logical like a memory address. In the context of a search or lookup, you can obtain what's contained at that address.
Similarly, a URL designates a particular resource location, as exemplified by its full name, Uniform Resource Locator. In the context of a client/server request, you can similarly obtain a representation of what's at a URL.
"The downside of the second example URL is that if a book or shelf changes its name, references to it based on hierarchical names like this one in the example URL will break."
The author appears to have forgotten about 3xx redirection codes which were intended to solve that very problem.
3xx redirection requires the backing-store to maintain some kind of permanent edit history, and is therefore not necessarily something one can assume one will have.
There's also the problem of aliasing; if another book by the same name is later added to the shelf, the hierarchical name now references an entirely different resource.
Bypassing black lists when posting links while still benefiting from crawlers following the links comes to mind.
During the 2000s, following links for a forum or blog was way too expensive, so they had black lists of dirty words to avoid porn sites spaming and get juice during the page rank golden years where any back reference mattered.
Hence it was just easier, to avoid the filters, to create non blacklisted domain names with redirections.
Then another trick was to write a perfectly legitimate page, get google to index it, then redirect that page to the less legitimate page. Because at the time Google refreshed once a week (or a month...), you'd get plenty of traffic and revenue for long enough to be worth it. If you sold niche porn and viagra, that is.
Another one was just to setup fake sites with different URL schemes with stats on them, and get a regular update on which URL formats were getting the best hits. At the time URLs where very important in getting points. Then you would regularly update your most important sites URL scheme accordingly, several times a year if needed.
I have a hard time believing that modern search engines are so incapable that they have to devalue redirects to the point that honest users have to worry about it.
Well that's just what I know about the things we did then. I'm not working in porn anymore, so I'm missing the new cool tricks, or abuses, depending of your point of view. But the community is VERY creative.
Now the last time I did change massively URLs for a client website and noticed a significant drop in traffic that took a few months to recover was years ago. So the situation might have changed. But I'm not going to test that assumption with my clients money :)
It's been a long time since I accidentally got to porn on the internet. I think that kind of thing is mostly dead (although some of the 'related articles' sections look pretty iffy), and instead porn monetization has moved towards people intentionally looking for porn.
They also tended to use usergroup permissions to restrict new users from posting URLs before they'd been a member for a certain amount of time (or posted a certain amount of messages) or had early posts subject to moderator approval if a link was included.
Never saw a lot of sites using 'innocent' URLs to sneak porn onto random internet forums in the way you describe, cause said sites would simply treat any spammed link as suspicious regardless of what it claimed to be.
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
Identifying changing "stuff" in the real world is for me a fundamental topic of any serious data modeling for any kind of software (be it an API, a traditional database stuff, etc). Identity is also at the center of the entity concept of Domain-Driven Design (see the seminal book of Eric Evans on that: https://www.amazon.com/Domain-Driven-Design-Tackling-Complex...).
I started changing my way of looking at identity by reading the rationale of clojure (https://clojure.org/about/state#_working_models_and_identity) -> "Identities are mental tools we use to superimpose continuity on a world which is constantly, functionally, creating new values of itself."
Nice, this reflects the choice I've made with a recent API design. This is especially important for entity names you don't control.
For example, we ingest gamertags and IDs from players of Xbox Live, PSN, Steam, Origin, Battle.net, etc. - each have their own requirements in terms of what is allowed in a username, and even whether or not they're unique. Often you can't ensure a user is unique by their gamertag alone. You can't even ensure uniqueness based on gamertag and platform name. Reality is that search is almost always required in these cases, and that's why we've implemented search in the way described in this article, with each result pointing to a GUID representing a gamer persona.
> Reality is that search is almost always required in these cases, and that's why we've implemented search in the way described in this article, with each result pointing to a GUID representing a gamer persona
This also solves the technical† challenge of handling renaming, even within a single platform. (Steam, I hate you.)
† Another challenge is social, esp. regarding abuse.
Missing for me: Timestamps. A lot of data is sufficiently unique if prefixed with a timestamp, which could be as simple and readable as /2017/10/17/my-great-blog-post/
I'm a fan of formatting UUIDs in Base62 (which is like Base64 but doesn't require any non standard alpha-numeric characters). I have used Base62-encoded UUIDs in URLs and APIs on several occasions. It's not standard, but if you google around you'll see that it's gaining popularity, because of the shorter identifier length.
The author took an easy way out by recommending a canonical identifier based URL and a named URL, and then choosing a library as an example.
Books in a library are seldom renamed, if ever. The named URL would be almost as permanent as the canonical URL.
However in their earlier example of a bank account, a personal account name is typically the account holder name and the type of account, and both of these could be subject to change as a result of marriage, death, or the change in products offered by a bank. Even then, the rate of change is low.
A better example that the author could have (should have?) used is that of a news website where the article title may change frequently and yet there is a desire to make the link indicate the type of content at the destination... this is the real crux of the issue.
On a news site a canonical identifier driven URL may be correct... but does not sell or communicate the story behind the link and the link is likely to be shared without context. Sure you may see `example.com/news/a49a9762-3790-4b4f-adbf-4577a35b1df7` but this could be any news... it is far less obvious what is behind the link than the banking example as diversity in news stories is huge.
Yet the named URL would likely fail too, as once created and shared it should not mutate or at least should remain working... and yet the story title is likely to be sub-edited multiple times as news evolves.
The best scheme was not even mentioned in the article... combining both an identifier with a vanity named part: `example.org/news/a49a9762-3790-4b4f-adbf-4577a35b1df7_choosing_between_names_identifiers_URLs` . The named part can vary as it is not actually used for lookup, only the prefix identifier is used for lookup.
Though that has it's own downside... one can conjure up misleading named sections for valid identifiers to misdirect and mislead.
Odd that the article doesn't seem to mention the considerations of whether id's are a) globally unique and b) unguessable, and the huge difference between the URL param and directory styles - that param id is inferred from order in directory style, making all params required and missing the final one default to it equalling *.
There's also the locality aspect of the problem which is unaddressed. Typically humans resolve ambiguity in a finite namespace. E.g. there are only a few Bob's I know of. If a single human were asked to resolve of a bob without context it would be a hard to resolve problem. I think all naming resolution problem are related to identification on the basis of attributes, and a url in a certain sense is supposed to model enough attributes to help us resolve this. We have modeled systems unlike humans, not with distributed and local information but looking at url resolution using a central brain of sorts.
> You also need to be careful about how you store your identifiers—the identifiers that should be stored persistently by the API implementation are almost always the identifiers that were used to form the permalinks. Using names to represent references or identity in a database is rarely the right thing to do—if you see names in a database used this way, you should examine that usage carefully.
What does this mean? Is it just to say don‘t use the name hierarchy but rather the permalink-key as identity in the database?
Isn't one problem with this is that intermediate caches now have two resources that represent the same thing, therefore invalidation of intermediate caches will be nearly impossible?
Why not make every URL that's shown in the title bar a permalink by default?
That way, you have the best of both worlds in all cases.
If another object tries to use the same URL as another object (which was used first), then a new URL must be generated (just add something at the end of the name).
Because then you are compromising the utility of your url-as-search semantics, complicating your implementation, and probably distorting your data and/or schema. A better solution is to make the distinction between "display" and "permanent" url a first class concept.
For example, a post with title "post title" will get url "post-title".
Then a second post with title "post title" will get url "post-title-1".
Since there's only one URL part associated with each post, it's a unique identifier.
This gets rid of the ugly id in the URL, for epic URL awesomeness.
Furthermore, if you edit the first post to have "new post title" then its URL will update to "new-post-title", but "post-title" will still redirect to "new-post-title".
Someday I'm gonna open source a lib that lets you easily add awesome URLs to your app. :)
Good advice. Interesting that Canonical URLs aren't mentioned.
But the sheer arrogance of serving a webpage that doesn't render any text unless you execute their JavaScript really annoys me. It's not a fancy interactive web-app, it's a webpage with some text on it.
Your argument holds for web apps where it might be extra work to do progressive enhancement. But this is literally a webpage of text. It is more work to get JS involved.
Humans using off the shelf browsers aren't the only ones who consume webpages.
> But this is literally a webpage of text. It is more work to get JS involved. Humans using off the shelf browsers aren't the only ones who consume webpages.
Sort of, the contents of the post are in a database somewhere. It's not like someone uploaded a .html page to Blogspot and they converted it into JS. The JS makes it easier for users to customize templates.
The main reason you'd want to avoid doing something like this is that the Googlebot would penalize you, but somehow I doubt Google is concerned with that.
That said, it has a <noscript> version that seems to work fine (I turned off JS and it renders as expected.
> Sort of, the contents of the post are in a database somewhere. It's not like someone uploaded a .html page to Blogspot and they converted it into JS. The JS makes it easier for users to customize templates.
That's not always the case, either, however. There are a great many (majority?) of database-driven websites out there with framework-rendered templates; Django, Flask, Ruby on Rails, etc. They are not constructed using -- nor are they dependent upon -- JavaScript.
That's a bit disingenuous though because that's not because of user choice. JS has the enviable position of being the only language blessed to have an interpreter in the browser and this decision and it's consequences are foisted upon you regardless of whether you wanted it or not.
That argument also doesn't address OP's complaint: regardless of whether everyone has JS and uses it, the page is only rendering text, why is JS even necessary? It's not a web app, it doesn't have any special functionality etc, it doesn't have any legitimate reason to use JS, but for whatever reason, we're forced to use it anyway.
The notion of surfing the web without JavaScript enabled is increasingly antiquated. You can't even log into Google without JS enabled; it's necessary to mandate it because of iframe attacks.
Not all web pages are (or at least need to be) web apps. Logging into an account vs reading a static page is apples to oranges.
Mandating JS to get any content, no matter how static, seems like the start of the death of e.g. Linked Data and a the web as an open standards based platform. I know I'm in the minority but diversity is a strength, and there are few places more important than the web.
I don't see why JavaScript is antithetical to linked data. If it's because the web page can't be statically analyzed, that's a solvable problem---you dynamically execute the page in a JavaScript sandbox.
To whom is it increasingly antiquated? Are you calling me old or what? I use uMatrix always, and allow pages i want to load JavaScript, and I try to get more and more people to do this actually when they say they have issues with bilion popups and adds. Using addblockers should be increasingly a positive trend not antiquated unless you want to part of botnet.
Humans work well with ambiguity and context. You know that when your coworker says "Bob's birthday is this weekend" you know she means her husband Bob, not Bob from accounting who nobody likes. And you even prefer that system to having an unambiguous human identifier, even a friendly one like "Bob-4592-daring-weasel-horseradish".
Machines, on the other hand, hate ambiguity and context. Every bit of context is an extra bit of state that has to be stored somewhere, and now all your results are actually statistical guesses - how inelegant!
In the early days of computing, there was no separation between the internals of the machine and its interface. If you worked on a computer, you were as much the mechanic as the driver. We got used to usernames, filenames, and hostnames because they were a decent compromise; they were meaningful enough to humans, and unambiguous enough for machines, so we could use them as a kind of human-computer pidgin.
But we don't need them anymore, and they were never really very good at either job anyway. Google's (probably accidental) discovery was that we were using the web wrong. Everyone was building web directories and portals because they thought that URLs weren't discoverable, but the real problem was that they weren't usable. Search was the first human interface to the web.
So Google's going to kill the URL, Facebook's going to kill the username, and someone (apparently not Microsoft) is going to kill the filename. There'll be much wailing and gnashing of teeth from the old guard while it happens, but someday our grandchildren will grow up never having to memorise an arbitrary sequence of characters for a computer, and I think that's a future to look forward to.