utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.
UTF-16 is maybe better if your text is mostly made of codepoints which need 3 UTF-8 code units but only one (thus 2 bytes) UTF-16 code unit. This is extremely rare for general text and so you definitely shouldn't begin by assuming UTF-16 is a good choice without having collected actual data.
The old defense of 16-bit chars, popping up in 2026 still! Utf8 is efficient enough for all general purpose uses.
If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j
What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..
I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.
> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.
there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.
the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples
> sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples
I think this website audience begs to differ. But if you develop for S.Asia, I can see the pendulum swings to utf-16. But even then you have to account for this:
«UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many spaces, numbers, punctuation, markup (for e.g. web pages), and control characters, which take only one byte in UTF-8, this is only true for artificially constructed dense blocks of text. A more serious claim can be made for Devanagari and Bengali, which use multi-letter words and all the letters take 3 bytes in UTF-8 and only 2 in UTF-16.»¹
In the same vein, with reference to³:
«The code points U+0800–U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to the idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in real-world documents due to spaces, newlines, digits, punctuation, English words, and markup.»²
The .net ecosystem isn't happy with utf-16 being the default, but it is there in .net and Windows for historical reasons.
«Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"»¹
the talk page behind the utf-16 wiki is actually quite interesting. it seems the manifesto guys tried to push their agenda there, and the allusions to "real text" with missing citations are a remnant of that. obv there's no such thing as "real text" and the statements about it containing many spaces and punctuation are nonsense (many languages do not delimit words with spaces, plenty of text is not mostly markup, and so on..)
despite the frothing hoard of web developers desperate to consider utf-16 harmful, it's still a fact that the consortium optimized unicode for 16-bits (https://www.unicode.org/notes/tn12) and their initial guidance to use utf-8 for compatibility and portability (like on the web) and utf-16 for efficiency and processing (like in a database, or in memory) is still sound.
Interesting link! It shows its age though (22 years), as it makes the point that utf-16 is already the "most dominant processing format", but if that would be the deciding factor, then utf-8 would be today's recommendation, as utf-8 is the default for online data exchange and storage nowadays, all my software assumes utf-8 as the default as well. But I can't speak for people living and trading in places like S.Asia, like you.
If one develops for clients requiring a varying set of textual scripts, one could sidestep an ideological discussion and just make an educated guess about the ratio of utf-8 vs utf-16 penalties. That should not be complicated; sometimes utf-8 would require one more byte than utf-16 would, sometimes it's the other way around.
The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.
implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences
It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?
I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.
By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.
This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.
I guess it's somewhat known that you can trivially fake a repo w/a fork like this but it still feels like a bigger security risk than the "this commit comes from another repository" banner gives it credit for:
But this commit doesn't even have to belong to the preceding repository. You can reference a commit on a fork. Great way to sneak in an xz-utils style backdoor into critical CI workflows.
GitHub just doesn't care about security. Actions is a security disaster and has been for over a decade. They would rather spend years migrating to Azure for no reason and have multiple outages a week than do anything anybody cares about.
> But this commit doesn't even have to belong to the preceding repository. You can reference a commit on a fork. Great way to sneak in an xz-utils style backdoor into critical CI workflows.
Wow. Does the SHA need to belong to a fork of the repo? Or is GitHub just exposing all (public?) repo commits as a giant content-addressable store?
It appears that under their system all forks belong to same repo (I imagine they just make _fork/<forkname> ref under git when there is something forked off main repo) presumably to save on storage. And so accessing a single commit doesn't really care about origin(as finding to which branch(es) commit belongs would be a lot of work)
yikes.. there should be the cli equivalent of that warning banner at the very least. combine this with something like gitc0ffee and it's downright dangerous
A YAML linter for it, too. I was appreciating the cron input overlay in the current GitHub Actions VS Code extension. In ghost text beside a cron: 'something' input it gives you a human-readable description. Seems like it could also do a similar thing for actions commit refs, show a simple verification if it corresponds to a tag or not in that repo.
Yeah the way Github connects forks behind the scenes has created so many gotchas like this, I'm sure it's a nightmare to fix at this point but they definitely hold some responsibility here.
I've seen it used to impersonate github themselves and serve backdoored versions of their software (the banner is pretty easy to avoid: link to the readme of the malicious commit with an anchor tag and put a nice big download link in it).
In git a commit is a full tree snapshot, even though most commit views only show the diff with the previous commit. npm is using the commit hash as a "version number" and grabbing the full git tree snapshot for that point in time. (Just like in git you can always `git checkout b181e0` to end up in a "detached HEAD" state at that commit's tree. So many developers were doing that unintentionally which is why `git switch` requires the `--detach` flag to checkout the tree at a commit, but the same thing is possible `git switch --detach b181e0`.)
I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware
(Author) In my current codebase, I preserve the whitespace nodes. Whitespace changes would not affect the other nodes though. My first attempt to recover whitespace algorithmically not exactly failed, but more like I was unable to verify it is OK enough. We clang-format or go fmt the entire thing anyway, and whitespace changes are mostly noise, but I did not find 100% sure approach yet.
I think about eg the "using" section at the top of a .cs file where order doesn't matter and it's common for folks to use the "Remove and Sort Usings" feature in VS.. if that were modeled as a set then diffs would consist only of added/removed items and a re-ordering wouldn't even be representable. And then every other manner of refactor that noises up a PR: renaming stuff, moving code around, etc. in my fantasies some perfect high-level model would separate everything that matters from everything that doesn't and when viewing PRs or change history we could tick "ignore superficial changes" to cut thru all the noise when looking for something specific
..to my mind such a thing could only be language-specific and the model for C# is probably something similar to Roslyn's interior (it keeps "trivia" nodes separate but still models the using section as a list for some reason) and having it all in a queryable database would be glorious for change analysis
Some languages are unfortunately whitespace sensitive, so a generic VCS cannot discard whitespace at all. But maybe the diffing tools themselves could be made language aware and hide not meaningful changes.
hiding not meaningful changes is not enough. when a block in python changes the indentation, i want to see that the block is otherwise unchanged. so indentation changes simply need to be marked differently. if a tool can to that then it will also work with code where indentation is optional, allowing me to cleanly indent code without messing up the diff.
i saw a diff tool that marked only the characters that changed. that would work here.
Both svn and git have command line options for diff to hide whitespace changes. It's always disappointed me that various web UIs don't have a toggle for it.
hiding whitespace changes is not the same as indicating them as whitespace change. in particular hiding messes up the proper display of indentation, so it's not always useful.
In my investigations ages ago [1] I felt the trick was to go lower that an AST. ASTs by nature generally have to be language-aware and vary so much from language to language that trying to generally diff them is rough. I didn't solve the language-aware part, but I did have some really good luck with using tokenizers intended for syntax highlighting. Because they are intended for syntax highlighting they are fast, efficient, and generally work well with "malformed"/in-progress works (which is what you want for source control where saving in progress steps can be important/useful/necessary).
It still needs to be language-aware to know which token grammar to use, but syntax highlighting as a field has a relatively well defined shared vocabulary of output token types, which lends to some flexibility in changing the language on the fly with somewhat minimal shifts (particularly things like JS to TS where the base grammars share a lot of tokens).
I didn't do much more with it than generate simple character-based diffs that seemed like improvements of comparative line-based diffs, but I got interesting results in my experiments and beat some simple benchmarks in comparing to other character-based diff tools of the time.
(That experiment was done in the context of darcs exploring character-based diffs as a way to improve its CRDT-like source control. I still don't think darcs has the proposed character-based patch type. In theory, I could update the experiment and attempt to use it as a git mergetool, but I don't know if it provides as many benefits as a git mergetool than it might in a patch theory VCS like darcs or pijul.)
this right here, not just overmoderated but the mods were wrong-headed from the start believing that it was more important to protect some sacred archive than for users to have good experiences.
SO was so elite it basically committed suicide rather than let the influx of noobs and their noob questions and noob answers kill the site
It's funny, because I had a similar question but wanted to be able to materialize a view in Microsoft SQL Server, and ChatGPT went around in circles suggesting invalid solutions.
There were about 4 possibilities that I had tried before going to ChatGPT, it went through all 4, then when the fourth one failed it gave me the first one again.
You can't use the free chat client for questions like that in my experience. Almost guaranteed to waste your time. Try the big-3 thinking models (ChatGPT 5.2 Pro, Gemini 3 Pro, and Claude Opus 4.5).
probably trying to repro the crazy success of vscode, surely electron is the magic sauce and not the dream team of devs. azure data studio should've proved that you can't just sprinkle electron dust and get a winner.
sadly I loved azure data studio despite its being afflicted with electron, but it became so bug infested they had to completely abandon it.
reply