Hacker Newsnew | past | comments | ask | show | jobs | submit | dspillett's commentslogin

> I'm not sure why anyone would choose varchar for a column in 2026

The same string takes roughly half the storage space, meaning more rows per page and therefore a smaller working set needed in memory for the same queries and less IO. Also, any indexes on those columns will also be similarly smaller. So if you are storing things that you know won't break out of the standard ASCII set⁰, stick with [VAR]CHARs¹, otherwise use N[VAR]CHARs.

Of course if you can guarantee that your stuff will be used on recent enough SQL Server versions that are configured to support UTF8 collations, then default to that instead unless you expect data in a character set where that might increase the data size over UTF16. You'll get the same size benefit for pure ASCII without losing wider character set support.

Furthermore, if you are using row or page compression it doesn't really matter: your wide-character strings will effectively be UTF8 encoded anyway. But be aware that there is a CPU hit for processing compressed rows and pages every access because they remain compressed in memory as well as on-disk.

--------

[0] Codes with fixed ranges, etc.

[1] Some would say that the other way around, and “use NVARCHAR if you think there might be any non-ASCIII characters”, but defaulting to NVARCHAR and moving to VARCHAR only if you are confident is the safer approach IMO.


utf16 is more efficient if you have non-english text, utf8 wastes space with long escape sequences. but the real reason to always use nvarchar is that it remains sargeable when varchar parameters are implicitly cast to nvarchar.

UTF-16 is maybe better if your text is mostly made of codepoints which need 3 UTF-8 code units but only one (thus 2 bytes) UTF-16 code unit. This is extremely rare for general text and so you definitely shouldn't begin by assuming UTF-16 is a good choice without having collected actual data.

The old defense of 16-bit chars, popping up in 2026 still! Utf8 is efficient enough for all general purpose uses.

If you're storing gigabytes of non-latin-alphabet text, and your systems are constrained enough that it makes a difference, 16-bit is always there. But I'd still recommend anyone starting a system today to not worry and use utf8 for everything.j


it certainly isn't the best choice for sql server, see: https://sqlquantumleap.com/2018/09/28/native-utf-8-support-i...

What do you mean with non-english text? I don't think "Ä" will be more efficient in utf16 than in utf8. Or do you mean utf16 wins in cases of non-latin scripts with variable width? I always had the impression that utf8 wins on the vast majority of symbols, and that in case of very complex variable width char sets it depends on the wideness if utf16 can accommodate it. On a tangent, I wonder if emoji's would fit that bill too..

Japanese, Chinese, Korean and Indic scripts are mostly 2 bytes per character on UTF-16 and mostly 3 bytes per character in UTF-8.

Really, as an East Asian language user the rest of the comments here make me want to scream.

I am not sure if you mean me, as I just asked a question. I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

> I wonder what the best way is to handle this disparity for international software. It seems like either you punish the Latin alphabets, or the others.

there are over a million codepoints in unicode, thousands for latin and other language agnostic symbols emojis etc. utf-8 is designed to be backwards compatible with ascii, not to efficiently encode all of unicode. utf-16 is the reasonably efficient compromise for native unicode applications hence it being the internal format of strings in C# and sql server and such.

the folks bleating about utf-8 being the best choice make the same mistake as the "utf-8 everywhere manifesto" guys: stats skewed by a web/american-centric bias - sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples


  > sure utf-8 is more efficient when your text is 99% markup and generally devoid of non-latin scripts, that's not my database and probably not most peoples
I think this website audience begs to differ. But if you develop for S.Asia, I can see the pendulum swings to utf-16. But even then you have to account for this:

  «UTF-16 is often claimed to be more space-efficient than UTF-8 for East Asian languages, since it uses two bytes for characters that take 3 bytes in UTF-8. Since real text contains many spaces, numbers, punctuation, markup (for e.g. web pages), and control characters, which take only one byte in UTF-8, this is only true for artificially constructed dense blocks of text. A more serious claim can be made for Devanagari and Bengali, which use multi-letter words and all the letters take 3 bytes in UTF-8 and only 2 in UTF-16.»¹
In the same vein, with reference to³:

  «The code points U+0800–U+FFFF take 3 bytes in UTF-8 but only 2 in UTF-16. This led to the idea that text in Chinese and other languages would take more space in UTF-8. However, text is only larger if there are more of these code points than 1-byte ASCII code points, and this rarely happens in real-world documents due to spaces, newlines, digits, punctuation, English words, and markup.»²

The .net ecosystem isn't happy with utf-16 being the default, but it is there in .net and Windows for historical reasons.

  «Microsoft has stated that "UTF-16 [..] is a unique burden that Windows places on code that targets multiple platforms"»¹

___

1. https://en.wikipedia.org/wiki/UTF-16#Efficiency

2. https://en.wikipedia.org/wiki/UTF-8#Comparison_to_UTF-16

3. https://kitugenz.com/


hn often makes me want to scream

The non sargeablilty is an optimizer deficiency IMO. It could attempt to cast just like this article is doing manually in code, if that success use index, if it fails scan and cast a million times the other way in a scan.

implicit casts should only widen to avoid quiet information loss, if the optimizer behaved as you suggest the query could return incorrect results and potentially more than expected, with even worse consequences

It should not return incorrect results, if the nvarchar only contains ascii it will cast perfectly, if it doesn't then do the slow scan path, it's a simple check and the same work its doing for every row in the current behavior except one time and more restricted. Can you give me an example of an incorrect result here?

I am not talking about the default cast behavior from nvarchar to varchar, but a specific narrow check the optimizer can use to make decision in the plan of ascii or not with no information loss because it will do the same thing as before if it does not pass the one time parameter check.

By far the most common cause of this situation is using ascii only in a nvarchar because like say in this example the client language is using an nvarchar equivalent for all strings, which is pretty much universal now days and that is the default conversion when using a sql client library, one must remember to explicitly cast rather than the db doing it for you which is the expected behavior and the source of much confusion.

This would be purely an optimization fast path check otherwise fall back to the current slow path, correct results always with much faster results if only ascii is present in the string.


> ants are smarter than they look.

Many moons ago I had a big pot of rhubarb in my back yard¹ and was initially irritated by the appearance of ants and aphids, until I took a moment to watch them and realise that the ants were bringing in the aphids and tending to them. The buggers were farming. The ants can't digest the leaves of the rhubarb, but the aphids can and excrete a sugary by-product that the ants “milk” from them. It is a fascinating bit of nature to read into. They even defend the aphids from predators and so forth, so it isn't a bad life for them either.

--------

[1] Not a euphemism for a lovely garden in that case, it was literally about a square yard of concrete behind the mid-terrace I was renting.


I've not used PHP in anger in well over a decade, but if the general environment out there is anything like it was back then there are likely a lot of people, mostly on cheap shared hosting arrangements, running PHP versions older than that and for the most part knowing no better.

That isn't the fault of the language of course, but a valid reason for some of the “ick” reaction some get when it is mentioned.


PHP had its issues like every language, but also a minimal memory footprint, XML/SOAP parser, and several SQL database cursor options.

Most modern web languages like nodejs are far worse due to dependency rot, and poor REST design pattern implementations. =3


> languages like nodejs are far worse due to dependency rot

Yep. Node-based projects sometimes get an “ick” reaction from me similar to PHP ones for that reason. In this case it also isn't really the languages fault, but the way people have built the ecosystem around it.


> Also the language that has made me millions over my career with no degree.

Well done.

> Also the language that allows people to be up and running in seconds (with or without AI).

People getting up and running without any opportunity to be taught about security concerns (even those as simple as the risks of inadequate input verification), especially considering the infamous inconsistency in PHP's APIs which can lead to significant foot-guns, is both a blessing and a curse… Essentially a pre-cursor to some of the crap that is starting to be published now via vibe-coding with little understanding.


Not OP, and I'm no expert in the area at all, but I _do_ have a feeling that there have been quite a few such issues posted here and elsewhere that I read in the last year.

https://www.cve.org/CVERecord/SearchResults?query=io_uring seems to back that up. Only one relevant CVE listed there for 2026 so far, for more than two per month on average in 2025. Caveat: I've not looked into the severity and ease of exploit for any of those issues listed.


Did you read the CVEs? Half these aren't vulnerabilities. One allows the root user to create a kernel thread and then block its shutdown for several minutes. One is that if you do something that's obviously stupid, you don't get an event notification for it.

Remember the Linux kernel's policy of assigning a CVE to every single bug, in protest to the stupid way CVEs were being assigned before that.


> Did you read the CVEs?

You obviously didn't read to the end of my little post, yet feel righteous enough to throw that out…

> One allows the root user to create a kernel thread and then block its shutdown for several minutes.

Which as part of a compromise chain could cause a DoS issue that might be able to bypass common protections like cgroup imposed limits.


If we apply risk/reward analysis, how probable is such a chain of exploits? If you already got local root, you might as well do a little bit more than a simple DoS.

Depending on how much performance would be gained by using io_uring in a particular case, and how many layers of protection exist around your server, it might be a risk worth taking.


> Accepting AI-rewriting as relicensing could spell the end of Copyleft

The more restrictive licences perhaps, though only if the rewriter convinces everyone that they can properly maintain the result. For ancient projects that aren't actively maintained anyway (because they are essentially done at this point) this might make little difference, but for active projects any new features and fixes might result in either manual reimplementation in the rewritten version or the clean-room process being repeated completely for the whole project.

> chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API —

(from the github description)

The “same name” part to me feels somewhat disingenuous. It isn't the same thing so it should have a different name to avoid confusion, even if that name is something very similar to the original like chardet-ng or chardet-ai.


This is super interesting. Exploring the basis for Free Software (the 4 liberties, Richard Stallman)... if AI-code is effectively under Public Domain, wouldn't that actually be even MORE defensive than relying on copyright to be able to generate copyleft? Wouldn't the rewrite of code (previously under any license, and maybe even unknown to the LLM) constitute a massive win for the population in general, because now their 4 liberties are more attainable through the extensive use of LLMs to generate code?

Many copyleft licences give more rights to the user of the software than being public domain would.

A bit of public domain code can be used in a hidden way in perpetuity.

A bit of code covered by AGPL3 (for instance) (and other GPLs depending on context) can be used for free too, but with the extra requirement that users be given a copy of the code, and derivative works, upon request.

This is why the corps like MIT and similar and won't touch anything remotely like GPL (even LGPL which only covers derivative works of the library not the wider project). The MIT licence can be largely treated as public domain.


Who cares if it can be maintained. The system now penalizes the original creator for creating it and gives thieves the ability to conduct legal theft at a gargantuan scale, the only limit being how creative the abuser is in making money.

With the incentives set up like that, the era of open software cooperation would be ended rapidly.


> Who cares if it can be maintained.

People who understand and care about the implications of https://xkcd.com/2347/

Which admittedly is not nearly enough of us…


> how can a2mark ensure that AI did NOT do a clean-room conforming rewrite?

In cases like this it is usually incumbent on the entity claiming the clean-room situation was pure to show their working. For instance how Compaq clean-room cloned the IBM BIOS chip¹ was well documented (the procedures used, records of comms by the teams involved) where some other manufacturers did face costly legal troubles from IBM.

So the question is “is the clean-room claim sufficiently backed up to stand legal tests?” [and moral tests, though the AI world generally doesn't care about failing those]

--------

[1] the one part of their PCs that was not essentially off-the-shelf, so once it could be reliably legally mimicked this created an open IBM PC clone market


I don't remember him calling Linus a terrorist, though there were others that associated anything with a copyleft licence to be the loony left (or the commie left).

He certainly referred to both him and Linux as cancers though, that I do remember. He later changed his mind on that, and IIRC may even have publicly apologised for those statements.


He said Linux is a cancer, which was a stupid thing to say, but not the same as calling Linus a cancer. I say plenty of bad things about software that I would not say about the people who create it. I think Next.js is awful to use but that doesn't mean I think everyone at Vercel is an awful person, for example.

He may not have used the word cancer with respect to individuals, I can't find any such reference in a quick search, but he certainly had harsh words to say about proponents of Linux/OSS/similar.

> Small counties generate huge revenues with traffic cameras.

Whether or not that is true, I suspect it is, the best way to avoid fines for breaking traffic regulations is to not break traffic regulations. They can't make anything from you that way if you do.


Until they start changing speed limits, adjusting the timing on yellow lights, or just saying you ran a stop sign when you didn't and - oops! - they happened to have their dashcam off or their car angled so the actual intersection was just out of view.

If they are that corrupt then you have problems beyond traffic fines. Get your own dash cam and such so you can prove they are lying. No, in an ideal world you shouldn't have to, but if you have a corrupt police force you aren't living in an ideal world.

AGPLv3 largely does, if you can and do enforce it in some way when breaches happen.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: