Hacker News new | past | comments | ask | show | jobs | submit login
Why do database columns have a character length of 191? (grouparoo.com)
124 points by andyjih_ 5 months ago | hide | past | favorite | 35 comments

The article is missing important notions (and at risk of being misleading) about index storage and access.

It is correct that InnoDB indexes used to have a per-entry limit of 767 bytes; this is however the index left prefix. It means that one can still index longer strings, but infix/suffix index access can't be performed. Prefix index access can (limited to the number of chars).

Now, if one is indexing very long strings, they should really think twice about what they're doing, as some other data structures may be more appropriate. Indexing is a balancing act, and this is not a problem of MySQL, it's a inherent problem of database systems. One may think they're going to add a 2048-bytes index and speed up access, while they may be actually slowing down the system (ie. because full table scan may be faster, and the index slows down writing and query-optimizing).

To be noted that nowadays the limit is not 767 bytes, but 3072, at least, for the default row format[¹].


I always wondered how the index prefixing impacts RTL language strings.

It doesn't. LTR/RTL only applies at display time -- the first character of a string is still stored first, regardless of whether it's going to show up on the left or the right side of the screen.

As I understand it, the bytes are ordered the same (without regarding big vs little endian). The RTL (or LTR) has to do with the visual representation and can be set by a nonprinting character.




If you are building software that depends on a SQL database, always specify your encodings (and explicitly name your indexes)! I previously built an enterprise product that didn't specify our string encodings in its initial migration and spent years fielding support tickets and dealing with the hell of making migrations that could support the default encodings across platforms until we could justify copying the whole column.

Postgres largely avoids this problem by using utf8 as the default pretty much everywhere, but MySQL definitely has platforms that ship with defaults as latin1.

> Postgres largely avoids this problem by using utf8 as the default pretty much everywhere

That's… not actually really relevant to the issue. And you can configure postgres to use a non-utf8 encoding as its default.

Postgres avoids the issue by automatically moving large value outside of the row itself ("TOAST" storage). That's also why you can have an unbounded vachar in postgres, and there's no real difference between varchar, varchar(n) and text. And char(n) is slower than the others: char(n) will always pad to (n), while varchar and text will only use whatever they need (+ some overhead).

TOAST works in terms of bytes, so your UTF8 might move to the next storage location earlier than if you'd use a fixed-size single-byte encoding. It just doesn't matter that much (it'll just be a bit slower) unless you actually reach the limit of TOAST storage itself (1GB per value).

And Postgres has its own issue with fixed-size storage e.g. identifiers (names of tables, columns, indexes, constraints, etc…) can't be more than 63 bytes, they'll be silently and implicitly truncated to that.

It's true that Postgres does this. But the problem I was highlighting was the developer-error of lacking to explicitly declare an encoding. It's only really related to the OP link insofar-as when you fail to do so on MySQL, you'll receive error messages that inform you of the dreaded 767-byte index.

You can actually have a similar error in postgres, it's just that the index limit is at 2730 bytes, and it's unlikely you'll have indexed a column which can get content more than 2.5k.

It does happen though, as I've discovered (repeatedly, because I'm a slow learner).

MySQL finally ships with utf8mb4 by default in MySQL 8, but anything older still uses latin1 / latin1_swedish_ci by default and should be converted to utf8mb4.

It's been years and I still get mad every time I think about the fact that 'utf8' is a broken encoding and they had to make a new one called utf8mb4.

I have to trust my database. If I can't trust the authors of the database, then how do I trust the database?

To save a Google for anyone like me - there’s nothing wrong with utf8. The problem was that MySQL had a broken implementation of utf8 (“utf8mb3”) which only supported Unicode characters which fit in 3 bytes. As a result you couldn’t save any data in your database which contained emoji.

This is upsetting because until recently if you asked for a utf8 column, this is what you got instead. You got something which was almost, but not quite compatible with utf8.

Interestingly, this was a mistake even the creators of UTF-8 wanted to make

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these

Obviously 5 and 6 byte sequences are consigned to history but 4 byte sequences turned out to be useful for more than "political reasons".


(scroll right to the bottom)

I ran into this in Amazon's RDS not so long ago (3 years?) where I wasted a good few hours figuring out why mysql was mangling our emojis. A good test for this is trying to insert the character, which will only work with utf8mb4 as it uses the full 4 bytes, which mysql's broken 3 byte utf-8 implementation simply does not support. This is also the reason for 191 character limit as 191 is 3/4 of the 255 bytes you'd be able to store with their broken implementation. Which of course 1/3rd of the maximum entry size for innodb of 764 bytes (the last one is \0), or 4*191. They never changed the underlying implementation. Hence the weird limit.

You also need to be aware that MySQL's "latin1" is not latin-1. It's a nonstandard variant of windows CP1252 with 8 arbitrary extra characters.

191 is a magic number for VARCHAR columns in MySQL when innodb_large_prefix is disabled, which was the default for MySQL <= 5.7.6.

Here is a short list of MySQL magic numbers for column types.

Space efficiency and performance:

- VARBINARY(255) / VARCHAR(255) with single-byte character sets: 255 is the biggest size that requires only 1 byte of storage overhead.

- VARCHAR(63) with utf8mb4: 63 is the biggest size that requires only 1 byte of storage overhead because 64 × 4 > 255.

- VARBINARY(512) / VARCHAR(512): the largest columns of their types that are compatible with in-memory temporary tables when UNION or UNION ALL is used. Larger columns will use on-disk tables instead.

Limitations of 3072-byte index keys (innodb_large_prefix enabled; default for MySQL >= 5.7.7):

- VARBINARY(3072) / VARCHAR(3072) with single-byte character sets: 3072 is the biggest size that can be fully indexed.

- VARCHAR(768) with utf8mb4: 768 is the biggest size that can be fully indexed because 769 × 4 > 3072.

Limitations of 767-byte index keys (innodb_large_prefix disabled; default for MySQL <= 5.7.6):

- VARBINARY(767) / VARCHAR(767) with single-byte character sets: 767 is the biggest size that can be fully indexed without enabling innodb_large_prefix.

- VARCHAR(191) with utf8mb4: 191 is the biggest size that can be fully indexed without enabling innodb_large_prefix because 192 × 4 > 767. This is what the OP is about. VARCHAR(768), mentioned above, should be more relevant nowadays.

Different magic numbers come in if you are using VARCHAR with utf8 (the MySQL character set, not the real UTF-8). I'll let you work out the details, but the magic numbers are 85, 255, 1024.

Fascinating! But it replaces one question with another:

> The way innodb MySQL databases worked was that you can only have 767 bytes for an index

Was somebody a Boeing fan or is there another story behind the 767 figure?

EDIT: aaand I figured it out. 256 * 3 is 768, then take away one for the null terminator. Wild.

> EDIT: aaand I figured it out. 256 * 3 is 768, then take away one for the null terminator. Wild.

That's the consequence not the cause. The cause is

> If you reduce the InnoDB page size to 8KB or 4KB by specifying the innodb_page_size option when creating the MySQL instance, the maximum length of the index key is lowered proportionally, based on the limit of 3072 bytes for a 16KB page size. That is, the maximum index key length is 1536 bytes when the page size is 8KB, and 768 bytes when the page size is 4KB.

So some ancient MySQL probably used 4KB page which only had spacer for 768 bytes index records (probably because of the page structure). When the page size was increased, the key length was likely kept.

Incidentally, Postgres' default (BTree) index has a higher-but-not-infinite limit of 2730 bytes. It can be surprising when you index a varchar column without restricting column length and one day inserts start blowing up because someone tried inserting 5k worth of crap in the column.

Nice extra context, thanks!

You've now introduced two more magic constants, though, and so the question continues: where do those come from? Since 3072 is 3 * (16KB / 16), I'm guessing the 3 still comes from the original article's `Latin1` 3-byte encoding origin story. The 1/16th sounds like a balancing factor between the width of an index element and the number of elements per page.

Additional information/investigation I found: https://dba.stackexchange.com/questions/57005/why-are-innodb...

These kind of stories are really interesting because they provide so much context to questions that I couldn't answer up to now!

Like why my MySQL couldn't create indexes on data I migrated from PostgreSQL. Granted, that ship has long since sailed and we decided to stick with PostgreSQL, but nice to know the reason none the less!

This number is based on the character set of the column type. utf8 has more space available so you get 255 with utf8mb4 you get 191 which is mysql's default now.

> All modern popular relational database support (almost) unlimited sized strings with a text or blob-type column, so why not use that? The reason is indexes.

Indexes are not the only reason. Typical databases (mysql, mssql etc.) are row oriented so they store rows of data in blocks on disk. If you have a text or blob field it is potentially unbounded in size and if the field does get too big it will typically be moved 'off the row' and replaced with a pointer to the full object. If instead you have a varchar field then it's size is bounded and the db can keep all the data in contiguous blocks on the disk. Having all your data in contiguous blocks can improve read performance (though it's less of an issue with SSDs).

PostgreSQL is also row oriented but can store small text and blob (bytes) fields in the main table. PostgreSQL determines per value if it fits in the main table storage or not. This means that varchar and text are virtually identical.

This is done for all variably sized data types, including third party GIS polygons.

Postgres always amazes me how well designed it is. It’s too bad MySQL has the brand recognition it does because Postgres is just far and away the better choice for almost everything.

This reminds me of the times when I wanted to do things "the right way" so instead of using phpMyAdmin I switched to using MySQL Workbench (which I no longer use now, it was slow and painful). Its default VARCHAR size was an oddly specific 45 characters.

Edit: Seems like someone else was wondering about the same thing, with no clear answer: https://stackoverflow.com/questions/13174307/mysql-workbench...

Site was not loading for me so here is an archived version to save you guys a click:


History, like this, is great to understand and learn from.

Newbie database question:

I'm surprised the database cares at all about field length for indexing purposes, because I'd assume for indexing purposes it would hash the field and be done with it; a hash is a great tool for doing quick equality-comparisons.

Is this not sufficient because the index also needs to support sorting, so the original values matter to high resolution?

Databases use btree indices for two reasons:

1. They provide much better locality and updatability, because databases would historically get the data from extremely slow IO (spinning rust) btrees are a great fit, they're also extremely predictable (search, insert, and delete work in O(log n)).

2. hash indices only support exact matches, btrees support prefix and range matching, and they are ordered, so when all the stars align you don't even need to touch the data pages, you can serve an entire query by just reading the index.

Most databases allow opt-in hash indexes (and even other more complicated forms of indexes), but these are either much less flexible (hash) or much more expensive (gist/gin). The btree remains an excellent default.

Prefixes are a neat trick. One nice consequence is that an index for the column pair (a, b) is also an index for (a).

And if the cardinality of a is low, it can even be an index for b! This is called “skip scan”.

Oracle has supported this for a long time, and apparently newer versions of MySQL as well.

Yup, also that `like ‘foo%’` can use the index.

Hash indexing is sometimes used, but B-trees are more common so you can select with ORDER BY and also because you can insert additional records as many times as you want, by splitting nodes now and then, but never having to rehash the whole index.

Hash index is only useful for exact matches, a full table scan would be required for any other kinds of queries. That is why most database engines have settled with B+ tree.

Regardless the type of index, field length still matters if you care about performance. Smaller fields and smaller indices means you can fits more things (the working set) in RAM and less I/O when you have to hit disks.

Hash tables perform poorly when they need to be resized periodically as the table grows (as everything needs to get rehashed). Using a tree structure instead ensures consistent performance.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact