It is correct that InnoDB indexes used to have a per-entry limit of 767 bytes; this is however the index left prefix. It means that one can still index longer strings, but infix/suffix index access can't be performed. Prefix index access can (limited to the number of chars).
Now, if one is indexing very long strings, they should really think twice about what they're doing, as some other data structures may be more appropriate. Indexing is a balancing act, and this is not a problem of MySQL, it's a inherent problem of database systems. One may think they're going to add a 2048-bytes index and speed up access, while they may be actually slowing down the system (ie. because full table scan may be faster, and the index slows down writing and query-optimizing).
To be noted that nowadays the limit is not 767 bytes, but 3072, at least, for the default row format[¹].
Postgres largely avoids this problem by using utf8 as the default pretty much everywhere, but MySQL definitely has platforms that ship with defaults as latin1.
That's… not actually really relevant to the issue. And you can configure postgres to use a non-utf8 encoding as its default.
Postgres avoids the issue by automatically moving large value outside of the row itself ("TOAST" storage). That's also why you can have an unbounded vachar in postgres, and there's no real difference between varchar, varchar(n) and text. And char(n) is slower than the others: char(n) will always pad to (n), while varchar and text will only use whatever they need (+ some overhead).
TOAST works in terms of bytes, so your UTF8 might move to the next storage location earlier than if you'd use a fixed-size single-byte encoding. It just doesn't matter that much (it'll just be a bit slower) unless you actually reach the limit of TOAST storage itself (1GB per value).
And Postgres has its own issue with fixed-size storage e.g. identifiers (names of tables, columns, indexes, constraints, etc…) can't be more than 63 bytes, they'll be silently and implicitly truncated to that.
It does happen though, as I've discovered (repeatedly, because I'm a slow learner).
I have to trust my database. If I can't trust the authors of the database, then how do I trust the database?
This is upsetting because until recently if you asked for a utf8 column, this is what you got instead. You got something which was almost, but not quite compatible with utf8.
> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these
Obviously 5 and 6 byte sequences are consigned to history but 4 byte sequences turned out to be useful for more than "political reasons".
(scroll right to the bottom)
Here is a short list of MySQL magic numbers for column types.
Space efficiency and performance:
- VARBINARY(255) / VARCHAR(255) with single-byte character sets: 255 is the biggest size that requires only 1 byte of storage overhead.
- VARCHAR(63) with utf8mb4: 63 is the biggest size that requires only 1 byte of storage overhead because 64 × 4 > 255.
- VARBINARY(512) / VARCHAR(512): the largest columns of their types that are compatible with in-memory temporary tables when UNION or UNION ALL is used. Larger columns will use on-disk tables instead.
Limitations of 3072-byte index keys (innodb_large_prefix enabled; default for MySQL >= 5.7.7):
- VARBINARY(3072) / VARCHAR(3072) with single-byte character sets: 3072 is the biggest size that can be fully indexed.
- VARCHAR(768) with utf8mb4: 768 is the biggest size that can be fully indexed because 769 × 4 > 3072.
Limitations of 767-byte index keys (innodb_large_prefix disabled; default for MySQL <= 5.7.6):
- VARBINARY(767) / VARCHAR(767) with single-byte character sets: 767 is the biggest size that can be fully indexed without enabling innodb_large_prefix.
- VARCHAR(191) with utf8mb4: 191 is the biggest size that can be fully indexed without enabling innodb_large_prefix because 192 × 4 > 767. This is what the OP is about. VARCHAR(768), mentioned above, should be more relevant nowadays.
Different magic numbers come in if you are using VARCHAR with utf8 (the MySQL character set, not the real UTF-8). I'll let you work out the details, but the magic numbers are 85, 255, 1024.
> The way innodb MySQL databases worked was that you can only have 767 bytes for an index
Was somebody a Boeing fan or is there another story behind the 767 figure?
EDIT: aaand I figured it out. 256 * 3 is 768, then take away one for the null terminator. Wild.
That's the consequence not the cause. The cause is
> If you reduce the InnoDB page size to 8KB or 4KB by specifying the innodb_page_size option when creating the MySQL instance, the maximum length of the index key is lowered proportionally, based on the limit of 3072 bytes for a 16KB page size. That is, the maximum index key length is 1536 bytes when the page size is 8KB, and 768 bytes when the page size is 4KB.
So some ancient MySQL probably used 4KB page which only had spacer for 768 bytes index records (probably because of the page structure). When the page size was increased, the key length was likely kept.
Incidentally, Postgres' default (BTree) index has a higher-but-not-infinite limit of 2730 bytes. It can be surprising when you index a varchar column without restricting column length and one day inserts start blowing up because someone tried inserting 5k worth of crap in the column.
You've now introduced two more magic constants, though, and so the question continues: where do those come from? Since 3072 is 3 * (16KB / 16), I'm guessing the 3 still comes from the original article's `Latin1` 3-byte encoding origin story. The 1/16th sounds like a balancing factor between the width of an index element and the number of elements per page.
Like why my MySQL couldn't create indexes on data I migrated from PostgreSQL. Granted, that ship has long since sailed and we decided to stick with PostgreSQL, but nice to know the reason none the less!
Indexes are not the only reason. Typical databases (mysql, mssql etc.) are row oriented so they store rows of data in blocks on disk. If you have a text or blob field it is potentially unbounded in size and if the field does get too big it will typically be moved 'off the row' and replaced with a pointer to the full object. If instead you have a varchar field then it's size is bounded and the db can keep all the data in contiguous blocks on the disk. Having all your data in contiguous blocks can improve read performance (though it's less of an issue with SSDs).
This is done for all variably sized data types, including third party GIS polygons.
Edit: Seems like someone else was wondering about the same thing, with no clear answer: https://stackoverflow.com/questions/13174307/mysql-workbench...
I'm surprised the database cares at all about field length for indexing purposes, because I'd assume for indexing purposes it would hash the field and be done with it; a hash is a great tool for doing quick equality-comparisons.
Is this not sufficient because the index also needs to support sorting, so the original values matter to high resolution?
1. They provide much better locality and updatability, because databases would historically get the data from extremely slow IO (spinning rust) btrees are a great fit, they're also extremely predictable (search, insert, and delete work in O(log n)).
2. hash indices only support exact matches, btrees support prefix and range matching, and they are ordered, so when all the stars align you don't even need to touch the data pages, you can serve an entire query by just reading the index.
Most databases allow opt-in hash indexes (and even other more complicated forms of indexes), but these are either much less flexible (hash) or much more expensive (gist/gin). The btree remains an excellent default.
Oracle has supported this for a long time, and apparently newer versions of MySQL as well.
Regardless the type of index, field length still matters if you care about performance. Smaller fields and smaller indices means you can fits more things (the working set) in RAM and less I/O when you have to hit disks.