Opened 12 years ago
Closed 4 days ago
#18392 closed New feature (fixed)
Make MySQL backend default to utf8mb4 encoding
Reported by: | EmilStenstrom | Owned by: | bcail |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | utf8mb4 mysql |
Cc: | kitsunde@…, django@…, cvrebert, moggers87, flisky, cmawebsite@…, contact@…, Jonas Trappenberg, ask@…, django@…, emorley@…, clokep@…, Adam Johnson, Clifford Gama, bcail | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Background:
I just ran into a problem with iPhone emoji characters being saved into MySQL. The text was cut off after the first emoji character. After some research I found which explains how it works: http://mzsanford.wordpress.com/2010/12/28/mysql-and-unicode/
The recommendation is to use MySQL 5.5, and the "utf8mb4" encoding.
Suggestion:
Make "utf8mb4" the default encoding for MySQL 5.5 and up.
Change History (60)
comment:1 by , 12 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:2 by , 12 years ago
I don't know enough about MySQL and the ORM to answer your questions, I hope someone else does.
comment:3 by , 12 years ago
Ok, there's some trickiness here. Indexes in InnoDB tables can't be longer than 255 chars with utf8, but only 191 chars with utf8mb4. This means that the default indexes that Django makes for CharField(max_length=255) is too long, and will break things (break what? I'm running a migration that converts all my tables to utf8mb4 automatically, and setting utf8mb4 on a long charfield breaks because the index is too long).
From the official docs:
"InnoDB has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters. In an InnoDB table, these column and index definitions are legal: col1 VARCHAR(500) CHARACTER SET utf8, INDEX (col1(255)) To use utf8mb4 instead, the index must be smaller: col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))"
From: http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html
comment:4 by , 12 years ago
Triage Stage: | Accepted → Design decision needed |
---|
I am marking this as design decision needed. This will need a good solution which guarantees trouble-free upgrades for current users. While it would be nice to have full UTF8 support with MySQL, I don't think it is worth risking a breakage for existing users.
So, looking for good solutions here. Does MySQL infer the used character set from some runtime variable (perhaps something set at CREATE DATABASE time)? If so, we could just check the used charset for the db, and use that for test database creation and index creation. If so, this would be a simple solution to this problem.
comment:5 by , 12 years ago
I don't believe setting the connection charset to this new utf8mb4 encoding (if we're talking to a server that supports it) would cause any problem with the indexing: that's a problem that comes into play when you change the database charset itself. Whether we need to change that isn't clear to me. It's possible that these new characters (stored in a DB with utf8mb4 encoding) would be transmitted over a connection that has charset set simply to utf8. If so then there's really no point in making any connection charset change. So one open question here is: do you get an error or incorrect behavior trying to read/write these characters over a utf8 connection created by Django when operating on a database with utf8mb4 encoding?
Traditionally Django has not gone beyond advising in the documentation what charset to use (uft8) for the database. Django doesn't attempt to set the charset for stuff it creates to utf8, it just uses the default charset for the DB, which is set before Django ever gets involved. Given this new MySQL 5.5 support for "more better" unicode we probably need to update the docs to mention the new option for database charset.
Having the index creation fail for a too-long CharField is a nuisance. I know we've had tickets before that dealt with this issue, but I don't recall offhand what the status is. Django tries to disallow creation of a CharField where the index creation will fail, but MySQL makes it incredibly difficult to figure out what the right value for "max allowed" is. This new encoding just makes for more of a mess there.
comment:6 by , 12 years ago
Maybe this ticket just needs some exploration before going further: A simple test that tries to save and their fetch some 4-byte unicode characters to MySQL... I think that would reveal what the minimal changes are, for full unicode support to be possible.
follow-up: 8 comment:7 by , 12 years ago
Keywords: | hack utf8mb4 mysql added |
---|
As proof of concept I tried inserting into mysql the following values in a varchar.
😄😃😊☺😉😍😘😚
And got a "Warning: Incorrect string value" from mysql (which python treats as an exception)
As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: http://pastebin.com/MW5BgRgP
Of course the correct way would be to change this in django when setting up the cursor connection.
follow-up: 9 comment:8 by , 12 years ago
Replying to rogeliorv:
As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: http://pastebin.com/MW5BgRgP
Of course the correct way would be to change this in django when setting up the cursor connection.
Did your hack remove the exception? What was the rationale behind the hack? What's the next step?
comment:9 by , 12 years ago
Has patch: | set |
---|---|
Keywords: | hack removed |
Needs tests: | set |
Patch needs improvement: | set |
Replying to EmilStenstrom:
Replying to rogeliorv:
As a way to test it. The hack consists in adding self.query('SET NAMES utf8mb4') in MySQLdb.connections in Connection.set_character_set function as shown here: http://pastebin.com/MW5BgRgP
Of course the correct way would be to change this in django when setting up the cursor connection.
Did your hack remove the exception? What was the rationale behind the hack? What's the next step?
Yes, the hack removed the exception. The rationale followed was to make the mysql client to use a certain encoding.
The next step is to make django's mysql connections to use utf8mb4 by default or otherwise make it more configurable. Since utf8bm4 is utf8 compatible, there should be no extra changes in that regard.
To achieve this django.db.base.cursor should be changed in class DatabaseWrapper function _cursor, (complete function definition here http://pastebin.com/A6dMEMd4):
kwargs = {
"conv": django_conversions,
"charset": "utf8mb4",
"use_unicode": True,
}
Unfortunately this won't work unless we also change MySQLdb.connections class Connection function set_character_set:
Change the two bottom lines to (complete function definition here: http://pastebin.com/AMN1B8za)
#Hack so data can be decoded/encoded using python's utf8 since
# python does not know about mysql utf8mb4
if charset == 'utf8mb4':
charset = 'utf8
self.string_decoder.charset = charset
self.unicode_literal.charset = charset
This will guarantee you can use special characets like 😄😃😊☺😉😍😘😚
Unlike the previous hack, which worked on reading/writing data, this patch only allows me to read data in utfmb4 format, but now I've hit an error on insertion/creation where I get 'Cursor' object has no attribute '_last_executed'. I will report evidence on this error as I find it. All your help regarding this error is appreciated.
You can reach me via twitter, @rogeliorv
comment:10 by , 12 years ago
As a workaround, you can make python understand 'utf8mb4' as an alias for 'utf8':
import codecs codecs.register(lambda name: codecs.lookup('utf8') if name == 'utf8mb4' else None)
comment:11 by , 12 years ago
Cc: | added |
---|
comment:12 by , 12 years ago
There is a fix for this issue in the upcoming MySQLdb-python release (https://github.com/farcepest/MySQLdb1/tree/utf8mb4)
This should be used in together with the appropriate OPTIONS in the database config:
DATABASES = { 'default': { 'ENGINE': 'django.db.backends.mysql', 'OPTIONS': {'charset': 'utf8mb4'}, (...)
comment:13 by , 12 years ago
The fix mentioned above has been merged to master ( https://github.com/farcepest/MySQLdb1) and released (http://pypi.python.org/pypi/MySQL-python/1.2.4b5)
comment:14 by , 12 years ago
Cc: | added |
---|
comment:15 by , 12 years ago
Type: | Uncategorized → New feature |
---|
comment:16 by , 12 years ago
Triage Stage: | Design decision needed → Accepted |
---|
This is just another case of MySQL being purposefully and irreversibly brain-damaged. What it calls utf8 in actually a non-standard 3-bytes encoding unrelated to utf8.
Django should give the option to use utf8mb4, and maybe recommend it in the docs (if it works well; non-default features of MySQL rarely work well).
The comments above suggest this isn't possible right now, although I'm still confused as to the nature of the problem. Accepting on this basis.
comment:17 by , 11 years ago
Cc: | added |
---|
comment:18 by , 11 years ago
#21308 describes an issue with running tests on MySQL when charset=utf8mb4 that I've marked as a duplicate of this.
follow-up: 20 comment:19 by , 11 years ago
This InnoDB restriction was fixed in MySQL 5.5.14. You have to set the following:
innodb_large_prefix=ON
innodb_file_per_table=ON
innodb_file_format=Barracuda
and CREATE TABLE
or ALTER TABLE
with the ROW_FORMAT=DYNAMIC
attribute.
follow-up: 21 comment:20 by , 11 years ago
Cc: | added |
---|
Replying to anonymous:
*snip*
and
CREATE TABLE
orALTER TABLE
with theROW_FORMAT=DYNAMIC
attribute.
I'm not aware of any way to specify options like ROW_FORMAT=DYNAMIC
for create/alter statements in Django other than specifying SET ROW_FORMAT=DYNAMIC
with 'init_command' - but that adds overhead to each connection (and feels a bit hacky imo)
comment:21 by , 11 years ago
Replying to moggers87:
Replying to anonymous:
*snip*
and
CREATE TABLE
orALTER TABLE
with theROW_FORMAT=DYNAMIC
attribute.
I'm not aware of any way to specify options likeROW_FORMAT=DYNAMIC
for create/alter statements in Django other than specifyingSET ROW_FORMAT=DYNAMIC
with 'init_command' - but that adds overhead to each connection (and feels a bit hacky imo)
Having run into the same issue my conclusion is that the only proper solution is to switch to PostgreSQL. It would appear that MySQL has extremely bad support for indexing on utf8mb4 fields.
comment:22 by , 10 years ago
Cc: | added |
---|
comment:23 by , 10 years ago
Cc: | added |
---|
One solution would be to reduce the INDEX size to 191 for mysql, like the example above:
col1 VARCHAR(500) CHARACTER SET utf8mb4, INDEX (col1(191))"
follow-up: 25 comment:24 by , 10 years ago
https://mathiasbynens.be/notes/mysql-utf8mb4 has more information on this. Ideally it would be wonderful if Django users could set an option and have Django migration backend generate all the code for the migration to the full utf8mb4 encoding for all databases, tables, columns and corresponding changes to all indexes as well. And then on next major version this option could become the new default.
comment:25 by , 9 years ago
Replying to aigarius:
[...] have Django migration backend generate all the code for the migration [...]
Even the utf8 to utf8mb4 migration, which would be the easiest in terms of required changes, may cause data loss if you have indexes that are longer than 191 characters and should thus probably be inspected and fixed up individually and manually. For that reason, it seems impossible to me to automate. That being said, clear migration instructions would probably be helpful for many users.
Maybe this could be another backend-specific setting, similar to what was implemented for integer types in 1506c71a95cd7f58fbc6363edf2ef742c58d2487? Although, it only applying to index columns may make it way more painful to implement.
comment:26 by , 9 years ago
For a first step, I propose a setting in DATABASES OPTIONS that tells Django the maximum index size to use when creating new indexes. I think it should default to 191. That way Django won't ever reduce the size of existing indexes, only new indexes.
If we have that setting, it's at least _possible_ to use utf8mb4. We could then down the road have Django try to default to using utf8mb4 encoding for new databases/tables/columns.
(We need this for #20846.)
Ohh and it _is_ possible to have indexes longer than 191, it just requires changing your db configuration, so Django should allow longer than 191 if it's told that it's ok.
comment:27 by , 9 years ago
Will that setting work nicely with migrations though? I think we need to know the index names for some operations like AlterField
. It seems problematic if we have a way that users can vary the index names without updating existing names.
comment:28 by , 9 years ago
I was thinking it wouldn't actually change the name of the index, but I haven't actually looked at the code. :)
comment:29 by , 9 years ago
In case it's not clear, here's the problematic scenario I see:
- Indexes are created with max index name length=191 and truncated accordingly.
- Developer increases max index name length setting to 200 (no changes to index names in the database).
- Now migrations can't operate on existing indexes because the 200 character index names it generates aren't what's in the database.
comment:30 by , 9 years ago
I'm not thinking of limiting the _name_ of the index. The issue is "the maximum number of characters that can be indexed".
comment:31 by , 9 years ago
Thanks Collin, in that case your proposal makes more sense to me. It could be nice to get a consensus from more MySQL users though.
comment:32 by , 9 years ago
Cc: | added |
---|
comment:33 by , 9 years ago
Cc: | added |
---|
comment:34 by , 9 years ago
Cc: | added |
---|
comment:35 by , 9 years ago
Cc: | added |
---|
By way of consensus: I agree that if possible, going forward, Django ought to seek to use utf8mb4, because for better or worse, the world is becoming more mobile oriented, and with it more emoji-laden; the non-mb4 charsets/collations choke on such things, which is just plain unfortunate.
If a suitable backwards-compatible, migration-friendly patch can't be achieved, it should at least be mentioned in the docs somewhere, IMHO (searching for utf8mb4
or emoji
currently yields no results, so I assume there are none squirreled away). Whether that mention is a recommendation or not is probably still unclear.
comment:36 by , 9 years ago
Slightly off-topic but maybe interesting: The Wordpress team did convert all their users from utf8 to utf8mb4 according to this blog post: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/.
Looking at the code it seems they just dropped and recreated all the indexes: https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/upgrade.php#L2687
And then converted all the tables one by one: https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/upgrade.php#L1951
comment:37 by , 9 years ago
Yes, I based my proposal off of what WordPress did. WordPress limited the length of the index without limiting the length of the field itself. Django currently doesn't have that option.
comment:38 by , 9 years ago
Cc: | added |
---|
comment:39 by , 8 years ago
Cc: | added |
---|
comment:40 by , 8 years ago
The skips for the tests added in 1a9f6db5ffd2d5e71d73340ab59476572e05a728 should be removed or modified when completing this ticket. Should we skip them conditionally based on the test charset or should we require running the MySQL tests with utf8mb4?
comment:41 by , 7 years ago
I suggest to begin with a very minimal patch like this PR, which will at least allow users to begin converting some database columns to utf8mb4
through custom migrations, and use these columns in their code (where indexing doesn't come in their way).
Working on index issues can come later, and will be needed to run the Django test suite with utf8mb4
.
comment:42 by , 7 years ago
Oh, now I realize that utf8mb4
can also be set in DATABASES OPTIONS. Still, using it in Django by default is a strong signal.
comment:43 by , 7 years ago
Needs tests: | unset |
---|---|
Patch needs improvement: | unset |
I provided a more comprehensive patch.
comment:44 by , 7 years ago
Patch needs improvement: | set |
---|---|
Summary: | Use utf8mb4 encoding with MySQL 5.5 → Make MySQL backend default to utf8mb4 encoding |
There's an outstanding issue to fix on the pull request and Claude said, "I'm not sure if I'll have time to continue working on this, so if anyone wants to take this patch further, feel free!"
comment:45 by , 7 years ago
As a workaround, I came up with this monkey patch that limits the index size. We put this in our migrations/init.py:
from django.db.models.fields import CharField def _create_index_sql(self, model, fields, suffix="", sql=None): """ Return the SQL statement to create the index for one or several fields. `sql` can be specified if the syntax differs from the standard (GIS indexes, ...). """ tablespace_sql = self._get_index_tablespace_sql(model, fields) idx_columns = [] for field in fields: c = field.column if isinstance(field, CharField): if field.max_length > 255: idx_columns.append(self.quote_name(c) + '(255)') else: idx_columns.append(self.quote_name(c)) else: idx_columns.append(self.quote_name(c)) columns = [field.column for field in fields] sql_create_index = sql or self.sql_create_index return sql_create_index % { "table": self.quote_name(model._meta.db_table), "name": self.quote_name(self._create_index_name(model, columns, suffix=suffix)), "using": "", "columns": ", ".join(column for column in idx_columns), "extra": tablespace_sql, } from django.db.backends.mysql.schema import DatabaseSchemaEditor DatabaseSchemaEditor._create_index_sql = _create_index_sql
comment:46 by , 6 years ago
Can anyone clarify the process to migrate the default Django generated MySQL schema to a utf8mb4 friendly one?
comment:47 by , 5 years ago
Cc: | added |
---|
For all - Django-MySQL has a system to check to warn if you're not usnig utf8mb4: https://django-mysql.readthedocs.io/en/latest/checks.html#django-mysql-w003-utf8mb4
Replying to Arthur Pemberton:
Can anyone clarify the process to migrate the default Django generated MySQL schema to a utf8mb4 friendly one?
See this post: https://mathiasbynens.be/notes/mysql-utf8mb4 .
follow-up: 49 comment:48 by , 3 years ago
I submitted a utf8mb4 fix on github and was redirected here to this old relic.
https://github.com/django/django/pull/14563
Lots of stuff had changed since the issue was first open nine years ago. My two cents:
- In v5.7, innodb indexes no longer limits the 767 bytes hardcap on utf8mb4 indexes.
https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-7.html#mysqld-5-7-7-feature
When innodb_file_format is set to Barracuda, innodb_large_prefix=ON allows index key prefixes longer than 767 bytes (up to 3072 bytes) for tables that use a Compressed or Dynamic row format.
- MySQL 8 will default use db-wide utf8mb4
https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-1.html#mysqld-8-0-1-charset
Important Change: The default character set has changed from latin1 to utf8mb4. These system variables are affected:
IMHO Django's decision to hack MySQL's default settings to less supported utf8mb3 (aka the utf8) would be unwise. Maybe it was a proper compromise 9 years ago, but it will be a liability in the years to come.
Maybe at least we can add some notes to alarm the readers?
CREATE DATABASE <dbname> CHARACTER SET utf8;
This would be a huge misleading mistake on official Django doc.
@felixxm @pope1ni
comment:49 by , 3 years ago
Replying to lambdaq:
Lots of stuff had changed since the issue was first open nine years ago.
Yes, it has. But it is still not necessarily straightforward.
- In v5.7, innodb indexes no longer limits the 767 bytes hardcap on utf8mb4 indexes.
https://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-7.html#mysqld-5-7-7-feature
When innodb_file_format is set to Barracuda, innodb_large_prefix=ON allows index key prefixes longer than 767 bytes (up to 3072 bytes) for tables that use a Compressed or Dynamic row format.
Based on your linked release notes, MySQL 5.7.7 changed the defaults of some settings to the following:
innodb_file_format=Barracuda # Previous default was Antelope innodb_large_prefix=ON # Previous default was OFF
These allow indexing strings up to 768 characters instead of 191 characters for utf8mb4 which should eliminate the problem of . I don't think we need to worry about that new upper limit being an issue as 255 was the previous cap anyway with utf8 (a.k.a. utf8mb3).
It should be noted that these options are also deprecated as of that release and removed in 8.0.0. The release notes also state that using non-default values for the above settings in MySQL 5.7.7+ will log a deprecation warning.
The underlying problem, however, is what row format is used. It is necessary for the row format of a table to be COMPRESSED
or DYNAMIC
for large key index support. These were unavailable with the previous default file format configuration (Antelope) and only REDUNDANT
and COMPACT
could be used.
Note that even in MySQL 8.0.0+ all of these row formats can be used, will not be changed automatically during upgrade and thus the migration of existing projects from utf8mb3 → utf8mb4 (with 255 → 191 characters) is still a potential problem.
That said, the default value of innodb_default_row_format added in MySQL 5.7.9 is DYNAMIC
. (See here also.) Note that in 5.6 the default value was COMPACT
.
I found the following upgrade details which state:
In MySQL 5.7.9,
DYNAMIC
replacesCOMPACT
as the implicit default row format for InnoDB tables. A new configuration option,innodb_default_row_format
, specifies the default InnoDB row format. Permitted values includeDYNAMIC
(the default),COMPACT
, andREDUNDANT
.
After upgrading to 5.7.9, any new tables that you create use the row format defined by
innodb_default_row_format
unless you explicitly define a row format (ROW_FORMAT
).
For existing tables that do not explicitly define a
ROW_FORMAT
option or that useROW_FORMAT=DEFAULT
, any operation that rebuilds a table also silently changes the row format of the table to the format defined byinnodb_default_row_format
. Otherwise, existing tables retain their current row format setting. For more information, see Defining the Row Format of a Table.
So maybe things are not as bad as they seem after all.
- MySQL 8 will default use db-wide utf8mb4
https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-1.html#mysqld-8-0-1-charset
Important Change: The default character set has changed from latin1 to utf8mb4. These system variables are affected:
IMHO Django's decision to hack MySQL's default settings to less supported utf8mb3 (aka the utf8) would be unwise. Maybe it was a proper compromise 9 years ago, but it will be a liability in the years to come.
Maybe at least we can add some notes to alarm the readers?
CREATE DATABASE <dbname> CHARACTER SET utf8;
This would be a huge misleading mistake on official Django doc.
@felixxm @pope1ni
It would be nice to sort this out, not that I'm volunteering. You should reignite the discussion on the DevelopersMailingList if you want to take this on.
Given how things have changed maybe there is an easy path forward now and we could consider the following:
- Require MySQL 5.7.9+. We can't drop 5.7 entirely as it is supported until October 2023. See SupportedDatabaseVersions.
- Add a system check to ensure that the configuration options are set to expected values:
# For MySQL < 8.0.0: innodb_file_format=Barracuda innodb_large_prefix=ON # For all versions: innodb_default_row_format=DYNAMIC
- Add a system check that the
ROW_FORMAT
of all tables isDEFAULT
,DYNAMIC
, orCOMPRESSED
. (See here.) - Change the documentation and connection configuration to use
utf8mb4
. - Add plenty of warnings into the release notes detailing the restrictions and how to fix any issues in preparation for upgrading Django.
One other question is how this all affects MariaDB which is supported by django.db.backends.mysql
. I haven't looked into that.
comment:50 by , 2 months ago
Cc: | added |
---|---|
Owner: | changed from | to
Status: | new → assigned |
Version: | 1.4 → dev |
comment:52 by , 2 months ago
Cc: | added |
---|
comment:53 by , 2 months ago
What if we changed the defaults in django to "utf8mb4", but put prominent notes in the documentation for how users can set the database options to use "utf8mb3"? We could link to a page like this, and then users could stay on utf8mb3 if they need to, and django wouldn't need to try to automatically migrate data.
comment:55 by , 6 weeks ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:57 by , 4 days ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:58 by , 4 days ago
Patch needs improvement: | unset |
---|
comment:59 by , 4 days ago
Triage Stage: | Accepted → Ready for checkin |
---|
To me it seems the character encoding is set on connect to 'charset': 'utf8'. Is it enough to change this from 'utf8' to 'utf8mb4' to change this default. If not, where should this default encoding be defined? Does this mean that any text column must be VARCHAR(N) CHARACTER SET utf8mb4 on creation?
I am accepting this as to me the change sounds valid.