Opened 13 years ago

Closed 12 years ago

#16052 closed Bug (wontfix)

MySQL: CharField's with utf8_bin collation return as str, should return as unicode.

Reported by: Graham King Owned by: Graham King
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: mysql utf_bin unicode
Cc: Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

On MySQL, a CharField with default collation comes out as unicode, but with utf8_bin it comes out as str.

Adding this line into the conversion dictionary (in django/db/backends/mysql/base.py) fixes it:

 FIELD_TYPE.VAR_STRING: [(None, force_unicode)],

Patch includes two tests in regressiontests/backends/. The full test suite passes with this patch applied.

Attachments (1)

mysql_utf8bin_unicode.diff (4.5 KB ) - added by Graham King 13 years ago.
Tries to guess whether a VAR_STRING has binary or unicode data, and converts appropriately.

Download all attachments as: .zip

Change History (6)

comment:1 by Graham King, 13 years ago

Owner: changed from nobody to Graham King

comment:2 by Karen Tracey, 13 years ago

Triage Stage: UnreviewedDesign decision needed

We tried to do something like this before, see r8318 and r8329. You'll notice we were trying to specify FIELD_TYPE.VARCHAR, not FIELD_TYPE.VAR_STRING and upon further investigation we found that the VARCHAR variant was never in fact returned by the MySQL C API, so the code added to attempt to convert to unicode wasn't ever called. The attempt was reverted in r8380.

It was reverted rather than switched to specify VAR_STRING because VAR_STRING is also the type reported for a MySQL column with type VARBINARY: truly binary data. Attempting to convert truly binary data to unicode (assuming utf-8 encoding) could generate an exception or could corrupt the data; it didn't seem appropriate to be attempting the conversion for all types MySQL reported as VAR_STRING types.

Now, Django doesn't actually support binary fields (never generates a (VAR)BINARY column), but back then there was some hope it might in the future. (Though I've not heard any mention of binary field support in a while.) However, the ability to pretty easily create/use custom fields means there might be people out there with custom fields that use the MySQL VARBINARY type...I fear this proposed change would break such fields.

Also note that this deviation from the norm of returning unicode for character data is documented in the database notes: http://docs.djangoproject.com/en/1.3/ref/databases/#collation-settings. If/when we do find a way to safely change the low-level behavior here those notes will need to be revised as well.

by Graham King, 13 years ago

Attachment: mysql_utf8bin_unicode.diff added

Tries to guess whether a VAR_STRING has binary or unicode data, and converts appropriately.

comment:3 by Graham King, 13 years ago

Thanks for the explanation. I thought it was a little too easy! :-)

It seems the 'standard' way of guessing if we have binary or text data is to look for a null byte. Apparently that's what 'grep' and 'git diff' do. I've changed the patch to do that, and added tests. A VAR_STRING now becomes str if it's binary data (or utf-16), unicode otherwise.

Null byte idea is from here: http://stackoverflow.com/questions/898669/how-can-i-detect-if-a-file-is-binary-non-text-in-python

Hopefully this means we can do both: not break (VAR)BINARY columns, and make utf8_bin CharField less surprising.

comment:4 by Aymeric Augustin, 13 years ago

UI/UX: unset

Change UI/UX from NULL to False.

comment:5 by Aymeric Augustin, 12 years ago

Resolution: wontfix
Status: newclosed

Django gained a binary field in 8ee1eddb7e148de89aebde9e68da495633fc1ec9.

Unless I missed something, the technique of looking for a null byte is very fragile. If the data stored in a binary field happens not to contain a null byte, it will be returned as unicode. You could iterate over 1000 rows, get 993 str and 7 unicode!

Note: See TracTickets for help on using tickets.
Back to Top