Opened 16 years ago
Last modified 13 years ago
#8102 closed
get_or_create is unable to handle unicode characters — at Version 15
Reported by: | Dave Naffziger | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Keywords: | ||
Cc: | Triage Stage: | Accepted | |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Models.py:
class Article(models.Model): headline = models.CharField(max_length=100) Shell session: >>> from mysite.models import Article >>> string = u'\xae' >>> a = Article.objects.get_or_create(headline = string) >>> a (<Article: Article object>, True) >>> a = Article.objects.get_or_create(headline = string) >>> a (<Article: Article object>, True)
This seems to be recently introduced (I'm on 8125) and only noticed the bug after a recent update.
This is a pretty major bug and I ran out of time tonight to dig into it, but I wanted to at raise the issue now so that others can look at it.
This is going to create serious backwards incompatibility issues in people's databases if they are running with this bug.
Change History (19)
comment:1 by , 16 years ago
Resolution: | → worksforme |
---|---|
Status: | new → closed |
comment:2 by , 16 years ago
Mysql.
I need to invest some time to deal with the latest BackwardsIncompatibleChanges then I'll try updating and will try again to recreate the problem.
comment:3 by , 16 years ago
Resolution: | worksforme |
---|---|
Status: | closed → reopened |
I updated to r8207. The bug still exists.
I've reproduced this on:
Windows, Python 2.5.2, MySQL
Linux, Python 2.5.2, MySQL
comment:4 by , 16 years ago
Component: | Core framework → Database wrapper |
---|---|
milestone: | → 1.0 |
follow-up: 12 comment:6 by , 16 years ago
OK, I've been able to recreate this if I change my table charset to latin1. I had been using a default of utf8, and with that I could not recreate. So key to hitting the error seems to be (I'm guess) having a non-utf8 charset for the column/table.
by , 16 years ago
Use COLLATE UTF8_BIN not BINARY to force case-sensitive comparisons on MySQL
comment:8 by , 16 years ago
milestone: | 1.0 → 1.0 beta |
---|---|
Triage Stage: | Unreviewed → Accepted |
Attached a patch that changes case-sensitive comparisons from using BINARY to COLLATE UTF8_BIN. That way MySQL doesn't neglect to do the charset conversion between table charset and utf8 (what Django always provides). Tested it with a latin1 table and the provided model. Before the patch I observed the failure described, after the patch case-sensitive comparisons correctly retrieve the matched row:
>>> from mysqlci.models import Article >>> string = u'\xea' >>> a1, created1 = Article.objects.get_or_create(headline = string) >>> a1 <Article: Article pk=4, headline=ê> >>> created1 True >>> a2, created2 = Article.objects.get_or_create(headline = string) >>> a2 <Article: Article pk=4, headline=ê> >>> created2 False >>> Article.objects.filter(headline__contains = string) [<Article: Article pk=4, headline=ê>] >>> Article.objects.filter(headline__startswith = string) [<Article: Article pk=4, headline=ê>] >>> Article.objects.filter(headline__endswith = string) [<Article: Article pk=4, headline=ê>] >>> Article.objects.filter(headline__regex = string) [<Article: Article pk=4, headline=ê>] >>>
Upped the milestone to beta since I do think this is a pretty serious error that should get addressed ASAP, core devs may of course override that.
comment:9 by , 16 years ago
Patch needs improvement: | set |
---|
Patch fixes problems for (VAR)CHAR fields but introduces problems for non-string types, so is no good. You can't specify COLLATE for, say, an integer field. Still investigating how to fix this.
follow-up: 11 comment:10 by , 16 years ago
OK, I've got an alternative version of the patch that for iexact matches casts what's being compared to a CHAR. That avoids problems with specifying COLLATE for a column type that doesn't support such, but I'm unsure of the performance implications of this. I did not include the cast for the other lookup types, since it seems they only make sense when dealing with string-types? But I could be missing something there.
I ran the test suite using a default latin1 charset (which means some testcases will fail, because some tests use data that is not representable in latin1) both with the patch and without. Differences observed:
First, one more failure without the patch than with. That is, the patch fixes one testcase (in regressiontests/string_lookup/models.py) that was known to fail with a non-UTF8 database for MySQL. Specifically these lookups:
# Regression tests for #3937: make sure we can use unicode characters in # queries. # BUG: These tests fail on MySQL, but it's a problem with the test setup. A # properly configured UTF-8 database can handle this. >>> fx = Foo(name='Bjorn', friend=u'François') >>> fx.save() >>> Foo.objects.get(friend__contains=u'\xe7') <Foo: Foo Bjorn> # We can also do the above query using UTF-8 strings. >>> Foo.objects.get(friend__contains='\xc3\xa7') <Foo: Foo Bjorn>
succeed with the patch.
The other failures (same both with and without the patch) relate to attempting to store data that can't be represented in latin1 and the fact that True/False on MySQL comes back as 1/0. (I ran with MyISAM as the storage engine in order to avoid other known(?) problems with InnoDB and serializers.)
Second, the tests took 3% (28 min. vs. 27) longer to run WITHOUT the patch than with. So there does not appear to be a huge performance hit with this, though any difference could be being overwhelmed by the other overhead of running the tests.
I'll attach the updated patch but I'm not at all sure this is the right way to fix this problem. Someone with more than half a clue on mapping Django lookups to SQL might have a better approach.
by , 16 years ago
Attachment: | 8102-2.diff added |
---|
Cast exact lookups to CHAR before specifying COLLATE to avoid problems with specifying collate on numeric/binary columns
by , 16 years ago
Attachment: | nopatch_failures.txt added |
---|
For reference, test failures without the patch
comment:11 by , 16 years ago
Replying to kmtracey:
OK, I've got an alternative version of the patch that for iexact matches...
Typo, I meant EXACT matches there.
comment:12 by , 16 years ago
Uhm, do we have to care about non-utf8 databases?
"Make sure your database is configured to be able to store arbitrary string data. Normally, this means giving it an encoding of UTF-8 or UTF-16."
comment:13 by , 16 years ago
latin1 is the default MySQL encoding, so yes I think it is important to support it. The current Django docs come nowhere close to saying that a database encoding of utf-8 is required for proper interaction with Django. The very next sentence of what you quoted says:
"If you use a more restrictive encoding — for example, latin1 (iso8859-1) — you won’t be able to store certain characters in the database, and information will be lost."
That says if you don't use utf-8 you won't be able to store certain data, not that Django will exhibit crippling bugs. It makes perfect sense that if I choose a latin1 encoding for my db I won't be able to store, say, Japanese characters in it. These characters simply cannot be represented in latin1 encoding.
It does not make sense that I cannot successfully do an exact lookup on, say, little e with circumflex. It's a perfectly legitimate character in both latin1 and Unicode, it's just got a different binary representation in latin1 vs. utf8.
comment:15 by , 16 years ago
Description: | modified (diff) |
---|
Unbreak the description so that my eyes stop watering.
I'm not 100% certain this is a real bug. If you're using latin1 as your database coding, then Django cannot be expected to adjust for that because it should not have to introspect the server/client differences. Could well be a case of "Doctor, it hurts when I do this...".
Still there's a lot of other good collate-related stuff in here, so leaving it open because there's some stuff here we need to apply.
I cannot recreate this on r8204 with either MySQL or sqlite backends. In both cases, using the model you show (except with a
__unicode__
method added that prints the pk), I get:So I see it created with the first call and not with the second. What DB are you using?