Opened 16 years ago
Last modified 3 years ago
#9682 new Bug
icontains can be case-sensitive on MySQL
Reported by: | Owned by: | nobody | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.0 |
Severity: | Normal | Keywords: | |
Cc: | Tom Carrick | Triage Stage: | Someday/Maybe |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Firstly, this is a known problem, see groups.google.com/group/django-users/browse_thread/thread/2936ad3387c4b369.
Secondly, I disagree with the ‘You set the database collation and Django respects that’ approach shown there. For icontains and MySQL, Django generates ‘x LIKE '%y%'’. So, the programmer has told Django explicitly, ‘I want case-insensitive comparison’, and Django tells MySQL, ‘We want default comparison’. This is not field_icontains but rather some field_usingdefaultsettingscontains.
In fact, here’s how Django defines operators for the MySQL backend:
operators = { 'exact': '= %s', 'iexact': 'LIKE %s', 'contains': 'LIKE BINARY %s', 'icontains': 'LIKE %s', 'regex': 'REGEXP BINARY %s', 'iregex': 'REGEXP %s', 'gt': '> %s', 'gte': '>= %s', 'lt': '< %s', 'lte': '<= %s', 'startswith': 'LIKE BINARY %s', 'endswith': 'LIKE BINARY %s', 'istartswith': 'LIKE %s', 'iendswith': 'LIKE %s', }
So, case-sensitivity is explicitly requested, while case-insensitivity is implied.
The aforementioned thread says that ‘case-sensitive matching with the default utf_*_ci collations’ and overall making ‘things like that work regardless of collation’ is ‘really, really hard’. That sounds strange, because case-sensitive matching is no harder than a simple memcpy, it’s case-insensitive matching that is problematic.
However, MySQL doesn’t seem to implement the Unicode collation trickery correctly. It thinks that 'ı' is LIKE 'i' and 'groß' is not LIKE 'GROSS'. (It does perform LOWER('COMMITTED' COLLATE utf8_turkish_ci) correctly, however.) What about altering the above table, making it 'COLLATE utf8_general_ci LIKE %s' for cases where the default collation isn’t case-insensitive?
You’re right Unicode is quite hard to implement properly. However, for something_icontains the database does have to determine some collation, and if the default collation is case insensitive, some fallback has to be derived. Maybe the most reasonable solution would be a setting like FORCE_COLLATION_FOR_CI (maybe MySQL-specific, maybe it can be usable for other DBMSs), that would determine how to collate values for which no collation is specified?
It would solve a practical problem: I want iexact to differ from exact and icontains to differ from contains (please note these completely reasonable requirements conflict on current MySQL under current Django implementation regardless of collation settings). I’ve read what Django documentation has to say about collation, and while you might be justified in having ‘exact’ return what the database considers exact (for example, for purposes of UNIQUE indices), it’s wrong not to provide a lookup option that would consistently test for case-sensitive equality across DBMSs. Maybe in addition to contains and icontains something like scontains is also needed?
To summarize, I propose two things:
- For those who have case-insensitive collation in MySQL: a lookup option that would always invoke case-sensitive matching. Maybe the ‘s’ prefix for ‘sensitive’, or ‘b’ for binary: bcontains, bexact, bregex and so on.
- For case-sensitive people: an option to set a collation for case-insensitive matches, for example in the settings module.
Both options preserve backwards compatibility and both should be easy to implement. (People on KDE bugzilla have told me I should never tell them whether I think something is easy to implement, but nevertheless it’s a couple of lines for both changes. :-)
Attachments (1)
Change History (18)
by , 16 years ago
Attachment: | django-force-collation-for-ci.diff added |
---|
comment:1 by , 16 years ago
Triage Stage: | Unreviewed → Someday/Maybe |
---|
You've left out of your estimate the lines of doc required to explain the new lookup types and settings you propose, and the estimated person-hours per year needed to answer questions related to them on their mailing list and IRC, and to deal with the resulting tickets when someone sets things up incorrectly, or wants to use different collations for different tables/columns, etc. :-)
If we were to fix MySQL exact/iexact lookup, this would not be the approach I would take. I think new lookup types and settings would just add to the confusion. Rather I'd prefer to make the existing lookups work properly regardless of database collation.
That's not particularly hard to do either, in terms of lines of code. One way I experimented with doing it was adding a new call into the database backend to get an appropriate collation string for a given lookup type and column datatype. On all DBs except MySQL this was a no-op, but MySQL returned (for string-type data) either the utf8 binary or case-insensitive collation as appropriate for the given lookup type. (Given where the collate was placed it was OK to always use a utf8 collation since Django always uses utf8 on the DB connection.) This worked correctly but had hideous performance on my DB, which happens to use latin1 encoding, since the specification of a utf8 collation for a latin1 column resulted in inability to use indexes for the lookups (see last comment on #8102).
At that point I dropped looking into it, since I don't actually need anything beyond what Django does today for my own use -- it was mainly an academic exercise to see if I couldn't get both lookup types working properly regardless of collation on MySQL. If I were to look at it again I'd be seeing if I can't somehow get from MySQLdb what the collation in use is and only specify an alternate (utf8) one if necessary, or see if different placement of collate in the statement and use of a collation that matched the DB charset in use didn't cause such a performance hit, etc. That is I'd prefer to put the complexity of fixing this into the code rather than push it off onto the user in the form of alternative lookup types and settings.
I'm going to put this into Someday/Maybe because this is a wart that would be nice to fix. But (I believe) we need a solution that doesn't add new lookup types or settings, performs well for MySQL default configurations, and does not add too much complexity to the database layer (I do not even know if my collate lookup call would have been accepted).
comment:2 by , 16 years ago
Thanks for the quick reply.
I think that three lookup types are actually needed. One case-sensitive, one case-insensitive, and one that uses the default database settings. Consider:
if model.objects.filter(pk="TeSt").count() == 0: ... model.objects.create(pk="TeSt", ...)
Here the intent is to check whether a record with the given PK would be allowed by the database engine, regardless of whether the field has case-sensitive collation or not. Neither _exact nor _inexact would work for sure.
By the way, what about cutting this Gordian knot by simply making _exact strictly case-sensitive, _inexact strictly insensitive, and if none is specified, using the database settings? This would be compatible with existing code that simply says get(fieldname="string"), it would differ in _exact becoming completely exact (this is what could break code, but hardly much of it).
Another issue I seem not to have expressed clearly is the collation for fields that are case-sensitive by default when case-insensitive matching is requested. There’s nowhere for MySQL to derive that from. Suppose I look for status_iexact="committed"; should "COMMITTED" match? It should unless Turkish collation is needed, and even then one might argue that while "Türkiye" = "TÜRKİYE", "committed" != "COMMİTTED" but rather "committed" = "COMMITTED" (the rather well-known SVN bug). So the perfect solution would be a way to specify a collation for each string lookup, but until then at least a way to specify global case-insensitive collation like my patch does is required. You see, we do need a way to be able to compare utf8_bin fields case-insensitively, Django doesn’t provide a single means to do that.
comment:3 by , 16 years ago
Please go back and look at the long history of this issue in other tickets and on the django-developers mailing list before suggesting that "hardly breaking much code" is something we should do. Firstly, breaking any code isn't an option (there's no way for people to tell that there code is suddenly broken and it's more or less locked in by our API stability guarantee). Secondly, there are lots of subtle corners here, as you'll see from related tickets in Trac and they are precisely the worst cases to break in any way.
The two big problems are that MySQL's collation stuff is a little awkward to work with and the MySQLdb wrapper isn't currently able to handle things properly. We have to work with the restrictions we have.
comment:4 by , 16 years ago
Let me get it very straight and simple.
- Currently either exact = iexact or contains = icontains, there’s no setting under which all four would be distinct, which is definitely an error.
- I propose adding one entry to the settings module, the patch above has perfect backward compatibility and makes icontains behave reasonably on utf8_bin fields.
- It would be great to be able to specify collation explicitly for all kinds of lookups, for example, to order records differently for German ("Ö" < "P") and Swedish ("Ö" > "Z") users, but I understand it needs design decisions and can be hard to implement. (It could look like ‘country_iexact=collate("TÜRKİYE", "tr_TR.UTF-8")’ and ‘order_by=[collate("last_name", "de_CH.UTF-8")]’.)
Yes, I did look at other issues and I searched the mailing lists. Of other issues #7789 is like this one, but the result of it was ‘Changed "exact" matches in MySQL to use the database's native collation’ and that bug was about having _exact do the same comparison the database does internally for PK, but what I request here is, simply put, a way of using icontains for utf8_bin fields—while, as I said in the bug description, having _exact be CI or CS depending on configuration can be acceptable, _icontains must be CI as the documentation describes it. Don’t you find the definitions of exact, iexact, contains and icontains in the code excerpt above not completely correct?
comment:5 by , 16 years ago
I strongly vote for this. It solves the problem how to insert "COLLATE <collation>" into the query, invoked by normal filters (without writing SQL from scratch).
Why would one need to use COLLATE? E.g. in case, when he has a database containing strings with accented characters, and this database is in the different collation than utf8_general_ci - e.g. utf8_czech_ci. In this case, sql select e.g. "SELECT ... LIKE %cviceni% ...
" does not hit strings containing "cvičení", but it does, when "SELECT ... COLLATE utf8_general_ci LIKE %cviceni% ...
" . What is interesting, if ORDER BY
is used, results are ordered by the original database collation (czech in this case), what might be exactly what he needs.
comment:6 by , 14 years ago
Specifying COLLATE in comparisons and ORDER BY clause is a very common MySQL pattern when dealing with languages with accented characters. I'd love see this in Django.
comment:7 by , 14 years ago
Severity: | → Normal |
---|---|
Type: | → Bug |
comment:10 by , 11 years ago
Just bumped into this recently and also want to vote for this patch. How/When can we see it in django?
comment:11 by , 11 years ago
Just bumped into this recently as well. We have columns that are utf8_bin, and iexact operators on Django mysql ORM do _not_ perform case-insensitive lookups on these columns.
comment:12 by , 11 years ago
Same here: MySQL+utf8_bin. I spent an hour to find out that it is a bug in Django...
comment:13 by , 10 years ago
Same problem here. Is there some workaround? It would be great to have some warning in documentation.
comment:14 by , 10 years ago
Hi, to support CI search with utf8_bin collation, wrap firstly your queryset into the following ext() decorator and you're done in nearly every case:
# coding: UTF-8 from django.db.models.query import EmptyQuerySet try: from django.db.backends.mysql.base import DatabaseOperations as MySqlDatabaseOperations from django.db.models.query import QuerySet from django.db.models import sql from django.db.models.sql.where import WhereNode import copy class ExtMySqlDatabaseOperations(MySqlDatabaseOperations): def lookup_cast(self, lookup_type): if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'): return "LOWER(%s)" return super(ExtMySqlDatabaseOperations, self).lookup_cast(lookup_type) class ExtWhereNode(WhereNode): def make_atom(self, child, qn, connection): lvalue, lookup_type, value_annotation, params_or_value = child old_ops = None if type(connection.ops) in (MySqlDatabaseOperations, ExtMySqlDatabaseOperations): if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'): params_or_value = params_or_value.lower() old_ops = connection.ops connection.ops = ExtMySqlDatabaseOperations(connection) atom = WhereNode.make_atom(self, (lvalue, lookup_type, value_annotation, params_or_value), qn, connection) if old_ops: connection.ops = old_ops return atom @classmethod def deep_convert(cls, node): if type(node) is WhereNode: copy = ExtWhereNode(connector=node.connector, negated=node.negated) copy.children = map(cls.deep_convert, node.children) copy.subtree_parents = map(cls.deep_convert, node.subtree_parents) return copy return node class ExtQuerySet(QuerySet): def __init__(self, model=None, query=None, using=None): if not query: query = sql.Query(model, where = ExtWhereNode) elif type(query.where) is not ExtWhereNode: newquery = sql.Query(model, where = ExtWhereNode) newquery.where = ExtWhereNode.deep_convert(query.where) query = newquery super(ExtQuerySet, self).__init__(model = model, query = query, using = using) def ext(qs): if isinstance(qs, EmptyQuerySet): return qs else: return ExtQuerySet(model=qs.model, query = qs.query, using=qs._db) except: def ext(qs): return qs
comment:15 by , 9 years ago
Readers should note that COLLATE
is SQL-standard, though unfortunately the collation names are not. You will be able to use a solution that implements the COLLATE
clause in PostgreSQL too, e.g.
SELECT * FROM collate_demo WHERE test_col > 'A' COLLATE "C" ORDER BY test_col COLLATE "C";
to use POSIX byte-order based collation.
It'll be important to support the COLLATE
clause for index creation too. Otherwise users will get upset when their queries don't use the index that seems to be appropriate for the query.
Most of the time it's best to specify COLLATE
on a column in DDL, not in a query. That way queries using the column and indexes on the column automatically use the desired collation. You can "declare" a given column to be binary-sorted text where you don't care about national language collation.
CREATE TABLE some_table ( ...., test_field text COLLATE "C", ... );
I strongly recommend a solution using the COLLATE
clause, but keep in mind that it's not a global query modifier. It affects the operator or ORDER BY
clause it's attached to, not the whole query.
Users can work around this in the mean time by ALTER
ing their columns to give them a COLLATE
clause. e.g.
ALTER TABLE collate_demo ALTER COLUMN test_col TYPE text COLLATE "C";
Keep the existing type, of course. Note that this might re-create indexes, which could take a while and hold an exclusive lock.
There's probably a way to do this via Django's DDL support at CREATE TABLE
time or via a schema migration, but I'll leave that to people who know Django.
(I don't use Django, I contribute to PostgreSQL and I'm here because of some research I was doing for a Stack Overflow user who runs Django on Pg).
comment:16 by , 4 years ago
Cc: | added |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Django now supports the Collate()
database function (see #21181, 63300f7e686c2c452763cb512df9abf7734fd588) and specifying a database collation for the field (see #31777, e387f191f76777015b6ea687ce83cdb05ee47cee). Both solution can be used to fix the original issue.
IMO there is no need for a new lookups or settings.
comment:17 by , 3 years ago
Resolution: | fixed |
---|---|
Status: | closed → new |
The addition of Collate()
is great, but it didn't solve the main problem of this ticket. Django should properly use the existing lookups independent of the database collation. I came to this ticket because I realized that is not possible to use ModelAdmin.search_fields
in the way is presented in the docs:
When somebody does a search in the admin search box, Django splits the search query into words and returns all objects that contain each of the words, case-insensitive...
I discover this after the need of use MySQL instead of PostgreSQL.
A patch implementing proposal #2.