Opened 7 years ago

Closed 3 years ago

#28949 closed Bug (wontfix)

Multibyte table name or column name causes miscalculation of the length of index name.

Reported by: Pak Youngrok Owned by: Jacob Walls
Component: Migrations Version: 2.0
Severity: Normal Keywords: migration multibyte index
Cc: Triage Stage: Unreviewed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

Django migration automatically creates index with name consists of table name, column names, hash, and suffix. When the length of generated index name is greater than self.connection.ops.max_name_length(), it shortens the name. However, it calculate length as python string type, so it's length doesn't match with the length of databases. The length should be calculated after encoded with the database encoding. Because of this issue, migration fails with these conditions below:

  • long multibyte model names
  • two multibyte model related with foreign key
  • the foreign key field is CharField(or it's child class)

With these conditions, django migration tries to create two index(one for normal index, one for like index), and the name of those are same except suffix(the latter has suffix _like), and the lengths of both index names as string are less than max name length but the length of both index names as bytes are greater than max name length, so name conflict is raised.

long multibyte table name and foreign key name.

Here is the code:
https://github.com/django/django/blob/4420761ea9457d386b2000cf9df5b2f6f88f8f91/django/db/backends/base/schema.py#L873

        index_name = '%s_%s_%s' % (table_name, '_'.join(column_names), hash_suffix_part)
        if len(index_name) <= max_length:
            return index_name

Django assumes that all databases use UTF-8 encoding, so the code should be fixed like this:

        index_name = '%s_%s_%s' % (table_name, '_'.join(column_names), hash_suffix_part)
        if len(index_name.encode('utf8')) <= max_length:
            return index_name

The code that shorten the name should be also fixed. Getting a third of each part and re-joining is not good strategy in multibyte world, it can also cause miscalculation. I think getting very small amount of table and column names like 2 or 3 characters and joining them with original hash can be a safe solution.

Change History (10)

comment:1 by Tim Graham, 7 years ago

Triage Stage: UnreviewedAccepted

comment:2 by Abhishek Gautam, 7 years ago

Owner: changed from nobody to Abhishek Gautam
Status: newassigned

comment:3 by Abhishek Gautam, 7 years ago

As we just need a unique name for an index can so, can we create index_name as :

    index_name = '%s%s' % (self._digest(*([table_name] + column_names)), suffix)

_digest function will be:

    @classmethod
    def _digest(cls, *args):
        """
        Generate a 32-bit digest of a set of arguments that can be used to
        shorten identifying names.
        """
        h = hashlib.md5()
        for arg in args:
            h.update(force_bytes(arg))
        return h.hexdigest()

Using _digest method we will get 32 byte string and in that we will add suffix which will give us a length of index_name = 32 + length of suffix.
As suffix length will be very small length of index_name will not be able to exceed 40 also.

Version 0, edited 7 years ago by Abhishek Gautam (next)

comment:4 by Abhishek Gautam, 7 years ago

Owner: Abhishek Gautam removed
Status: assignednew

comment:5 by Jacob Walls, 3 years ago

Has patch: set
Owner: set to Jacob Walls
Status: newassigned

comment:6 by Mariusz Felisiak, 3 years ago

Needs tests: set
Patch needs improvement: set

comment:7 by Jacob Walls, 3 years ago

Needs tests: unset
Patch needs improvement: unset

comment:8 by Mariusz Felisiak, 3 years ago

Triage Stage: AcceptedReady for checkin

comment:9 by Mariusz Felisiak, 3 years ago

Patch needs improvement: set
Triage Stage: Ready for checkinAccepted

comment:10 by Mariusz Felisiak, 3 years ago

Resolution: wontfix
Status: assignedclosed
Triage Stage: AcceptedUnreviewed

Closing per discussion. We cannot use encode() because identifier limits are express in chars not bytes, chars that can have 2, 3, 4 bytes. It may also depend on encoding of the operating system or database, so it's not feasible to prepare a fully backward compatible solution. I'd say that if you decided to use non-ASCII chars in identifiers, you actually did this to yourself. Any solution would be error-prone.

Note: See TracTickets for help on using tickets.
Back to Top