Opened 14 months ago

Closed 13 months ago

Last modified 13 months ago

#34944 closed Bug (fixed)

Missing or misinferred attributes in output fields of generated fields

Reported by: Paolo Melchiorre Owned by: Mariusz Felisiak
Component: Database layer (models, ORM) Version: 5.0
Severity: Release blocker Keywords: field, database, generated, output_field
Cc: Lily Foote, Mariusz Felisiak Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Paolo Melchiorre)

Continuing with my experiments with the generated fields I found these two error situations which I reported together because they are very connected and could lead to a single solution.

I do not rule out that there are other examples of use of the generated fields that could lead to the same errors and if they come to mind, please add them in the comments.

Misinferred attributes

In the common example of a field generated as a concatenation of two CharField in a Model:

class Person(models.Model):
    first_name = models.CharField(max_length=255)
    last_name = models.CharField(max_length=255)
    full_name = models.GeneratedField(
        expression=Concat(
            "first_name", models.Value(" "), "last_name"
        ),
        db_persist=True,
    )

The SQL code for the generated column has an automatically inferred max length of only 255 characters (varchar(255)), while the field should be able to contain strings of 511 characters (varchar(511)):

CREATE TABLE "community_person" (
    "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
    "first_name" varchar(255) NOT NULL,
    "last_name" varchar(255) NOT NULL,
    "full_name" varchar(255) GENERATED ALWAYS AS (
        COALESCE("first_name", '') ||
        COALESCE(COALESCE(' ', '') ||
        COALESCE("last_name", ''), '')
    ) STORED
);

Proposal

To solve the problem you could alternatively:

  1. make the maximum length extraction process smarter for the automatically created output fields
  2. mandatorily require specifying the output field when some of the fields involved could lead to situations like the ones above

Missing attributes

If in the previous example, I explicitly specify the output field without attributes, the migration is generated without generating errors:

class Person(models.Model):
    first_name = models.CharField(max_length=255)
    last_name = models.CharField(max_length=255)
    full_name = models.GeneratedField(
        expression=Concat(
            "first_name", models.Value(" "), "last_name"
        ),
        db_persist=True,
        output_field=models.CharField(),
    )

The SQL code contains an incorrect varchar(None) type:

CREATE TABLE "community_person" (
    "id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
    "first_name" varchar(255) NOT NULL,
    "last_name" varchar(255) NOT NULL,
    "full_name" varchar(None) GENERATED ALWAYS AS (
        COALESCE("first_name", '') ||
        COALESCE(COALESCE(' ', '') ||
        COALESCE("last_name", ''), '')
    ) STORED
);

which will generate an error when trying to apply the migration to the database:

sqlite3.OperationalError: near "None": syntax error

Proposal

To solve the problem you could alternatively:

  1. make the SQL code generation process smarter so as to generate working SQL code
  2. request to specify mandatory attributes for the output fields raising an error in the migration creation phase

Cumulative proposal

To solve both problems shown above, given the little time remaining before the release of the stable version of Django 5, the two most drastic solutions from both of the above proposals could be adopted.

Required output field

Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)

Example of error message:

django.core.exceptions.FieldError: Expression doesn't contain an explicit type. You must set output_field.

Required attributes

Request to specify mandatory attributes for the output fields raising an error in the migration creation phase.

$ python3 -m manage makemigrations
SystemCheckError: System check identified some issues:

ERRORS:
community.Person.full_name: (fields.E120) CharFields must define a 'max_length' attribute.

Change History (38)

comment:1 by Paolo Melchiorre, 14 months ago

Description: modified (diff)

comment:2 by Natalia Bidart, 14 months ago

Cc: Lily Foote Mariusz Felisiak added
Triage Stage: UnreviewedAccepted

Thank you Paolo! I reproduced both scenarios, accepting.

comment:3 by Om Dahale, 14 months ago

Owner: changed from nobody to Om Dahale
Status: newassigned

Hello, I would like to work on this issue.
As pointed out by Paolo, I can -

  1. make the output_field required when it can generate errors
  2. and ask to specify the required attributes like max_length on migration creation phase.

in reply to:  3 ; comment:4 by Paolo Melchiorre, 14 months ago

Replying to Om Dahale:

Hello, I would like to work on this issue.

Since this issue is a release blocker I hope you have time to propose an MR ASAP :)

As pointed out by Paolo, I can -

  1. make the output_field required when it can generate errors
  2. and ask to specify the required attributes like max_length on migration creation phase.

Precisely, the two points I proposed as a solution are the ones that can be implemented faster than others.

For the point above:

  1. "when it can generate errors" it can be difficult in all the cases, maybe you can start asking every time for output_field
  2. "attributes like max_length" we need to add tests for all model field types

comment:5 by David Sanders, 14 months ago

Hm I think there are 2 tickets here… or at least 2 PRs. Perhaps forum discussion is required but I don't think we ought to go down the route of trying to make resolve_expression() smart enough to determine that concat'ing 2 varchars means we need to add the max-lengths together – there may be a can of worms here since expressions could be ever so complex? 🤔

We definitely need to fix the broken "None" in the DDL 👍

Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)

Perhaps 👍 "Explicit is better than implicit" and all that, though I think that documenting "For complex expressions consider always declaring an output_field" is a nice option.

I'm keen to hear Mariusz & Lily's thoughts.

in reply to:  4 comment:6 by Om Dahale, 14 months ago

Replying to Paolo Melchiorre:

Since this issue is a release blocker I hope you have time to propose an MR ASAP :)

Okay

Precisely, the two points I proposed as a solution are the ones that can be implemented faster than others.

For the point above:

  1. "when it can generate errors" it can be difficult in all the cases, maybe you can start asking every time for output_field
  2. "attributes like max_length" we need to add tests for all model field types

Noted

in reply to:  5 ; comment:7 by Om Dahale, 14 months ago

Replying to David Sanders:

Hm I think there are 2 tickets here… or at least 2 PRs. Perhaps forum discussion is required but I don't think we ought to go down the route of trying to make resolve_expression() smart enough to determine that concat'ing 2 varchars means we need to add the max-lengths together – there may be a can of worms here since expressions could be ever so complex? 🤔

We definitely need to fix the broken "None" in the DDL 👍

Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)

Perhaps 👍 "Explicit is better than implicit" and all that, though I think that documenting "For complex expressions consider always declaring an output_field" is a nice option.

I'm keen to hear Mariusz & Lily's thoughts.

So shall I continue my work on the 2 points given above?

in reply to:  5 comment:8 by Paolo Melchiorre, 14 months ago

Replying to David Sanders:

Perhaps forum discussion is required ...

I started a discussion in the forum:
https://forum.djangoproject.com/t/release-blocker-missing-or-misinferred-attributes-in-output-fields-of-generated-fields/25103

... I don't think we ought to go down the route of trying to make resolve_expression() smart enough to determine that concat'ing 2 varchars means we need to add the max-lengths together ... 🤔

In fact, in my proposal for the "Cumulative proposal", I excluded this hypothesis.

We definitely need to fix the broken "None" in the DDL 👍

Yes, I think it absolutely needs to be resolved.

Always require specifying the output field (except when you are sure that the extracted type cannot generate error situations?)

Perhaps 👍 "Explicit is better than implicit" and all that, though I think that documenting "For complex expressions consider always declaring an output_field" is a nice option.

It could be a good solution, but it should be highlighted that deriving the output_field from the expression could hide pitfalls even in apparently simple cases such as concatenating two CharField on SQLite. Not all Django developers look under the hood of the ORM or check the SQL code generated by migrations.

Perhaps it might be safer to always require the user to specify an output_field.

I'm keen to hear Mariusz & Lily's thoughts.

Me too, and also the people who collaborated on this feature like Simon, Lily, and Adam.

in reply to:  7 ; comment:9 by Paolo Melchiorre, 14 months ago

Replying to Om Dahale:

So shall I continue my work on the 2 points given above?

I think you should.

in reply to:  9 comment:10 by Om Dahale, 14 months ago

Replying to Paolo Melchiorre:

Replying to Om Dahale:

So shall I continue my work on the 2 points given above?

I think you should.

Okay

comment:11 by Lily Foote, 14 months ago

I'm curious if there are other expressions that could lead to an increased column size or if it's just Concat. Maybe we could add something to Concat (and other relevant expressions) that can express that column size change?

If that's not feasible I would lean towards detecting the case where the column size could be invalid and require output_field explicitly in those cases. I'd be sad if GeneratedField always required an explicit output_field.

in reply to:  11 comment:12 by Paolo Melchiorre, 14 months ago

I published an introduction to database generated columns, using SQLite and the new GeneratedField added in Django 5.0

In this article, I shared the work I have done in testing GenerateField using SQLite as a database backend, and I hope it can be useful to people working in this issue.

https://www.paulox.net/2023/11/07/database-generated-columns-part-1-django-and-sqlite/

in reply to:  11 comment:13 by Paolo Melchiorre, 14 months ago

Replying to Lily Foote:

I'm curious if there are other expressions that could lead to an increased column size or if it's just Concat.

Unfortunately, many other cases come to mind, for example a field generated as a product of two IntegerField which should be a BigIntegerField.

If that's not feasible I would lean towards detecting the case where the column size could be invalid and require output_field explicitly in those cases. I'd be sad if GeneratedField always required an explicit output_field.

I agree with you but from the various tests I have done there are more and more cases in which the automatically derived output field has problems.

comment:14 by Simon Charette, 14 months ago

Resolving the output_field of combined expressions is a complex task that the ORM only has limited support for #26355, #31506, #33397, this is not a problem specific to GenerateField.

It is particularly tricky in the case of varchar(x) + varchar(y) as some backends restrict the allowed length of varchar (e.g. MySQL at 255) and using a TextField might come with backend specific limitations.

Our stance so far has been to make a best effort guess and refuse the temptation to guess.

The SQL code for the generated column has an automatically inferred max length of only 255 characters (varchar(255))

I wish we were more rigid in this particular case when dealing with parameterized fields such as CharFied, ArrayField, DecimalField but it's non-trivial and often backward incompatible.

I suggest that we take the same approach here and expect the user to specify an explicit output_field.

For the second part of the problem, CharField(None), I think that the addition of a check is in order. Adding a GeneratedField._check_output_field that simply delegates to self.output_field.check(**kwargs) seems like the way to go to catch all misuse of output_field.

Version 2, edited 14 months ago by Simon Charette (previous) (next) (diff)

comment:15 by Om Dahale, 14 months ago

Just fyi I am getting the hang of it now and will be raising a PR soon 🙂

comment:16 by David Sanders, 14 months ago

Thanks Simon. I think we all seem to be in consensus about explicit output_field… Lily & I were chatting on Discord (Paolo I believe you're on there too) about how it seems unavoidable. 👍

@Om Dahale Thanks, I was going to suggest actually starting with an easier ticket as I assumed you were a new contributor? But if you feel like you're up to the task then good luck, just be aware that someone else may need to "take over" if things stall seeing as this is a release blocker ;)

comment:17 by Simon Charette, 14 months ago

FWIW I create a draft MR to demonstrate how Concat resolving could be enhanced but I still think that we should not included it in 5.0. Might be worth a follow up cleanup/optimization ticket though.

in reply to:  14 comment:18 by Paolo Melchiorre, 14 months ago

Replying to Simon Charette:

Resolving the output_field of combined expressions is a complex task that the ORM only has limited support for #26355, #31506, #33397, this is not a problem specific to GenerateField.

Thanks for pointing this out, it helped me put this issue in the right perspective.

It is particularly tricky in the case of varchar(x) + varchar(y) as some backends restrict the allowed length of varchar (e.g. MySQL at 255) ...

I would add that, to complicate things, the restrictions on the various types of fields also vary between different versions of the same database.

The SQL code for the generated column has an automatically inferred max length of only 255 characters (varchar(255))

... In this particular case it's tempting to adjust the logic of Concat._resolve_output_field to be smarter and generate a CharField(max_length=511) but I suspect it might have unintended effect and not something we'd like to include in 5.0 at this point.
I suggest that we take the same approach here and expect the user to specify an explicit output_field.

I totally agree, and thanks for rephrasing my initial proposal more clearly.

For the second part of the problem, CharField(None), I think that the addition of a check is in order. Adding a GeneratedField._check_output_field that simply delegates to self.output_field.check(**kwargs) seems like the way to go to catch all misuse of output_field.

Thanks for pointing out this approach.

in reply to:  16 comment:19 by Paolo Melchiorre, 14 months ago

Replying to David Sanders:

Thanks Simon. I think we all seem to be in consensus about explicit output_field… Lily & I were chatting on Discord (Paolo I believe you're on there too) about how it seems unavoidable. 👍

Given the different time zone, I have only now read Discord, but I repeat that, as I suspected in the description of the issue, it is inevitable at this point to request an explicit output_field.

... just be aware that someone else may need to "take over" if things stall seeing as this is a release blocker ;)

Thanks for making that clear.

in reply to:  17 comment:20 by Paolo Melchiorre, 14 months ago

Replying to Simon Charette:

FWIW I create a draft MR to demonstrate how Concat resolving could be enhanced but I still think that we should not included it in 5.0. Might be worth a follow up cleanup/optimization ticket though.

To avoid mixing too many things here, I have already opened Issue #34954

comment:21 by Mariusz Felisiak <felisiak.mariusz@…>, 14 months ago

In 8b1acc04:

Refs #30446, Refs #34944 -- Fixed crash when adding GeneratedField with string Value().

This should allow smarter output_field inferring in functions dealing
with text expressions.

Regression in f333e3513e8bdf5ffeb6eeb63021c230082e6f95.

comment:22 by Mariusz Felisiak <felisiak.mariusz@…>, 14 months ago

In 73869a51:

[5.0.x] Refs #30446, Refs #34944 -- Fixed crash when adding GeneratedField with string Value().

This should allow smarter output_field inferring in functions dealing
with text expressions.

Regression in f333e3513e8bdf5ffeb6eeb63021c230082e6f95.

Backport of 8b1acc0440418ac8f45ba48e2dfcf5126c83341b from main

in reply to:  14 ; comment:23 by Om Dahale, 14 months ago

Replying to Simon Charette:

For the second part of the problem, CharField(None), I think that the addition of a check is in order. Adding a GeneratedField._check_output_field that simply delegates to self.output_field.check(**kwargs) seems like the way to go to catch all misuse of output_field.

Hello, Simon!
I raised a PR can you please have a look and let me know if I am going in the right direction?

in reply to:  16 comment:24 by Om Dahale, 14 months ago

Replying to David Sanders:

@Om Dahale Thanks, I was going to suggest actually starting with an easier ticket as I assumed you were a new contributor? But if you feel like you're up to the task then good luck, just be aware that someone else may need to "take over" if things stall seeing as this is a release blocker ;)

Yes, understood.

comment:26 by Om Dahale, 14 months ago

Reposting from PR..

I did this and now receiving the message to specify max_length

def check(self, **kwargs):
      databases = kwargs.get("databases") or []
      return [
          *super().check(**kwargs),
          *self._check_output_field(**kwargs),
          *self._check_supported(databases),
          *self._check_persistence(databases),
      ]

def _check_output_field(self, **kwargs):
        # Ignore field name checks as the output_field is not a field.
        self.output_field.name = self.name + "_output_field"
        self.output_field.model = self.model
        return self.output_field.check(**kwargs)

is this correct?

comment:27 by Om Dahale, 14 months ago

I am available full day today so let's wrap this up quickly

in reply to:  23 comment:28 by Om Dahale, 13 months ago

Hello, Paolo Simon David
Can you help me out, just let me know if it is correct and if not then please point out the mistakes?

comment:29 by Mariusz Felisiak, 13 months ago

Owner: changed from Om Dahale to Mariusz Felisiak

comment:30 by Natalia Bidart, 13 months ago

Patch needs improvement: set

There is a minor typo but otherwise PR looks good.

comment:31 by Mariusz Felisiak <felisiak.mariusz@…>, 13 months ago

In de4884b1:

Reverted "Refs #30446, Refs #34944 -- Fixed crash when adding GeneratedField with string Value()."

This reverts commit 8b1acc0440418ac8f45ba48e2dfcf5126c83341b.

comment:32 by Mariusz Felisiak <felisiak.mariusz@…>, 13 months ago

In 5b1d0a6:

[5.0.x] Reverted "Refs #30446, Refs #34944 -- Fixed crash when adding GeneratedField with string Value()."

This reverts commit 8b1acc0440418ac8f45ba48e2dfcf5126c83341b.

Backport of de4884b114534f43c49cf8c5b7f10181e737f4e9 from main

comment:33 by Mariusz Felisiak, 13 months ago

Patch needs improvement: unset

comment:34 by Natalia Bidart, 13 months ago

Triage Stage: AcceptedReady for checkin

comment:35 by Mariusz Felisiak <felisiak.mariusz@…>, 13 months ago

Resolution: fixed
Status: assignedclosed

In 5875f03c:

Fixed #34944 -- Made GeneratedField.output_field required.

Regression in f333e3513e8bdf5ffeb6eeb63021c230082e6f95.

comment:36 by Mariusz Felisiak <felisiak.mariusz@…>, 13 months ago

In c705625:

Refs #34944 -- Propagated system checks for GeneratedField.output_field.

comment:37 by Mariusz Felisiak <felisiak.mariusz@…>, 13 months ago

In ddbe5c8:

[5.0.x] Fixed #34944 -- Made GeneratedField.output_field required.

Regression in f333e3513e8bdf5ffeb6eeb63021c230082e6f95.

Backport of 5875f03ce61b85dfd9ad34f7b871c231c358d432 from main

comment:38 by Mariusz Felisiak <felisiak.mariusz@…>, 13 months ago

In fcc55f8c:

[5.0.x] Refs #34944 -- Propagated system checks for GeneratedField.output_field.

Backport of c705625ebff0141ed2b95dd3c8174bda8270a47f from main

Note: See TracTickets for help on using tickets.
Back to Top