Opened 3 years ago

Last modified 7 months ago

#33647 assigned Bug

bulk_update silently truncating values for size limited fields

Reported by: jerch Owned by: Akash Kumar Sen
Component: Database layer (models, ORM) Version: 4.0
Severity: Normal Keywords:
Cc: Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

On postgres backend, bulk_update passes overlong values for size limited fields along without any notification/exception, instead truncating the value.

Repro:

Code highlighting:

# some model to repro
class TestModel(models.Model):
    name = models.CharField(max_length=32)

# in the shell
>>> from bulk_test.models import TestModel
>>> tm=TestModel(name='hello')
>>> tm.save()
>>> tm.name
'hello'
>>> tm.name='m'*100
>>> tm.save()  # good, raises:
...
django.db.utils.DataError: value too long for type character varying(32)

>>> TestModel.objects.all().values('name')
<QuerySet [{'name': 'hello'}]>
>>> TestModel.objects.all().update(name='z'*100)  # good, raises as well:
...
django.db.utils.DataError: value too long for type character varying(32)

>>> TestModel.objects.all().values('name')
<QuerySet [{'name': 'hello'}]>
>>> TestModel.objects.bulk_update([tm], ['name'])  # not raising, instead truncating:
1
>>> TestModel.objects.all().values('name')
<QuerySet [{'name': 'mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm'}]>

Not sure, if this is intended/expected behavior, well it is inconsistent to .save or .update, which both raise here. I only tested postgres backend for this, it may apply to other size limiting databases as well (sqlite itself is not affected, as it does not limit values).

If this is intended, it may be a good idea to at least document the slightly different behavior, so users are aware of it, and can prepare their code to avoid silent truncation with follow-up errors. A better way prolly would fix bulk_update to spot value overflows and raise, but I am not sure, if thats feasible.

Change History (26)

comment:1 by Simon Charette, 3 years ago

Triage Stage: UnreviewedAccepted
Type: UncategorizedBug

I manage to reproduce, this is due to requires_casted_case_in_updates=True on the Postgres backend does a silent ::varchar(2) cast on the CASE statement.

diff --git a/tests/queries/test_bulk_update.py b/tests/queries/test_bulk_update.py
index bc252c21c6..f7244aab72 100644
--- a/tests/queries/test_bulk_update.py
+++ b/tests/queries/test_bulk_update.py
@@ -3,7 +3,7 @@
 from django.core.exceptions import FieldDoesNotExist
 from django.db.models import F
 from django.db.models.functions import Lower
-from django.db.utils import IntegrityError
+from django.db.utils import DataError, IntegrityError
 from django.test import TestCase, override_settings, skipUnlessDBFeature

 from .models import (
@@ -259,6 +259,14 @@ def test_ipaddressfield(self):
                     CustomDbColumn.objects.filter(ip_address=ip), models
                 )

+    def test_charfield_constraint(self):
+        article = Article.objects.create(
+            name="a" * 20, created=datetime.datetime.today()
+        )
+        article.name = "b" * 50
+        with self.assertRaises(DataError):
+            Article.objects.bulk_update([article], ["name"])
+
     def test_datetime_field(self):
         articles = [
             Article.objects.create(name=str(i), created=datetime.datetime.today())

We'll need to find an elegant way to cast to varchar instead of varchar(N)

in reply to:  1 comment:2 by jerch, 3 years ago

Replying to Simon Charette:

... this is due to requires_casted_case_in_updates=True on the Postgres backend does a silent ::varchar(2) cast on the CASE statement.

Then only postgres is affected here? (from the code it seems that other backends dont set this flag...)

For postgres the next question would be, if other data types with contraints are affected as well (basically any type, that allows narrowing by type(???) notation), or if this is a varchar only edge case. From https://www.postgresql.org/docs/current/datatype.html possible candidates for type != type(???) behavior are:

  • bit
  • bit varying
  • character
  • character varying
  • interval
  • numeric
  • time
  • timestamp

Imho django uses most of these for some field type (beside bit/bit varying?)

In general the broader "super" type with no constraints can be derived in postgres like this:

Code highlighting:

postgres=# select 'varchar(5)'::regtype;
      regtype      
-------------------
 character varying
(1 row)

Maybe it is enough to apply the super type to the cast in that line https://github.com/django/django/blob/a1e4e86f923dc8387b0a9c3025bdd5d096a6ebb8/django/db/models/query.py#L765?

Version 0, edited 3 years ago by jerch (next)

comment:3 by Simon Charette, 3 years ago

Cc: Simon Charette added

Then only postgres is affected here?

yep, if you set this feature flag to False on the Postgres backend and run the queries.test_bulk_update you'll be able to see why it was added in the first place.

In general the broader "super" type with no constraints can be derived in postgres like this:
...
Maybe it is enough to apply the super type to the cast in that line

That could be one approach yes, we'd likely need to adapt CAST to allow for such usage though. Not sure of what the argument should be named though, maybe generic which defaults to False? Not sure what generic=True would mean in the case of Cast(expr, ArrayField(CharField(max_length=20), generic=True) would it be ::varchar[] or ::array.

comment:4 by Simon Charette, 3 years ago

An alternative approach might to commit to dropping the whole CASE/WHEN wrapping altogether on backends that support it as described in #29771.

We know the underlying expression construction approach performs poorly #31202 and the following form doesn't suffer for the type inference issue we're experiencing here

UPDATE test_model SET name = v.name
FROM (VALUES
   (1, aaaaaaaaaa),
   (2, bbbbbbbbbb)
) AS v(id, name)
WHERE test_model.id = v.id

in reply to:  4 comment:5 by jerch, 3 years ago

Replying to Simon Charette:

An alternative approach might to commit to dropping the whole CASE/WHEN wrapping altogether on backends that support it as described in #29771.

Hmm, yes perfwise I totally agree - the UPDATE FROM VALUES variants are much better for pumping tons of individual values. I already tried to address that pattern in https://github.com/netzkolchose/django-fast-update, with string formatting atm. The runtime numbers there speak for themselves. But for a serious integration in the ORM there are some obstacles to overcome:

  • f-expressions wont work anymore (at least not without big workarounds)
  • profound ORM integration needs serious rework on the update sql compiler (or even a separate one just for the update + values pattern)
  • depends on recent db engines (+distinction of mysql8 vs. mariadb)

Regarding f-expressions - idk if thats a biggie: bulk_update always occured to me as an interface to pump individual values rather than doing column trickery with it, so I would not mind, if that interface does not support f-expressions anymore. But thats just me, if people insist on using f-expressions here as well, this would need a workaround of unknown complexity.

Imho the second point can be done, it just needs someone with time and enough dedication (well I can try that, but would need serious help, as I lack deeper knowledge of ORM internals).

The last point is more tricky - how to deal with older or incompatible database engines here? Create a fallback (current implementation)? Or just confront users with "nope, not supported here, get a newer/compatible db engine"? It also raises the question, where to park the actual implementation - while the ORM itself could blueprint the UPDATE FROM VALUES pattern in ORM style, the backends would have to translate it into their very own style, or even substyles for mysql (mysql8 != mariadb here).

I guess that such a ground-shaking change to django would need some sort of consensus first?

Last edited 3 years ago by jerch (previous) (diff)

in reply to:  4 comment:6 by jerch, 3 years ago

Replying to Simon Charette:

An alternative approach might to commit to dropping the whole CASE/WHEN wrapping altogether on backends that support it as described in #29771.

We know the underlying expression construction approach performs poorly #31202 and the following form doesn't suffer for the type inference issue we're experiencing here

UPDATE test_model SET name = v.name
FROM (VALUES
   (1, aaaaaaaaaa),
   (2, bbbbbbbbbb)
) AS v(id, name)
WHERE test_model.id = v.id

Can you give me a pointer, how to get a discussion about that rolling? Or who to contact? I already wrote about my implementation of that idea 3 months ago, no response from anyone. Then I put everything into a neat package with extensive tests, no response (beside some 3rd person giving really helpful feedback). I even wrote a mailing list request about it yday to discuss some details - no response either (though its still pretty fresh). I really dont know what else to do. Could it be that no one is actually interested in a revamped bulk_update implementation in django? Or is django development known to have a very slow pace / being in maintenance mode mostly? I dont want to blame anyone - could all be me pushing the wrong buttons, but I've never faced it to that degree in any other OSS project.

comment:7 by Carlton Gibson, 3 years ago

Link to the mailing list thread

Hi Jörg — thanks for the input here. Sorry you're feeling frustrated.

Could it be that no one is actually interested in a revamped bulk_update implementation in django? Or is django development known to have a very slow pace / being in maintenance mode mostly?

So there's three points there:

  • I suspect it's not lots of people who are directly vested, but there are a number of regular contributors to the ORM (Simon included) and I'd imagine this is a topic of interest, but, as you've already pointed out in your mailing list post, there are several tradeoffs to consider, and it'll need some thought. Folks have limit bandwidth: that doesn't entail no interest. I hope that's clear.
  • Django does have a slow pace. That's OK. After 16+ years, that's proven to be one of its strengths. It's a big project, with a lot of surface area, and (again) folks have limited bandwidth. It's one reason why third-party packages, such as the one you've done, are a good way to go, as they allow a faster pace, and a sandbox to work on issues.
  • Despite the slow pace, Django is in anything but maintenance mode: you need only look at the release notes over the last few major releases to see that new features are constantly being worked on and delivered. If you zoom-out from any particular issue, I contest, the development pace is actually quite rapid for a project of Django's size and maturity (despite being "slow" on the surface.)

We're currently heads-down working towards the feature freeze for Django 4.1 — there is no chance (I'd think) of this getting addressed for that. That leaves a realistic opportunity to discuss it for Django 4.2, and if you're keen, and the technical questions can be resolved, there's no reason it couldn't get in for that. If we miss that, then the next one... — Again zooming out, it soon fades that it took x-cycles to get any particular feature work completed.

Looking at the timestamps on the discussion here, not much time has passed between comments. I'd suggest a little patience, and working on the third-party implementation to resolve any outstanding issues in that time. If it's ready™ following up on the mailing list thread may be appropriate to let folks know they can give it a try.

I hope that all makes sense, and helps anchor expectations. There's a nice comic here which I always try to keep in mind.

Kind Regards,

Carlton

Last edited 3 years ago by Carlton Gibson (previous) (diff)

comment:8 by jerch, 3 years ago

Hey Carlton,

thanks for the headsup. I didnt meant to sound like a drama queen, sorry if it came that way. Also I am not eager to push my ideas through at any price, since I could be totally on the wrong track, simply for reasons I've overlooked. So it is more about getting feedback at all, whether things go into the right direction or not.

What I've learned from >20ys OSS contributions - giving ppl early feedback helps to keep them engaged, and lowers the risk of time consuming dead end implementations (time on both ends, the implementer and the reviewers/maintainers), esp. when the bigger picture needs to be addressed (API changes, bigger codebase changes involved at several places). I know that django is a very big codebase with lots of legacy, which imho makes it even harder for someone from outside to get involved. This for sure is a balancing act to maintain. Ofc slow pace is not a bad thing - in fact I like django for not buying every shiny new idea in town, as it gives very solid development experience (using django myself since version 0.8).
At this point I wonder if the separated issue tracker in trac vs. repo in github might be part of a communication/transfer issue? (At least for me as maintainer of several projects github/gitlab made conceptual discussions and overall communications alot easier than back in SVN/mailing list times...)

I hope I did not derail this issue too much. :)

comment:9 by Carlton Gibson, 3 years ago

Hi Jörg.

...giving ppl early feedback helps to keep them engaged, and lowers the risk of time consuming dead end implementations (time on both ends, the implementer and the reviewers/maintainers), esp. when the bigger picture needs to be addressed (API changes, bigger codebase changes involved at several places)

Sure. I'm not sure the 11 days since you opened the ticket is that much for folks to come to a view. If your workaround is preforming well for you, that's good input. Otherwise you may need a little patience, though I see some input on the mailing list... 🙂

...if the separated issue tracker in trac...

That's out of scope for this one I'm afraid 😅. (There are mailing list discussions about it, but it's not trivial, not least because of the history in Trac... — As I say out of scope for here... 😬)

Thanks.

in reply to:  9 comment:10 by jerch, 3 years ago

Replying to Carlton Gibson:

... I'm not sure the 11 days since you opened the ticket is that much for folks to come to a view. If your workaround is preforming well for you, that's good input. Otherwise you may need a little patience, though I see some input on the mailing list... 🙂

Well this ticket here is only loosely related and much younger (found the issue while doing some bulk_update tests). Ah whatever, will just wait on the mailing list input...

comment:11 by Dan Hamilton, 16 months ago

Owner: changed from nobody to Dan Hamilton
Status: newassigned

comment:12 by Dan Hamilton, 16 months ago

Owner: Dan Hamilton removed
Status: assignednew

comment:13 by Priyank Panchal, 15 months ago

Owner: set to Priyank Panchal
Status: newassigned

comment:14 by David Sanders, 14 months ago

Hi Priyank,

Just checking are you still interested in working on this?

comment:15 by Priyank Panchal, 14 months ago

Owner: Priyank Panchal removed
Status: assignednew

comment:16 by Priyank Panchal, 14 months ago

Hello, I have attempted to address this issue ,and I've identified the problem. It appears that the problem arises whenever the CAST() function is executed. it seems that changing these parameters to False is not an option. Additionally, the issue with SQLite occurs when all characters are stored in the database.and I'm still interested working on this tickets. What would be the best approach to resolve this problem?

Last edited 14 months ago by Priyank Panchal (previous) (diff)

comment:17 by Akash Kumar Sen, 13 months ago

Has patch: set
Owner: set to Akash Kumar Sen
Status: newassigned

Sorry I missed your comment and accidentally created a patch, Let's connect if you need help in some other ORM ticket.

Patch - https://github.com/django/django/pull/17363

in reply to:  17 comment:18 by Mariusz Felisiak, 13 months ago

Has patch: unset

Replying to Akash Kumar Sen:

Sorry I missed your comment and accidentally created a patch, Let's connect if you need help in some other ORM ticket.

Patch - https://github.com/django/django/pull/17363

We cannot regress Cast() to avoid truncating values in bulk_update().

comment:19 by Akash Kumar Sen, 13 months ago

We cannot regress Cast() to avoid truncating values in bulk_update() .

I don't think we are regressing the Cast() here the query generated is

UPDATE "queries_article" SET "name" = (CASE WHEN ("queries_article"."id" = 1) THEN bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb WHEN ("queries_article"."id" = 2) THEN bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb ELSE NULL END)::varchar WHERE "queries_article"."id" IN (1, 2)

it is just generating the super type i.e. varchar instead of varchar(20) in case of a CharField

If you can explain a little further that would be great Mariusz

Last edited 13 months ago by Akash Kumar Sen (previous) (diff)

in reply to:  19 comment:20 by Mariusz Felisiak, 13 months ago

If you can explain a little further that would be great Mariusz

Explained in PR.

comment:21 by Akash Kumar Sen, 13 months ago

Has patch: set

comment:22 by Mariusz Felisiak, 13 months ago

Patch needs improvement: set

comment:23 by Akash Kumar Sen, 13 months ago

I have checked your GitHub comment. Any suggestions you have in mind?
Initially I went for updating the compiler itself, but that seems to be a much more tedious task. I have one more hacky Idea like this which is as follows:

  • Introduce a new database function named CastSuperType that will always cast the super type for every possible arguments.
  • Like varchar for varchar(20) and similar equivalents for all the other fields that supports casting.
Last edited 13 months ago by Akash Kumar Sen (previous) (diff)

in reply to:  23 comment:24 by Mariusz Felisiak, 13 months ago

Has patch: unset
Patch needs improvement: unset

Replying to Akash Kumar Sen:

I have checked your GitHub comment. Any suggestions you have in mind?
Initially I went for updating the compiler itself, but that seems to be a much more tedious task. I have one more hacky Idea like this which is as follows:

  • Introduce a new database function named CastSuperType that will always cast the super type for every possible arguments.
  • Like varchar for varchar(20) and similar equivalents for all the other fields that supports casting.

This is a tricky issue to solve, and we cannot move forward with a stub solution just for this reason. I suspect that we will need to revisit bulk_update() to make it work properly, but I don't have any specific advice.

comment:25 by Akash Kumar Sen, 13 months ago

Following this approach mentioned by Simon in comment 3 would be reasonable I think.

That could be one approach yes, we'd likely need to adapt CAST to allow for such usage though. Not sure of what the argument should be named though, maybe generic which
defaults to False? Not sure what generic=True would mean in the case of Cast(expr, ArrayField(CharField(max_length=20), generic=True) would it be ::varchar[] or ::array.

comment:26 by Craig de Stigter, 7 months ago

related: #35362

Note: See TracTickets for help on using tickets.
Back to Top