Opened 3 years ago
Last modified 7 months ago
#33647 assigned Bug
bulk_update silently truncating values for size limited fields
Reported by: | jerch | Owned by: | Akash Kumar Sen |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 4.0 |
Severity: | Normal | Keywords: | |
Cc: | Simon Charette | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
On postgres backend, bulk_update
passes overlong values for size limited fields along without any notification/exception, instead truncating the value.
Repro:
Code highlighting:
# some model to repro class TestModel(models.Model): name = models.CharField(max_length=32) # in the shell >>> from bulk_test.models import TestModel >>> tm=TestModel(name='hello') >>> tm.save() >>> tm.name 'hello' >>> tm.name='m'*100 >>> tm.save() # good, raises: ... django.db.utils.DataError: value too long for type character varying(32) >>> TestModel.objects.all().values('name') <QuerySet [{'name': 'hello'}]> >>> TestModel.objects.all().update(name='z'*100) # good, raises as well: ... django.db.utils.DataError: value too long for type character varying(32) >>> TestModel.objects.all().values('name') <QuerySet [{'name': 'hello'}]> >>> TestModel.objects.bulk_update([tm], ['name']) # not raising, instead truncating: 1 >>> TestModel.objects.all().values('name') <QuerySet [{'name': 'mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm'}]>
Not sure, if this is intended/expected behavior, well it is inconsistent to .save
or .update
, which both raise here. I only tested postgres backend for this, it may apply to other size limiting databases as well (sqlite itself is not affected, as it does not limit values).
If this is intended, it may be a good idea to at least document the slightly different behavior, so users are aware of it, and can prepare their code to avoid silent truncation with follow-up errors. A better way prolly would fix bulk_update
to spot value overflows and raise, but I am not sure, if thats feasible.
Change History (26)
follow-up: 2 comment:1 by , 3 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Type: | Uncategorized → Bug |
comment:2 by , 3 years ago
Replying to Simon Charette:
... this is due to
requires_casted_case_in_updates=True
on the Postgres backend does a silent::varchar(2)
cast on theCASE
statement.
Then only postgres is affected here? (from the code it seems that other backends dont set this flag...)
For postgres the next question would be, if other data types with contraints are affected as well (basically any type, that allows narrowing by type(???)
notation), or if this is a varchar only edge case. From https://www.postgresql.org/docs/current/datatype.html possible candidates for type != type(???)
behavior are:
- bit
- bit varying
- character
- character varying
- interval
- numeric
- time
- timestamp
Imho django uses most of these for some field type (beside bit/bit varying?)
In general the broader "super" type with no constraints can be derived in postgres like this:
Code highlighting:
postgres=# select 'varchar(5)'::regtype; regtype ------------------- character varying (1 row)
Maybe it is enough to apply the super type to the cast in that line https://github.com/django/django/blob/a1e4e86f923dc8387b0a9c3025bdd5d096a6ebb8/django/db/models/query.py#L765?
Edit:
Just tested it - affected by falsey truncation are bit
, bit varying
, character
and character varying
. All others are specifying precision, which is either not explicitly set by django, or would raise anyway for non-suitable values (thats the case for numeric
).
Edit2:
Arrayfields also might be affected for these 4 types, as select '{1234567890}'::varchar(6)[]
also silently truncates the content.
comment:3 by , 3 years ago
Cc: | added |
---|
Then only postgres is affected here?
yep, if you set this feature flag to False
on the Postgres backend and run the queries.test_bulk_update
you'll be able to see why it was added in the first place.
In general the broader "super" type with no constraints can be derived in postgres like this:
...
Maybe it is enough to apply the super type to the cast in that line
That could be one approach yes, we'd likely need to adapt CAST
to allow for such usage though. Not sure of what the argument should be named though, maybe generic
which defaults to False
? Not sure what generic=True
would mean in the case of Cast(expr, ArrayField(CharField(max_length=20), generic=True)
would it be ::varchar[]
or ::array
.
follow-ups: 5 6 comment:4 by , 3 years ago
An alternative approach might to commit to dropping the whole CASE
/WHEN
wrapping altogether on backends that support it as described in #29771.
We know the underlying expression construction approach performs poorly #31202 and the following form doesn't suffer for the type inference issue we're experiencing here
UPDATE test_model SET name = v.name FROM (VALUES (1, ‘aaaaaaaaaa’), (2, ‘bbbbbbbbbb’) ) AS v(id, name) WHERE test_model.id = v.id
comment:5 by , 3 years ago
Replying to Simon Charette:
An alternative approach might to commit to dropping the whole
CASE
/WHEN
wrapping altogether on backends that support it as described in #29771.
Hmm, yes perfwise I totally agree - the UPDATE FROM VALUES variants are much better for pumping tons of individual values. I already tried to address that pattern in https://github.com/netzkolchose/django-fast-update, with string formatting atm. The runtime numbers there speak for themselves. But for a serious integration in the ORM there are some obstacles to overcome:
- f-expressions wont work anymore (at least not without big workarounds)
- profound ORM integration needs serious rework on the update sql compiler (or even a separate one just for the update + values pattern)
- depends on recent db engines (+distinction of mysql8 vs. mariadb)
Regarding f-expressions - idk if thats a biggie: bulk_update
always occured to me as an interface to pump individual values rather than doing column trickery with it, so I would not mind, if that interface does not support f-expressions anymore. But thats just me, if people insist on using f-expressions here as well, this would need a workaround of unknown complexity.
Imho the second point can be done, it just needs someone with time and enough dedication (well I can try that, but would need serious help, as I lack deeper knowledge of ORM internals).
The last point is more tricky - how to deal with older or incompatible database engines here? Create a fallback (current implementation)? Or just confront users with "nope, not supported here, get a newer/compatible db engine"? It also raises the question, where to park the actual implementation - while the ORM itself could blueprint the UPDATE FROM VALUES pattern in ORM style, the backends would have to translate it into their very own style, or even substyles for mysql (mysql8 != mariadb here).
I guess that such a ground-shaking change to django would need some sort of consensus first?
comment:6 by , 3 years ago
Replying to Simon Charette:
An alternative approach might to commit to dropping the whole
CASE
/WHEN
wrapping altogether on backends that support it as described in #29771.
We know the underlying expression construction approach performs poorly #31202 and the following form doesn't suffer for the type inference issue we're experiencing here
UPDATE test_model SET name = v.name FROM (VALUES (1, ‘aaaaaaaaaa’), (2, ‘bbbbbbbbbb’) ) AS v(id, name) WHERE test_model.id = v.id
Can you give me a pointer, how to get a discussion about that rolling? Or who to contact? I already wrote about my implementation of that idea 3 months ago, no response from anyone. Then I put everything into a neat package with extensive tests, no response (beside some 3rd person giving really helpful feedback). I even wrote a mailing list request about it yday to discuss some details - no response either (though its still pretty fresh). I really dont know what else to do. Could it be that no one is actually interested in a revamped bulk_update implementation in django? Or is django development known to have a very slow pace / being in maintenance mode mostly? I dont want to blame anyone - could all be me pushing the wrong buttons, but I've never faced it to that degree in any other OSS project.
comment:7 by , 3 years ago
Link to the mailing list thread
Hi Jörg — thanks for the input here. Sorry you're feeling frustrated.
Could it be that no one is actually interested in a revamped bulk_update implementation in django? Or is django development known to have a very slow pace / being in maintenance mode mostly?
So there's three points there:
- I suspect it's not lots of people who are directly vested, but there are a number of regular contributors to the ORM (Simon included) and I'd imagine this is a topic of interest, but, as you've already pointed out in your mailing list post, there are several tradeoffs to consider, and it'll need some thought. Folks have limit bandwidth: that doesn't entail no interest. I hope that's clear.
- Django does have a slow pace. That's OK. After 16+ years, that's proven to be one of its strengths. It's a big project, with a lot of surface area, and (again) folks have limited bandwidth. It's one reason why third-party packages, such as the one you've done, are a good way to go, as they allow a faster pace, and a sandbox to work on issues.
- Despite the slow pace, Django is in anything but maintenance mode: you need only look at the release notes over the last few major releases to see that new features are constantly being worked on and delivered. If you zoom-out from any particular issue, I contest, the development pace is actually quite rapid for a project of Django's size and maturity (despite being "slow" on the surface.)
We're currently heads-down working towards the feature freeze for Django 4.1 — there is no chance (I'd think) of this getting addressed for that. That leaves a realistic opportunity to discuss it for Django 4.2, and if you're keen, and the technical questions can be resolved, there's no reason it couldn't get in for that. If we miss that, then the next one... — Again zooming out, it soon fades that it took x-cycles to get any particular feature work completed.
Looking at the timestamps on the discussion here, not much time has passed between comments. I'd suggest a little patience, and working on the third-party implementation to resolve any outstanding issues in that time. If it's ready™ following up on the mailing list thread may be appropriate to let folks know they can give it a try.
I hope that all makes sense, and helps anchor expectations. There's a nice comic here which I always try to keep in mind.
Kind Regards,
Carlton
comment:8 by , 3 years ago
Hey Carlton,
thanks for the headsup. I didnt meant to sound like a drama queen, sorry if it came that way. Also I am not eager to push my ideas through at any price, since I could be totally on the wrong track, simply for reasons I've overlooked. So it is more about getting feedback at all, whether things go into the right direction or not.
What I've learned from >20ys OSS contributions - giving ppl early feedback helps to keep them engaged, and lowers the risk of time consuming dead end implementations (time on both ends, the implementer and the reviewers/maintainers), esp. when the bigger picture needs to be addressed (API changes, bigger codebase changes involved at several places). I know that django is a very big codebase with lots of legacy, which imho makes it even harder for someone from outside to get involved. This for sure is a balancing act to maintain. Ofc slow pace is not a bad thing - in fact I like django for not buying every shiny new idea in town, as it gives very solid development experience (using django myself since version 0.8).
At this point I wonder if the separated issue tracker in trac vs. repo in github might be part of a communication/transfer issue? (At least for me as maintainer of several projects github/gitlab made conceptual discussions and overall communications alot easier than back in SVN/mailing list times...)
I hope I did not derail this issue too much. :)
follow-up: 10 comment:9 by , 3 years ago
Hi Jörg.
...giving ppl early feedback helps to keep them engaged, and lowers the risk of time consuming dead end implementations (time on both ends, the implementer and the reviewers/maintainers), esp. when the bigger picture needs to be addressed (API changes, bigger codebase changes involved at several places)
Sure. I'm not sure the 11 days since you opened the ticket is that much for folks to come to a view. If your workaround is preforming well for you, that's good input. Otherwise you may need a little patience, though I see some input on the mailing list... 🙂
...if the separated issue tracker in trac...
That's out of scope for this one I'm afraid 😅. (There are mailing list discussions about it, but it's not trivial, not least because of the history in Trac... — As I say out of scope for here... 😬)
Thanks.
comment:10 by , 3 years ago
Replying to Carlton Gibson:
... I'm not sure the 11 days since you opened the ticket is that much for folks to come to a view. If your workaround is preforming well for you, that's good input. Otherwise you may need a little patience, though I see some input on the mailing list... 🙂
Well this ticket here is only loosely related and much younger (found the issue while doing some bulk_update tests). Ah whatever, will just wait on the mailing list input...
comment:11 by , 16 months ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:12 by , 16 months ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:13 by , 15 months ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:14 by , 14 months ago
Hi Priyank,
Just checking are you still interested in working on this?
comment:15 by , 14 months ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:16 by , 14 months ago
Hello, I have attempted to address this issue ,and I've identified the problem. It appears that the problem arises whenever the CAST() function is executed. it seems that changing these parameters to False is not an option. Additionally, the issue with SQLite occurs when all characters are stored in the database.and I'm still interested working on this tickets. What would be the best approach to resolve this problem?
follow-up: 18 comment:17 by , 13 months ago
Has patch: | set |
---|---|
Owner: | set to |
Status: | new → assigned |
Sorry I missed your comment and accidentally created a patch, Let's connect if you need help in some other ORM ticket.
comment:18 by , 13 months ago
Has patch: | unset |
---|
Replying to Akash Kumar Sen:
Sorry I missed your comment and accidentally created a patch, Let's connect if you need help in some other ORM ticket.
We cannot regress Cast()
to avoid truncating values in bulk_update()
.
follow-up: 20 comment:19 by , 13 months ago
I don't think we are regressing the Cast()
here the query generated is
UPDATE "queries_article" SET "name" = (CASE WHEN ("queries_article"."id" = 1) THEN bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb WHEN ("queries_article"."id" = 2) THEN bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb ELSE NULL END)::varchar WHERE "queries_article"."id" IN (1, 2)
it is just generating the super type i.e. varchar
instead of varchar(20)
in case of a CharField
If you can explain a little further that would be great @felixxm
comment:20 by , 13 months ago
If you can explain a little further that would be great Mariusz
Explained in PR.
comment:21 by , 13 months ago
Has patch: | set |
---|
comment:22 by , 13 months ago
Patch needs improvement: | set |
---|
follow-up: 24 comment:23 by , 13 months ago
I have checked your GitHub comment. Any suggestions you have in mind?
Initially I went for updating the compiler itself, but that seems to be a much more tedious task. I have one more hacky Idea like this which is as follows:
- Introduce a new database function named
CastSuperType
that will always cast the super type for every possible arguments. - Like varchar for varchar(20) and similar equivalents for all the other fields that supports casting.
- I also thought of having a different
SQLUpdateCompiler
compiler for PostgreSQL, but as the UpdateQuery(https://github.com/Akash-Kumar-Sen/django/blob/bulk_update/django/db/models/sql/subqueries.py#L48) code is shared between the databases I am finding a hard time to do that.
comment:24 by , 13 months ago
Has patch: | unset |
---|---|
Patch needs improvement: | unset |
Replying to Akash Kumar Sen:
I have checked your GitHub comment. Any suggestions you have in mind?
Initially I went for updating the compiler itself, but that seems to be a much more tedious task. I have one more hacky Idea like this which is as follows:
- Introduce a new database function named
CastSuperType
that will always cast the super type for every possible arguments.- Like varchar for varchar(20) and similar equivalents for all the other fields that supports casting.
- I also thought of having a different
SQLUpdateCompiler
compiler for PostgreSQL, but as the UpdateQuery(https://github.com/Akash-Kumar-Sen/django/blob/bulk_update/django/db/models/sql/subqueries.py#L48) code is shared between the databases I am finding a hard time to do that.
This is a tricky issue to solve, and we cannot move forward with a stub solution just for this reason. I suspect that we will need to revisit bulk_update()
to make it work properly, but I don't have any specific advice.
comment:25 by , 13 months ago
Following this approach mentioned by Simon in comment 3 would be reasonable I think.
That could be one approach yes, we'd likely need to adapt
CAST
to allow for such usage though. Not sure of what the argument should be named though, maybegeneric
which
defaults toFalse
? Not sure whatgeneric=True
would mean in the case ofCast(expr, ArrayField(CharField(max_length=20), generic=True)
would it be::varchar[]
or::array
.
I manage to reproduce, this is due to
requires_casted_case_in_updates=True
on the Postgres backend does a silent::varchar(2)
cast on theCASE
statement.We'll need to find an elegant way to cast to
varchar
instead ofvarchar(N)