#17788 closed Bug (fixed)
bulk_create() "too many SQL variables" error
Reported by: | Alpár Jüttner | Owned by: | Anssi Kääriäinen |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.4 |
Severity: | Release blocker | Keywords: | |
Cc: | anssi.kaariainen@…, michal.petrucha@…, alasdair@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
I tried to bulk_create()
lots of (e.g. >300) objects with Django 1.4b1 + sqlite backend. It fails with
django.db.utils.DatabaseError: too many SQL variables
It makes the use of bulk_create()
quite risky.
Wouldn't is be possible the automatically break up the list of objects to smaller pieces when it turns out to be too large?
Attachments (3)
Change History (62)
follow-up: 3 comment:1 by , 13 years ago
Cc: | added |
---|---|
Triage Stage: | Unreviewed → Design decision needed |
comment:2 by , 13 years ago
Triage Stage: | Design decision needed → Accepted |
---|
See also #16426 for a problem with the same root reason in a feature introduced in 1.3.
comment:3 by , 13 years ago
Replying to akaariai:
Yes, it would be possible. The biggest problem is actually knowing how many parameters the backend happens to support. For most backends the limit isn't variable amount, it is the query string length. Because of this it is impossible to make the bulk_insert immune to all batch size errors. (No, calculating the query string length isn't a sane option).
What if we always break up the list to bulks consisting of no more than e.g. 20 elements each? This should be safe for all backends. Does it have any noticeable performance penalty?
follow-up: 5 comment:4 by , 13 years ago
I also thought about that. I think the better approach would say that every backend has a 1000 parameter limit. That way if you are inserting 2-column rows (m2m tables for example), you could insert 499 items in one query. If they were 20 column items, then you would get to insert 49 and so on.
I think 20 objects per call is a little low anyhow.
And the above is a bit theoretical anyways, as that change can't be done now.
I think the best approach is to set a 1000 parameter limit for SQLite, no limit for others, and give a way to define the batch size by a kwarg. I will ask pgsql-general if they have some suggestions about this. I know doing a WHERE id IN large_list isn't efficient, but I don't know if this is true for inserts.
I don't think there is any possibility to get the above into 1.4. There is just too little time to fine-tune all the details (transaction handling etc). In addition the above is a new feature. Correct me if I am wrong, I would love to have a more reliable bulk_insert available.
follow-up: 21 comment:5 by , 13 years ago
Replying to akaariai:
The above is a bit theoretical. I don't think that change will get accepted for 1.4, and after that it will be backwards incompatible change.
It would be very sad. I think bulk_create()
is pretty much unusable (for being unsafe to use) right now.
I don't see any reasonable way of handling this issue at user level.
If nothing changes until the release, at least the following warning should be added the the doc of bulk_create().
WARNING. This function crashes if you try to create more than a certain number of objects. This number depends on the database backend, and also on the number of fields in the object. In short: don't use this function.
I don't think there is any possibility to get the above into 1.4. There is just too little time to fine-tune all the details (transaction handling etc). In addition the above is a new feature.
The solution you propose may be considered a new feature, but the problem itself is a bug. As far as the bulk_create()
functionality is concerned, this is pretty much a critical one.
comment:6 by , 13 years ago
Yes, it is a somewhat dangerous operation. It will seem to work in testing but crash in production. For that reason it would be good to have some (large) limits for other backends which have query size limits, too. MySQL has no practical limit (16MB by default), Oracle has a limit of 64K which you could actually hit if you try hard enough.
It is notable that this same problem is present when you do qs.filter(pk__in=[large_list])
and that one can not be fixed in any transparent way.
It could be possible to have a simple fix for sqlite3: just split the inserts so that they have less than 1000 parameters each. No changes for other backends, no ability to override the batch size. It would fix this ticket's immediate problem, and could perhaps get into 1.4.
I asked pgsql-performance if there is any wisdom about preferred batch sizes. I suspect there is no point at which the performance will predictably collapse, so there is no need for default batch size for PostgreSQL. But lets see if something pops up. The thread is this one: http://archives.postgresql.org/pgsql-performance/2012-02/msg00265.php
Maybe I should open another ticket for the bigger changes? That way this ticket could be dedicated to just fixing the SQLite3 problem.
follow-up: 15 comment:7 by , 13 years ago
Talked this over briefly with Jacob in IRC - it's not a release blocker, since it seems likely that the answer "use a better database" suffices to prevent it being an actual problem in production (except maybe in Oracle? not clear to me). I'd still like to get it fixed for 1.4 if we can, and something like a bulk_insert_chunk_size
on the db backend seems pretty reasonable.
comment:8 by , 13 years ago
Severity: | Normal → Release blocker |
---|
Actually, marking it a release blocker because at the very least it needs a doc fix before release; an actual fix is not a release blocker but would be nice. Thanks claudep for pointing this out.
comment:9 by , 13 years ago
I am going to test if Oracle has any batch size limits.
For PostgreSQL there is no practical limit. I tested this and once you go to millions of objects you could OOM your server, but most likely you will OOM your Python process before that.
For what I know MySQL does not have any practical limit either. 16Mb of query string should be enough for everybody(tm).
Oracle is an open question, and then there are all the third party databases. It would maybe be worth to have an API of get_next_batch(param_count, params). For example sqlite could return min(1000/param_count, len(params)/param_count).
comment:10 by , 13 years ago
OK, seems like I am not going to test this on Oracle. Installing Oracle XE is ridiculously hard, and now I have half-configures Oracle XE installed...
So, if somebody could test the following on Oralce:
class SomeModel(models.Model): f = models.CharField(max_length=100) SomeModel.objects.bulk_create(SomeModel(f='a'*100) for i in range(0, 1000))
If that works, increase the 1000 to say 10000. If no problems, I guess Oracle is safe enough for 1.4.
comment:11 by , 13 years ago
For the record here's the diff I used:
Index: tests/regressiontests/test_17788/__init__.py =================================================================== Index: tests/regressiontests/test_17788/tests.py =================================================================== --- tests/regressiontests/test_17788/tests.py (revision 0) +++ tests/regressiontests/test_17788/tests.py (revision 0) @@ -0,0 +1,8 @@ +from django.test import TestCase + +from .models import SomeModel + +class SomeTest(TestCase): + + def test_bulk_insert(self): + SomeModel.objects.bulk_create(SomeModel(f='abc') for i in range(0, 10000)) Index: tests/regressiontests/test_17788/models.py =================================================================== --- tests/regressiontests/test_17788/models.py (revision 0) +++ tests/regressiontests/test_17788/models.py (revision 0) @@ -0,0 +1,4 @@ +from django.db import models + +class SomeModel(models.Model): + f = models.CharField(max_length=100)
This test passes under Oracle.
comment:12 by , 13 years ago
As the field length is just 3 chars the amount of data from the parameters is roughly 30KB. If you make the field length, say 30 chars, then it should trigger the problem if it exists at all.
I suspect it doesn't. I have googled this a little, and people do not complain about this often. If the parameters were calculated into the query string length, then I expect there would be more complaints.
comment:13 by , 13 years ago
I just ran Aymeric test inserting 10K SomeModel instances settings values of its f field to both a 30-char string and 300-char string without failures. Database is Oracle Database 10g Express Universal Edition 10.2.0.1 as per OracleTestSetup
comment:14 by , 13 years ago
There is already a supports_1000_query_parameters
db backend feature, set to False
for SQLite and True
for all other backends. That seems kind of odd, like it should be max_query_parameters = 1000
instead, but it was probably done that way so it could be used with skipUnlessDBFeature
in the tests, which is the only place its currently used.
Whether we leave that feature as it is or change it to a number, it could be used to resolve both this and #16426. Though if SQLite is really the only affected backend, it's awfully tempting to just say "get a better database" rather than introducing a bunch of extra chunking complexity in both deletion and bulk insertion, just for SQLite's sake.
follow-up: 26 comment:15 by , 13 years ago
Replying to carljm:
[...] since it seems likely that the answer "use a better database" suffices to prevent it being an actual problem in production [...]
My problem is that although we "use a better database" for production, but we use SQLite during the development, because it is easier to setup, to create and share snapshots etc.
I believe we are not alone with this development model.
If we accept that "correct behavior" should have priority over "performance", then SQLite, bulk_create()
may simply create the object one-by-one. It is slow (on SQLite), but at least it works, and one may hope better performance on a "better database". So, it is already much better than having something unusable.
How does the batch size affects to performance? As far as I understand, there is an inevitable processing requirement for inserting data, plus the overhead of processing the SQL query (one-by-one). The feature bulk_create()
is about to reduce the second one. Thus, if we use batch size as small as 10, then it already reduces the overhead to one tenth. Although I didn't tested it, but it makes me feel, that the performance gain can already be achieved using very small batches, Large batches will result in negligible further improvements.
Finally, using reasonable sized batches may be beneficial even on "better databases", as a huge single SQL insert command may cause problems (such as timeout) on heavily loaded systems.
comment:16 by , 13 years ago
OK, so it seems safe to say Oracle doesn't suffer from this. You need 64K of SQL string without the parameters, and that will fit a lot of placeholders. I checked DB2 and SqlServer
and they seems to be safe enough from this issue, too.
So, it seems the only problematic backend is SQLite with its 1000 parameters limit. Fixing bulk_create is not that hard, but I have no idea how hard it is to fix delete. If delete is hard, is it worth fixing bulk_create alone? If both are somewhat easy then I think this should be done. It will make testing a little bit easier. If this is fixed, there is at least one 2500 object creating test in queries which should be fixed, too. Makes the test runner a few seconds faster.
comment:17 by , 13 years ago
Attached is a patch that fixes the bulk_create issue. I used get_batch_size which can inspect the fields and the objects in the query.
The patch also implements a default limit of 100000 objects in one batch. Dropping one zero from the end would be wise, or if you prefer no limit, thats OK too for me. It would be trivial to implement batch_size kwarg for bulk_create if that is wanted down the road.
In addition to the .delete() there is also prefetch_related which doesn't work if the inner sets are large enough (note: haven't actually tested). My opinion is that these aren't critical issues, but it would be nice to abstract the backend differences away if possible. However there is going to be always one issue: id__in
queries can not be abstracted (splitting the query breaks sorting), so there will always be cases where SQLite doesn't work correctly.
I noticed one issue with the bulk_create feature which might need some attention. It uses some effort to allow mixing of pk set / pk not set objects in one batch. In my opinion that is downright dangerous to allow, as the pk field must be autofield for the insert to work. Saving self-defined values to an autofield is something to be avoided, as that can break the sequence. It is fine to do bulk inserts with autofield PK set. Assume the user knows what he is doing, maybe he is going to reset the PK sequence after the operation or something. But I can't see any valid use case for mixing PK set and PK not set objects in one batch. The last added test in the patch doesn't work on SQLite for the very reason that mixing things in this way isn't safe. So, my question is, should this be disallowed? After 1.4 is released that can't be done. This isn't anything critical IMHO, but it would be good to discourage unsafe coding practices.
by , 13 years ago
Attachment: | batch_bulk_insert.diff added |
---|
comment:18 by , 13 years ago
Possibly a stupid question, but what if the model has a foreign key to itself? Isn't the chunking likely to break foreign key constraints?
Even though the patch is rather straightforward, I'm really wary of unexpected side effets, on the eve of a release candidate.
I suggest we document the problem as a know limitation of the SQLite backend in 1.4 and fix it properly later -- along with the other problems triggered by this limitation, like #16426.
comment:19 by , 13 years ago
Owner: | changed from | to
---|
comment:20 by , 13 years ago
Good catch. It does indeed break self-referencing foreign keys. Although the breakage happens only when the batch needs to be splitted, and in that case you currently get an error anyways... Still, the pk set/not set mixed in one query could already give you problems with foreign keys.
Leaving all this as a 1.5 problem sounds like a good plan at this point.
follow-up: 27 comment:21 by , 13 years ago
I'm taking care of adding a warning. It boils down to: "don't use SQLite if your application has requirements that exceed SQLite's limits."
For the record, I disagree strongly with the wording suggested in comment 5. We won't spread FUD like "this number depends on the database backend", nor cargo-cult advice like "don't use this function."
by , 13 years ago
Attachment: | 17788-warning-for-1.4.diff added |
---|
comment:23 by , 13 years ago
Severity: | Release blocker → Normal |
---|
comment:25 by , 13 years ago
Owner: | changed from | to
---|
comment:26 by , 13 years ago
Replying to alpar:
Replying to carljm:
[...] since it seems likely that the answer "use a better database" suffices to prevent it being an actual problem in production [...]
My problem is that although we "use a better database" for production, but we use SQLite during the development, because it is easier to setup, to create and share snapshots etc.
...and it will quietly accept plenty of error cases that your production database will choke on, thus hiding errors during development that will NOT be hidden in production. Get a better database, even in development.
comment:27 by , 13 years ago
Replying to aaugustin:
For the record, I disagree strongly with the wording suggested in comment 5.
That wording was used only to emphasize that I found it a major issue. I didn't intended that text for actual inclusion in the doc, either. Sorry for not being clearer about it.
Replying to kmtracey:
My problem is that although we "use a better database" for production, but we use SQLite during the development, because it is easier to setup, to create and share snapshots etc.
...and it will quietly accept plenty of error cases that your production database will choke on, thus hiding errors during development that will NOT be hidden in production. Get a better database, even in development.
I don't think this is the right place for debating over development practices. Like it or not, many people use SQLite in some stages of development even when eventually a more capable database is used for production. And those people are affected by the issue.
comment:28 by , 13 years ago
'Nice' to read i'm not the only one, I just ran into this problem trying to play with latest stable django-cms in 1.4b1rev17666.
the "use a better database" solution (Postgres in our case) more or less solved the problem
comment:29 by , 13 years ago
I don't think the following FUD/cargo cult like statement in [17653] is correct:
If your application's performance requirements exceed SQLite's limits, it's recommended to switch to another database engine, such as PostgreSQL.
- The whole thing has nothing to do with "application's performance requirements". Creating 300 object at once it certainly not a performance requirement that SQLite is incapable to meet.
- There is an obvious solution to the problem even using SQLite: splitting up the bulk to smaller pieces, either internally by Django or at the user level.
comment:30 by , 13 years ago
So, I would like to implement my own safe version of bulk_create()
at user level like this:
def safe_bulk_create(objs): BULK_SIZE = 900/FIELD_NUM for i in range(0,len(objs),BULK_SIZE): objs[0].__class__.objects.bulk_create(objs[i:i+BULK_SIZE])
This works well for me so far, both on SQLite
, both on PostgreSQL
, where it does not cause significant performance penalty compared to directly using bulk_create()
.
However, I couldn't figure out how to calculate the number of fields in the object. What is the recommended way to do that?
comment:31 by , 13 years ago
@alpar I think you can use the not so internal model options len(obj.__class__._meta.fields)
comment:32 by , 13 years ago
Cc: | added |
---|
I'll just chime in, we're about to hit this limit with the Django test suite. The more models the test suite contains, the more permissions are created in django.contrib.auth.management.create_permissions
using bulk_insert
. Currently on trunk this creates 996 objects; working on a development branch with some extra tests makes it impossible to run the test suite against SQLite.
comment:33 by , 13 years ago
My opinion is that this should be fixed for SQLite. I am not too sure about the rest of the problematic operations (prefetch, delete etc), but this one is so simple to fix that it should be done. The price of having the possibility of self-referential foreign key errors on large batches is less than having the 1000 parameter limit for bulk_create. As said before, currently you will get the 1000 parameters error anyways in the problematic situations.
It seems that SQLite supports deferrable foreign keys. I wonder why those aren't used on SQLite when they are used on PostgreSQL and Oracle?
comment:34 by , 13 years ago
Hit this today with MySQL inserting ~19000 rows of a simple model with 5 fields, including the primary key. MySQL gave me this: http://dev.mysql.com/doc/refman/5.1/en/packet-too-large.html
I had to bump max_allowed_packet from 1M to 2M to get the insert to work; I can see bulk inserting a model that had large TextField() values being particularly troublesome. Note that 1M is the out of the box default with MySQL.
comment:35 by , 13 years ago
I don't think we are going to have a default on MySQL. The reason is it is pretty much impossible to guess the correct batch size (2 objects might be too much). However, we could expose a batch_size parameter, and suggest users set that to something reasonably low.
In addition, can we please fix the SQLite "too many variables" error? That one hits me 9/10 of times when doing database performance testing. Is there opposition to adding batch_size parameter to bulk_create which defaults to unlimited, except for sqlite the default is set so that the variable limit is not hit. Yes, there could be problems for self-referential foreign keys. But that query is already doomed, as this would just trade "too many variables" to "foreign key broken" on the rare case there is actually a problem.
comment:36 by , 13 years ago
Just as an additional data point for fixing this on SQLite: SQLite doesn't actually enforce foreign key constraints, so there is little risk of breaking self-referential foreign key constraints.
Any objections for the following plan?
- Add a batch_size variable to bulk_create.
- By default the batch_size is unlimited, but on SQLite it is restricted so that a single query does not get too many variables.
comment:38 by , 13 years ago
I agree with getting this fixed, and hopefully in an incremental to 1.4. My company *will not* migrate to Django 1.4 unless the test suite passes on SQLite (which we use for trunk development), which it currently does not. I have a very hard time believing that this is not relatively common among Django dev houses as sqlite is the simplest way to get your application going without worrying about setting up and configuring a more reliable database.
comment:39 by , 13 years ago
You may want to point out to the people who set this rule that requiring the test suite to pass on a database you aren't using in production doesn't make a lot of sense :)
And yes, we'll backport the fix to 1.4.1.
comment:40 by , 13 years ago
Owner: | changed from | to
---|
I guess I should assign this to myself.
For 1.4.1 we can't add the batch_size variable (a new feature). So, would it make sense to add it for 1.5 and just use the default of "limited by variable amount for SQLite, otherwise unlimited" in 1.4.1 without the chance of overriding it?
comment:41 by , 13 years ago
Somehow I'm not convinced by the idea to add a parameter to bulk_create
. I feel it is an attempt to fix this at a wrong level. From an API perspective, it seems wrong to me to hard-code a number specifying the batch size into the actual call to bulk_create
since the appropriate number is completely backend-dependent.
I think it would make more sense for this to go into the backend-specific code, something like DatabaseOperations.max_batch_size
that would take an InsertQuery
object to calculate the value based on the number of fields inserted or something like that. There's a similar (albeit simpler) mechanism for __in
lookups, DatabaseOperations.max_in_list_size
.
But maybe it's just me abstracting things too much.
comment:42 by , 13 years ago
The current plan is to add a batch_size argument, by which you can override the default for the used backend. The default is based on variable amount limit on SQLite, otherwise it is unlimited. (The current patch has a "large enough" default limit, but that will likely be changed to unlimited).
The only place where you really need the batch size argument is with MySQL. MySQL has a 1MB limit for query string, and I believe it will be too complicated to calculate the query string size using the given instances. So, the user can limit the size if he needs to, but there is no default limit in place, as we really don't know what the default should be.
On other core backends there are no practical query size limits, or at least I could not hit any limits.
Still, maybe it would be better to handle this at one level up from bulk_insert. So, you would do something like this:
for batch in connection.get_batches(objs): SomeModel.objects.bulk_create(batch)
This makes it very explicit what is happening. You could use different batch generators if needed. Every user is not forced to rewrite the batching logic himself.
I like the automatic batch size limit more, but it might be that I am a little fixated to that approach...
comment:43 by , 13 years ago
On second thought, I agree with the batch_size
approach. As long as it works by default where it is possible.
The problem with connection.get_batches
is that:
- you usually don't work with
connection
objects (at least with basic ORM usage, it's a bit different when you're using complicated custom SQL) - it introduces boilerplate you'd have to add everywhere
bulk_insert
is used anyway...
So, all in all, I withdraw my previous objection.
comment:44 by , 13 years ago
Has patch: | set |
---|
I finally managed to update the patch. It is as close to committable as I managed to get it. The part I am unsure about is the documentation changes. Please verify. In addition to the batching there is one notable change: SQLite3's can_combine_inserts_with_and_without_auto_increment_pk was changed from True to False. The reason for this change is that when doing multiple batches (due to the parameter limit) things get weird when combining pk set/unset objects. If you have these PK values in your batches:
5, None, 7, None and 9, None, 11, None
when the first batch is inserted you get PK values 1, 2, 5, 7 inserted, the second batch you will get 8, 9, 9, 11 because the PK field always starts from max(id) + 1... In short: this behavior is not wanted.
There is also https://github.com/akaariai/django/tree/ticket_17788 for those who prefer github.
by , 13 years ago
Attachment: | ticket_17788.diff added |
---|
comment:45 by , 12 years ago
Overall the patch looks good to me, only one remark: personally I'd use a non-recursive implementation of _batched_insert
. Maybe it's just me just being excessively reluctant to use recursion when it can be rewritten 1:1 into a loop. Also, in extreme cases where the number of batches is high, the running time of this method will be O(n²) as a new copy of the whole list of remaining objects is created in each iteration.
Anyway, I don't know how much of an issue these concerns are.
comment:46 by , 12 years ago
I have updated the patch to remove the recursion from _batched_insert: https://github.com/akaariai/django/compare/ticket_17788
comment:48 by , 12 years ago
s/valuer/value/ in bulk_batch_size docstring. In bulk_batch_size for SQLite3, I would rather use the //
operator (in Python 3, '/' would return a float). I would also say in the docstring that 999 is a compile-time default in SQLite (SQLITE_LIMIT_VARIABLE_NUMBER).
comment:49 by , 12 years ago
Corrected the review issues in: https://github.com/akaariai/django/commit/332fb9308c39249cdd38a034a3565778636c373b with some other improvements, too. See the commit message for details.
BTW I think this way of Git usage could be better than producing constantly rebased versions - it should be easier to see what changes have been made after the review. It is another issue if the review-fix commits should be merged or squashed when committing to the master repo. IMO squashed into one commit...
comment:50 by , 12 years ago
Thanks for the improvements. Please squash, unless the individual commits have real values (for example to distinguish format editing and real code).
comment:51 by , 12 years ago
Cc: | added |
---|
comment:52 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:53 by , 12 years ago
Resolution: | fixed |
---|---|
Severity: | Normal → Release blocker |
Status: | closed → reopened |
Version: | 1.4-beta-1 → 1.4 |
I have created a backpatch of the committed fix in https://github.com/akaariai/django/tree/ticket_17788_backpatch.
I will wait for last comments about the backpatch before proceeding with the backpatch. I am little less enthusiastic of backpatching this than before, but I think I am still going to do it as that was the decision some time ago.
comment:54 by , 12 years ago
I'd be +1 to backporting since I currently have to apply this to each 1.4.X release to make things work for my installation (we have a lot of permissions and use SQLite for testing so the tests don't run without this patch).
comment:55 by , 12 years ago
No - votes and some + votes. I will try to commit this to 1.4 during the weekend.
comment:56 by , 12 years ago
For the record, +1 on a backport from me, too. If you're looking for a procedural reason to justify this -- "Major problem with new feature" is one of the criteria for a backport; if we'd had this patch prior to the 1.4 release, I'd warrant we would have held up the 1.4 release to make sure it got in.
comment:57 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
comment:59 by , 12 years ago
I was wondering about the release notes myself when committing. I looked through the notes of older releases and it seemed most commits to back branches are not mentioned in the notes. So, I left the note out.
It might be a good idea to add this, and any non-trivial changes to the release notes of back-branches. Opinions?
Yes, it would be possible. The biggest problem is actually knowing how many parameters the backend happens to support. For most backends the limit isn't variable amount, it is the query string length. Because of this it is impossible to make the bulk_insert immune to all batch size errors. (No, calculating the query string length isn't a sane option).
The way forward would be defining a max_variables_per_query for each backend. For sqlite3 it is 1000, for other backends it could be None for no limit. There could also be a kwarg "batch_size" by which you could override the default batch size given by the backend.