Opened 7 years ago

Closed 7 years ago

Last modified 5 years ago

#28641 closed New feature (needsinfo)

Improvements to QuerySet.bulk_create()

Reported by: Дилян Палаузов Owned by: nobody
Component: Database layer (models, ORM) Version: 1.11
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

https://docs.djangoproject.com/en/1.11/ref/models/querysets/#bulk-create does not state what bulk_create returns, so that the implementation can be optimizerd by making bulk_create() return None.

As QuerySet.get_or_create() ensures, that after the operation an object exists in the database, bulk_create() can do the same on Postgresql using "ON CONFLICT DO NOTHING", possibly taking a parameter. I guess currently bulk_create fails as whole, if any INSERT cannot be performed, so that a developer has to do "for x in Y: x.get_or_create()" instead of buld_create([x for x in Y]).

INSERT (for bulk_create) can also use RETURNING to get the ids of the newly created objects which can be used then to send signals. Is there any particular reason, why on Postgresql bulk_create() does not send pre_save / post_save signals?

Change History (5)

comment:1 by Дилян Палаузов, 7 years ago

Part 1: Send post_save signal in bulk_create:

diff  --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt
--- a/django/db/models/query.py
+++ b/django/db/models/query.py
@@ -20,6 +20,7 @@ from django.db.models.expressions import F
 from django.db.models.fields import AutoField
 from django.db.models.functions import Trunc
 from django.db.models.query_utils import InvalidQuery, Q
+from django.db.models.signals import post_save
 from django.db.models.sql.constants import CURSOR
 from django.utils import six, timezone
 from django.utils.deprecation import RemovedInDjango20Warning
@@ -417,7 +418,7 @@ class QuerySet(object):
         #    insert into the childmost table.
         # We currently set the primary keys on the objects when using
         # PostgreSQL via the RETURNING ID clause. It should be possible for
-        # Oracle as well, but the semantics for  extracting the primary keys is
+        # Oracle as well, but the semantics for extracting the primary keys is
         # trickier so it's not done yet.
         assert batch_size is None or batch_size > 0
         # Check that the parents share the same concrete model with the our
@@ -448,6 +449,11 @@ class QuerySet(object):
                     obj_without_pk._state.adding = False
                     obj_without_pk._state.db = self.db
 
+
+        if not self.model._meta.auto_created and (not objs_without_pk or connection.features.can_return_ids_from_bulk_insert):
+            for obj in objs_with_pk + objs_without_pk:
+                post_save.send(sender=self.model.__class__, instance=obj, created=True, using=self.db)
+
         return objs
 
     def get_or_create(self, defaults=None, **kwargs):

diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt
--- a/docs/ref/models/querysets.txt
+++ b/docs/ref/models/querysets.txt
@@ -1937,14 +1937,18 @@ are)::
 
 This has a number of caveats though:
 
-* The model's ``save()`` method will not be called, and the ``pre_save`` and
-  ``post_save`` signals will not be sent.
+* The model's ``save()`` method will not be called, and the ``pre_save``
+  signal will not be sent.  The ``post_save`` signal will be sent, if there
+  are no conflicts at database level and either all objs have set their primary
+  keys before calling bulk_create, or if the backend is PostgreSQL.
 * It does not work with child models in a multi-table inheritance scenario.
 * If the model's primary key is an :class:`~django.db.models.AutoField` it
   does not retrieve and set the primary key attribute, as ``save()`` does,
   unless the database backend supports it (currently PostgreSQL).
 * It does not work with many-to-many relationships.
 
+Returns the passed objects, possibly setting their primary keys.
+
 .. versionchanged:: 1.10
 
     Support for setting primary keys on objects created using ``bulk_create()``

comment:2 by Дилян Палаузов, 7 years ago

Component: UncategorizedDatabase layer (models, ORM)
Type: UncategorizedNew feature

comment:3 by Tim Graham, 7 years ago

Resolution: needsinfo
Status: newclosed
Summary: QuerySet.bulk_create()Improvements to QuerySet.bulk_create()

Hi, please try to keep a ticket focused on one issue. I'm going to close this ticket for now because it's too unfocused but I'll leave some ideas here about your proposals:

  1. Most likely we're not going to break backwards compatibility and change the return type of bulk_create() (there would need to be a really compelling reason to do that). Perhaps the return type should be added to the documentation.
  1. I'm not sure if it's common to use bulk_create() in situations where a race condition might be present. Most likely a discussion on the DevelopersMailingList is needed to form a consensus about your proposal regarding "ON CONFLICT DO NOTHING."
  1. There's some discussion in the comments of #19527 about a version of QuerySet.bulk_create() that sends signals. Again, this isn't something to change lightly as it has an impact on performance.

If you want to pursue these issues, I'd say the next step is to try searching Trac using a Google search with site:code.djangoproject.com and related keywords to find past discussion. You can also search the archives of the DevelopersMailingList. Then you would make a post to the DevelopersMailingList that summarizes past discussion of the issue(s) and includes your proposal about how to move forward. If the discussion yields a consensus, then we can open a more focused ticket (or tickets) with the improvement(s).

Thanks!

comment:4 by Tim Graham, 7 years ago

#28668 is created for ON CONFLICT DO NOTHING support.

comment:5 by Marcin Nowak, 5 years ago

Hi Tim.

Based on my long-term experience the inconsistency in sending signals is a source of many logical errors. They're hard to debug, because signals are implicit.
Please consider an example:

signals.post_save.connect(fn, sender=SomeModel)
signals.post_delete.connect(fn, sender=SomeModel)

[ ... ]

to_create = [ ... ]  # list of model instances to create
to_update = SomeModel.objects.filter(...)  # queryset of models to update


SomeModel.objects.bulk_create(to_create)

for obj in to_update:
    obj.field1 = ...
    obj.save()

It's not obvious whether nor when fn() will be called. In fact it will be called for to_update objects, but not for to_create.

The code must be changed in a way, where fn() must be called explicitely for each instance created by bulk_create:

signals.post_save.connect(fn, sender=SomeModel)
signals.post_delete.connect(fn, sender=SomeModel)

[ ... ]

to_create = [ ... ]  # list of model instances to create
to_update = SomeModel.objects.filter(...)  # queryset of models to update


SomeModel.objects.bulk_create(to_create)
for obj in to_create:
    fn(obj)

for obj in to_update:
    obj.field1 = ...
    obj.save()

The code is now more explicit. You see when and where fn() is called. But there is a catch - fn() is also called during obj.save() in update section.
In terms of code readability it's even worse than first example.

In such case a developer will decide to remove signals, but by that way he will introduce regression and another software failure, because he don't know what and where emits them. A developer will conclude that signals are bad by design.

We all know that signals are helpful, because they're extension-like interfaces for 3rd-party apps (like contrib.admin). But they must be consistent - every save (create, update) must emit (post_)save signal, and every delete must emit (post_)delete signal. Without that there is a use case, where you can't do what developer really need: do something after each save of the model for in-project and 3rd-party apps at the same time.

Note: See TracTickets for help on using tickets.
Back to Top