Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#33516 closed Bug (invalid)

QuerySet update() under postgres with select_for_update(skip_locked=True) results in neither FOR UPDATE nor SKIP LOCKED

Reported by: Grant Gainey Owned by: nobody
Component: Database layer (models, ORM) Version: 3.2
Severity: Normal Keywords:
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Versions:

$ cat /etc/fedora-release 
Fedora release 34 (Thirty Four)

$ pip freeze | grep -i django
Django==3.2.12
django-crispy-forms==1.13.0
django-currentuser==0.5.3
django-extensions==3.1.5
django-filter==21.1
django-guardian==2.4.0
django-guid==3.2.1
django-import-export==2.7.1
django-lifecycle==0.9.3
django-readonly-field==1.1.1
django-storages==1.12.3
djangorestframework==3.13.1
djangorestframework-queryfields==1.0.0

$ psql --version
psql (PostgreSQL) 13.4

Problem Statement

pulpcore is a Django app that does A Lot Of Concurrent Stuff. One of those Things includes "update the last-touched timestamp of large batches of entities at the end of a workflow". We use queryset.update(timestamp_of_interest=now()) for this.

Batch updates don't guarantee row-update-order in Postgresql. In heavy-concurrency contexts, when updating overlapping content in multiple threads, this can result in deadlocks as postgres acquires row-locks in arbitrary orders.

To get around this, in raw postgres, one uses UPDATE...WHERE id IN (SELECT id FROM table WHERE expression ORDER BY id FOR UPDATE). The SELECT-FOR-UPDATE-ORDER-BY enforces row-lock-ordering prior to the UPDATE, and everyone is happily synchronized and all the dining philosophers get to eat.

If (as in our specific use-case) you don't actually care who updates the row, just that it is being updated "now()-ish", you can regain some concurrency by using Postgres' SKIP LOCKED functionality.

So - we attempted to use the following to accomplish that in our update workflow:

TestModel.objects.filter(pk__in=content_q).order_by("pk").select_for_update(
        skip_locked=True
    ).update(timestamp_of_interest=now())

By the time we get to Postgres, this does not emit FOR UPDATE or SKIP LOCKED, and therefore doesn't solve our deadlock problem. Cue sad-trombone.

Specifically, from the driver-example below, this code:

def update_timestamp(index):
    content_q = (
        TestModel.objects.filter()
            .order_by("pk")
            .values_list("pk", flat=True)
    )
    TestModel.objects.filter(pk__in=content_q).order_by("pk").select_for_update(
        skip_locked=True
    ).update(timestamp_of_interest=now())

generates this SQL:

2022-02-15 19:12:47.276 UTC [337361] LOG:  statement: UPDATE "core_testmodel" SET "timestamp_of_interest" = '2022-02-15T19:12:47.187475+00:00'::timestamptz WHERE "core_testmodel"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_testmodel" U0)

Running the driver-example will generate deadlocks; the more TestModel entries exist, the more often you hit a deadlock. With 5K entries, test runs hit deadlocks ~5-10 times. With 20K entries, it was more like ~60-70% of the 100 runs.

Workaround

Right now, we implemented a workaround in raw SQL. You can see it here:

https://github.com/pulp/pulpcore/pull/2159/files#diff-81f6a78175bb93934b6beff952646d3ca1ef3731f1ff14492d4ec77bfc3fdf82R132-R141

We would much much rather not have to use raw-sql.

Full Example:

  • Here's a really simple model, stored as pulpcore/apps/models/test.py
    import uuid
    from django.db import models
    from django_lifecycle import LifecycleModel
    
    
    class TestModel(LifecycleModel):
        pulp_id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
        timestamp_of_interest = models.DateTimeField(auto_now=True)
    
  • Here's a migration that creates that model:
    # Generated by Django 3.2.12 on 2022-02-15 18:53
    
    from django.db import migrations, models
    import django_lifecycle.mixins
    import uuid
    
    
    class Migration(migrations.Migration):
    
        dependencies = [
        ]
    
        operations = [
            migrations.CreateModel(
                name='TestModel',
                fields=[
                    ('pulp_id', models.UUIDField(default=uuid.uuid4, editable=False, primary_key=True, serialize=False)),
                    ('timestamp_of_interest', models.DateTimeField(auto_now=True)),
                ],
                options={
                    'abstract': False,
                },
                bases=(django_lifecycle.mixins.LifecycleModelMixin, models.Model),
            ),
        ]
    
  • Here's some python that adds 5000 rows to the resulting table (there is probably a better/faster way to do this, but this is obvious):
    from pulpcore.app.models import TestModel
    
    for tm in range(5000):
        TestModel.objects.create()
    
  • Turning on statement-loggin in Postgres and restarting:
    $ sudo vim /var/lib/pgsql/data/postgresql.conf
    ...
    log_statement = 'all'                   # none, ddl, mod, all
    $ sudo systemctl restart postgresql
    
  • And here, at last, is driver-code that will generate deadlocks and let us inspect the emitted SQL:
    from threading import Thread
    from pulpcore.app.models import TestModel
    from django.utils.timezone import now
    
    
    durations = []
    
    def update_timestamp(index):
        content_q = (
            TestModel.objects.filter()
                .order_by("pk")
                .values_list("pk", flat=True)
        )
        TestModel.objects.filter(pk__in=content_q).order_by("pk").select_for_update(
            skip_locked=True
        ).update(timestamp_of_interest=now())
        print(">>>done")
    
    for r in range(10):
        threads = []
        for i in range(10):
            threads.append(Thread(target=update_timestamp, args=(i,)))
        for t in threads:
            t.start()
        for t in threads:
            t.join()
    

Change History (2)

comment:1 by Simon Charette, 3 years ago

Resolution: invalid
Status: newclosed

In the latest version of Django the ordering should not be cleared when select_for_update is used on the subquery provided to a __in lookup.

def update_timestamp(index):
    content_q = (
        TestModel.objects.order_by("pk")
        .select_for_update(skip_locked=True)
        .values_list("pk", flat=True)
    )
    TestModel.objects.filter(pk__in=content_q).update(timestamp_of_interest=now())

There was some recent changes in this area in 4.0 and I haven't confirmed if this is an issue on 3.2 but in all cases order_by.select_for_update has to be called on the subquery and not the outer one for this to work as the resulting SQL hints at and that's likely why your solution isn't working here.

comment:2 by Grant Gainey, 3 years ago

Ah-ha! Perfect, that explains everything!

Two comments/suggestions:

  • Could it be possible for select_for_update() to throw an exception when it's not going to be emitted? Would have helped me target the problem way sooner if the method yelled at me for using it incorrectly
  • Similarly - when called 'correctly', select_for_update() throws an exception if you're not in a transaction. When called incorrectly, it doesn't. Would be great if the behavior was consistent.

For posterity's sake, below is my test-script using the suggested changes, that works on a 20K table with no deadlocks. Thanks for the pointer!

import statistics
import time
from threading import Thread
from pulpcore.app.models import Content  
from django.db import transaction
from django.utils.timezone import now

durations = []

def update_timestamp(index):

    print(">>>in update_timedstamp index {}".format(index))
    start = time.time()
    with transaction.atomic():
        content_q = (
            Content.objects.filter()
                .order_by("pk")
                .select_for_update(skip_locked=True)
                .values_list("pk", flat=True)
        )
        Content.objects.filter(pk__in=content_q).update(timestamp_of_interest=now())
    end = time.time()
    durations.append(end-start)
    print(">>>done")

for r in range(10):
    threads = []
    for i in range(10):
        threads.append(Thread(target=update_timestamp, args=(i,)))
    for t in threads:
        t.start()
    for t in threads:
        t.join()

print("Avg time : {}".format(sum(durations) / len(durations)))
print("Median time : {}".format(statistics.median(durations)))
print("StdDev : {}".format(statistics.stdev(durations)))

And the emitted SQL:

2022-02-16 13:46:02.517 UTC [368804] LOG:  statement: UPDATE "core_content" SET "timestamp_of_interest" = '2022-02-16T13:46:02.516956+00:00'::timestamptz WHERE "core_content"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_content" U0 ORDER BY U0."pulp_id" ASC FOR UPDATE SKIP LOCKED)
Note: See TracTickets for help on using tickets.
Back to Top