Opened 10 years ago

Last modified 4 months ago

#24810 new New feature

Reopen database connection automatically when no transaction is active

Reported by: Aymeric Augustin Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords:
Cc: depaolim@…, binary@…, Marti Raudsepp, dms-cat, amureki, Kai Schlamp, Ülgen Sarıkavak Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: yes Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

This is a follow-up of #15802.

When the database closes the connection, for instance because the database server was restarted, an exception will be raised the next time user code attempts to use the database. Django doesn't attempt to do anything about it.

This is expected to result in at most one 500 error for each thread that serves requests, since database connections are thread-local. This doesn't sound horrific. After all the database was restarted while the website was serving requests.

It can also cause management commands to crash, which isn't the end of the world either since management commands should fall into one of the following categories:

  • short-lived, idempotent and scheduled to run regularly
  • long-lived and supervised to run constantly
  • run manually

If the connection was in autocommit mode and an operation fails because the database has closed the connection, theoretically, it's safe the reopen the connection and retry the operation.

In that case, Django could continue to operate transparently instead of raising an exception.

This may not be easy to implement.

Change History (21)

comment:1 by Anssi Kääriäinen, 10 years ago

It must be noted that SQL connections can have state even when not in a transaction. These could be for example session variables, per-connection configuration (SET search_path = xxx), temporary tables etc. While Django doesn't use any of these directly, breaking Django for users who do use connection state should be avoided.

If we are going to reopen connections automatically we need some safe-guards against above problems. As we can't detect automatically when users have session state in their connections, the users should tell us about this. I think opt-out (or maybe opt-in?) flag in the connection's settings could work. Something like set OPTIONS['stateful_connection'] = True, and Django will not try to reopen failed connections.

Other than the above remark I don't see a problem with this approach. Still, without further evidence that these kinds of failures are common, I believe this will require too much complexity compared to what is gained.

comment:2 by Marco De Paoli, 10 years ago

Cc: depaolim@… added

comment:3 by Sergey Dobrov, 9 years ago

Yes, I have the same problem in the daemon tasks that access DB through django models.

I don't see a problem with the connection state, actually, as we can say to do all the necessary preparations in the connection_created signal and then newly created connection can be set up correctly?

comment:4 by Sergey Dobrov, 9 years ago

Cc: binary@… added

comment:5 by Mike Charnoky, 8 years ago

It should be noted that in the case where a daemon is accessing the DB via the ORM, if the db connection is closed for some reason (usually the db server restarting), then ALL subsequent DB read queries FAIL. At that point, the process can no longer perform any read query since the db connection is closed. I'm not concerned with retrying statements that failed during the time the db was down, but it does seem like a bug that the ORM completely fails to reconnect AT ALL. The funny thing is: if and only if the process happens to perform a DB write after the DB connection gets closed, then the ORM will open a new connection.

comment:6 by Aymeric Augustin, 8 years ago

Yes, these are the reasons why this ticket exists. Patches *that do not ignore transactional integrity* welcome.

in reply to:  6 comment:7 by Sergey Dobrov, 8 years ago

Replying to aaugustin:

Yes, these are the reasons why this ticket exists. Patches *that do not ignore transactional integrity* welcome.

So I gave it a humble try.

There is no problem in case the trouble happened inside a transaction because in case error happened and was caught by the Atomic, the connection will be closed in one of the places like this. And next transaction will work without any problem as connection will be re-established because django is now aware that the current one is not an option any more.

It's still a problem in autocommit mode though because in this case connection won't be closed and will remain in a broken state, so in case we try to obtain a cursor again, we'll get the InterfaceError exception. So the code like:

def do_query():
    return Model.objects.first()

def loop():
    while True:
        try:
            print do_query()
        except:
            print 'shit happened'
        sleep(5)

will stop working after first disconnect.

At first I tried to intercept InterfaceError at the moment we are obtaining a cursor and OperationalErorr at the execute method, but then I realized that such a problem may happen in other methods, like fetchone or anywhere else, so eventually I came to just intercept it in the DatabaseErrorWrapper and close the connection if one of those errors occured but only in case we are not inside a transaction, as this case is covered already and it may break that code. Here's the commit: https://github.com/jbinary/django/commit/568a4eb573c7fce2fdec2d1578b2db4f3d0e51fd

After that, that code above starts to work even after disconnect.

I have no any idea if this is the right approach, and I have no idea how to write tests for this yet. But if anyone will tell this one is a worthy approach, I'll try to think about tests too. (but advices are welcome, ofc)

comment:8 by Sergey Dobrov, 8 years ago

Has patch: set
Needs tests: set

comment:9 by Marti Raudsepp, 6 years ago

Cc: Marti Raudsepp added

comment:10 by dms-cat, 5 years ago

Cc: dms-cat added

comment:11 by Hugo Osvaldo Barrera, 5 years ago

I've been trying to address this with no luck.

My main problem is accurately reproducing this issue -- it seems to happen from time to time in production for me. The DB connection seems to die for some reason, and the next request fails with InterfaceError.

However, testing locally I tried to reproduce this by:

  1. Start Django and Postgres.
  2. Execute a Request (which does a DB call). It works!
  3. Soft-exit Postgres
  4. Execute a Request (which does a DB call). It still works! 😓

I tried as an alternative killing postgres in (3), but the result is the same.

As per Sentry, I'm hitting this error a few hundred times a month, so I'd really like to address this.

Any hints on how to accurately reproduce this?

in reply to:  11 comment:12 by Russ Ferriday, 4 years ago

Replying to Hugo Osvaldo Barrera:

As per Sentry, I'm hitting this error a few hundred times a month, so I'd really like to address this.

Any hints on how to accurately reproduce this?

Hello Hugo,
I'm seeing this less than once a month, and I would like to see improvement, but my time is limited.
Today's occurrence was not related to database outage, or maintenance, which seems to be the usual glib explanation for the error.
My application is used cyclically, in a particular season. The database is NOT in the same google region as the application server. Yes, that's not ideal, and I may need to correct this before the coming season.
I suspect that progress on this may be helped by Gremlins, mentioned here: https://www.qualimente.com/2016/04/26/introduction-to-failure-testing-with-docker/
If gremlins could force the error to occur, there would be a chance to develop solutions.

comment:13 by Daniel Hahler, 3 years ago

Hugo,

you should be able to reproduce this when restarting PostgreSQL after getting the cursor, i.e.:

from django.db import connection


cursor = connection.cursor()
__import__("time").sleep(5)  # restart postgresql
cursor.execute("select 1")

This then fails with the following for me on Django 3.2.8:

Traceback (most recent call last):
  File "…/django/db/backends/utils.py", line 82, in _execute
    return self.cursor.execute(sql)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "…/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "…/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "…/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "…/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "…/django/db/backends/utils.py", line 82, in _execute
    return self.cursor.execute(sql)
django.db.utils.OperationalError: terminating connection due to administrator command
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

(Note that in this case the first exception is of type AdminShutdown, whereas on Sentry I see psycopg2.OperationalError: server closed the connection unexpectedly there.)

Just for reference: there exists https://github.com/jdelic/django-dbconn-retry, but it has its (open) issues also.

comment:14 by Johannes Maron, 3 years ago

Type: New featureUncategorized

I can second this issues. We see this happening thousands of times, but still at random (we have a lot of traffic). We have a relatively small stack, and I scanned it completely to see if we have any place that closes the connection prematurely. This is not the case. Therefore, I would contest the ticket status and would propose to move this to bug.
This is certainly unexpected for a framework that comes with a very sophisticated ORM and is probably not debuggable or mitigable event for most experienced developers.

comment:15 by Carlton Gibson, 3 years ago

Type: UncategorizedNew feature

I think it was originally specified as a New Feature since it's never been supported. (It seems moot: any patch will only be applied to the development branch.)

comment:16 by amureki, 14 months ago

Cc: amureki added

comment:17 by Kai Schlamp, 14 months ago

Cc: Kai Schlamp added

comment:18 by Kai Schlamp, 14 months ago

This issue also bit me. I have a long-running management command that accesses the database only a few times a day, and I often get an OperationalError that the database connection is already closed. I tried to work around it by using db.close_old_connections(), but then my tests using pytest failed as it also closed some other connections. I wonder if CONN_HEALTH_CHECKS makes a difference here, but I guess that it only works with a request inside a view.

comment:19 by Simon Charette, 14 months ago

This ticket relates to #14845 which ought to document the database connection-creation and management process.

If that can be of any help an established pattern when dealing with long running management commands is to call close_old_connections in strategic locations alike to how it's called between each HTTP requests that Django serves.

Last edited 14 months ago by Simon Charette (previous) (diff)

comment:20 by Ülgen Sarıkavak, 10 months ago

Cc: Ülgen Sarıkavak added

comment:21 by HaukurPall, 4 months ago

In our system we use FastAPI with uvicorn and the Django ORM (using psycopg). When the database is restarted Django does not reinitialize the connection and it is broken until the system is restarted. psycopg reports the connection as BAD but it is not reestablished.

I was able to resolve the issue by call close_old_connections but care needs to be taken in how it is done via FastAPI.

My solution was to create an async exception_handler and wrap the close_old_connections in a sync_to_async.

If the close_old_connections is not wrapped, it still runs but it has no observable effect. This function does call BaseDatabaseWrapper.close() which is marked as async_unsafe but still allows the close_old_connections to be called from async context. I don't understand how that is possible.

I also tried making the exception_handler synchronous, iterate over the connections and call BaseDatabaseWrapper.connect() on the default connection (we only have one) but then I noticed that the underlying threads started reporting different memory addresses for the connection. I.e. the underlying psycopg connection remains the same (BAD) but in my exception_handler the connection is ok and the connection is restarted, with a new connection object which is not used by other parts of django.

So this seems to also be a threading issue?

Note: See TracTickets for help on using tickets.
Back to Top