Opened 10 years ago
Last modified 2 months ago
#24810 new New feature
Reopen database connection automatically when no transaction is active
Reported by: | Aymeric Augustin | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | depaolim@…, binary@…, Marti Raudsepp, dms-cat, amureki, Kai Schlamp, Ülgen Sarıkavak | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
This is a follow-up of #15802.
When the database closes the connection, for instance because the database server was restarted, an exception will be raised the next time user code attempts to use the database. Django doesn't attempt to do anything about it.
This is expected to result in at most one 500 error for each thread that serves requests, since database connections are thread-local. This doesn't sound horrific. After all the database was restarted while the website was serving requests.
It can also cause management commands to crash, which isn't the end of the world either since management commands should fall into one of the following categories:
- short-lived, idempotent and scheduled to run regularly
- long-lived and supervised to run constantly
- run manually
If the connection was in autocommit mode and an operation fails because the database has closed the connection, theoretically, it's safe the reopen the connection and retry the operation.
In that case, Django could continue to operate transparently instead of raising an exception.
This may not be easy to implement.
Change History (21)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Cc: | added |
---|
comment:3 by , 9 years ago
Yes, I have the same problem in the daemon tasks that access DB through django models.
I don't see a problem with the connection state, actually, as we can say to do all the necessary preparations in the connection_created signal and then newly created connection can be set up correctly?
comment:4 by , 9 years ago
Cc: | added |
---|
comment:5 by , 8 years ago
It should be noted that in the case where a daemon is accessing the DB via the ORM, if the db connection is closed for some reason (usually the db server restarting), then ALL subsequent DB read queries FAIL. At that point, the process can no longer perform any read query since the db connection is closed. I'm not concerned with retrying statements that failed during the time the db was down, but it does seem like a bug that the ORM completely fails to reconnect AT ALL. The funny thing is: if and only if the process happens to perform a DB write after the DB connection gets closed, then the ORM will open a new connection.
follow-up: 7 comment:6 by , 8 years ago
Yes, these are the reasons why this ticket exists. Patches *that do not ignore transactional integrity* welcome.
comment:7 by , 8 years ago
Replying to aaugustin:
Yes, these are the reasons why this ticket exists. Patches *that do not ignore transactional integrity* welcome.
So I gave it a humble try.
There is no problem in case the trouble happened inside a transaction because in case error happened and was caught by the Atomic, the connection will be closed in one of the places like this. And next transaction will work without any problem as connection will be re-established because django is now aware that the current one is not an option any more.
It's still a problem in autocommit mode though because in this case connection won't be closed and will remain in a broken state, so in case we try to obtain a cursor again, we'll get the InterfaceError exception. So the code like:
def do_query(): return Model.objects.first() def loop(): while True: try: print do_query() except: print 'shit happened' sleep(5)
will stop working after first disconnect.
At first I tried to intercept InterfaceError at the moment we are obtaining a cursor and OperationalErorr at the execute method, but then I realized that such a problem may happen in other methods, like fetchone or anywhere else, so eventually I came to just intercept it in the DatabaseErrorWrapper and close the connection if one of those errors occured but only in case we are not inside a transaction, as this case is covered already and it may break that code. Here's the commit: https://github.com/jbinary/django/commit/568a4eb573c7fce2fdec2d1578b2db4f3d0e51fd
After that, that code above starts to work even after disconnect.
I have no any idea if this is the right approach, and I have no idea how to write tests for this yet. But if anyone will tell this one is a worthy approach, I'll try to think about tests too. (but advices are welcome, ofc)
comment:8 by , 8 years ago
Has patch: | set |
---|---|
Needs tests: | set |
comment:9 by , 6 years ago
Cc: | added |
---|
comment:10 by , 5 years ago
Cc: | added |
---|
follow-up: 12 comment:11 by , 4 years ago
I've been trying to address this with no luck.
My main problem is accurately reproducing this issue -- it seems to happen from time to time in production for me. The DB connection seems to die for some reason, and the next request fails with InterfaceError
.
However, testing locally I tried to reproduce this by:
- Start Django and Postgres.
- Execute a Request (which does a DB call). It works!
- Soft-exit Postgres
- Execute a Request (which does a DB call). It still works! 😓
I tried as an alternative killing postgres in (3), but the result is the same.
As per Sentry, I'm hitting this error a few hundred times a month, so I'd really like to address this.
Any hints on how to accurately reproduce this?
comment:12 by , 4 years ago
Replying to Hugo Osvaldo Barrera:
As per Sentry, I'm hitting this error a few hundred times a month, so I'd really like to address this.
Any hints on how to accurately reproduce this?
Hello Hugo,
I'm seeing this less than once a month, and I would like to see improvement, but my time is limited.
Today's occurrence was not related to database outage, or maintenance, which seems to be the usual glib explanation for the error.
My application is used cyclically, in a particular season. The database is NOT in the same google region as the application server. Yes, that's not ideal, and I may need to correct this before the coming season.
I suspect that progress on this may be helped by Gremlins, mentioned here: https://www.qualimente.com/2016/04/26/introduction-to-failure-testing-with-docker/
If gremlins could force the error to occur, there would be a chance to develop solutions.
comment:13 by , 3 years ago
Hugo,
you should be able to reproduce this when restarting PostgreSQL after getting the cursor, i.e.:
from django.db import connection cursor = connection.cursor() __import__("time").sleep(5) # restart postgresql cursor.execute("select 1")
This then fails with the following for me on Django 3.2.8:
Traceback (most recent call last): File "…/django/db/backends/utils.py", line 82, in _execute return self.cursor.execute(sql) psycopg2.errors.AdminShutdown: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<console>", line 1, in <module> File "…/django/db/backends/utils.py", line 66, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File "…/django/db/backends/utils.py", line 75, in _execute_with_wrappers return executor(sql, params, many, context) File "…/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "…/django/db/utils.py", line 90, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "…/django/db/backends/utils.py", line 82, in _execute return self.cursor.execute(sql) django.db.utils.OperationalError: terminating connection due to administrator command server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request.
(Note that in this case the first exception is of type AdminShutdown
, whereas on Sentry I see psycopg2.OperationalError: server closed the connection unexpectedly
there.)
Just for reference: there exists https://github.com/jdelic/django-dbconn-retry, but it has its (open) issues also.
comment:14 by , 3 years ago
Type: | New feature → Uncategorized |
---|
I can second this issues. We see this happening thousands of times, but still at random (we have a lot of traffic). We have a relatively small stack, and I scanned it completely to see if we have any place that closes the connection prematurely. This is not the case. Therefore, I would contest the ticket status and would propose to move this to bug.
This is certainly unexpected for a framework that comes with a very sophisticated ORM and is probably not debuggable or mitigable event for most experienced developers.
comment:15 by , 2 years ago
Type: | Uncategorized → New feature |
---|
I think it was originally specified as a New Feature since it's never been supported. (It seems moot: any patch will only be applied to the development branch.)
comment:16 by , 13 months ago
Cc: | added |
---|
comment:17 by , 12 months ago
Cc: | added |
---|
comment:18 by , 12 months ago
This issue also bit me. I have a long-running management command that accesses the database only a few times a day, and I often get an OperationalError that the database connection is already closed. I tried to work around it by using db.close_old_connections(), but then my tests using pytest failed as it also closed some other connections. I wonder if CONN_HEALTH_CHECKS makes a difference here, but I guess that it only works with a request inside a view.
comment:19 by , 12 months ago
This ticket relates to #14845 which ought to document the database connection-creation and management process.
If that can be of any help an established pattern when dealing with long running management commands is to call close_old_connections
in strategic locations alike to how it's called between each HTTP requests that Django serves.
comment:20 by , 8 months ago
Cc: | added |
---|
comment:21 by , 2 months ago
In our system we use FastAPI with uvicorn and the Django ORM (using psycopg). When the database is restarted Django does not reinitialize the connection and it is broken until the system is restarted. psycopg reports the connection as BAD but it is not reestablished.
I was able to resolve the issue by call close_old_connections but care needs to be taken in how it is done via FastAPI.
My solution was to create an async exception_handler and wrap the close_old_connections in a sync_to_async.
If the close_old_connections is not wrapped, it still runs but it has no observable effect. This function does call BaseDatabaseWrapper.close() which is marked as async_unsafe but still allows the close_old_connections to be called from async context. I don't understand how that is possible.
I also tried making the exception_handler synchronous, iterate over the connections and call BaseDatabaseWrapper.connect() on the default connection (we only have one) but then I noticed that the underlying threads started reporting different memory addresses for the connection. I.e. the underlying psycopg connection remains the same (BAD) but in my exception_handler the connection is ok and the connection is restarted, with a new connection object which is not used by other parts of django.
So this seems to also be a threading issue?
It must be noted that SQL connections can have state even when not in a transaction. These could be for example session variables, per-connection configuration (SET search_path = xxx), temporary tables etc. While Django doesn't use any of these directly, breaking Django for users who do use connection state should be avoided.
If we are going to reopen connections automatically we need some safe-guards against above problems. As we can't detect automatically when users have session state in their connections, the users should tell us about this. I think opt-out (or maybe opt-in?) flag in the connection's settings could work. Something like set
OPTIONS['stateful_connection'] = True
, and Django will not try to reopen failed connections.Other than the above remark I don't see a problem with this approach. Still, without further evidence that these kinds of failures are common, I believe this will require too much complexity compared to what is gained.