Opened 11 years ago
Closed 11 years ago
#21597 closed Bug (wontfix)
(2006, 'MySQL server has gone away') in django1.6 when wait_timeout passed
Reported by: | Owned by: | Aymeric Augustin | |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | 1.6 |
Severity: | Normal | Keywords: | mysql |
Cc: | Tim Graham, victorgama, markus.magnuson@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
EDIT -- THE SOLUTION TO THIS PROBLEM IS EXPLAINED IN COMMENT 29. JUST DO WHAT IT SAYS. THANK YOU!
In django 1.6, when the wait_timeout passed (of mysql), then DB access cause the (2006, 'MySQL server has gone away') error.
This was not the case in django 1.5.1
I've noticed this error when using workers that run the django code (using gearman).
To reproduce:
Set the timeout to low value by editing /etc/mysql/my.cnf
add the following under [mysqld]
wait_timeout = 10
interactive_timeout = 10
Then
% python manage.py shell >>> # access DB >>> import django.contrib.auth.models >>> print list(django.contrib.auth.models.User.objects.all()) >>> import time >>> time.sleep(15) >>> print list(django.contrib.auth.models.User.objects.all())
Now you get the error.
Simple solution I found on the web is to call django.db.close_connection() before the access
>>> import django.db >>> django.db.close_connection() >>> print list(django.contrib.auth.models.User.objects.all())
works ok.
Attachments (1)
Change History (45)
by , 11 years ago
comment:1 by , 11 years ago
comment:3 by , 11 years ago
Description: | modified (diff) |
---|
This might be a side effect of using autocommit. Since Django 1.5 kept a transaction open in that case, MySQL couldn't close the connection. Now it can.
Could you try the following snippet to confirm my hypothesis?
>>> import django.contrib.auth.models >>> from django.db import transaction >>> with transaction.atomic(): ... print list(django.contrib.auth.models.User.objects.all()) ... import time ... time.sleep(15) ... print list(django.contrib.auth.models.User.objects.all())
You shouldn't get a timeout when you run this on Django 1.6.
comment:4 by , 11 years ago
I ran into the same issue after upgrading to Django 1.6. In my case it affects long-running (multiple days) WorkerThreads, although I am not using any wait_timeout setting. It looks like the connection just times out without Django noticing/handling it.
Code which is triggering this:
cursor = connections['legacydb'].cursor() cursor.execute(query) result = cursor.fetchall() cursor.close()
The above code works for a few hours and after a while the OperationalError 2006 is triggered on cursor.execute(). This behavior was not present in early Django versions and is a critical regression (aka: code which used to work now raises an Exception → severity should be set to release blocker).
BTW: The legacydb in the example does not support transactions (MyISAM).
comment:5 by , 11 years ago
AFAIK the default timeout on MySQL is 28800 seconds (8 hours).
http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_wait_timeout
comment:6 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|
@aaugustin: even with transaction.atomic()
(as of comment:3), I'm able to reproduce the failure.
comment:7 by , 11 years ago
Query comparison (as in OP example, without using transaction.atomic()
):
Django 1.5 | Django 1.6 |
---|---|
562 Connect root@localhost on test_ 562 Query SET NAMES utf8 562 Query set autocommit=0 562 Query SET SQL_AUTO_IS_NULL = 0 562 Query SELECT `auth_user`.`id`, ... FROM `auth_user` 563 Connect root@localhost on test_ 563 Query SET NAMES utf8 563 Query set autocommit=0 563 Query SET SQL_AUTO_IS_NULL = 0 563 Query SELECT `auth_user`.`id`,... FROM `auth_user` |
560 Connect root@localhost on test_ 560 Query SET NAMES utf8 560 Query set autocommit=0 560 Query SET SQL_AUTO_IS_NULL = 0 560 Query set autocommit=1 560 Query set autocommit=0 560 Query SELECT `auth_user`.`id`, ... FROM `auth_user` 561 Connect root@localhost on test_ 561 Query SET NAMES utf8 561 Query set autocommit=0 561 Query SET SQL_AUTO_IS_NULL = 0 561 Query set autocommit=1 561 Query SELECT `auth_user`.`id`, ... FROM `auth_user` |
comment:8 by , 11 years ago
Could this actually be a failure to re-establish a connection to MySQL after receiving the 2006 error? I get the same symptoms in production now when a MySQL server is restarted - all workers start getting 2006 errors and do not recover. Only gunicorn restart helps then. You could reproduce it locally by calling "print list(django.contrib.auth.models.User.objects.all())" every second and then restarting MySQL server. The connection does not recover even after the server is back up and accepting connections. This was tested with CONN_MAX_AGE > 0 but lower than wait_timeout on the MySQL server.
comment:9 by , 11 years ago
For now we are using the following workaround for long-running workers:
from django.db import connection ... def is_connection_usable(): try: connection.connection.ping() except: return False else: return True ... def do_work(): while(True): # Endless loop that keeps the worker going (simplified) if not is_connection_usable(): connection.close() try: do_a_bit_of_work() except: logger.exception("Something bad happened, trying again") sleep(1)
comment:10 by , 11 years ago
Maybe we can validate that connection is usable in ensure_connection ?
something like this:
*** Django-1.6.1/django/db/backends/__init__.py 2014-01-23 16:57:15.927687924 +0400 --- /usr/local/lib/python2.7/dist-packages/django/db/backends/__init__.py 2014-01-23 16:56:21.000000000 +0400 *************** *** 119,125 **** """ Guarantees that a connection to the database is established. """ ! if self.connection is None or not self.is_usable(): with self.wrap_database_errors: self.connect() --- 119,125 ---- """ Guarantees that a connection to the database is established. """ ! if self.connection is None: with self.wrap_database_errors: self.connect()
are there any caveats?
comment:12 by , 11 years ago
Hi,
to me it seems to be a bug. Old django would close every connection right away, django 1.6 checks with CONN_MAX_AGE:
- It gets CONN_MAX_AGE from DATABASES, sets close_at:
max_age = self.settings_dict['CONN_MAX_AGE'] self.close_at = None if max_age is None else time.time() + max_age
- Actually the code above affects close_if_unusable_or_obsolete, which closes the connection if 'self.close_at is not None and time.time() >= self.close_at'
- close_if_unusable_or_obsolete itself is being called by close_old_connections, which in turn is a request handler for signals.request_started and signals.request_finished.
We have a worker, which is effectively a django app but but it doesn't process any HTTP requests. In fact that makes all connections persistent because close_old_connections never gets called.
Please advise.
Thanks
comment:13 by , 11 years ago
Hi,
Without transactions you hit the Gone Away if the sleep is longer than MySQL's wait_timeout:
mysql> set global wait_timeout=10;
>>> import django.contrib.auth.models >>> import time >>> print list(django.contrib.auth.models.User.objects.all()) >>> time.sleep(15) >>> print list(django.contrib.auth.models.User.objects.all())
According to MySQL/python documentation this should not be a problem. If you add the INTERACTIVE bit to the client connection flags in db/backends/mysql/base.py, you regain the MySQL drivers' auto-reconnect feature and everything works as before (I think that in Django 1.5 you ran into trouble with a transaction that ran longer than wait_timeout too).
from kwargs['client_flag'] = CLIENT.FOUND_ROWS
to kwargs['client_flag'] = CLIENT.FOUND_ROWS | CLIENT.INTERACTIVE
I haven't looked into the origins of this line, but maybe it is the real culprit for the recent Gone Away issues.
comment:14 by , 11 years ago
Hi Jeroen!
It seems like adding CLIENT.INTERACTIVE flag just tells the driver to switch from checking on wait_timeout to taking interactive_timeout into account. I set interactive_timeout=10 and was able to reproduce this problem.
Both of these values are 8 hours by default, but once your code has been inactive for that long, mysql drops the connection and the client fails next time it tries to access some data. It looks perfectly right to catch this error in the code, call django.db.close_connection() every time or whatever, but I think that maybe connection persistence logic needs a bit of fine-tuning so that we can control persistence without relying on signals.request_started/request_finished.
comment:16 by , 11 years ago
I've checked with my existing long running processes on Django 1.5 installations, running the same mysql-python and libmysqlclient.so, with tcpdump:
They do set the interactive flag on the MySQL connection. That explains for me why I never experienced the Gone Away's before. Django doesn't notice that the underlying connection went away and came back.
I haven't had enough time to find out (dig down deep enough) what makes this flag appear on the connection in the 1.5 situation and what changed in 1.6 that is relevant to this problem. (a) My suspicion is that it isn't directly related to the connection persistence mechanism. (b) To me it doesn't seem to be in any way related to the transactions mechanisms: My transactions happen fast enough and it's fine that things break if the transaction takes longer than wait_timeout (e.g. more than two minutes); The same application that works fine in 1.5 also works in 1.6 with the interactive flag set.
JeroenP
comment:17 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:18 by , 11 years ago
https://github.com/django/django/pull/2454 I've submitted pull request
comment:19 by , 11 years ago
Cc: | added |
---|---|
Has patch: | set |
When I run the MySQL tests using djangocore-box, I get the traceback below when the test suite concludes. The PR above resolves this error.
Traceback (most recent call last): File "/django/tests/runtests.py", line 374, in <module> options.failfast, args) File "/django/tests/runtests.py", line 216, in django_tests test_labels or get_installed(), extra_tests=extra_tests) File "/django/django/test/runner.py", line 149, in run_tests self.teardown_databases(old_config) File "/django/django/test/runner.py", line 124, in teardown_databases connection.creation.destroy_test_db(old_name, self.verbosity) File "/django/django/db/backends/creation.py", line 452, in destroy_test_db self._destroy_test_db(test_database_name, verbosity) File "/django/django/db/backends/creation.py", line 466, in _destroy_test_db % self.connection.ops.quote_name(test_database_name)) File "/django/django/db/backends/utils.py", line 59, in execute return self.cursor.execute(sql, params) File "/django/django/db/utils.py", line 94, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/django/django/db/backends/utils.py", line 57, in execute return self.cursor.execute(sql) File "/django/django/db/backends/mysql/base.py", line 128, in execute return self.cursor.execute(query, args) File "/home/vagrant/.virtualenvs/py2.7/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/home/vagrant/.virtualenvs/py2.7/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue django.db.utils.OperationalError: (2006, 'MySQL server has gone away')
comment:20 by , 11 years ago
Patch needs improvement: | set |
---|
I believe this patch breaks the transactional integrity guarantees made by transaction.atomic
.
Django mustn't attempt to reconnect until the connection has been dropped properly, possibly after waiting for the exit of an atomic block.
comment:21 by , 11 years ago
A fix for this is still badly required, as this is a horrible regression. Please mark this bug as a 'release blocker', as it breaks existing code and makes upgrading to Django 1.6 impossible for quite a few people.
follow-up: 24 comment:22 by , 11 years ago
Severity: | Normal → Release blocker |
---|
Unless I missed something, this only affects people that have management commands running for more than 8 hours (or MySQL's connection timeout if it has been changed from the default). Is that correct?
comment:23 by , 11 years ago
Hi, this absolutely true for me. However I've managed to come up with a workaround, we've wrapped some of our code with decorators which implicitly call close_old_connections() routine, so that django now respects CONN_MAX_AGE parameter to some extent.
comment:24 by , 11 years ago
Replying to aaugustin:
Unless I missed something, this only affects people that have management commands running for more than 8 hours (or MySQL's connection timeout if it has been changed from the default). Is that correct?
As far as I can tell I think this is correct. At least this is the place where I was hitting this bug in my environment (before downgrading to 1.5 again because of this).
comment:25 by , 11 years ago
Maybe this helps someone like me who uses Django and SQLAlchemy. I was experiencing this problem every morning with an API (not used for several hours at night) which had CONN_MAX_AGE=3600
and MySQL server with wait_timeout = 28800
. I even changed settings which are used by manage.py
(to run migrations and collect static assets) to not use persistent connections CONN_MAX_AGE=0
, but the problem persisted. Then I noticed that this is actually happening in the code where SQLAlchemy accesses DB, so I googled and found this post: http://nathansnoggin.blogspot.co.uk/2013/11/integrating-sqlalchemy-into-django.html Followed the instructions in the blog post. Now that my SQLAlchemy does not handle connections (just uses whatever Django supplies) the problem is gone.
comment:26 by , 11 years ago
Having the same problem in Django 1.6.3 in a process that is run indefinitely via a management command. I am attempting a build now where at the start of each loop I call:
db.close_old_connections()
We'll see if this solves the problem or not.
comment:28 by , 11 years ago
Owner: | changed from | to
---|
comment:29 by , 11 years ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
Actually this is the intended behavior after #15119. See that ticket for the rationale.
If you hit this problem and don't want to understand what's going on, don't reopen this ticket, just do this:
- RECOMMENDED SOLUTION: close the connection with
from django.db import connection; connection.close()
when you know that your program is going to be idle for a long time. - CRAPPY SOLUTION: increase wait_timeout so it's longer than the maximum idle time of your program.
In this context, idle time is the time between two successive database queries.
comment:30 by , 11 years ago
Could someone give a little more context to this issue? I ran into it the very first morning after upgrading from 1.5.x to 1.6.x, I am guessing because the app was idle for a few hours (2am to 8am). I had 'CONN_MAX_AGE' set to 15mins, but have had wait_timeout and interactive_timeout set to 5mins.
So in my case, I don't exactly know my program is going to be idle for a long time, though my wait_timeout didn't seem that low. I've preemptively adjusted my wait_timeout to 30mins and my interactive_timeout to 60mins, but a bit more explanation of the issue. I don't yet see the relation to #15119.
Also, this issue should probably be mentioned in the docs somewhere.
comment:31 by , 11 years ago
My preferred solution is to increase the wait_timeout to 86400 (24hr) on sessions from processes that are long lived. Otherwise I have to do a close before any blocking call to other systems, e.g. redis, which may block for a month or return in a split second. These blocking calls are mostly in loops; Under load they are repeatedly called. I am not about to add a connection.close() call in those code paths. (And this is where the MySQL reconnect behavior worked fine, it only kicked in whén there was a timeout).
comment:32 by , 11 years ago
Cc: | added |
---|
I just opened an pull request that may address this issue: https://github.com/django/django/pull/2740
@aaugustin can you review it?
comment:33 by , 11 years ago
I believe this patch breaks the transactional integrity guarantees made by transaction.atomic
.
Django mustn't attempt to reconnect until the connection has been dropped properly, possibly after waiting for the exit of an atomic block.
(Yes, I've just copy-pasted comment 20, because that's the answer to all naive reconnection attempts. If this bug was that easy to fix, I would have done it.)
comment:34 by , 11 years ago
What is the correct solution for management commands that may take longer than the regular web queries? This issue was closed with more questions than answers.
comment:35 by , 11 years ago
I already answered in comment 29.
If people read answers instead of asking the same question again and again, it would remain more visible.
comment:36 by , 11 years ago
Description: | modified (diff) |
---|
comment:37 by , 11 years ago
Description: | modified (diff) |
---|
comment:38 by , 11 years ago
I know I'll be hated for this, but it actually happens even for regular web queries. We installed 1.6.5 on one of our production servers yesterday and we received a hundred or so tracebacks from some of our views. I can't do the math, but it's probably less then 1%. Still, with 1.5 we had no issues.
To give you an idea, the problem arises both from select and insert statements. They could take maybe fractions of second. I triple checked MySql variables and wait_timeout
is 28800, while connection_timeout
is 10 seconds (which, by the way, should raise a Lost connection to MySQL server
).
I tried switching from CONN_MAX_AGE = 0
to CONN_MAX_AGE = 2
and nothing changed, so maybe it's not related with the new persistent connection mechanism.
The traceback is slightly different from the one in comment 19:
Traceback (most recent call last): [...] File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/query.py", line 96, in __iter__ self._fetch_all() File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/query.py", line 857, in _fetch_all self._result_cache = list(self.iterator()) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/query.py", line 220, in iterator for row in compiler.results_iter(): File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 713, in results_iter for rows in self.execute_sql(MULTI): File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 786, in execute_sql cursor.execute(sql, params) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/backends/util.py", line 53, in execute return self.cursor.execute(sql, params) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/utils.py", line 99, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/backends/util.py", line 53, in execute return self.cursor.execute(sql, params) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute return self.cursor.execute(query, args) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/MySQLdb/cursors.py", line 205, in execute self.errorhandler(self, exc, value) File "/home/django/VIRTUALENVS/multi/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue OperationalError: (2006, 'MySQL server has gone away')
comment:39 by , 11 years ago
Resolution: | wontfix |
---|---|
Status: | closed → new |
That's a different issue and a good reason to reopen the ticket until we figure it out.
Do you have any way to check if MySQL drops connections while a view function is executing? In that case, Django 1.5 reopens a connection automatically (possibly breaking transactional integrity) while Django 1.6 raises an error.
comment:40 by , 11 years ago
I was able to reproduce the error. First of all, while the global wait_timeout
was 28800, the session value was just 30 seconds.
That means the if the time between two queries in the same view is longer than that, or if a view runs just one query but after 30 seconds from its invocation (for example for a very long computation in a middleware or a forced sleep) it raises the infamous MySQL has gone away
.
For example:
def my_view(request): import time time.sleep(31) print User.objects.get(pk=1) return HttpResponse()
and I think this is the expected behavior.
The problem is that we run some pretty sophisticated monitoring application (namely New Relic) and the longest trace I can see is a 3.5 seconds view which was probably a glitch from MySQL (it took 3.1 seconds to select a user using his pk).
For now, we solved the most severe view error by closing the connection at the beginning of it (forcing the creation of a new one), but I'll keep monitoring the situation and I'll get back to you because I'm starting to think that it has to do with our particular configuration. Django 1.6 has been out for too long for us being the first to notice that.
comment:41 by , 11 years ago
Cc: | added |
---|
comment:42 by , 11 years ago
Severity: | Release blocker → Normal |
---|
It has been suggested to call close_if_unusable_or_obsolete before handling each HTTP request. However, I don't think that makes much sense. If the connection to MySQL times out between two HTTP requests, you should either stop using persistent connections (with CONN_MAX_AGE = 0) or increase wait_timeout. I don't like the idea of adding overhead to Django to compensate for inconsistent settings.
What I don't understand is that you tried CONN_MAX_AGE = 0 and that didn't help. I fail to see why closing the connection before each request would help if closing it after each request does nothing. Theoretically closing after a request is a safer behavior than closing before the next request -- you never run into timeouts. Could you double-check your tests?
comment:43 by , 11 years ago
We closed the connection at the beginning of the view, not at the beginning of the request. So, presumably, the timeout happened inside a middleware but it didn't bubble up before the first query in the view.
We ended up increasing wait_timeout (we kept it at a fairly low value as we had a large number of sleeping queries) and that solved the issue.
As far as I'm concerned I think that is exactly the expected behavior and as such the ticket can be closed.
comment:44 by , 11 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Cool. Thanks for taking the time to investigate and report your results!
Are you using persistent connections (CONN_MAX_AGE > 0)?