Opened 2 months ago

Closed 5 weeks ago

Last modified 3 weeks ago

#35757 closed Bug (needsinfo)

memcached.PyMemcacheCache reentrancy problem with ASGI-based runserver

Reported by: Harm Verhagen Owned by:
Component: Core (Cache system) Version: 5.1
Severity: Normal Keywords: asgi, async
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Harm Verhagen)

I found that using memcached.PyMemcacheCache with runserver has reentrancy problems.

I implemented a json api in django, when issuing multiple request simultaneously from javascript, the requests sometimes crash.

I have urls that do something like this.

from django.core.cache import cache

def example_url(request):
    cache.incr("some key", 1)   # count requests

    return JsonResponse(data=data, status=201)

When issuing 5-10 requests simultaneously on a runserver, it often (not always) get this crash


  File "..../views/api.py", line 2512, in example_url
    cache.incr("some key", 1)
  File "./venv/lib/python3.12/site-packages/django/core/cache/backends/memcached.py", line 110, in incr
    val = self._cache.incr(key, delta)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/hash.py", line 350, in incr
    return self._run_cmd("incr", key, False, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/hash.py", line 322, in _run_cmd
    return self._safely_run_func(client, func, default_val, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/hash.py", line 211, in _safely_run_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/base.py", line 827, in incr
    results = self._misc_cmd([cmd], b"incr", noreply)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/base.py", line 1291, in _misc_cmd
    buf, line = _reader(self.sock, buf)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/base.py", line 1658, in _readline
    buf = _recv(sock, RECV_SIZE)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "./venv/lib/python3.12/site-packages/pymemcache/client/base.py", line 1750, in _recv
    return sock.recv(size)

 OSError: [Errno 9] Bad file descriptor

The following code looked a bit suspicious.

cache.backends.memcached.py:99 BaseMemcachedCache

    def close(self, **kwargs):
        # Many clients don't clean up connections properly.
        self._cache.disconnect_all()

Doesn't this race? closing a all connections, might close a connection of a parallel request that was just opened but not yet finished.

Notes

I'm not putting a heavy load on runserver, its a simple javascript app, that happens to do 10 requests on a single page. Django should be able to handle that, also in a dev environment

Workaround

Adding the --noasgi option seems to help.

./manage.py runserver --noasgi

Weird enough the (to me) more obvious, --nothreading option does _not_ help. That crashes in the same way.

NB: this workaround breaks django channels, so if you have websockets in your app, then this "workaround" is not usable.

Version info

Django==5.1.1
pymemcache==4.0.0
python 3.12.6

channels==3.0.5             <--- related?
channels-redis==3.4.1

Change History (22)

comment:1 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:2 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:3 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:4 by Harm Verhagen, 2 months ago

Summary: memcached.PyMemcacheCache reentrancy problem (OSError: [Errno 9] Bad file descriptor)memcached.PyMemcacheCache reentrancy problem with ASGI-based runserver

comment:5 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:6 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:7 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:8 by Harm Verhagen, 2 months ago

Its not just a reentrancy problem. Wrapping the calls to cache.incr() in a global mutex (threading.Lock(), just for testing purposes) does not seem to make any difference.

comment:9 by Harm Verhagen, 2 months ago

Description: modified (diff)

comment:10 by Sarah Boyce, 2 months ago

Resolution: invalid
Status: newclosed

This report seems better suited to be a support request. The best place to get answers to your issue is using any of the user support channels from this link.

Trac is for issues that are confirmed to be in Django. If, after debugging, you find out that this is indeed a bug in Django, please re-open with the specific details and please be sure to include a small Django project to reproduce or a failing test case.

Thank you!

comment:11 by Harm Verhagen, 2 months ago

Last edited 6 weeks ago by Harm Verhagen (previous) (diff)

comment:12 by Harm Verhagen, 7 weeks ago

I reproduced this problem with a small Django project.
It seems to be related with channels/daphne

As there is a testcase, I re-opened the ticket for re-evaluation.
Please advise if more information is required.

I created a new project according the to polls example. The problem occurs when adding 'channels'/'daphne' to INSTALLED_APPS (without even using channels).

polls.views.py

from django.http import HttpResponse
from django.core.cache import cache

def incr():
    key = 'example:key'
    try:
        val = cache.incr(key, 1)
    except ValueError:
        cache.set(key, 1)
        val = 1

    return val

def index(request):
    v = incr()
    return HttpResponse("Hello, world. You're at the polls index. %d" % v)

settings.py

CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.PyMemcacheCache',
        # 'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',           # using LocMemCache 'fixes' the problem
        'LOCATION': "localhost:11211"
    },
}

INSTALLED_APPS = [
    'daphne',     # must be before staticfiles   (removing this fixes the problem (with channels 4)
    'channels',   

    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
]

ASGI_APPLICATION = "mysite.asgi.application"                # required for channels

fire up runserver and browse to http://localhost:8000/polls/, and see everything working (no exceptions)
so far, so good....

Steps to reproduce:

run the following script get_parallel.sh

#!/bin/bash
#
# do requests in parallel
# https://code.djangoproject.com/ticket/35757
url='http://localhost:8000/polls/'

curl $url&
curl $url&
curl $url&
curl $url

Result

Exceptions as shown in original report. (OSError: [Errno 9] Bad file descriptor)
After a few runs runserver stops serving requests completely, all requests, even sequential ones from a browser, get a 500 with this traceback.

Traceback (most recent call last):
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "venv/lib/python3.12/site-packages/django/core/handlers/exception.py", line 42, in inner
    response = await get_response(request)
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "/venv/lib/python3.12/site-packages/django/core/handlers/base.py", line 253, in _get_response_async
    response = await wrapped_callback(
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 468, in __call__
    ret = await asyncio.shield(exec_coro)
  File ".../Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/venv/lib/python3.12/site-packages/asgiref/sync.py", line 522, in thread_handler
    return func(*args, **kwargs)
  File "mysite/polls/views.py", line 20, in index
    incr()
  File "mysite/polls/views.py", line 11, in incr
    val = cache.incr(key, 1)
  File "venv/lib/python3.12/site-packages/django/core/cache/backends/memcached.py", line 110, in incr
    val = self._cache.incr(key, delta)
  File "venv/lib/python3.12/site-packages/pymemcache/client/hash.py", line 350, in incr
    return self._run_cmd("incr", key, False, *args, **kwargs)
  File "venv/lib/python3.12/site-packages/pymemcache/client/hash.py", line 314, in _run_cmd
    client = self._get_client(key)
  File "venv/lib/python3.12/site-packages/pymemcache/client/hash.py", line 182, in _get_client
    raise MemcacheError("All servers seem to be down right now")

pymemcache.exceptions.MemcacheError: All servers seem to be down right now

When this happens, only a restart the runserver makes the system serve requests again.

Version info

with channels 3 the problem appears (and disappears when removed) when adding 'channels' to INSTALLED_APPS
with channels 4 the problem appears when adding 'channels' & 'daphne' to INSTALLED_APPS, disappears when 'daphne' is commented out.

channels==4.1.0
daphne==4.1.2
Django==5.1.1
python 3.12.6

Version 7, edited 6 weeks ago by Harm Verhagen (previous) (next) (diff)

comment:13 by Harm Verhagen, 6 weeks ago

Resolution: invalid
Status: closednew

in reply to:  13 comment:14 by Natalia Bidart, 6 weeks ago

Component: UncategorizedCore (Cache system)
Keywords: asgi async added
Resolution: duplicate
Status: newclosed
Type: UncategorizedBug

Replying to Harm Verhagen:

Hello Harm! Thank you for the extra details. After some investigation, I believe this is a duplicate of #33625, so I'll be closing as such.

comment:15 by Harm Verhagen, 6 weeks ago

I don't think its a duplicate of #33625 (although it seems related)

I reopened for reconsideration.

Why this is different?

#33625 is an optimization problem. All works fine, but could be optimized for performance. The connection is opened/closed which adds (a bit of) latency to each request. (which might be acceptable, or not depending on your usecase)

This problem (#35757) however is about that the system actually breaks. When running the test above. runserver stops working completely, until restarted.
If you are developing a javascript app, that issues multiple requests simultaniously, then runserver (in this configuration) is unusabe.

Another reason why its no duplicate. I can reproduce the problem also with django.core.cache.backends.memcached.PyLibMCCache instead of '....memcached.PyMemcacheCache.
PyLibMCCache already does provide connection pooling, so the optimization requested in #33625 is not applicable for PyLibMCCache.

Reproducibility

Where you able to reproduce the problem with the example attached? (the problem that runserver actually gives 500's and eventually stops working completely)

Results with PyLibMCCache

NB: this might be a different problem, triggered by the same testcase

Result (PyLibMCCache)

When running requests in parallel, some of the requests get a 500.

Internal Server Error: /polls/
Traceback (most recent call last):
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "venv/lib/python3.12/site-packages/django/core/handlers/exception.py", line 42, in inner
    response = await get_response(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 518, in thread_handler
    raise exc_info[1]
  File "venv/lib/python3.12/site-packages/django/core/handlers/base.py", line 253, in _get_response_async
    response = await wrapped_callback(
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 468, in __call__
    ret = await asyncio.shield(exec_coro)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "venv/lib/python3.12/site-packages/asgiref/sync.py", line 522, in thread_handler
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "django-35757/mysite/polls/views.py", line 22, in index
    v = incr()
        ^^^^^^
  File "/tmp/dbg/django-35757/mysite/polls/views.py", line 11, in incr
    val = cache.incr(key, 1)
          ^^^^^^^^^^^^^^^^^^
  File "/Users/harmverhagen/Documents/symeon/intemo/sensoren/cityznz/cityzenz/api/venv/lib/python3.12/site-packages/django/core/cache/backends/memcached.py", line 110, in incr
    val = self._cache.incr(key, delta)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pylibmc.Error: 1 keys failed
HTTP GET /polls/ 500 [5.13, 127.0.0.1:63568]

NB: Sometimes I even noticed a python crash ! (with PyLibMCCache only)

HTTP GET /polls/ 200 [0.11, 127.0.0.1:63819]
Python(51613,0x17d037000) malloc: double free for ptr 0x13980a400
Python(51613,0x17d037000) malloc: *** set a breakpoint in malloc_error_break to debug

So, maybe the PyLibMCCache issue is something different. (broken pylibmc install on my mac, or something). If so ignore the PyLibMCCache part of this comment, and lets just focus on PyMemcacheCache, if that reproduces for you.

Last edited 6 weeks ago by Harm Verhagen (previous) (diff)

comment:16 by Harm Verhagen, 6 weeks ago

Resolution: duplicate
Status: closednew

in reply to:  16 comment:17 by Natalia Bidart, 6 weeks ago

Replying to Harm Verhagen:

Hello Harm, thanks for the extra details. I haven't been able to reproduce yet. While I re-try to do that, could you please confirm if this is reproducible with a production web server? (that means, without using the development server provided by runserver).

Thank you!

comment:18 by Harm Verhagen, 6 weeks ago

I've not seen this issue on production, only runserver.

We run production with
daphne <ourappname>.asgi:application -b 0.0.0.0 -p 8000
In a container, behind nginx. app configured to use PyMemcacheCache

Last edited 6 weeks ago by Harm Verhagen (previous) (diff)

comment:19 by Carlton Gibson, 6 weeks ago

Removing daphne (or channels in older versions) restores use of Django’s own runserver command, so at that point you’re no longer running ASGI.

I would say the comment about the race condition looks most likely, but I’m confused by the last remark that it doesn’t occur in production. (Daphe’s runserver command does very little except run daphne)

My memcached chops are not what they could be. (I haven’t used it in production since before 2010 IIRC so… 🤷) Short of a clear diagnosis, it’s hard to know what to suggest.

comment:20 by Harm Verhagen, 6 weeks ago

Removing daphne (or channels in older versions) restores use of Django’s own runserver command, so at that point you’re no longer running ASGI.

ok, clear. that explains why that 'fixes' the problem.

comment:21 by Carlton Gibson, 5 weeks ago

Resolution: needsinfo
Status: newclosed

I've not seen this issue on production, only runserver.

That still doesn’t make 100% sense to me.

Can I ask you to open an issue on the Daphne repo, providing a full reproduce project there please?

https://github.com/django/daphne

I’ll try and take a look, but short of something more concrete I can’t say that this is an issue in Django per se.
I’m going to mark this as needsinfo in the meantime.

comment:22 by Harm Verhagen, 3 weeks ago

Filed in daphne repo as issue 534

Last edited 3 weeks ago by Harm Verhagen (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top