#32831 closed Cleanup/optimization (fixed)
Some cache tests are flaky (e.g. cache.tests.BaseMemcachedTests.test_touch)
Reported by: | Chris Jerdonek | Owned by: | Wassef Ben Ahmed |
---|---|---|---|
Component: | Core (Cache system) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
The following tests occasionally fail in CI:
cache.tests.PyLibMCCacheTests.test_touch
cache.tests.PyMemcacheCacheTests.test_touch
Traceback (most recent call last): File "/home/jenkins/workspace/pull-requests-bionic/database/sqlite3/label/bionic-pr/python/python3.8/tests/cache/tests.py", line 481, in test_touch self.assertIs(cache.touch('expire1'), True) AssertionError: False is not True
The line that fails is here: https://github.com/django/django/blob/ed3af3ff4b3cfb72de598f1b39a1028ba3826efb/tests/cache/tests.py#L481
Both test classes inherit from BaseMemcachedTests
, which inherits from BaseCacheTests
.
Change History (23)
comment:1 by , 3 years ago
Description: | modified (diff) |
---|
comment:2 by , 3 years ago
Triage Stage: | Unreviewed → Accepted |
---|
comment:3 by , 3 years ago
Description: | modified (diff) |
---|
comment:4 by , 3 years ago
Could this be due to different concrete subclasses of BaseMemcachedTests
stepping on each other during a parallel run? I ask because different test classes are still using the same cache key.
comment:5 by , 3 years ago
Another cache-related flaky test is: cache.tests.CacheMiddlewareTest.test_cache_page_timeout
.
comment:6 by , 3 years ago
A possible pattern for fixing issues like this would be to keep retrying the assertion in a while
loop, with a small sleep in between each attempt, and a relatively large cumulative amount of time before giving up. That would make it as pass as quickly as possible, without waiting longer than necessary. I'm thinking of something like--
def assert_with_retries(self, assert_func, *args, interval_time=0.1, total_time=10, **kwargs): start = time.time() while True: try: assert_func(*args, **kwargs) except Exception as exc: if time.time() - start > total_time: raise RuntimeError(f'Total retry time exceeded.') from exc else: break time.sleep(interval_time)
comment:7 by , 3 years ago
Something else that occurred to me is that it could only help to start each test by asserting that each key being used isn't set, and to end each test by clearing the key in a finally block. The former would help to show if a test isolation issue is at fault.
comment:8 by , 3 years ago
Summary: | Flaky test in CI: cache.tests.BaseMemcachedTests.test_touch → Some cache tests are flaky (e.g. cache.tests.BaseMemcachedTests.test_touch) |
---|
comment:9 by , 3 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:10 by , 3 years ago
Another flaky test failure: cache.tests.MemcachedCacheTests.test_forever_timeout()
at this line-- https://github.com/django/django/blob/1697098795707bd39501d9ceddd3d9be93f8d549/tests/cache/tests.py#L606
(again with pull-requests-bionic/database=spatialite,label=bionic-pr,python=python3.8
).
comment:11 by , 3 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:12 by , 10 months ago
I started some debugging efforts in this forum post and landed a fix for test_touch
when using the filebased cache backend.
comment:13 by , 8 months ago
Cache flaky test in main-random:
> git init /home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12 # timeout=10 ... Using shuffle seed: 5744808359 (generated) ... ====================================================================== FAIL [0.004s]: test_lru_get (cache.tests.LocMemCacheTests.test_lru_get) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/test/utils.py", line 446, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/tests/cache/tests.py", line 1406, in test_lru_get self.assertEqual(cache.get(key), key) AssertionError: None != 0 ---------------------------------------------------------------------- Ran 17769 tests in 1613.662s FAILED (failures=1, skipped=1160, expected failures=5)
comment:14 by , 7 months ago
Another one, different but similar:
... Using shuffle seed: 5526860306 (generated) ... ====================================================================== ERROR [0.003s]: test_lru_incr (cache.tests.LocMemCacheTests.test_lru_incr) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/test/utils.py", line 446, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/tests/cache/tests.py", line 1433, in test_lru_incr self.assertEqual(cache.incr(key), key + 1) ^^^^^^^^^^^^^^^ File "/home/jenkins/workspace/main-random/database/spatialite/label/focal/python/python3.12/django/core/cache/backends/locmem.py", line 70, in incr raise ValueError("Key '%s' not found" % key) ValueError: Key ':1:0' not found ---------------------------------------------------------------------- Ran 17776 tests in 1566.550s FAILED (errors=1, skipped=1161, expected failures=5)
comment:15 by , 7 months ago
I get a sense that a lot of these failures are due to tests using their same global store at the same time as Chris pointed out. A strong supporting signal of that is cache.clear
is being called at setUp
time which (e.g. in the case of memcached backend tests) could cause keys to be unexpectedly evicted.
A solution here could be to run these tests cases in they own separate key space (e.g. with a cache key prefix or with two distinct memcached instances) or to serialize them to ensure that no tests share the same resource (e.g. using a multiprocessing.lock
acquired in setUp
and released in tearDown
).
There might be more too it though as I'm unsure what the djangci setup looks like. Is there one memcached / Redis instance per job or are they shared amongst them?
comment:16 by , 5 months ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:17 by , 4 months ago
to my understanding,
so far we don't know which tests exactly are flaky.
mainly the cases mentioned above, and anything that uses a shared medium.
as @Simon Charette suggested this could be done by figuring out where to set locks, which he also pointed out that it might get tricky depending on how djangoci works (I have no clue)
we might need to handle lock depending on the platform running the test or use something like "portalocker" which covers cross-platform locks but forces the adoption of a new dependency?
expanding on a previous answer by @Chris Jerdonek on setting assertion retries, why not set it at a unit level instead of individual assertion?
how you'd typically do with some pytest extensions... And we'll be assured the whole test passed without having to reason line-by-line on what went wrong?
implementation goes something like:
from functools import wraps def retry(retries=3, delay=1): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): attempts = 0 while attempts < retries: try: return func(*args, **kwargs) except AssertionError as e: attempts += 1 if attempts >= retries: raise time.sleep(delay) return wrapper return decorator
only catching/retrying on "AssertionError" so other exceptions are not permitted.
which I'm still not sure about? (maybe a shared medium causes you to pluck a None or something...)
comment:18 by , 4 months ago
Has patch: | set |
---|
comment:19 by , 3 months ago
Patch needs improvement: | set |
---|
comment:20 by , 3 months ago
Patch needs improvement: | unset |
---|
comment:21 by , 2 months ago
Triage Stage: | Accepted → Ready for checkin |
---|
Yes these tests are flaky (and probably 2-3 more cache tests). We tried to fix them in the past but with no success (see PR).