Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#20485 closed Cleanup/optimization (fixed)

Find faster ways for test fixture loading

Reported by: Anssi Kääriäinen Owned by: Aymeric Augustin
Component: Testing framework Version: dev
Severity: Normal Keywords:
Cc: Triage Stage: Ready for checkin
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Currently Django's test suite uses about 70s for fixture loading when running full test suite on in-memory SQLite. Considering that after #20483 Django's test suite takes around 210s in total, this means 1/3 of time is used for fixture loading. Most of the time in fixture loading is used for finding the fixture file. Django needs to do effectively this for each file, for each test:

for each app in INSTALLED_APPS:
    for each suffix in combo(('json', 'yaml', 'xml') X ('bz2', 'gz', 'zip')):
        try to open file() - if succeeds add to fixture files, else skip

Unsurprisingly this is expensive. In addition fixture loading after the file has been found isn't that fast, it is actually somewhat faster to create the models by SomeModel.objects.create() than by fixtures.

I have done some testing and it shows that of the 70s, around 50-60s should be avoidable.

Some options:

  • Somehow use absolute paths for the fixtures. Maybe by using path(__file__) + '/fixtures/' + fixture_name. A subset of this is to always use, say gz compressed files which removes the need to check for combos.
  • Limit fixture searching to current app (if requested): maybe by something like './somefixture' -> search only from current app. 'somefixture' -> search everywhere.
  • Cache the existing fixtures. That is, go only once through the directories, check what files exists and cache that information, then check against the cached information instead of trying to open each possible combo in each possible directory.
  • Do not use fixtures for Django's test data.

Of the above I like the last option as I find reading models created in Python code much easier than definitions in fixtures (especially, what relations does given instance have). Unfortunately converting the fixtures takes time. In addition, I am sure some developers favour fixtures instead of models created in Python code.

Marking as accepted as in "lets see what can be done here".

Attachments (1)

20485-prewalk-fixtures-dirs.diff (8.2 KB ) - added by Aymeric Augustin 12 years ago.

Download all attachments as: .zip

Change History (7)

comment:1 by Anssi Kääriäinen, 12 years ago

I created some proof-of-concept code for this, see: https://github.com/akaariai/django/commit/448b5d76d30b5af4682702ec21c8fe85cc1bb7d3 (applies over isolated_apps patch).

Runtime of Django's test suite is reduced from ~210 to ~160. So, around 50 seconds gone.

The code isn't commit quality. It uses a syntax of fixtures = [(__file__, 'fixturename.json')] to indicate that fixtures from a directory relative to current test class should be loaded. This causes a couple of problems, but is enough to demonstrate the speed gain. Better ideas & implementations welcome.

I am not sure if my initial analysis of all the suffix combinations causing the problem is correct. While going through the tests it became evident that there are multiple fixtures with the same name in different test apps. If I am mistaken any time 'testdata.json' is defined as a fixture for a test class, then all different testdata.json files will be loaded. Another similar issue is that there are a couple of 'initial_data.json' files in the test suite, and so every flush will load these files.

comment:2 by Aymeric Augustin, 12 years ago

Would it be faster to call os.listdir on the fixtures directory of each application, and then do everything in Python instead of hitting the filesystem repeatedly?

comment:3 by Aymeric Augustin, 12 years ago

I'm attaching a patch that caches the contents of fixture directories in memory and checks this cache before attempting to hit the filesystem.

This patch makes the test suite 10% faster (540s vs. 600s) on my laptop with a SSD.

I didn't include the current directory in the list of directories whose content is cached, because it could contain a lot of files, and the cache would then use lots of memory for no good reason. A regression ensues: fixtures cannot be found the current directory. (Two tests fail because of this.)

This could be fixed by adding an explicit search inside the current directory if no fixture is found... but then the patch starts looking bad, and I didn't go further.


I think I have a better idea to make fixture search decently fast.

Here's how I would load a given fixture label:

  1. Take the basename of the label and build all the combinations of format, compression, etc. These are the targets we're looking for.
  2. Build a list of prefixes:
    • If label is a relative path, [<app>/fixtures/<label> for each app] + [<fixture_dir>/<label> for each fixture dir] + [<label>]
    • If label is an absolute path, [<label>]
  3. For each prefix, run glob.glob(<prefix>*) to obtain a list of candidates, and split them in dirname + basename. The basenames are the candidates.
  4. Check the intersection of the set of targets and the set of candidates; these are the fixtures that must be loaded.
  5. Once this is done for all prefixes, cache the list of all fixtures found for this label.

This minimizes filesystems hits by making the search for a single label efficient, and by performing it only once.

by Aymeric Augustin, 12 years ago

comment:4 by Aymeric Augustin, 12 years ago

Has patch: set
Owner: changed from nobody to Aymeric Augustin
Status: newassigned
Triage Stage: AcceptedReady for checkin

I created a pull request with the technique described in my previous comment. The speedup is still around 10% as expected.

https://github.com/django/django/pull/1237

Feedback welcome! Yes, it's a major refactoring of loaddata, I know it's quite hard to review...

comment:5 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

Resolution: fixed
Status: assignedclosed

In 51aa000378c00a442273f01142acdebc94dec68d:

Fixed #20485 -- Refactored loaddata for speed.

Thanks Anssi for reporting this performance bottleneck.

comment:6 by Aymeric Augustin <aymeric.augustin@…>, 12 years ago

In 6900cb79dcbc502e0b18476e53ad493b7069ffb8:

Fixed small regression from 51aa000378.

A test failed if the path to the Django checkout contained a dot.

Refs #20485.

Note: See TracTickets for help on using tickets.
Back to Top