Opened 16 years ago

Closed 6 months ago

Last modified 6 months ago

#8307 closed Cleanup/optimization (fixed)

ImageFile use of width_field and height_field is slow with remote storage backends

Reported by: sebastian.serrano@… Owned by: Jacob
Component: File uploads/storage Version: dev
Severity: Normal Keywords: imagefile height width
Cc: Alan Justino da Silva, wiktor@…, walter+django@… Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

im attaching a patch that could make it work

Attachments (3)

images.py.diff (801 bytes ) - added by anonymous 16 years ago.
images.1.diff (995 bytes ) - added by sebastian.serrano@… 16 years ago.
images.py.2.diff (750 bytes ) - added by Alan Justino da Silva 13 years ago.
Let the storage backend disable touching the image file to get height/width

Download all attachments as: .zip

Change History (28)

by anonymous, 16 years ago

Attachment: images.py.diff added

comment:1 by anonymous, 16 years ago

milestone: 1.0

comment:2 by oyvind, 16 years ago

Has patch: set
Keywords: imagefile height width added
Triage Stage: UnreviewedDesign decision needed

Seems like a bug to me

comment:3 by Eric Holscher, 16 years ago

Is this the same bug as #8208 ?

comment:4 by anonymous, 16 years ago

no, is not the same bug. this one is related with the use of the width_field and height_field

comment:5 by Jacob, 16 years ago

Owner: changed from nobody to Jacob
Status: newassigned
Triage Stage: Design decision neededAccepted

comment:6 by Marty Alchin, 16 years ago

Needs documentation: set

This is a tough one. On one hand, I can appreciate the performance gain of using the cache fields, since they're already available, and they *should* always be in sync with what's on the file. But I can't help thinking that I like the idea of ImageFile.width and ImageFile.height always returning the actual dimensions of the file, no matter what.

If they used the cache fields and those ever get out of sync with the file itself (which is possible any number of ways, depending on your application), the only way to get the true dimensions would be to use django.core.files.images.get_image_dimensions manually. And, after all, isn't the point of having properties for width and height to avoid having to call that manually? Plus, using get_image_dimensions directly means you'll have to cache those values manually if you expect to need them more than once.

I suppose I'm -0 on this ticket; someone with a decent argument might be able to sway me. Either way, though, I think the docs could be a bit more explicit about where the width and height values come from.

comment:7 by sebastian.serrano@…, 16 years ago

About your worries, the width and height values are updated every time the imagefield save method is fired, they could only be out of sync from a bug or a wrong use of the FileStorage mechanism (like setting the ._name field attr manually).
Also I have been using the patch I submit and has been working fine for me.
Without this patch the people that use remote filestorage backends (like S3, nirvanix or mogilefs), suffer a big performance hit for having to load the entire file every time a Model with an imagefield gets loaded. That means in this scenario a lot of more traffic, less performance and also a monetary cost.

English is not my primary language, sorry if something is not clear.
PD: im attaching another patch that could be more safe.

by sebastian.serrano@…, 16 years ago

Attachment: images.1.diff added

comment:8 by Marty Alchin, 16 years ago

For the record, here are a few of the ways I know of that the two could get out of sync:

  • A bug in Django (as you mention) which shouldn't happen, but is still possible
  • Improperly using Django's API (as you mention) to cause a mismatch
  • An admin or staff member could manually update a file directly, without using Django
  • A cron job could rename, resize or otherwise alter files
  • An admin or staff member could manually update the database directly, without using the model's save() method
  • A monkeypatch (mistakenly or not) altered or removed the code to update width_field and height_field
  • A custom ImageFile subclass (mistakenly or not) sets the width_field and height_field to the wrong values
  • ... and the list goes on

As for the performance (and potentially monetary) hit when accessing files, the width and height shouldn't be loaded when retrieving the file; they only get populated when actually accessing them. If you find they are in fact getting accessed every time regardless of whether they're being used, that's a legitimate bug that should be filed separately.

I have no doubt that the attached patch works correctly for the behavior you want. I'm just not sure it's the behavior Django should use.

comment:9 by Jacob, 16 years ago

Resolution: wontfix
Status: assignedclosed

Yeah, I like the access methods always getting the *real* file sizes.

Think of it this way: Model.image_width is a number the model knows about, but to get the real story, check Model.image.width -- all sorts of things can happen on the disk behind your back!

comment:10 by Jacob, 13 years ago

milestone: 1.0

Milestone 1.0 deleted

comment:11 by Alan Justino da Silva, 13 years ago

Cc: Alan Justino da Silva added
Easy pickings: unset
Resolution: wontfix
Severity: Normal
Status: closedreopened
Summary: ImageFile doesn't use the width_field and height_field for cacheImageFile use of width_field and height_field is slow with remote storage backends
Type: Cleanup/optimization
UI/UX: unset

What about letting the storage to choose if is worth to touch the file just to get its height/width?

This solves the problem for me. And it is a huge problem: My EC2 instance with S3 backend wastes about 4sec more each pageview just to know the width/height of 8 photos. And looks like the absent of this info harms not my app.

Is this an acceptable solution?

def _get_image_dimensions(self):
    if not hasattr(self, '_dimensions_cache'):
        if getattr(self.storage, 'IGNORE_IMAGE_DIMENSIONS', False):
            self._dimensions_cache = (0, 0) # Too costly to hit the file?
        else:
            close = self.closed
            self.open()
            self._dimensions_cache = get_image_dimensions(self, close=close)
    return self._dimensions_cache

by Alan Justino da Silva, 13 years ago

Attachment: images.py.2.diff added

Let the storage backend disable touching the image file to get height/width

comment:12 by Aymeric Augustin, 12 years ago

Status: reopenednew

comment:13 by Wiktor, 11 years ago

Cc: wiktor@… added

comment:14 by Walter Doekes, 11 years ago

Cc: walter+django@… added

comment:15 by Mariusz Felisiak, 10 months ago

#35139 was a duplicate.

comment:16 by john-parton, 10 months ago

Here's the contents from my duplicate issue

I have prepped a github repo here with the basic tests: https://github.com/john-parton/django-image-field-extra-read

Conditions for behavior

You must have an ImageField with the width_field or height_field arguments set.

Description of current behavior

When a model is saved, the image file is written out using the Storage API, and then
in order for the width and height fields to be updated, the file is read back out
and then the width and height are extracted from the image.

In the case the storage is local, the performance impact is probably negligible,
unless the application is seriously IO constrained, however if the storage is
remote, the performance impact is significant, and there can be other impacts on
operations.

For instance, if using S3 as backing, more GET operations are performed than
strictly necessary. This could be a few dollars a day of operational costs if your
scale is small, but can be significant if egress charges are high.

As another example, CloudFlare Images rate limits requests. This effectively cuts
the rate limit in half because every save operations requires an additional GET.

Proposed behavior

The proposed behavior is to simple read the image file which is resident in memory
without reading it back out from the storage backend.

Possible breaking issues

The vast majority of storage backends and use cases likely guarantee that if you
write a file into storage, and then retrieve it, you will get the same file back.

However, for some image-specific services, they will compress or crush larger images.

For users who specifically have this use case, they may end up with the width_field
and height_field not representing the actual size of the image in the store, but
rather the size of the image at time of upload.

Explanation of current behavior

It looks like when a model's save() method is called, the field's pre_save() method
is called which results in the descriptor for the field having its get method
called and then immediately having its set method called with a similar value.

The effect is to coerce the value of the field to ImageFieldFile which is a subclass
of ImageFile. The ImageFieldFile instance is assigned a property of file which
is the wrapped original value.

The image field then saves and persists the data using the storage API, and then the
wrapped file isn't referred to later to get the width and height. When the width and
height are requested, the file is read back out of storage.

Proposed fix

No specific fix at this time.

Mitigating breaking issues

Considering how unusual this use case is, it may be sufficient to document the change
in behavior and provide a code snippet to wire up a signal handler to do the
additional read for those users who have the unusual storage backends and actually
care about the width/height being what is on disk. This would also allow users to
customize the behavior. For instance, maybe if the image is under a certain resolution,
the storage provider guarantees they don't mangle the image. A user could enshrine
that logic in the signal handler, so they could still get the performance uplift where
appropriate.

Summary

It seems pretty unlikely that this is the intended behavior, and seems to largely be a byproduct of heavy descriptor use and "magic" in the code. I've tried mitigating this in my application code, but it requires me to monkey patch some private methods, and I'm not even sure I got it right. Any attempt to "fix" this at a level higher than Django's internals will just result in an unmaintainable mess. Back when Django was new, I'm not sure I would make this argument, but now storage is usually backed by some API like S3, I think it makes sense to make avoiding an extra network request the "sane default."

comment:17 by john-parton, 10 months ago

I've got a fix that I think should pretty much always work.

Like I alluded to earlier, it requires some deep knowledge of Django's internals. You have to know exactly when to side-step a descriptor and what the purpose of the descriptor is.

Basically, the FileField.save() method will call the descriptors set method with the value being a string. That's required, because the storage backend might return a different string than you originally fed in. I believe the motivating reason for this was to avoid accidentally overwriting files. The other function of the descriptor is to calculate and store the width and height values if it's configured. However, the descriptor's set method had to of been already called to even get to this point in the save flow. Because the set method of the descriptor basically only serves this one function and the function has been satisfied earlier, I just side step the descriptor by manipulating the instances dict directly.

Here's the meat of the code: https://github.com/john-parton/django-image-field-extra-read/blob/bugfix/proposed-fix/test_app/fields.py

Of course, I wouldn't want to make a change like this without some tests. As I said, I believe the primary purpose of the descriptor use there is update the string value of the file field, which I have made sure that it's updated to a non-conflicting value on subsequent saves.

Here's a test case that I wrote which captures some of the spirit of what I'm trying to do: https://github.com/john-parton/django-image-field-extra-read/blob/bugfix/proposed-fix/test_app/tests.py

I'm happy to take feedback, or I can work on building a proper patch.

in reply to:  17 ; comment:18 by Alan Justino da Silva, 10 months ago

Replying to john-parton:

I've got a fix that I think should pretty much always work.
...
Basically, the FileField.save() method will call the descriptors set method with the value being a string.

First, thanks for refreshing this 15yo issue, john-parton.

That said, I do not see how this solution fixes a template page rendering ~10 images from a remote storage as there is no .save() involved for using existing images:

When I reopened the issue, my application needed only the URL of the files, already stored in the database, thus a O(1) call to DB. That is because I needed not to put height and width in the <img /> HTML tag, as the browser deals with whatever it fetches from the URL. If grabbing the URL keeps forcing to .get_image_dimentions(), then this call becomes O(10) calls to the storage, meaning O(10) HTTP HEAD calls to S3. That is why I proposed the storage to decide if this should be done, or cached, or faked.

Do you think that letting the storage proxy the image dimensions would be another way to fix your issues as well?

in reply to:  18 comment:19 by john-parton, 10 months ago

Replying to Alan Justino da Silva:

For existing images, if you have a width_field and height_field set, then .get_image_dimensions() is not ordinarily called. Usually the width_field and height_field are referred to. That's the point of those fields.

My patch does nothing if your ImageField doesn't have a width_field or height_field, and that's intentional. Adding the width_field and height_field should already solve the behavior you're describing.

If you could provide an actual test case where you think the behavior isn't correct, I'd really appreciate it. Ideally a mini Django project using my branch with a clear minimal reproduction of the undesired behavior you're referring it.

Thanks.

EDIT Here's the current working PR https://github.com/django/django/pull/17775

I have a test for the regression and explicit tests for the behavior I'm trying to fix. It's VERY limited in scope because I don't want to break backwards compat.

DOUBLE EDIT I think the behavior that you're describing has been fixed elsewhere in the descriptor logic some time between twelve years ago and today.

Last edited 10 months ago by john-parton (previous) (diff)

comment:20 by john-parton, 10 months ago

I added some tests. Big test is that there's a model with a Storage backend that raises an exception if you try and read from it. Test suite appears to be passing. Added a small note about the old behavior in the docs.

comment:21 by john-parton, 10 months ago

Needs documentation: unset

I added a small amount of documentation. A 'version changed' describing the old behavior and under which circumstances the new behavior might affect a user.

comment:22 by john-parton, 7 months ago

I've created a different pull request with a different approach, which is better than my other patch in at least two ways:

  • It removes an extra container class I was using as a signal for when to update
  • It removes an extra keyword argument I had to add (returning the function signature to as it was before the patch)

https://github.com/django/django/pull/18070

I think we need to come to consensus on whether this is a bug that we can just patch, or if this change is significant enough that we need to introduce a setting to opt-in to the behavior (most likely changing the setting in a future release).

comment:23 by Sarah Boyce, 7 months ago

Patch needs improvement: set

in reply to:  18 ; comment:24 by Sarah Boyce, 6 months ago

Has patch: unset
Patch needs improvement: unset
Resolution: fixed
Status: newclosed

Replying to Alan Justino da Silva:

That said, I do not see how this solution fixes a template page rendering ~10 images from a remote storage as there is no .save() involved for using existing images:

When I reopened the issue, my application needed only the URL of the files, already stored in the database, thus a O(1) call to DB. That is because I needed not to put height and width in the <img /> HTML tag, as the browser deals with whatever it fetches from the URL. If grabbing the URL keeps forcing to .get_image_dimentions(), then this call becomes O(10) calls to the storage, meaning O(10) HTTP HEAD calls to S3. That is why I proposed the storage to decide if this should be done, or cached, or faked.

Hi Alan, the issue you are describing was fixed in #34517.
I will close this issue and reopen #35139 as this is a separate concern.

in reply to:  24 comment:25 by john-parton, 6 months ago

I think it's very important that they provide a minimal test.

Based on my research, I do not think the scenario they are describing is actually a problem anymore, and had been fixed sometime in the last twelve years by another change.

I could be wrong, of course, but without actual code to pore over, my gut says it's fixed.

Of course discussion should probably move over there.

Replying to Sarah Boyce:

Replying to Alan Justino da Silva:

That said, I do not see how this solution fixes a template page rendering ~10 images from a remote storage as there is no .save() involved for using existing images:

When I reopened the issue, my application needed only the URL of the files, already stored in the database, thus a O(1) call to DB. That is because I needed not to put height and width in the <img /> HTML tag, as the browser deals with whatever it fetches from the URL. If grabbing the URL keeps forcing to .get_image_dimentions(), then this call becomes O(10) calls to the storage, meaning O(10) HTTP HEAD calls to S3. That is why I proposed the storage to decide if this should be done, or cached, or faked.

Hi Alan, the issue you are describing was fixed in #34517.
I will close this issue and reopen #35139 as this is a separate concern.

Last edited 6 months ago by john-parton (previous) (diff)
Note: See TracTickets for help on using tickets.
Back to Top