Opened 15 years ago

Closed 14 years ago

Last modified 12 years ago

#11663 closed Uncategorized (wontfix)

Delete orphaned replaced files

Reported by: Chris Beaven Owned by: Chris Beaven
Component: File uploads/storage Version: dev
Severity: Normal Keywords:
Cc: Johannes Bornhold Triage Stage: Design decision needed
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When using a model instance with a FileField (or subclass), you can update the image by setting a new file as an instance matching the field. This leaves the potential of an orphaned file.

A common case is that you have a ModelForm (bound to an existing instance) for which you use to upload a new file:

  • The form is saved in your view which overwrites the linked FieldFile with a new one (which will be based on the data from the UploadedFile).
  • Now there is no reference to the old file, so even custom storage can't touch it (unless you use an upload_to callable which names it the same and your custom storage is changed to dangerously blindly overwrite files with the same name).

Note, the delete orphan logic already exists - FileField attaches a post_delete signal to it's Model class which does this.

This is obviously a change in functionality. To preserve backwards compatibility, I recommend that this is a new argument on FileField (off by default).

Attachments (3)

11663.diff (19.6 KB ) - added by Chris Beaven 15 years ago.
Patch with comprehensive tests
11663.2.diff (21.0 KB ) - added by Chris Beaven 15 years ago.
11663.3.diff (23.1 KB ) - added by Chris Beaven 14 years ago.
With some simple docs (and updated to r13350)

Download all attachments as: .zip

Change History (25)

by Chris Beaven, 15 years ago

Attachment: 11663.diff added

Patch with comprehensive tests

comment:1 by Chris Beaven, 15 years ago

Needs documentation: set
Owner: changed from nobody to Chris Beaven
Status: newassigned

comment:2 by Marty Alchin, 15 years ago

On the two prior tickets where Chris has been pushing this, there were notes indicating that this behavior was easy to achieve with the file-storage refactor that went into Django 1.0. Unfortunately, what I had been discussing was simply the ability to overwrite existing files easily enough. What Chris -- and many others -- are asking for here is the ability to make that decision on a per-instance basis, which currently isn't easy.

To clarify the comment about "dangerously" blindly overwriting files, Chris is referring to the situation where one user uploads a file and another uploads another with the same name, overwriting the first user's file. That was my suggestion in the past, and I do have to agree with Chris that it was a naive approach that has some pretty significant downsides.

comment:3 by Malcolm Tredinnick, 15 years ago

I know naming things is hard in computer science, but come on... safe_delete_replaced?! Implies there is also dangerous_delete_replaced. All the usages of safe_ prefixes in this patch make my brain hurt. Either we're deleting or we're not. It's neither safe nor dangerous, so let's try to avoid confusing things with non-useful information in the parameter name.

Why does the save() method needs an extra parameter? Deleting is a property of the field itself (if you only want to delete sometimes, it's pretty easy already to override the save() method). I'm still fairly under-enthused by this idea, but at least let's make the patch correct so that we can see what's being evaluated.

by Chris Beaven, 15 years ago

Attachment: 11663.2.diff added

comment:4 by Chris Beaven, 15 years ago

Thanks for your initial comments, Malcolm. I've uploaded a new version addressing these.

  • "safe" implied only deleting orphans. But you are correct, it wasn't a very good term.
  • This went through several iterations - save() doesn't need an extra property now and this has been removed.

I'm a bit saddened as to lack of enthusiasm for inclusion.

comment:5 by Alex Gaynor, 15 years ago

Triage Stage: UnreviewedDesign decision needed

comment:6 by Ole Laursen, 15 years ago

Just two quick comments from a luser: I fail to see the rationale in leaving orphaned files around, why would you make this configurable and turned off instead of just assuming it to be default behaviour? (On a similar note, I fail to see how two different objects can point to the same file unless someone has hacked the field paths, in which case they asked for it. So the database lookup appears unnecessary. But I guess it's just a carry over. By the way, does this field install an index? Otherwise, what happens when you try to delete user profiles with images on a site with 1.000.000 users?)

The other comment is that this still doesn't not fix issue 4339 since there is no easy knob for telling Django "this filename, and I mean it, no underscores" for fields where you do not use the original filename but auto-generate one (say based on DB id).

But it's great to see you're working on it. :)

comment:7 by Chris Beaven, 15 years ago

It is currently optional because it is a change in functionality. Currently, files aren't touched if the model is updated (they are already removed if the model is deleted). A site may be relying on this behaviour. I didn't want to rock the boat more than possible.

But let's leave design discussion on the Django Developers google group. Add a comment here with a link to discussion thread.

Regarding #4339, if that's what it was after, then it's a straight dupe of the original meaning of #2983, which was closed as a wontfix (as this is possible by using custom FileStorage). Again, if you contest this then bring it up in the google group.

comment:9 by Johannes Bornhold, 14 years ago

Cc: Johannes Bornhold added

Is there any progress on this ticket?

I was just hit by the behavior of FileField, and did notice it only notice it by accident. So if there are good reasons not to change the API, it may be a good thing to add a note or a warning note to the documentation of FileField, so users of FileField get a real chance to notice this behavior.

This just would have saved me some hours of googling around, so I am willing to prepare a patch for the docs if it is appreciated, this patch should be easier to integrate (no side effects in functionality and no API change) until a design decision will be done in the far future. ;-)

comment:10 by metamemetics <dholl57@…>, 14 years ago

Hello this problem was bothering me A LOT.
I just added a custom save method for my model and it seems to be working fine though:

def save(self, *args, **kwargs):
    try:
        this = MyModelName.objects.get(id=self.id)
        if this.MyImageFieldName != self.MyImageFieldName:
            this.MyImageFieldName.delete()
    except: pass
    super(MyModelName, self).save(*args, **kwargs)

Hopefully this helps anyone looking for a quick work around, let me know if it doesn't work.

by Chris Beaven, 14 years ago

Attachment: 11663.3.diff added

With some simple docs (and updated to r13350)

comment:11 by Chris Beaven, 14 years ago

Needs documentation: unset

comment:12 by dogada, 14 years ago

IMO when you just need to update existing file, Django should be able to do this (with a flag like reuse_file=True) instead of creating new and then deleting old file. To achieve this I use following workaround for unfortunately missed base functionality.

import os
from base64 import b32encode
from django.db.models import FileField
from django.db.models.fields.files import FileDescriptor

def generate_random_filename(request_filename, default_ext=''):
    ext = os.path.splitext(request_filename)[1] or default_ext
    # base32 produces url-friendly strings, but shorter than hex.                                                              
    rnd_string = b32encode(os.urandom(settings.SAFE_IMAGE_BYTE_LENGTH))
    filename = rnd_string.lower() + ext
    return filename

def _initial_field_name(field_name):
    return '_initial_%s_name' % field_name

class StableFileDescriptor(FileDescriptor):
    """                                                                                                                        
    The descriptor allows to remember original filename of a FileField.                                                      
    """

    def __set__(self, instance, value):
        if isinstance(value, basestring): # filename of current file loaded from database                                      
            instance.__dict__[_initial_field_name(self.field.name)] = value
        super(StableFileDescriptor, self).__set__(instance, value)


class StableFileField(FileField):
    descriptor_class = StableFileDescriptor

    def generate_filename(self, instance, request_filename):
        filename = getattr(instance, _initial_field_name(self.name), None)
        if not filename:
            filename = generate_random_filename(request_filename)
        return super(StableFileField, self).generate_filename(instance, filename)

comment:13 by Chris Beaven, 14 years ago

Resolution: wontfix
Status: assignedclosed

Talking with malcolm, this has about a 0.01% chance of making it into django. On that note, I'm wontfixing the issue.

Oh well, I did learn a lot about the internals :)

comment:14 by Malcolm Tredinnick, 14 years ago

For the record (since I just got a Trac collision whilst also wontfix'ing this), my logic is: Django shouldn't be in the business of deleting files, which is more of a system-level task. We have no way of knowing what else on the system is using those files. If we *don't* delete a file and the user wants to remove it, they can use the os module. If we *do* delete a file and that wasn't the right thing (from the point of view of something else on the system), the user is screwed. We cannot guarantee that removal is correct, so we shouldn't do it.

Writing a utility function to check for orphaned files and clean them up is a simple enough operation. Whether to include it in Django or not is debateable (it's kind of like the cleanup sessions management command, I guess) and the topic for another ticket. Doing it automatically feels too dangerous to be correct at any time.

comment:15 by anonymous, 14 years ago

People who need the cleanup command that Malcolm mentions will perhaps find https://github.com/mrts/django-commands/blob/devel/django_commands/management/commands/file_cleanup.py useful.

in reply to:  10 comment:16 by webinvent, 13 years ago

Easy pickings: unset
Severity: Normal
Type: Uncategorized
UI/UX: unset

Hi,
The solution by metamemetics works perfectly for me!
Thank you

Last edited 13 years ago by webinvent (previous) (diff)

comment:17 by banerjs.sid, 13 years ago

Having recently had this problem on my application, I will risk the wrath of the community in adding onto an already over-discussed, and perhaps a dead thread.

Malcolm's concerns regarding the deletion of files is valid, and honestly, I would think twice about deleting any file from my server (I am speaking from personal experience). However that said, I think that it is important to realize that more often than not, the files that are deleted are unimportant and a useless waste of disk space that many of us are paying for.

I think that as "The Web framework for perfectionists with deadlines" it is important that Django makes the development process quicker by allowing us developers to simply toggle one or more field options for file and image uploads rather than having us set up cron jobs or the like to deal with the orphaned files. It may be simple, but it takes time, and makes the process much more tedious.

Also to lay some of other Malcolm's fears to rest, I do not think that any developer worth any amount of programming talent would blame Django if an important file was deleted. The blame will lay solely on the developers inability to keep mission critical files separate from not as important ones.

Please do consider applying this as a fix to a future Django release.

comment:18 by DanielClemente, 13 years ago

Django cannot know whether the file is used somewhere else, but sometimes *the developer* does know this for certain and wants to delete old files. There should be an easy way to do it (say, by toggling a parameter after reading the documentation for some seconds) because there are enough interested users who are currently spending hours with this issue.

The proposed half-baked workarounds are not trivial enough; none worked for me at the first try.
I propose having FileField(upload_to='sw',replace=True), where replace=False would be the default.

comment:19 by anonymous, 13 years ago

+1 for beeing able to do something like this:

FileField(upload_to='sw',replace=True)

@DanielClemente:
The only "quick and dirty workaround" i found for this is: (see #4339 )

from django.core.files.storage import FileSystemStorage
class OverwriteStorage(FileSystemStorage):
    def _save(self, name, content):
        if self.exists(name):
            self.delete(name)
        return super(OverwriteStorage, self)._save(name, content)
    def get_available_name(self, name):
        return name

comment:20 by Ole Laursen, 13 years ago

I hate to add to the comment spam, but really, people, seeing that this is WONTFIX and that 1.3 has removed feature that a file field will delete it's associated file when the object is deleted, it's clear that Malcolm's idea of never touching the file system has won this battle.

As much as you may disagree with this, please don't comment here. It's useless and will just spam all of us. If you want to do something about this, maybe use your energy to code a new file field that does in fact clean up?

Regarding the file_cleanup.py command link posted above: there's a possible race condition there if a file ends up on disk before the corresponding object is saved in the database. If you want to make a non-racy garbage collector for this, you need to be really careful.

comment:21 by anonymous, 13 years ago

There is a package for this.

http://pypi.python.org/pypi/django-orphaned

comment:22 by anonymous, 12 years ago

why not to compare files also by CRC together with assesing filename existence and returning name from _save is both are the same?

Note: See TracTickets for help on using tickets.
Back to Top