Opened 14 years ago

Last modified 2 months ago

#13960 new New feature

abstract file upload/download handling

Reported by: Waldemar Kornewald Owned by: nobody
Component: File uploads/storage Version: dev
Severity: Normal Keywords:
Cc: hv@… Triage Stage: Someday/Maybe
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: yes
Easy pickings: no UI/UX: no

Description

This is a proposal to add abstract file upload and download handling via a backend mechanism. This allows to support App Engine, async S3 uploads, and many other file handling solutions via a single API. The end-user simply specifies the desired backends for his hosting solution in the settings.

See also: http://groups.google.com/group/django-developers/browse_thread/thread/8ac5846e47d96df5

Attachments (4)

filetransfers.diff (18.5 KB ) - added by Waldemar Kornewald 14 years ago.
patch against trunk
filetransfers.2.diff (17.3 KB ) - added by Waldemar Kornewald 14 years ago.
patch against trunk
filetransfers.3.diff (20.8 KB ) - added by Waldemar Kornewald 14 years ago.
patch with RequestForm, RequestModelForm. now independent of models (uses string ID instead of model and field)
filetransfers.4.diff (30.9 KB ) - added by Waldemar Kornewald 14 years ago.
added admin support for file uploads and implemented request-based formsets

Download all attachments as: .zip

Change History (19)

by Waldemar Kornewald, 14 years ago

Attachment: filetransfers.diff added

patch against trunk

by Waldemar Kornewald, 14 years ago

Attachment: filetransfers.2.diff added

patch against trunk

comment:1 by Waldemar Kornewald, 14 years ago

The second patch gets rid of the FileField convenience API. Now the low-level API supports field names (strings) in addition to field instances. This also simplified the patch a little bit.

by Waldemar Kornewald, 14 years ago

Attachment: filetransfers.3.diff added

patch with RequestForm, RequestModelForm. now independent of models (uses string ID instead of model and field)

by Waldemar Kornewald, 14 years ago

Attachment: filetransfers.4.diff added

added admin support for file uploads and implemented request-based formsets

comment:2 by Waldemar Kornewald, 14 years ago

A sample app demonstrating an async upload process can be found here:
http://bitbucket.org/wkornewald/upload-sample-trunk-filetransfers

Note that this even works with the admin. The file first gets sent to a view at /upload and then the processed request gets forwarded to the admin view.

Also, here's my development branch:
http://bitbucket.org/wkornewald/django-trunk-filetransfers

comment:3 by Jannis Leidel, 14 years ago

Is there a reason why the backends are called "*FileBackend" while they are actual "*TransferBackends"?

Also, why this configuration based approach? Couldn't the FileField gain an additional transfer parameter similar to storage? Alternatively, couldn't we also extend the already existing storage API?

comment:4 by Waldemar Kornewald, 14 years ago

I just left the naming and other low-priority questions to the last design step. But you're right, TransferBackend would be much better.

We can't extend the existing storage API because the storage mechanism (e.g., file system) doesn't necessarily know anything about the upload mechanism (e.g., streaming in-memory) and both don't know anything about the download mechanism (e.g. X-Fileserve). These are three completely separate things. Only in some cases they belong together (e.g., Amazon S3, Google App Engine Blobstore, etc.).

This also means we can't use it as a transfer parameter to FileField. The API is not limited to the storage mechanism.

The whole approach of adding project-specific details directly to the models and app code makes Django apps less reusable. The configuration-based approach allows the same app to be used with App Engine, S3, your local home server, and any other hosting solution and that's the goal behind this API: Abstracting the file transfer and storage, so reusable apps can be written.

in reply to:  4 ; comment:5 by Jannis Leidel, 14 years ago

Replying to wkornewald:

We can't extend the existing storage API because the storage mechanism (e.g., file system) doesn't necessarily know anything about the upload mechanism (e.g., streaming in-memory) and both don't know anything about the download mechanism (e.g. X-Fileserve). These are three completely separate things. Only in some cases they belong together (e.g., Amazon S3, Google App Engine Blobstore, etc.).

This isn't true, the storage framework knows about the upload handler (django.core.files.uploadhandler) and misses a download feature, which is why it should be done by extending the storage API instead of inventing a separate layer. If -- for ambiguity reasons -- it makes sense to name the download part "transfer", I'm +1 on it, otherwise I don't see a reason to consider your idea. Note, I acknowledge the general idea of having an abstraction for downloading uploaded files.

This also means we can't use it as a transfer parameter to FileField. The API is not limited to the storage mechanism.

But to what else?

The whole approach of adding project-specific details directly to the models and app code makes Django apps less reusable. The configuration-based approach allows the same app to be used with App Engine, S3, your local home server, and any other hosting solution and that's the goal behind this API: Abstracting the file transfer and storage, so reusable apps can be written.

While I see your point, I do think that 3rd party apps can easily make the transfer parameter configurable, if required, without having to rely on a global style setting. This is a matter of "convention over configuration", e.g. using a different transfer method for every file field in a project is by far less likely than using the same for all. It's good to be able to specify a different default backend, and be able to override that on a per-field base.

comment:6 by Waldemar Kornewald, 14 years ago

OK, I think your suggestion does have valid points, so here are a few thoughts:

The upload preparation process (see prepare_upload() function) is most probably tightly 1:1-coupled with FileUploadHandler. IOW, two upload backends will never share the same upload preparation process (unless it needs no preparation, at all). So, as a first step, what do you think about moving prepare_upload() to FileUploadHandler?

The "post-upload -> storage" process has an m:n mapping from FileUploadHandler to Storage. At least the tempfile and the in-memory upload handlers are both used with different storage handlers (file system, MongoDB, S3, etc.). Merging those two backends into one backend wouldn't make sense because it would lead to an unnecessary explosion in the number of backends (m*n combinations), so FileUploadHandler and Storage would stay two different backends. This already separates the file handling process into two parts.

The download process consists of these parts:

  1. serving the file via a Django view
  2. generating a download URL which can either
    1. point to a Django view (always the case for non-public downloads)
    2. point to a direct publicly accessible download URL (not all backends support this, so a fallback to 2.1 must be supported)

Part (1) might actually be handled by the storage backend, but there's no tight 1:1 coupling between those two. Obviously, one storage mechanism can have different file serving mechanisms which means it's at least 1:n. The important question is: Can one file serving mechanism also apply to multiple storage mechanisms (i.e.: m:n)? A download mechanism that streams the download via HttpResponse might be shared between multiple storage mechanisms which means that many backends would have to be provided in two or more variations (FileStorageStreaming, FileStorageXSendfile, S3Streaming, S3Redirect, etc.) and hopefully those variations are compatible with each other, so you can replace your streaming download with XSendfile without having to migrate the data stored in your FileFields. Also, while Django could provide streaming downloads in the base backend it would lead to unnecessary code duplication if another download mechanism is shared between multiple storage backends (e.g., redirect to MEDIA_URL, if that makes sense). I have strong doubts that it's a good idea to combine the file serving mechanism with Storage. What's the benefit of combining those backends, anyway? Removing one entry from settings.py? The risk is that we're overlooking something and at some point the combination of both backends has more disadvantages than the advantages that come with a *minor* settings simplification.

Case (2) is about handling public and private download URLs. Many projects will have both types of downloads. Some files are publicly accessible and thus should be served directly and efficiently without bothering a Django instance. Other files are are only accessible by certain users, so those downloads have to be handled by a Django view which can check permissions and do other fancy stuff before passing on to the backend (X-Sendfile or a streaming HttpResponse, etc.). Since in many cases both types of downloads (public and private) need to be supported we'll need to specify different download_url() and prepare_upload()/FileUploadHandler backend combinations for different apps or models. Almost every storage option provides one or more solutions for private downloads and one or more solutions for public downloads. Again, this will lead to an explosion of combinations.

There is also the possibility to combine upload and download backends, but since almost every storage solution will work with at least two upload mechanisms and sometimes three or more download mechanisms I'm sure nobody wants to have six or more backends per storage mechanism to handle all possible combinations.

Finally, the TransferBackend API also provides a mechanism for selecting the Storage backend for a FileField without hard-coding it in your models.py. Specifying only one backend isn't sufficient because we need to support public and private uploads and in some cases these are even on different hosting solutions (e.g. for performance reasons or whatever). This mechanism can't be added to Storage because it doesn't really make sense for Storage to be responsible for selecting which Storage should be used. So, even if we combine all of the other APIs into Storage we're still left with a TransferBackend which selects the Storage.

That's why I don't really see how we can combine TransferBackend with Storage. Only the prepare_upload() function might need to be moved from TransferBackend to FileUploadHandler. Well, I'm open for suggestions if you know a better design. Also, apart from this adding a few more *optional* entries to settings.py (in simple cases you won't need a TransferBackend), what's the big issue with this approach, anyway?

comment:7 by Waldemar Kornewald, 14 years ago

BTW, regarding letting 3rd-party app developers implement a custom transfer backend selection mechanism: I don't think it's a good idea to pass this decision as an optional gimmick to app developers. In that case, only a small number of apps will implement an extra backend API because it's too much work and that'll mean that users who have private and public downloads have to hack every lazily developed app. If Django can provide a built-in mechanism for backend selection that's much better for everyone. It means less code in 3rd-party apps and a standard way to select backends instead of having different solutions in every 3rd-party app.

in reply to:  5 comment:8 by Waldemar Kornewald, 14 years ago

Replying to jezdez:

Replying to wkornewald:

This also means we can't use it as a transfer parameter to FileField. The API is not limited to the storage mechanism.

But to what else?

I feel like this might be important to answer: Imagine the case where a user uploads an image which gets resized/converted. Or imagine a CSV or Excel file which gets converted into DB entities. In these cases there's no direct upload to some FileField.

comment:9 by Russell Keith-Magee, 14 years ago

milestone: 1.3
Patch needs improvement: set
Triage Stage: UnreviewedDesign decision needed

The broad use case is valid, but there is still a lot of design work to do here.

comment:10 by Waldemar Kornewald, 14 years ago

Cc: wkornewald@… added

comment:11 by Thomas Güttler, 14 years ago

Cc: hv@… added

comment:12 by Graham King, 14 years ago

Severity: Normal
Type: New feature

comment:13 by Waldemar Kornewald, 13 years ago

Cc: wkornewald@… removed
Easy pickings: unset
UI/UX: unset

comment:14 by Jacob, 12 years ago

Triage Stage: Design decision neededSomeday/Maybe

Marking someday/maybe: this might be something worth doing, but there's still a lot of work left. Until there's a really solid bit of code it's hard to say with certainty if this should go in or not.

comment:15 by Ahmed Ibrahim, 2 months ago

bumping this, is this still valid?

Note: See TracTickets for help on using tickets.
Back to Top