Opened 12 years ago

Last modified 7 years ago

#20562 new Cleanup/optimization

Docs: How to use django ORM with multiprocessing

Reported by: Thomas Güttler Owned by: nobody
Component: Documentation Version: 1.5
Severity: Normal Keywords:
Cc: Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

There are several tickets closed as "invalid" which were submitted because the user had problems to use the django ORM with the multiprocessing library.

Please add some documentation how to do use it.

Main part: restart the database connection after fork()....

Change History (12)

comment:1 by Baptiste Mispelon, 12 years ago

Cc: bmispelon@… added
Easy pickings: set
Triage Stage: UnreviewedAccepted
Type: UncategorizedNew feature

I seem to have commented on the wrong ticket.

Sorry for the noise :/

Last edited 12 years ago by Baptiste Mispelon (previous) (diff)

comment:2 by Baptiste Mispelon, 12 years ago

Cc: bmispelon@… removed
Easy pickings: unset
Triage Stage: AcceptedUnreviewed
Type: New featureUncategorized

comment:3 by Anssi Kääriäinen, 12 years ago

Triage Stage: UnreviewedAccepted

I believe the only thing we can document is that "don't use fork()". Closing connection after fork() might be too late (who says that it is safe to close a connection from the child?). You need to do it before fork(). This might work. Or might not work. How about in-memory sqlite database, will that work? And so on...

The problem isn't that we aren't willing to make fork() work, or document how you can use fork() with Django. The problem is that in general the libraries used by Django aren't fork() safe. We can't work around that.

I am marking this as accepted. We should at least mention that you should not use fork(). In addition we should maybe recommend alternatives to fork(). I don't believe we should mention that "you can use fork() if you do the following things". It will be nearly impossible to actually guarantee that will be true.

comment:4 by Thomas Güttler, 12 years ago

My rule of thumb: "fork() before connection.cursor is created. If it is None, it is safe to fork()". The same goes for other connections (for example memcached).

comment:5 by Tim Graham, 11 years ago

Type: UncategorizedCleanup/optimization

comment:6 by Héctor Urbina, 10 years ago

Hello akaariai,

so, did you have anything in mind when you said "recommend alternatives to fork"? I'm facing a problem of a for loop accesing a django model queryset (so, making db queries) that takes too long. each iteration makes independent calculations, so, is it possible to distribute them somehow? Anyone?

Last edited 10 years ago by Héctor Urbina (previous) (diff)

comment:7 by Nikolas N, 10 years ago

I'm also curious if anyone is using the django orm from multiple processes.

comment:8 by Aymeric Augustin, 10 years ago

Most users of the Django ORM use it from multiple processes, since most production WSGI servers use multiple processes :-)

This question isn't specific to Django. The general problem that you can't carry sockets across fork.

If your Django process has network connections to remote data stores and you want to fork, you need to close these connections before forking. (Usually, they're reopened automatically on the next access.)

In practice, applications servers fork before Django does anything, so this issue only arises when you fork in a management command, typically because you're trying to use multiprocessing.

in reply to:  8 comment:9 by Moritz Sichert, 10 years ago

Replying to aaugustin:

This question isn't specific to Django. The general problem that you can't carry sockets across fork.

fork() copies the whole file descriptor table to the child process, so sockets are definitely carried across to the child process.

It's just that you have to explicitly and carefully handle the sockets to open and close them in the correct processes.
And that's where Django and probably most of the db driver libraries fall short.

comment:10 by Aymeric Augustin, 10 years ago

OK, I'm out of my depth there. All I know if that, if multiple children attempt to use a connection established in the parent, you get timeouts, probably because packets are sent back to the parent.

Is there something like a "pre-fork" signal that Django could react to? I'm not aware of such a thing.

comment:11 by Thomas Güttler, 10 years ago

I created this ticket two years ago. I think a guideline is enough here.

I resolved my issues by checking that no ORM code gets executed before the multiprocessing module spawns the workers.

In other words:

  1. start the workers via multiprocessing.
  2. connect to DB.

If you have N workers, you need N connections to the database.

I think no change to the django code base is necessary. Just docs.

in reply to:  8 comment:12 by Antony V. Badaykin, 7 years ago

Replying to Aymeric Augustin:

Most users of the Django ORM use it from multiple processes, since most production WSGI servers use multiple processes :-)

This question isn't specific to Django. The general problem that you can't carry sockets across fork.

If your Django process has network connections to remote data stores and you want to fork, you need to close these connections before forking. (Usually, they're reopened automatically on the next access.)

In practice, applications servers fork before Django does anything, so this issue only arises when you fork in a management command, typically because you're trying to use multiprocessing.

What about to add some top-level wrapper, like MultiprocessingCommand for example, that's carry out about connections and etc?

Note: See TracTickets for help on using tickets.
Back to Top