#17258 closed New feature (fixed)
Move threading.local from DatabaseWrapper to connections dictionary
Reported by: | Anssi Kääriäinen | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | anssi.kaariainen@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
It might be a good idea to move the threadlocality of connections from the DatabaseWrapper to the django.db.connections dictionary. This would be useful if you want to do routing of connections on per-view or per-method basis. The use case is when you have a writing view in your application, you might want to route all queries to the master database, not just the writing queries. Now your view will work on a consistent snapshot all the time. This would also allow for much easier generation of separate connections.
Currently, if you go on and change the DEFAULT_DB_ALIAS to point to "master" in the connections dictionary, this will result in global routing of the DEFAULT_DB_ALIAS, that is, all threads will see the master connection in the DEFAULT_DB_ALIAS, not just your thread. In other words, this is not doable currently.
I would like to work on an external project "db_helpers", which would provide a function wrapper for this. After this change, you could do this:
@route_connections(default="master", close_connections=True): def my_writing_view(...): ... or @route_connections(default=db_helpers.separate_connection("master")): def log_stuff_in_separate_connection(...): ...
And the implementation of route_connections would be:
def route_connections(f): # first handle special kwargs, like _close_connections old_connections = {} try: for k, v in kwargs: old_connections[k] = connections[k] connections[k] = connections[v] # or new connection... # call the wrapped method finally: for k in old_connections.keys(): connections[k] = old_connections[k]
Is the change of threading.local from DatabaseWrapper to the connections dictionary something that has a chance to get included in Django? I think this could be a pretty simple thing to do (if you can call anything involving threads simple). Of course, I am willing to do the investigation & the patch if there is a chance for this.
Attachments (7)
Change History (31)
comment:1 by , 13 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|---|
Type: | Uncategorized → New feature |
Version: | → SVN |
comment:2 by , 13 years ago
First, everything user-visible would be done in a project external to Django. For routing connections (instead of queries) some changes in django.db.connections is needed. The idea behind this is that you could make an one-db application multi-db app by just changing where 'default' points - for writing views it points to master, else it points to one of the slaves.
I do not mean to change the value of DEFAULT_DB_ALIAS, I just mean to change where it points at (that is, connectionsdefault is changed for one thread, not the word 'default', not the DatabaseWrapper
it points to).
Yes, there are some drawbacks... Maybe the global state is a stupid thing to do, but on the other hand it is pretty hard to pass the correct connection you want to use from middleware, so that it is used in other middleware, and in all the queries (user.get_profile, get_permissions, model validation etc). Separate connections to the same database are also somewhat hard to do correctly.
Anyways, this would not be django-core stuff, but a prerequisite is moving the thread-locality of connections from DatabaseWrapper
to django.db.connections dictionary. This involves some problems, because threading.local objects have some special GC behavior when dealing with cyclic references, and DatabaseWrapper
happens to have many cyclic references. This would mean that separate threads' connections are GCed (and closed) with different timing than they are now. But this timing is Python implementation dependent, for pypy there would not be any change for example... And closing connections manually for separate threads should be done anyways.
comment:3 by , 13 years ago
Hmmh, my descriptions above aren't as clear as they could be. So, another try. I suggest that Django would:
- Make
BaseDatabaseWrapper
a subclass of object instead of threading.local - Make django.db.connections threadlocal.
Benefits:
BaseDatabaseWrapper
is a complex object, and it is subclassed a lot. django.db.connections is a simple dictionary. Making the simple object threading.local instead of the complex object would be a good thing IMHO.- You can share connections between threads (needed for SQLite in-memory databases): Pass the connections you want to share to new thread, and in the new thread assign the passed in connections to django.db.connections.
- You can change per-thread where each connections
[alias]
points to. So, you could do this in a middleware:old_conn = django.db.connections['default'] try: django.db.connections['default'] = django.db.connections['master'] # for writing transactions. (Or just for POST connections) django.db.connections['default'] = django.db.connections['slaveN'] # for read-only transactions. finally: django.db.connections['default'].close() django.db.connections['default'] = old_conn
- Implementing a connection pool: Make a
ConnectionPoolWrapper
(a fake backend), assign it to django.db.connections[alias]
. TheConnectionPoolWrapper
would then pass most attribute accesses down to one of the pooled connections (which would be regular backend objects). - It would be fairly easy to have multiple connections to the same database in one thread: just assign a new instance of a wanted connection to django.db.connections['another_connection`], and then do queries with .using('another_connection'). (Or use the above mentioned snippet to assign it to one of the existing connections). Useful for logging and testing for example. Currently Django creates another thread when it wants a separate connection for testing (see select_for_update tests for example).
- Implementing a connection pool: Make a
The important thing to note is that the examples above would not be Django-core. Moving thread-locality from BaseDatabaseWrapper
to django.db.connections would allow 3rd party projects to do the above things. As far as I see, they are not doable at the moment, or at least not in any trivial way.
by , 13 years ago
Attachment: | 17258.thread-local-connections.diff added |
---|
comment:4 by , 13 years ago
Triage Stage: | Design decision needed → Accepted |
---|
Accepting as the benefits suggested in comment:3 all seem valid and as we already have an immediate need for this, i.e. to allow in-memory sqlite database connections being shared between multiple threads for Selenium tests (ref #2879).
The attached patch turns the DatabaseWrapper
instances into global objects and makes the django.db.connections
and django.db.connection
references thread-local.
As discussed with akaariai on IRC, there is a potential risk that garbage collection may not be handled properly since DatabaseWrapper
has cyclic references. So we'd need to do further testing to be sure this change doesn't cause memory leaks or unexpected behaviours.
comment:5 by , 13 years ago
Cc: | added |
---|---|
Has patch: | set |
Do you need to make django.db.connection threading.local? As is, it is fetching / assigning the connection using connections, but connections is already threading.local. So, would this work exactly as is, but with making the DefaultConnectionProxy
subclass of object instead?
About the garbage collection risk, there are a couple of things to note:
- The garbage collection closes the underlying connection, so you would a connection leak if the GC fails to collect the connection object.
- Circular references +
__del__
make objects in cyclic references non-collectable. See: http://docs.python.org/library/gc.html#gc.garbage. - However, if the cycle is caused by a threading.local object, 2. isn't in effect. (Yes, this is weird).
- On standard Python the timing of the garbage collection is altered. Non-cyclic (or cycles containing threading.local objects) are collected immediately, but objects with referential cycles only when the GC happens to run next time. On
PyPy
, the objects are collected only when the garbage collector happens to run, so no change forPyPy
users. In general, if your code relies on the exact timing of GC, you are doing it wrong.
It so happens that the BaseDatabaseWrapper
is currently threading.local, and has referential cycles (wrapper -> introspection -> wrapper etc). So, by removing the threading.local, we remove rule 3, and rule 2 is in effect. Luckily there is no __del__
defined for the wrapper, so we should be relatively safe, except for the altered timing. External backends which define __del__
would have a really hard bug to debug.
This should not affect web-serving at all, as after each request connections are closed explicitly. But testing & scripts and things like that could potentially have a problem.
The long-term solution would be to break the cycles in BaseDatabaseWrapper
. However that is going to be an invasive patch, and would break most external backends.
All in all, I think the patch should be safe from GC point of view. The patch looks good (on a quick glance), except potentially for the duplicate threading.local.
by , 13 years ago
Attachment: | 17258.thread-local-connections.2.diff added |
---|
comment:6 by , 13 years ago
Thanks akaariai. DefaultConnectionProxy
now inherits from object
. I've also tweaked the tests to pass on Postgres.
I now have just one concern about the patch. Some third party code may expect django.db.connection
to be a DatabseWrapper
instance and therefore would break. This is not a huge deal, but it'd be nice if this could be achieved without having to use a proxy class.
comment:7 by , 13 years ago
I am afraid there is no easy way for connection to be DBWrapper
instance instead of the proxy. If you do in your code:
from django.db import connection
on module level, and the DBWrapper
isn't threading.local, then that connection must be a proxy. BTW this also means that if somebody does this for example:
def somefunc(connection=connections['some_alias']): # use connection in some way
that would break if used in multithreaded fashion. You are now sharing a single connection instead of a threading.local object.
Would there be any point in making an assert in some proper place of BaseDatabaseWrapper
that a single connection is not shared between threads (unless explicitly allowed). That would be really helpful for those who would hit multithread-bugs due to this change. I haven't looked where that check could be made so that it is checked often enough to actually catch something, but seldom enough to not cause performance regressions. Maybe ._cursor() for each backend could do this check?
In short, this change can cause connections to be shared when not actually wanted. Before you would be sharing a threading.local, and thus get different connections, now you would be sharing the actual connection. Allowing this is actually one of the aims of this patch, but this might break user code in hard-to-debug ways. So, a guard against this and a way to explicitly tell that "I am sharing this connection" would be a good thing IMHO.
by , 13 years ago
Attachment: | 17258.thread-local-connections.3.diff added |
---|
POC: check sharing of connections between threads
comment:8 by , 13 years ago
I like this idea. Pysqlite by default prevents connections to be passed around between multiple threads, unless you explicitly provide the 'check_same_thread=False' parameter.
See how this is implemented: http://www.google.com/codesearch#aEvhAxCkZ8U/src/connection.c&q=check_same_thread%20package:http://pysqlite%5C.googlecode%5C.com&l=838
And where this function is called: http://www.google.com/codesearch#search/&q=pysqlite_check_thread%20package:http://pysqlite%5C.googlecode%5C.com&type=cs
I've implemented something similar in the latest patch. Thoughts?
by , 13 years ago
Attachment: | 17258.thread-local-connections.4.diff added |
---|
Similar approach to pysqlite's check_same_thread
comment:9 by , 13 years ago
akaariai: I hadn't realised you had posted a patch as well. It seems like we've both used a similar approach. Let's merge the good bits from both patches ;)
comment:10 by , 13 years ago
It seems your approach is better. I actually don't know what would be the good bits to merge from my patch :)
by , 13 years ago
Attachment: | 17258.thread-local-connections.5.diff added |
---|
comment:11 by , 13 years ago
After a long discussion with akaariai on IRC, I've updated the patch. The most notable changes are:
- All underlying SQLite connections (i.e.
sqlite3.DatabaseWrapper.connection
) systematically receivecheck_same_thread=False
on opening. - Modified the API namings, in particular
check_same_thread
(defaulting toTrue
) becomesallow_thread_sharing
(defaulting toFalse
).
Some documentation is now necessary:
- Release notes announcing that the database connections are now global objects (only their references are thread-local), that
django.db.connection
now is a proxy, and that sqlite db connections can be shared between threads. - Notes in the SQLite doc about the change in behaviour: https://docs.djangoproject.com/en/dev/ref/databases/#sqlite-notes
In the code, maybe a warning or exception could be generated if DatabaseWrapper.connection
is directly accessed and DatabaseWrapper.allow_thread_sharing
is False
with SQLite. I'm not sure if/how this is achievable.
There is also a question as whether or not we want to officially document DatabaseWrapper.allow_thread_sharing
property yet.
comment:12 by , 13 years ago
One more thing. It'd be useful to see if this change wouldn't affect performance too much due to all the added conditional tests.
comment:13 by , 13 years ago
Quick review: The attribute check_same_thread is used in the exception risen from validate_thread_sharing (instead of the allow_thread_sharing). If it is not public API, should it be mentioned at all? I kinda like the idea of making it semi-public...
I haven't done any performance testing, but I would be surprised if the checks would cause any noticeable performance regressions. The checks are in places where the actual operation is likely going to be much more expensive than the check. Maybe the overhead in .cursor() could be noticed if you repeatedly call it and don't actually do anything with the cursor. But that is hardly an use case to optimize for...
Re the needed release notes. Here is the things I thought would be good to mention (lots of polish needed):
- If you are passing connections (that is, connections[some_alias] instead of just some_alias) between threads, you are now sharing the actual connection instead of having different connections to the same database. Using the same connection in different threads is an error, so this should not accidentally hit you. Pass the alias instead and use connections[alias] in the other thread to get a separate connection.
- If you just use the ORM or connection.cursor() for raw SQL, then nothing to see here.
- django.db.connection is now a proxy for the
DatabaseWrapper
. If you need to have the real DBWrapper for some reason, use connections[DEFAULT_DB_ALIAS] instead. - (if allow_thread_sharing is made public): While it is possible to pass connections between threads now, Django does not make any effort to synchronize access to the underlying connection implementation. Concurrency behavior is defined by the underlying implementation. Check their docs for details.
Did I miss something? Is this overly cautious/verbose? I wonder what to say about garbage collection. It is very technical, and the risk sounds a lot worse than it actually is. Maybe it would be good to keep the release notes section very short, and have a longer explanation somewhere else (check django-developers thread X or Trac post Y for more details?).
by , 13 years ago
Attachment: | 17258.thread-local-connections.6.diff added |
---|
comment:14 by , 13 years ago
I believe it's ready to go. You may do a quick scan of the patch before I commit.
comment:15 by , 13 years ago
I wonder if a normal user who just uses the ORM and does nothing special will understand the release notes. However, I believe it is fine to adjust the release notes later on, if need be. Also: Finally, while it is now possible to pass connections between threads... section hints that it is possible to pass connections between threads. However, using public API, you are not allowed to do that.
I will try to run this through djangobench to see if there is any performance problems lurking in there. Otherwise looks good to me.
comment:16 by , 13 years ago
Yes, perhaps the release notes are overly technical. The fact we're not publicising allow_thread_sharing
perhaps also makes it a bit confusing ("it is now possible to share connections between threads but we won't tell you how").
The two main things that have changed and could surprise some users are that django.db.connection
now is a proxy, and that underlying sqlite connections are thread-shareable. The former is straight forward, the latter is a bit awkward without telling more details about the new behaviour.
comment:17 by , 13 years ago
OK, benchmarks ran, nothing repeatable found. There is something wrong with my computer or djangobench, because in the attached benchmarks query_update and query_get show bad results, but when I retested them, the results were not any slower. I wonder if it is the garbage collector which affect the results. The tests were run as:
djangobench --trials 400 --control master --experiment connections_threadlocal --vcs git
Re the release notes. Maybe the best thing is to let them just be, and fine tune them later on. Get opinions of more people, and then tune them. Waiting for the opinions before commit will make this a 1.5 feature...
comment:18 by , 13 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
In my experience djangobench's results are to be taken with a grain of salt as many external factors seem to influence them. As long as nothing is repeatable then we should be safe. I'll go ahead with it.
comment:19 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
In [17205]:
(The changeset message doesn't reference this ticket)
comment:21 by , 13 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
This change leads to a failure under Python 3.x because the SQLite DatabaseWrapper
is no longer hashable. I noticed that the BaseDatabaseWrapper
defines __eq__
but not __hash__
- does this not prevent its instances from being keys in dictionaries / members of sets?
If you add hash = object.hash, then the problem seemingly goes away.
follow-up: 23 comment:22 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Please don't reopen this ticket, if there's a new issue as a result open a new ticket.
If I understand correctly, your goal is to avoid adding a
.using(...)
clause to everyQuerySet
in the view.The drawback is that we introduce some state — basically, a global (thread-local) variable that defines the "current default" database connection.
Two other remarks:
close_connection
a separate topic?