Opened 12 years ago
Closed 12 years ago
#18748 closed Cleanup/optimization (fixed)
Remove dupe-avoidance logic from the ORM
Reported by: | Anssi Kääriäinen | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
There is some complicated-looking and zero documented logic related to dupe avoidance in sql/query.py and sql/compiler.py. I began to wander what exactly the dupe avoidance does. It turns out it does nothing. I removed the dupe-avoidance, and all tests still pass.
The main benefit is easier to follow code. There should be a slight speed advantage, too, but it is so small it is unmeasurable. However I did confirm the removal of dupe-avoidance doesn't lead to speed problems using djangobench.
Patch can be found from: https://github.com/akaariai/django/tree/dupe_away
I am marking this as RFC. I will likely commit this one soon if nobody complains.
Change History (10)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Triage Stage: | Ready for checkin → Accepted |
---|
Quick testing and seems this isn't the case. I added a join counter (how many times ' JOIN ' is found in the executed SQL) and ran queries, select_related, select_related_regress, many_to_many, aggregation_regress and select_related tests. All produce exactly as many joins.
Another reason to think this will not reduce joins is that the dedupe is used to mark joins to _exclude_ from reuse, so the dedupe would cause more joins to appear if anything. Although, the code is complex and it is sometimes hard to see what exactly is happening...
I will still downgrade this from RFC, this probably is a bit too large change to just push in...
comment:3 by , 12 years ago
If this isn't a correctness issue, and it doesn't cause extra JOINs then I'm comfortable with you landing this, based on the analysis you've posted. I can take a look if you want, but I don't think that's necessary.
comment:4 by , 12 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
I am not too afraid re the code changes, the bigger question is if there is something the dedupe logic is doing and I just haven't spotted it. To me it seems there isn't. The code creates same amount of joins and the generated queries are correct, at least according to test suite.
Todays useless fact: There are 29176 instances of the word 'JOIN' in SQL executed by the testing framework (both with and without patch).
Marking again RFC.
comment:5 by , 12 years ago
Alex pinged me to take a look at this. I am *very* suspicious of removing exclusions from the "def join(...)" part. I have a memory it was important at the time. The reason for this being there may have vanished in the interim (it was 5 years ago and prior to the SQL-compiler changes), but I'll try to puzzle it out this weekend.
The reason for it would have been to avoid inadvertent merging of table joins when we need to force joining separate copies of the same table. On the other hand it's been around much longer than the real monster join bugs prior to 1.0, so the effect may be duplicated in other code, removing the need for this,
comment:6 by , 12 years ago
The real problem I was trying to solve when starting this was "what is this dedupe stuff really used for". The approach I used for finding that out was "lets just remove it and see if anything breaks". So, it wouldn't surprise me if something actually breaks...
comment:7 by , 12 years ago
I did some digging, and the dupe avoidance seems to be about cases where there are multiple paths from model A to model B, but through different columns. Something like this:
class B: pass class A: b1 = models.ForeignKey(B) b2 = models.ForeignKey(B)
However, this case is already handled in join() by having the column in the join identifier. That is, we will no accidentally reuse join through b1 for b2.
To me it seems there is no need for dupe avoidance any more.
comment:8 by , 12 years ago
OK, I now know what exclude was used for - when combining queries and the rhs query has two joins for the same connection we must exclude the alias used for the first connection. As an example:
q1 = qs.filter(m2m__f=1) q2 = qs.filter(m2m__f=1).filter(m2m__f=2)
Now, we have two joins for m2m in q2, and specifically they have the same connections for the m2m joins. If we do (q1|q2), then we will reuse all joins for the first m2m join. If we don't exclude the alias used for the first m2m join, then we will again reuse the same join for the second m2m join. This would result in having just one join in the combined query. This is wrong, as the original rhs query had two joins.
I have done a fix for this in the https://github.com/akaariai/django/compare/dupe_away, specifically in https://github.com/akaariai/django/commit/da96154a72c4c4ca53cce198a82bb6af3f87ff64 - note that the exclude arg is still gone, instead I use the reuse arg here.
I also reworked the .join() implementation (https://github.com/akaariai/django/commit/b42280f45c7cd0fcde835ee4665e9d67327fda1d). I feel the code is much improved, but I can't claim to understand the existing code in full. I am afraid that there are other regressions lurking...
If anybody can review this I would much appreciate it!
comment:9 by , 12 years ago
I have updated the patch. It consist of two major parts:
- dupe avoidance removal
- refactoring the signature and implementation of sql.Query.join()
In a perfect world these should likely be two separate patches and tickets. However, I have the code ready in one patch, and I think now is a good time to commit this to master. Splitting this to two patches seems somewhat laborious for little benefit. The patch is still manageable in size.
Code available in: https://github.com/akaariai/django/compare/ticket_18748
The patch removes around 120 lines of code and adds 10 new comment lines.
Unless objections I will commit this to master.
comment:10 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I managed to miss the "Fixed #18748" from the commit message... So, fixed in [68847135bc9acb2c51c2d36797d0a85395f0cd35].
I think this may be to avoid duplicate JOINs, so it's not a correctness issue, but rather a performance issue, can you try some queries that look like they might go through that code and make sure the JOINs look ok?