#22268 closed Cleanup/optimization (fixed)
Document values_list() behavior for ManyToManyField
Reported by: | Kal Sze | Owned by: | Mikkel Bach Mortensen |
---|---|---|---|
Component: | Documentation | Version: | dev |
Severity: | Normal | Keywords: | orm, values_list, ManyToManyField |
Cc: | anubhav9042@…, pirosb3, saivnm5@… | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | yes | UI/UX: | no |
Description
Suppose you have a schools
app, with these model definitions:
from django.db import models class School(models.Model): name = models.CharField(unique=True, max_length=255) class Class(models.Model): school = models.ForeignKey(School) name = models.CharField(unique=True, max_length=255) students = models.ManyToManyField('Student') class Student(models.Model): surname = models.CharField(max_length=255) given_name = models.CharField(max_length=255)
Now, try this in the Django shell:
$ python manage.py shell >>> from schools.models import School, Class, Student # Create the school >>> concordia = School(name='Concordia University') >>> concordia.save() # Create the Software Engineering class >>> soen = Class(school=concordia, name='Software Engineering') >>> soen.save() # Create the Computer Engineering class >>> coen = Class(school=concordia, name='Computer Engineering') >>> coen.save() # Create a student >>> john_smith = Student(surname='Smith', given_name='John') >>> john_smith.save() # Add this student into one of the classes >>> soen.students.add(john_smith) # Now make a query using values_list >>> students = Class.objects.values_list('students', flat=True) # How many students are there supposed to be in this `students` QuerySet? >>> print students.count() 1 # What if we iterate over it? >>> for s in students: >>> print s 1 None # Wait, what!? >>> print len(list(students)) >>> 2
Attachments (1)
Change History (49)
comment:1 by , 11 years ago
Type: | Uncategorized → Bug |
---|
comment:2 by , 11 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Version: | 1.6 → master |
comment:3 by , 11 years ago
One thing I want to clear out is that I think that values
and values_list
return dictionary/tuple type data for all objects present, hence this might not be a bug.
eg.
Class.objects.values_list()
will give you something like
[(1L, 1L, u'Software Engineering'), (2L, 1L, u'Computer Engineering')]
thus for
Class.objects.values_list('students')
you are bound to get
[(1L,), (None,)]
Thoughts??
comment:5 by , 11 years ago
Cc: | added |
---|
comment:6 by , 11 years ago
I have start looking at this ticket, that is currently replicated on my machine.
The query generated by the values_list statement above is:
(u'SELECT "m2m_and_m2o_class_students"."student_id" FROM "m2m_and_m2o_class" LEFT OUTER JOIN "m2m_and_m2o_class_students" ON ( "m2m_and_m2o_class"."id" = "m2m_and_m2o_class_students"."class_id" )', ())
I think the issue here is related to the LEFT OUTER JOIN. We could modify ValuesListQuerySet.iterator and make changes to avoid this, but I am not sure it should be done here.
Can anyone give me some advice? am I looking in the correct place?
Thanks,
Dan
comment:7 by , 11 years ago
And I believe the problem is even more serious when the values_list is querying for a nullable field of a ManyToManyField. For example, if I modify the Student class and add a new field:
class Student(models.Model): surname = models.CharField(max_length=255) given_name = models.CharField(max_length=255) year_of_birth = models.SmallIntegerField(null=True)
And then I make this query:
>>> Class.objects.values_list('students__year_of_birth', flat=True) [None, None]
It becomes ambiguous where the None
s come from: one of the None
s is from john_smith, who has a null year_of_birth
, the other None
is because the other class has no student at all.
comment:8 by , 11 years ago
First thing we should make clear is that whether we want to change the current behaviour or not.
Please see my previous comment: values
and values_list
give result for all objects present.
comment:9 by , 11 years ago
Since values
and values_list
return a value for all objects present(here Class
).
Therefore if a class doesn't have that field present, it will have to return some value, if None
is not demanded for that field then what should we be wanting in place.
comment:10 by , 11 years ago
Cc: | added |
---|
I understand that it is unclear if this is an error or expected behavour, and that changing this will lead to backwards compatibility issues.
I still believe this should be modified in the future. With None being a possible value in values_list, it makes developer's lives more difficult, as they need to check when iterating over the list.
Another possible solution would be to make a documentation patch, the developer needs to be aware of this.
comment:11 by , 11 years ago
Changing this to not include the None value is definitely backwards compatibility breaking. So far, in this ticket, we have not been able to reach a clear consensus on what the correct behaviour is. To me, the current behaviour does not seem unreasonable. The issue raised in comment:7 is valid, but I don't see how removing extra None's would help - this is simply not the right query for that question.
Given that there is no strong case for removing the None's, I think we should not break backwards compatibility, and therefore keep the existing behaviour. If you disagree, please make sure to clarify exactly how you would like the behaviour to change. Otherwise, I do think a documentation patch would be good.
follow-up: 13 comment:12 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
I am also in favour of doc patch...as current behaviour is correct.
follow-up: 14 comment:13 by , 11 years ago
On the one hand, I do understand that a change in behaviour would break backward compatibility. On the other hand, I tend to agree with comment:10 here, in that the behaviour should be modified in the future. To me, the correct behaviour would be to not include the extra None
s. In my example, the ones from the null year_of_birth
are fine and should be included. Another point of consideration that I would raise is whether there is any reason to believe that people are relying on the current behaviour (I think it would be quite strange for people to be relying on ambiguity).
Would it be OK to document it as a quirk that is subject to change in the future, and not as the correct bahaviour?
Regarding comment:11, I think there are possible legitimate uses for this kind of query. For instance, if you want a Cartesian product count of some M2M relation, and then run it through the Python collections.Counter or something.
comment:14 by , 11 years ago
Replying to k_sze:
On the one hand, I do understand that a change in behaviour would break backward compatibility. On the other hand, I tend to agree with comment:10 here, in that the behaviour should be modified in the future. To me, the correct behaviour would be to not include the extra
None
s.
This is what I want to ask, if we do not want those extra None
s, then what do we return in its place.
If we just don't return anything at all, it is going to be more problematic.
Consider three Class
objects, out of which the second object doesn't have student
, then if we return something like ['some_id','some_id']
for this, would it not be confusing for similar case when third object doesn't have student
.
comment:15 by , 11 years ago
Actually the whole behaviour of values()
and values_list()
seems to be wrong once you mix M2M fields into the picture. I just tried creating more Class and Student objects:
# New student, Joe Blo >>> joe_blo = Student(surname='Blo', given_name='Joe') >>> joe_blo.save() # New class, Discrete Mathematics >>> discrete_math = Class(name='Discrete Mathematics', school=concordia) >>> discrete_math.save() # Enroll both john_smith and joe_blo in discrete_math discrete_math.students.add(joe_blo) discrete_math.students.add(john_smith)
First, let's look at values()
queries:
>>> Class.objects.values() [{u'id': 1, 'name': u'Software Engineering', 'school_id': 1}, {u'id': 2, 'name': u'Computer Engineering', 'school_id': 1}, {u'id': 4, 'name': u'Discrete Mathematics', 'school_id': 1}]
That's 3 dictionaries because we now have 3 Class objects. That seems fine. On the other hand, why don't students appear here?
So, let's specifically ask the query to include students
:
>>> Class.objects.values('name', 'students') [{'students': 1, 'name': u'Software Engineering'}, {'students': None, 'name': u'Computer Engineering'}, {'students': 1, 'name': u'Discrete Mathematics'}, {'students': 2, 'name': u'Discrete Mathematics'}]
That's 4 dictionaries even though we only have 3 Class objects (soen, coen, and discrete_math). Can you imagine what would happen if the Class object had more M2M fields? The number of dictionaries returned would probably explode exponentially! Why aren't the students combined into a tuple or a list? Shouldn't it be 3 dictionaries like this:
[ {'students': (1,), 'name': u'Software Engineering'}, {'students': (,), 'name': u'Computer Engineering'}, # Note the empty tuple for students, instead of None {'students': (1, 2), 'name': u'Discrete Mathematics'}, ]
Now, let's look at values_list()
queries:
>>> Class.objects.values_list() [(1, 1, u'Software Engineering'), (2, 1, u'Computer Engineering'), (4, 1, u'Discrete Mathematics')]
3 tuples. Again, this seems fine except for the fact that students aren't included.
>>> Class.objects.values_list('name', 'students') [(u'Software Engineering', 1), (u'Computer Engineering', None), (u'Discrete Mathematics', 1), (u'Discrete Mathematics', 2)]
4 tuples even though we have 3 Class objects only! Shouldn't this be 3 tuples like this:
[ (u'Software Engineering', (1,)), (u'Computer Engineering', (,)), # Note the empty tuple instead of None (u'Discrete Mathematics', (1, 2)), ]
comment:16 by , 11 years ago
To demonstrate how bad things become once you have multiple M2M fields, I created a new Language
model and added it as a M2M field in the Class model:
from django.db import models # other Model definitions skipped ... class Class(models.Model): school = models.ForeignKey(School) name = models.CharField(unique=True, max_length=255) students = models.ManyToManyField('Student') available_in_languages = models.ManyToManyField('Language') class Language(models.Model): name = models.CharField(max_length=255)
Now I create new languages and add them to the Discrete Mathematics class (which now has 2 students, if you recall):
# French >>> french = Language(name='French') >>> french.save() # English >>> english = Language(name='English') >>> english.save() # Add them to Discrete Mathematics class discrete_math = Class.objects.get(name='Discrete Mathematics') discrete_math.available_in_languages.add(french) discrete_math.available_in_languages.add(english)
Now I make a values()
query:
>>> Class.objects.values('available_in_languages', 'students') [{'students': 1, 'available_in_languages': None}, {'students': None, 'available_in_languages': None}, {'students': 1, 'available_in_languages': 2}, {'students': 2, 'available_in_languages': 2}, {'students': 1, 'available_in_languages': 1}, {'students': 2, 'available_in_languages': 1}]
That's 6 dictionaries now!
Similarly, with values_list()
:
>>> Class.objects.values_list('available_in_languages', 'students') [(None, 1), (None, None), (2, 1), (2, 2), (1, 1), (1, 2)]
6 tuples even though we only have 3 Class objects.
comment:17 by , 11 years ago
According to me the query generated for Class.objects.values('available_in_languages', 'students')
:
SELECT `tic_22268_class_available_in_languages`.`language_id`, `tic_22268_class_students`.`student_id` FROM `tic_22268_class` LEFT OUTER JOIN `tic_22268_class_available_in_languages` ON (`tic_22268_class`.`id` = `tic_22268_class_available_in_languages`.`class_id`) LEFT OUTER JOIN `tic_22268_class_students` ON (`tic_22268_class`.`id` = `tic_22268_class_students`.`class_id`)
is correct as it yield the following result
language_id student_id NULL 1 NULL NULL 1 1 1 2 2 1 2 2
Although the result is correct as we want combination of available_in_languages
and students
for Class
objects. However it is confusing.
I propose the following:
Whenever there is a case of M2M in values() or values_list(), we could introduce additional id
into the query, which will help in making the result meaningful and understandable.
We make the query into:
SELECT `tic_22268_class_available_in_languages`.`language_id`, `tic_22268_class_students`.`student_id`, `tic_22268_class_students`.`id` FROM `tic_22268_class` LEFT OUTER JOIN `tic_22268_class_available_in_languages` ON (`tic_22268_class`.`id` = `tic_22268_class_available_in_languages`.`class_id`) LEFT OUTER JOIN `tic_22268_class_students` ON (`tic_22268_class`.`id` = `tic_22268_class_students`.`class_id`)
resulting in
language_id student_id id NULL 1 1 NULL NULL 2 1 1 3 1 2 3 2 1 3 2 2 3
The output also becomes:
[{'students': 1L, 'available_in_languages': None, 'id': 1L}, {'students': None, 'available_in_languages': None, 'id': 2L}, {'students': 1L, 'available_in_languages': 1L, 'id': 3L}, {'students': 2L, 'available_in_languages': 1L, 'id': 3L}, { 'students': 1L, 'available_in_languages': 2L, 'id': 3L}, {'students': 2L, 'available_in_languages': 2L, 'id': 3L}]
}}}
Now the result is as required, since we wanted the combination of values: 'available_in_languages' and 'students' along with 'id' of the Class
object to which it belongs.
I am adding a diff.
comment:18 by , 11 years ago
Added the diff.
Although I am not sure if we should add it at the place where we have added it.
Apart from the place in the diff, we could also add it in _setup_query
of ValuesQuerySet.
by , 11 years ago
Attachment: | 22268.diff added |
---|
follow-up: 21 comment:19 by , 11 years ago
I would argue against solving the problem this way.
- Letting the number of results grow combinatorially seems questionable. According to the current documentation for
values()
andvalues_list()
, each result is supposed to correspond to one model object. Granted, the documentation doesn't say that the query cannot return multiple results for each model object, it still feels wrong because, in the current behaviour, none of the results represents a coherent big picture of one model object; for each model object, the values are either getting duplicated or fragmented across multiple dictionaries or tuples. - I suspect that including the
id
(orpk
) still doesn't solve the ambiguity that arises when you query for a nullable field of an M2M field (e.g.students__year_of_birth
). You still getNone
for both a) nullyear_of_birth
or b) no student at all. I can't confirm this yet because the diff doesn't seem to work against Django 1.6.2 (what version does the diff target?).
By the way, the problem is not limited to M2M fields, but also to ForeignKeys; e.g. if you have a Province
model and a City
model, and the City
has a ForeignKey
on Province
, and you do Province.objects.values('cities')
, you get the same problem.
This is looking like a rabbit hole. :(
comment:20 by , 11 years ago
The examples in comment:15 and comment:16 do seem odd. However, have a close look at #5768, where this behaviour seems to have been added. I haven't read it in detail, but there may be some explanation in there of the choices made.
comment:21 by , 11 years ago
Replying to k_sze:
I would argue against solving the problem this way.
- Letting the number of results grow combinatorially seems questionable. According to the current documentation for
values()
andvalues_list()
, each result is supposed to correspond to one model object. Granted, the documentation doesn't say that the query cannot return multiple results for each model object, it still feels wrong because, in the current behaviour, none of the results represents a coherent big picture of one model object; for each model object, the values are either getting duplicated or fragmented across multiple dictionaries or tuples.- I suspect that including the
id
(orpk
) still doesn't solve the ambiguity that arises when you query for a nullable field of an M2M field (e.g.students__year_of_birth
). You still getNone
for both a) nullyear_of_birth
or b) no student at all. I can't confirm this yet because the diff doesn't seem to work against Django 1.6.2 (what version does the diff target?).By the way, the problem is not limited to M2M fields, but also to ForeignKeys; e.g. if you have a
Province
model and aCity
model, and theCity
has aForeignKey
onProvince
, and you doProvince.objects.values('cities')
, you get the same problem.
This is looking like a rabbit hole. :(
Diff targets master.
Even if two None
s are returned from pk
value we know that which corresponds to what. What should we give instead of None
follow-ups: 23 25 comment:22 by , 11 years ago
Although I have another idea in mind:
We just add pk
to the query and then when final result is obtained we can club the dicts/tuples having same pk
and remove pk
.
Thoughts??
I am suggesting changes this way becoz I think that the query generated is correct.
comment:23 by , 11 years ago
Replying to anubhav9042:
Although I have another idea in mind:
We just addpk
to the query and then when final result is obtained we can club the dicts/tuples having samepk
and removepk
.
Thoughts??
I am suggesting changes this way becoz I think that the query generated is correct.
And also remove the ones where the pk
is None
(due to the LEFT OUTER JOIN
)?
follow-up: 26 comment:24 by , 11 years ago
Another idea that I have, which is probably crazy:
Instead of retrieving the M2M or reverse foreign key values in the same SQL query, use a clever combination of:
prefetch_related()
- subclassing
dict
andtuple
So values()
and values_list
return these subclassed dicts and tuples, which lazily construct and return the relevant M2M or reverse foreign key elements when they are accessed.
In the end, the subclassed dict
and tuple
would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.
comment:25 by , 11 years ago
Although I have another idea in mind:
We just addpk
to the query and then when final result is obtained we can club the dicts/tuples having samepk
and removepk
.
Thoughts??
I am suggesting changes this way becoz I think that the query generated is correct.
I tried to do this way, but clubbing those dics/tuples is not easy as they are not exactly dics/tuple rather QuerySets.
Will keep trying and post back soon
follow-up: 27 comment:26 by , 11 years ago
In the end, the subclassed
dict
andtuple
would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.
Will think
Although the same difficulty might come here as well.
comment:27 by , 11 years ago
Replying to anubhav9042:
In the end, the subclassed
dict
andtuple
would work somewhat like the Django model, except that you access things in the form of dictionaries and tuples, and you limit the elements that can appear in them.
Will think
Although the same difficulty might come here as well.
My idea is for these subclassed dict
and tuple
to be used recursively. And then because they are constructed lazily using QuerySets, you won't get combinatorial explosions, so you won't have to do any clubbing.
comment:28 by , 11 years ago
For better discussion, I have opened a thread:
https://groups.google.com/forum/#!topic/django-developers/DAslY6GI1O8
comment:29 by , 11 years ago
As discussed on IRC and mailing list, the solution has come out to be a doc update for this behaviour rather than tweaking the code any further.
comment:30 by , 11 years ago
Component: | Database layer (models, ORM) → Documentation |
---|---|
Type: | Bug → Cleanup/optimization |
comment:31 by , 9 years ago
Easy pickings: | set |
---|---|
Owner: | removed |
Status: | assigned → new |
Summary: | values_list() on a ManyToManyField returns extra None's when iterated over. → Document values_list() behavior for ManyToManyField |
comment:32 by , 9 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:33 by , 9 years ago
From the django documentation https://docs.djangoproject.com/en/dev/ref/models/querysets/#values :
Because ManyToManyField attributes and reverse relations can have multiple related rows, including these can have a multiplier effect on the size of your result set. This will be especially pronounced if you include multiple such fields in your values() query, in which case all possible combinations will be returned.
I believe that gives a fair amount of warning to the user as to the behaviour of .values()
with ManyToManyField attributes. What we can do is:
- Add a similar warning under the
.values_list()
reference. - Or modify the above warning.
The modification I am suggesting would be in the last line:
This will be especially pronounced if you include multiple such fields in your values() or values_list() query, in which case all possible combinations (including None) will be returned.
comment:34 by , 9 years ago
Cc: | added |
---|
comment:35 by , 9 years ago
Modifying the existing warning might make sense. An example may also help.
comment:36 by , 9 years ago
Added a pull request on Github https://github.com/django/django/pull/5976
Following timgraham's suggestion of adding an example to illustrate behavior.
comment:37 by , 9 years ago
Owner: | changed from | to
---|---|
Triage Stage: | Accepted → Ready for checkin |
Assigning to myself to expedite the process of closing this ticket, it has been open for far too long.
comment:38 by , 9 years ago
Has patch: | set |
---|---|
Triage Stage: | Ready for checkin → Accepted |
comment:39 by , 9 years ago
Patch needs improvement: | set |
---|
Left comments for improvement on the pull request in comment 36.
comment:40 by , 9 years ago
How about adding the same warning for values_list()
as there is for values()
(as suggested in comment:33) and add an example in this warning itself.
Would that meet our requirements ?
comment:41 by , 9 years ago
I'd rather not duplicate a large block of test if possible. Did you read my comments on the the pull request?
comment:42 by , 9 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:43 by , 9 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
comment:44 by , 9 years ago
Patch needs improvement: | unset |
---|---|
Triage Stage: | Accepted → Ready for checkin |
Hi,
Indeed, there seem to be something going wrong here.
I tried it out and the bug seems present on older versions too (I tried all the way to 1.4).
Thanks.