Opened 8 years ago
Closed 4 years ago
#27719 closed New feature (fixed)
Add queryset.alias() to mimic .annotate() for aggregations without loading data
Reported by: | Marc Tamlyn | Owned by: | Alexandr Tatarinov |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Triage Stage: | Ready for checkin | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
At the moment the Expressions API currently does not allow creating a value for later use in the query without returning that data from the database. In some cases this can be quite expensive over the pipe, and can also influence query execution where functional indexes are involved. I would like to introduce an API like alias()
, which does the same thing as annotate()
to allow reuse in later querysets but doesn't return the value.
To be explicit, we are changing a query from one of the form:
SELECT to_tsvector('english', "some_field"), "some", "other", "fields" FROM "table" WHERE to_tsvector('english', "some_field") @@ plainto_tsquery("search term") ... LIMIT 10;
to one of the form:
SELECT "some", "other", "fields" FROM "table" WHERE to_tsvector('english', "some_field") @@ plainto_tsquery("search term") ... LIMIT 10;
Change History (18)
comment:1 by , 8 years ago
comment:2 by , 8 years ago
Could this be solved by allowing expressions to be passed to filter()
instead? #25367.
comment:3 by , 8 years ago
Whilst support in all the places which can use an annotated name would solve this kind of performance problem, there are loads of ways you can use a field name after it's defined.
.filter(name__lookup=value)
.order_by('name')
.annotate(SomeOtherExpression('name'))
.values('name')
(and related APIs)`
.distinct()
.date()
(and datetimes
)
.defer('name')
(and only
)
I know that some of these places already support expressions. For very complex expressions there's utility in being able to "name" the expression as some kind of virtual field. If you're within the same function then a python reference is usually sufficient (except for the __lookup
case), but if you have for example a manager method which would annotate a certain field and you use it elsewhere then things get messy quickly.
comment:4 by , 8 years ago
Triage Stage: | Unreviewed → Accepted |
---|
That makes sense, thanks for the clarification.
comment:5 by , 8 years ago
I'm kind of on the fence with this. The 90% case is expressions in WHERE clauses. We need expression support in filter/exclude, including the ability to filter without a left-hand-side (column reference) for things such as EXISTS( subq ).
Having alias()
feels like a fairly big hack.
Let's look at your list of example usages of aliases:
.filter(name__lookup=value) - Yes, Yes, Yes. But solved by expressions-in-filters. .order_by('name') - Already supports expressions .annotate(SomeOtherExpression('name')) - Already supported. .values('name') (and related APIs)` - If you want the values, then it's already in the select list and should be kept there. But expressions in values works now, no? .distinct() - Already in the select list, refer by name. .date() (and datetimes) - unsure about this, fairly uncommon? .defer('name') (and only) - it's already in the select list
Out of all of the above, except for filter
we always want the expression in the select list anyway. And when we use defer, we still want it in the annotations list so we can use it later (I think).
Am I missing something? The only time you might not want the value in the select list is if you're running a filter and possibly an order by. maybe an annotation over another expression where the other expression shouldn't be selected, but I feel that'd be a very small set of users, just to avoid creating two objects that are effectively the same.
comment:6 by , 8 years ago
I take your point about the select list, most of those APIs (e.g. values, distinct) do indeed require the field to be selected making alias slightly redundant.
For me it's very explicitly not a hack, and the fact we can use annotate is the hack. But that's bikeshedding.
More practically, consider this code:
query = SearchQuery('some term') return qs.alias( main_vector=SearchVector('title', 'body'), secondary_vector=SearchVector('parent__title', 'parent__body'), ).alias( vector=F('main_vector') + F('secondard_vector'), ).alias( rank=Case( When(vector__search=query, then=SearchRank('vector', query)), When(secondary_vector__search=query, then=SearchRank('secondary_vector', query)), # other cases here ) ).order_by('-rank')
(I'm aware this example is contrived, but I have some pretty complex annotations for reporting, and they're even more fun with SubQuery
... In most of these cases it doesn't matter that this field is in the select_list, although it doesn't need to be.
Here I've actually used the "alias approach" and the "variable approach" - one for vectors and one for queries.
The alias approach has allowed me to:
- Incrementally build up expressions in the query
- Name values and potentially reuse them at a later time in the query creation phase - perhaps I have some other function which takes the returned queryset from this method and counts the records where the rank is > 0.7 to give an indication of strong matching results, then without the naming from annotate or alias I'd have to access or recreate the component expressions. Rank is an int, so again it doesn't really matter to have it queryable.
I think #25367 is an important piece of work, and it would help solve the performance problem by allowing unselected values. I think though that there is a merit in being able to create a "virtual field" using an expression, and use it in subsequent querysets. This would be an easier solution to this particular performance issue, and it also gives the user explicit API control over the select list where needed, rather than implicitly by how you used your expression. It's not an either/or scenario.
comment:7 by , 7 years ago
In case it helps, I have got an issue using annotate to simply select some values and performance related to it and the way it interacts with count: https://groups.google.com/forum/#!topic/django-users/q6XdfyK29HA
Looks like alias might help this issue.
comment:8 by , 6 years ago
FWIW alias(...)
was initially proposed in https://code.djangoproject.com/ticket/16187#comment:28 but was later removed.
comment:9 by , 5 years ago
There was a recent post to django-developers where a user was trying to use annotate(foo=expr).defer('foo')
for this exact purpose. Their use case was more about avoiding the double evaluation of an EXISTS
expression which is probably fixed on the master branch by 1ca825e4dc186da2b93292b5c848a3e5445968d7 and could be addressed by #25367 which was previously mentioned but it remains that adding annotation support to defer
and only
could be another solution for this issue.
comment:10 by , 4 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
I am working on this one. I've had a lot of cases where unneeded expressions ended up in a select clause.
When someone builds complex queries spanning multiple methods and modules, this can save a lot of CPU (for ORM to build the query) and query execution time.
My suggestion is to support only the usages below, values(), distinct() and aggregate() will require to use .annotate() to add the expression into the select list.
.alias(alias=SomeExpression()).annotate(SomeOtherExpression('alias')) .alias(alias=SomeExpression()).filter(alias__lookup=value) .alias(alias=SomeExpression()).order_by('alias') .alias(alias=SomeExpression()).update(field=F('alias'))
Also automatically supported: earliest/latest and dates/datetimes.
And if you want to use values() or distinct(), promote alias to annotation
.alias(alias=SomeExpression()).annotate(alias=F('alias')).distinct('alias')
The implementation relies on annotation_mask to remove an expression from the select list, all other logic remains the same: despite the expression is not selected, the joins and group by are the same as for annotation.
comment:12 by , 4 years ago
Has patch: | set |
---|
comment:13 by , 4 years ago
Patch needs improvement: | set |
---|
Left some comments for improvements on the PR but pending that the proposed changes LGTM.
comment:15 by , 4 years ago
Patch needs improvement: | set |
---|
comment:16 by , 4 years ago
Patch needs improvement: | unset |
---|
comment:17 by , 4 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
Doing a quick example query: In a table with a relatively large text field, but just 2,400 rows, the difference between returning all rows and all fields without the ts_vector() call and returning all with it was 12.7ms vs 24,000ms.
Doing a full sequential scan (no index) with both calculating the to_tsvector() for return and in query predicate is 27,000ms without recalculating the to_tsvector for return, 46,700ms with recalculating it.
After having created a functional index on the to_tsvector() call, and using a relatively common search term, the time to return without recalculating the to_tsvector was 1.7ms, vs 21,200ms with recalculating the ts_vector.
So, in any scenario, not having to recalcuate the to_tsvector() is a win.