Opened 5 years ago
Last modified 3 years ago
#31304 new New feature
PostgreSQL full-text search employs coalesce function for non-null single-column searches with SearchVector
Reported by: | Paul Boddie | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | PostgreSQL text search FTS coalesce SearchVector |
Cc: | Paul Boddie, Simon Charette | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
When following the PostgreSQL full-text search documentation for Django (https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/search/), the search lookup...
Table.objects.filter(column__search="keyword")
...produces SQL of the following form:
to_tsvector(column) @@ plainto_tsquery('keyword')
However, the PostgreSQL documentation notes that such expressions will be unable to take advantage of indexes created on the column:
"Only text search functions that specify a configuration name can be used in expression indexes [...] Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index."
https://www.postgresql.org/docs/11/textsearch-tables.html
Introducing a SearchQuery object employing the config parameter...
Table.objects.filter(column__search=SearchQuery("keyword", config="simple"))
...produces SQL of the following form:
to_tsvector(column) @@ plainto_tsquery('simple', 'keyword')
The Django documentation suggests using an annotation employing a SearchVector as follows:
Table.objects.annotate(search=SearchVector("column", config="simple")).filter(search=SearchQuery("keyword", config="simple"))
The resulting SQL generated by Django is then as follows:
to_tsvector('simple', coalesce(column, '')) @@ plainto_tsquery('simple', 'keyword')
Unfortunately, the use of coalesce now blocks any application of an index on the column.
What seems to be possible, however, is to modify the SQL generation to avoid using coalesce where it can be determined that the operand given to to_tsvector will not yield a null value. This should produce the following more desirable SQL:
to_tsvector('simple', column) @@ plainto_tsquery('simple', 'keyword')
A patch is provided as a suggestion of how this issue might be fixed.
Attachments (1)
Change History (15)
by , 5 years ago
Attachment: | patch-django-2.2.9-postgres-search.diff added |
---|
comment:1 by , 5 years ago
Cc: | added |
---|---|
Component: | Uncategorized → Database layer (models, ORM) |
comment:2 by , 5 years ago
comment:3 by , 5 years ago
I guess this ticket could also re-purposed as a new feature to make __search=SearchQuery
force the usage of the rhs's config
for the lhs.
Something along
-
django/contrib/postgres/lookups.py
diff --git a/django/contrib/postgres/lookups.py b/django/contrib/postgres/lookups.py index cc5bc022c6..1cb03b9510 100644
a b 1 1 from django.db.models import Lookup, Transform 2 2 from django.db.models.lookups import Exact, FieldGetDbPrepValueMixin 3 3 4 from .search import SearchVector, SearchVectorExact, SearchVectorField 4 from .search import ( 5 SearchQuery, SearchVector, SearchVectorExact, SearchVectorField 6 ) 5 7 6 8 7 9 class PostgresSimpleLookup(FieldGetDbPrepValueMixin, Lookup): … … class SearchLookup(SearchVectorExact): 57 59 58 60 def process_lhs(self, qn, connection): 59 61 if not isinstance(self.lhs.output_field, SearchVectorField): 60 self.lhs = SearchVector(self.lhs) 62 config = None 63 if isinstance(self.rhs, SearchQuery): 64 config = self.rhs.config 65 self.lhs = SearchVector(self.lhs, config=config) 61 66 lhs, lhs_params = super().process_lhs(qn, connection) 62 67 return lhs, lhs_params
That would make filter(column__search=SearchQuery('simple', config='simple'))
result in the desired to_tsvector('simple', column) @@ plainto_tsquery('simple', 'keyword')
.
FWIW that's already something we do for the SearchVectorExact
lookup.
comment:4 by , 5 years ago
The motivation for eliminating coalesce is that it is superfluous when searching a non-null column already indexed appropriately. I accept that in the general case, where people may be combining columns, there may be a need to coalesce null values, but I would regard it as pretty inelegant to have to create an index coalescing values that would never be null. (One might argue that PostgreSQL could itself eliminate any redundant coalesce operation specified for an index, and I don't know why it doesn't in this case.)
The suggestion to make the search lookup employ any indicating configuration from the operand is a good one. Indeed, it surprises me that this isn't already done because it appears to be an easy mistake when manually preparing queries that could be eliminated in a framework like this.
I'll try and make a pull request on GitHub.
comment:5 by , 5 years ago
Triage Stage: | Unreviewed → Accepted |
---|---|
Type: | Uncategorized → New feature |
Accepting based on the discussion so far. Thanks both.
comment:6 by , 5 years ago
Paul, I did more investigation and it looks like we can't also drop the Coalesce
even in cases where there's a single expression and no concatenation needs to take place. The reason for that is that to_tsvector('')
and to_tsvector(null)
are not equivalent as the latter will result in SQL Unknown
when used with the @@
operator. That's problematic when negating expressions as both to_tsvector(null) @@ to_tsquery('test')
and NOT to_tsvector(null) @@ to_tsquery('test')
result in the same value effectively breaking ~Q
and exclude
.
Regarding the .null
handling I also did a bit of investigation and here's a case where checking for output_field.null
would be a false negative
class Manager(models.Model): name = models.TextField() class Store(models.Model): manager = models.ForeignKey(Manager) second_manager = models.ForeignKey(Manager, null=True) Store.objects.annotate( search=SearchVector('manager__name', 'second_manager__name') ).filter(search='Victor')
In this particular case the Col
that 'second_manager__name'
resolves to would have Manager.name
as output_field
but it's not null=True
even it's nullable if store.second_manager is None
. In order to properly determine whether or not the expression should be coalesced we need to annotate Col
with a flag that takes the whole relationship nullability in account.
comment:7 by , 5 years ago
Thanks for the comment, Simon.
I wouldn't expect to_tsvector to treat an empty string and null in the same way - this being SQL after all - but the aim would be to avoid coalesce where null could never occur. This then becomes an issue of knowing when this might be guaranteed.
I'm still getting familiar with the ORM, but I imagine that there could easily be cases where column values are null even if the table columns are declared not null. For instance, any query involving an outer join could produce null values for a "not null" column. In such cases, the characteristics of the output column would need to be defined by identifying its role in the query, pretty much as you say.
I am inclined to think that making my own lookup would be the safest thing to do for now, with the lookups provided also considering the issue of transferring the configuration.
comment:8 by , 5 years ago
Cc: | added |
---|
I wouldn't expect to_tsvector to treat an empty string and null in the same way - this being SQL after all - but the aim would be to avoid coalesce where null could never occur. This then becomes an issue of knowing when this might be guaranteed.
Right the main issue here is that SearchVector
(just like Concat
for example) has been coalescing nulls to empty strings forever so we can't just change it now without breaking backward compatibility.
I'm still getting familiar with the ORM, but I imagine that there could easily be cases where column values are null even if the table columns are declared not null. For instance, any query involving an outer join could produce null values for a "not null" column. In such cases, the characteristics of the output column would need to be defined by identifying its role in the query, pretty much as you say.
Right comment:6 provides another example of that. The JOIN
resolution logic knows about these things but it's not surfaced at the Expression
introspection level right now so we can't even start deprecating the current behaviour if we wanted to. I plan on working on patch that exposes an Expression.nullable
property that gets assigned on field reference resolution but that's not an easy thing to get right as you might suspect.
I am inclined to think that making my own lookup would be the safest thing to do for now, with the lookups provided also considering the issue of transferring the configuration.
I tend to agree, are you still interested in submitting a PR that does a config
assignment in SearchLookup.process_lhs
?
comment:9 by , 5 years ago
Has patch: | unset |
---|---|
Version: | 2.2 → master |
comment:10 by , 5 years ago
I can probably look into it, given that I have managed to write a lookup that seems to have the desired behaviour. Should it be a different issue, though?
comment:14 by , 3 years ago
Until this is addressed, assuming you know expressions you are dealing with are NOT NULL
, a limited replacement for SearchVector
is the following
from django.contrib.postgres.search import SearchConfig, SearchVectorField from django.db.models import Func from django.db.models.expressions import ExpressionList class SearchVector(Func): """ Replacement of `django.contrib.postgres.search.SearchVector` that works around limitations of the later with regards to indexing. See https://code.djangoproject.com/ticket/31304#comment:6 """ function = 'to_tsvector' output_field = SearchVectorField() def __init__(self, *expressions, config=None): expressions = ( SearchConfig.from_parameter(config), ExpressionList(*expressions, arg_joiner=" || ' ' || "), ) super().__init__(*expressions)
Hey Paul,
Could you elaborate on that? What's preventing you from creating a GIN functional index on
to_tsvector('simple', coalesce(column, ''))
instead?I'd suggest you submit your patch as a Github PR so CI can run the full suite against it.
I suspect the
.null
check won't be enough as some column values mapped tonull=False
fields can end up being null when outer joins are involved and it won't be reflected inoutput_field
. In short we can't drop theCoalesce
wrapping unless we can guarantee we don't break backward compatibility here; I wish we forced the usage ofCoalesce
from the beginning instead but that's not the case.