Opened 5 years ago

Last modified 4 years ago

#31304 new New feature

PostgreSQL full-text search employs coalesce function for non-null single-column searches with SearchVector

Reported by: Paul Boddie Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: PostgreSQL text search FTS coalesce SearchVector
Cc: Paul Boddie, Simon Charette Triage Stage: Accepted
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

When following the PostgreSQL full-text search documentation for Django (https://docs.djangoproject.com/en/2.2/ref/contrib/postgres/search/), the search lookup...

Table.objects.filter(column__search="keyword")

...produces SQL of the following form:

to_tsvector(column) @@ plainto_tsquery('keyword')

However, the PostgreSQL documentation notes that such expressions will be unable to take advantage of indexes created on the column:

"Only text search functions that specify a configuration name can be used in expression indexes [...] Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index."

https://www.postgresql.org/docs/11/textsearch-tables.html

Introducing a SearchQuery object employing the config parameter...

Table.objects.filter(column__search=SearchQuery("keyword", config="simple"))

...produces SQL of the following form:

to_tsvector(column) @@ plainto_tsquery('simple', 'keyword')

The Django documentation suggests using an annotation employing a SearchVector as follows:

Table.objects.annotate(search=SearchVector("column", config="simple")).filter(search=SearchQuery("keyword", config="simple"))

The resulting SQL generated by Django is then as follows:

to_tsvector('simple', coalesce(column, '')) @@ plainto_tsquery('simple', 'keyword')

Unfortunately, the use of coalesce now blocks any application of an index on the column.

What seems to be possible, however, is to modify the SQL generation to avoid using coalesce where it can be determined that the operand given to to_tsvector will not yield a null value. This should produce the following more desirable SQL:

to_tsvector('simple', column) @@ plainto_tsquery('simple', 'keyword')

A patch is provided as a suggestion of how this issue might be fixed.

Attachments (1)

patch-django-2.2.9-postgres-search.diff (1.2 KB ) - added by Paul Boddie 5 years ago.

Download all attachments as: .zip

Change History (15)

by Paul Boddie, 5 years ago

comment:1 by Paul Boddie, 5 years ago

Cc: Paul Boddie added
Component: UncategorizedDatabase layer (models, ORM)

comment:2 by Simon Charette, 5 years ago

Hey Paul,

Unfortunately, the use of coalesce now blocks any application of an index on the column.

Could you elaborate on that? What's preventing you from creating a GIN functional index on to_tsvector('simple', coalesce(column, '')) instead?

I'd suggest you submit your patch as a Github PR so CI can run the full suite against it.

I suspect the .null check won't be enough as some column values mapped to null=False fields can end up being null when outer joins are involved and it won't be reflected in output_field. In short we can't drop the Coalesce wrapping unless we can guarantee we don't break backward compatibility here; I wish we forced the usage of Coalesce from the beginning instead but that's not the case.

comment:3 by Simon Charette, 5 years ago

I guess this ticket could also re-purposed as a new feature to make __search=SearchQuery force the usage of the rhs's config for the lhs.

Something along

  • django/contrib/postgres/lookups.py

    diff --git a/django/contrib/postgres/lookups.py b/django/contrib/postgres/lookups.py
    index cc5bc022c6..1cb03b9510 100644
    a b  
    11from django.db.models import Lookup, Transform
    22from django.db.models.lookups import Exact, FieldGetDbPrepValueMixin
    33
    4 from .search import SearchVector, SearchVectorExact, SearchVectorField
     4from .search import (
     5    SearchQuery, SearchVector, SearchVectorExact, SearchVectorField
     6)
    57
    68
    79class PostgresSimpleLookup(FieldGetDbPrepValueMixin, Lookup):
    class SearchLookup(SearchVectorExact):  
    5759
    5860    def process_lhs(self, qn, connection):
    5961        if not isinstance(self.lhs.output_field, SearchVectorField):
    60             self.lhs = SearchVector(self.lhs)
     62            config = None
     63            if isinstance(self.rhs, SearchQuery):
     64                config = self.rhs.config
     65            self.lhs = SearchVector(self.lhs, config=config)
    6166        lhs, lhs_params = super().process_lhs(qn, connection)
    6267        return lhs, lhs_params

That would make filter(column__search=SearchQuery('simple', config='simple')) result in the desired to_tsvector('simple', column) @@ plainto_tsquery('simple', 'keyword').

Version 1, edited 5 years ago by Simon Charette (previous) (next) (diff)

comment:4 by Paul Boddie, 5 years ago

The motivation for eliminating coalesce is that it is superfluous when searching a non-null column already indexed appropriately. I accept that in the general case, where people may be combining columns, there may be a need to coalesce null values, but I would regard it as pretty inelegant to have to create an index coalescing values that would never be null. (One might argue that PostgreSQL could itself eliminate any redundant coalesce operation specified for an index, and I don't know why it doesn't in this case.)

The suggestion to make the search lookup employ any indicating configuration from the operand is a good one. Indeed, it surprises me that this isn't already done because it appears to be an easy mistake when manually preparing queries that could be eliminated in a framework like this.

I'll try and make a pull request on GitHub.

comment:5 by Carlton Gibson, 5 years ago

Triage Stage: UnreviewedAccepted
Type: UncategorizedNew feature

Accepting based on the discussion so far. Thanks both.

comment:6 by Simon Charette, 5 years ago

Paul, I did more investigation and it looks like we can't also drop the Coalesce even in cases where there's a single expression and no concatenation needs to take place. The reason for that is that to_tsvector('') and to_tsvector(null) are not equivalent as the latter will result in SQL Unknown when used with the @@ operator. That's problematic when negating expressions as both to_tsvector(null) @@ to_tsquery('test') and NOT to_tsvector(null) @@ to_tsquery('test') result in the same value effectively breaking ~Q and exclude.

Regarding the .null handling I also did a bit of investigation and here's a case where checking for output_field.null would be a false negative

class Manager(models.Model):
    name = models.TextField()

class Store(models.Model):
    manager = models.ForeignKey(Manager)
    second_manager = models.ForeignKey(Manager, null=True)


Store.objects.annotate(
    search=SearchVector('manager__name', 'second_manager__name')
).filter(search='Victor')

In this particular case the Col that 'second_manager__name' resolves to would have Manager.name as output_field but it's not null=True even it's nullable if store.second_manager is None. In order to properly determine whether or not the expression should be coalesced we need to annotate Col with a flag that takes the whole relationship nullability in account.

comment:7 by Paul Boddie, 5 years ago

Thanks for the comment, Simon.

I wouldn't expect to_tsvector to treat an empty string and null in the same way - this being SQL after all - but the aim would be to avoid coalesce where null could never occur. This then becomes an issue of knowing when this might be guaranteed.

I'm still getting familiar with the ORM, but I imagine that there could easily be cases where column values are null even if the table columns are declared not null. For instance, any query involving an outer join could produce null values for a "not null" column. In such cases, the characteristics of the output column would need to be defined by identifying its role in the query, pretty much as you say.

I am inclined to think that making my own lookup would be the safest thing to do for now, with the lookups provided also considering the issue of transferring the configuration.

comment:8 by Simon Charette, 5 years ago

Cc: Simon Charette added

I wouldn't expect to_tsvector to treat an empty string and null in the same way - this being SQL after all - but the aim would be to avoid coalesce where null could never occur. This then becomes an issue of knowing when this might be guaranteed.

Right the main issue here is that SearchVector (just like Concat for example) has been coalescing nulls to empty strings forever so we can't just change it now without breaking backward compatibility.

I'm still getting familiar with the ORM, but I imagine that there could easily be cases where column values are null even if the table columns are declared not null. For instance, any query involving an outer join could produce null values for a "not null" column. In such cases, the characteristics of the output column would need to be defined by identifying its role in the query, pretty much as you say.

Right comment:6 provides another example of that. The JOIN resolution logic knows about these things but it's not surfaced at the Expression introspection level right now so we can't even start deprecating the current behaviour if we wanted to. I plan on working on patch that exposes an Expression.nullable property that gets assigned on field reference resolution but that's not an easy thing to get right as you might suspect.

I am inclined to think that making my own lookup would be the safest thing to do for now, with the lookups provided also considering the issue of transferring the configuration.

I tend to agree, are you still interested in submitting a PR that does a config assignment in SearchLookup.process_lhs?

comment:9 by Mariusz Felisiak, 5 years ago

Has patch: unset
Version: 2.2master

comment:10 by Paul Boddie, 5 years ago

I can probably look into it, given that I have managed to write a lookup that seems to have the desired behaviour. Should it be a different issue, though?

comment:11 by Simon Charette, 5 years ago

Should it be a different issue, though?

Yeah I think it should.

comment:12 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In 7c8b043a:

Refs #31304 -- Made search lookup default to its rhs' config.

This make the SearchLookup lookup more coherent with its
SearchVectorExact base which configures its rhs SearchQuery with its
lhs' config.

comment:13 by Mariusz Felisiak <felisiak.mariusz@…>, 5 years ago

In e2417010:

Refs #31304 -- Added tests for using search lookup with combined SearchQuerys with the same config.

comment:14 by Simon Charette, 4 years ago

Until this is addressed, assuming you know expressions you are dealing with are NOT NULL, a limited replacement for SearchVector is the following

from django.contrib.postgres.search import SearchConfig, SearchVectorField
from django.db.models import Func
from django.db.models.expressions import ExpressionList


class SearchVector(Func):
    """
    Replacement of `django.contrib.postgres.search.SearchVector` that
    works around limitations of the later with regards to indexing.

    See https://code.djangoproject.com/ticket/31304#comment:6
    """

    function = 'to_tsvector'
    output_field = SearchVectorField()

    def __init__(self, *expressions, config=None):
        expressions = (
            SearchConfig.from_parameter(config),
            ExpressionList(*expressions, arg_joiner=" || ' ' || "),
        )
        super().__init__(*expressions)
Note: See TracTickets for help on using tickets.
Back to Top