Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#33459 closed Cleanup/optimization (fixed)

Explain how to optimize full text search with SearchVectorField and GinIndex

Reported by: Thomas Aglassinger Owned by: nobody
Component: Documentation Version: 4.0
Severity: Normal Keywords: postgres
Cc: Triage Stage: Accepted
Has patch: yes Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description (last modified by Thomas Aglassinger)

The current documentation section on SearchVectorField at https://docs.djangoproject.com/en/dev/ref/contrib/postgres/search/#searchvectorfield does not explain how to use GinIndex or GistIndex to increase the performance of the search. It currently only describes how to add a SearchVectorField. To my understaning this somewhat does improve the performance on a linear scale by removing the need to parse the fields to search with Postgres' full text search parser. However also indexing this field would typically improve performance by a mangitude.

I eventually managed to piece this together from an article found at http://logan.tw/posts/2017/12/30/full-text-search-with-django-and-postgresql/ but believe this fairly standard use case should be covered in the Django documentation already.

So I propose to add a few paragraphs that show how to add a SearchVectorField to a model with a GinIndex, compute a search vector from multiple fields and then perform a ranked search on it.

For the related pull request, see <https://github.com/django/django/pull/15350>

I don't consider the current patch to be final, things to discuss:

  • Should the section on "SearchVectorField" be ranamed to "SearchVectorField and indexing"?
  • Should the section on "Performance" be included into the section on "SearchVectorField"? Currently it describes the problem well but I found the solution of pointing to the Postgres documentation unhelpful. If GinIndex is mention later anyway, the pointer to the postgres documentation could be added afterwards for further reading.
  • Is it alright to extend the Entry model from the previous chapter, or should I add a separate model like SearchableEntry? The first approach might confuse readers if they skim over the part where Entry gets redefined and think it's the same model as in other chapters.

Also it might be helpful to include a "full text search how-to" for example describing how to efficiently search a database of news articles in multiple languages. While the current reference documentation explains search configurations well enough, the later examples (rightfully) omit it to keep the explanations focused. This however limits their usefulness for skimming and copying the examples.

If you are interested, I could write such a how-to.

Change History (15)

comment:1 by Thomas Aglassinger, 3 years ago

Description: modified (diff)

comment:2 by Thomas Aglassinger, 3 years ago

Description: modified (diff)

comment:3 by Thomas Aglassinger, 3 years ago

Description: modified (diff)

comment:4 by Mariusz Felisiak, 3 years ago

Resolution: wontfix
Status: newclosed

Thanks for this ticket, however indexes are standard optimization techniques that are helpful in most cases not only for SearchVectorField and as such it's already mentioned in docs. I don't see any particular reason to document this in detail for SearchVectorField, the current docs seems sufficient to me . It's looks like a topic for a nice blog post.

in reply to:  4 ; comment:5 by Thomas Aglassinger, 3 years ago

Replying to Mariusz Felisiak:

Thanks for this ticket, however indexes are standard optimization techniques that are helpful in most cases not only for SearchVectorField and as such it's already mentioned in docs. I don't see any particular reason to document this in detail for SearchVectorField, the current docs seems sufficient to me.

The difference with full text search is that you cannot use the standard indexes but have to use the particular GinIndex or GistIndex only provided by postgres. Also there is no real point using them with the text fields of your model, which I had intuitively assumed. They are only really useful in combination with the SearchVectorField.

The current documentation does not point this out despite it being rather simple to show with an example.

It's looks like a topic for a nice blog post.

There are plenty of blogs on the topic (I linked only one), but they are much more elaborate and cover more ground. I believe adding a simple example of combining SearchVectorField with GinIndex would help quite a few people that try to utilize the full text search in an efficient manner.

in reply to:  5 ; comment:6 by Mariusz Felisiak, 3 years ago

The difference with full text search is that you cannot use the standard indexes but have to use the particular GinIndex or GistIndex only provided by postgres.

We could clarify this with a small diff, e.g.:

  • docs/ref/contrib/postgres/search.txt

    diff --git a/docs/ref/contrib/postgres/search.txt b/docs/ref/contrib/postgres/search.txt
    index cfed877d9c..4304d47552 100644
    a b run into performance problems. Full text search is a more intensive process  
    255255than comparing the size of an integer, for example.
    256256
    257257In the event that all the fields you're querying on are contained within one
    258 particular model, you can create a functional index which matches the search
    259 vector you wish to use. The PostgreSQL documentation has details on
     258particular model, you can create a functional
     259:class:`GIN <django.contrib.postgres.indexes.GinIndex>` or
     260:class:`GiST <django.contrib.postgres.indexes.GistIndex>` index which matches
     261the search vector you wish to use. The PostgreSQL documentation has details on
    260262`creating indexes for full text search
    261263<https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX>`_.
    262264

Also there is no real point using them with the text fields of your model, which I had intuitively assumed.

It's already documented in the Performance section: "can create a functional index which matches the search vector you wish to use".

in reply to:  6 ; comment:7 by Thomas Aglassinger, 3 years ago

Replying to Mariusz Felisiak:

We could clarify this with a small diff, e.g.: ...

Yes, that would be a welcome improvement.

Also there is no real point using them with the text fields of your model, which I had intuitively assumed.

It's already documented in the Performance section: "can create a functional index which matches the search vector you wish to use".

I found this too abstract to apply immediately and would have liked an example similar to the one included in the proposed patch. At least something like:

    from django.contrib.postgres.indexes import GinIndex
    from django.contrib.postgres.search import SearchVectorField

    class Entry(models.Model):
        ...  # same data fields as before
        search_vector = models.SearchVectorField(null=True)

        class Meta:
            indexes = [GinIndex(fields=['search_vector']]

Yes, everything shown here is explained one way or the other in the paragraphs before but it took me a while to turn these general and abstract descriptions into the code above. Unlike all the other parts in this chapter, which have a concise and well written example for the concepts they explain.

Last edited 3 years ago by Thomas Aglassinger (previous) (diff)

in reply to:  7 ; comment:8 by Mariusz Felisiak, 3 years ago

I found this too abstract to apply immediately and would have liked an example similar to the one included in the proposed patch. At least something like:

    from django.contrib.postgres.indexes import GinIndex
    from django.contrib.postgres.search import SearchVectorField

    class Entry(models.Model):
        ...  # same data fields as before
        search_vector = models.SearchVectorField(null=True)

        class Meta:
            indexes = [GinIndex(fields=['search_vector']]

Yes, everything shown here is explained one way or the other in the paragraphs before but it took me a while to turn these general and abstract descriptions into the code above. Unlike all the other parts in this chapter, which have a concise and well written example for the concepts they explain.

In the Performance section we're talking about "a functional index which matches the search vector you wish to use" so rather

GinIndex(SearchVector('body_text'), name='body_search_vector_idx')

What do you think about adding?

  • docs/ref/contrib/postgres/search.txt

    diff --git a/docs/ref/contrib/postgres/search.txt b/docs/ref/contrib/postgres/search.txt
    index cfed877d9c..fa369670bb 100644
    a b run into performance problems. Full text search is a more intensive process  
    255255than comparing the size of an integer, for example.
    256256
    257257In the event that all the fields you're querying on are contained within one
    258 particular model, you can create a functional index which matches the search
    259 vector you wish to use. The PostgreSQL documentation has details on
     258particular model, you can create a functional
     259:class:`GIN <django.contrib.postgres.indexes.GinIndex>` or
     260:class:`GiST <django.contrib.postgres.indexes.GistIndex>` index which matches
     261the search vector you wish to use. For example::
     262
     263    GinIndex(SearchVector('body_text'), name='body_search_vector_idx')
     264
     265The PostgreSQL documentation has details on
    260266`creating indexes for full text search
    261267<https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX>`_.
    262268

Would you like to prepare a patch? (if you like it of course)

comment:9 by Claude Paroz, 3 years ago

Patch needs improvement: set
Resolution: wontfix
Status: closednew
Triage Stage: UnreviewedAccepted

in reply to:  8 comment:10 by Thomas Aglassinger, 3 years ago

Replying to Mariusz Felisiak:

In the Performance section we're talking about "a functional index which matches the search vector you wish to use" so rather

GinIndex(SearchVector('body_text'), name='body_search_vector_idx')

What do you think about adding?

Agreed, this would be a good example for the performance section.

We still don't have an example for the SearchVectorField though but I guess people have to live with that for the time being.

Would you like to prepare a patch? (if you like it of course)

If it saves you work, I can do that. I can't really take credits for the contents though as they are too different to my proposal. I also would make a new PR to avoid clutter in the repository log.

Let me know if I should.

comment:11 by Mariusz Felisiak, 3 years ago

If it saves you work, I can do that. I can't really take credits for the contents though as they are too different to my proposal.

You've created this ticket and discussed options with me, so for me you are equally responsible for the current form of proposed clarification. You've deserved to be a Django contributor.

I also would make a new PR to avoid clutter in the repository log.

Yes we need a new PR.

in reply to:  11 comment:12 by Thomas Aglassinger, 3 years ago

Replying to Mariusz Felisiak:

Yes we need a new PR.

See https://github.com/django/django/pull/15371.

comment:13 by GitHub <noreply@…>, 3 years ago

Resolution: fixed
Status: newclosed

In 3a9b8b25:

Fixed #33459 -- Clarified index type in full text search docs.

comment:14 by Mariusz Felisiak <felisiak.mariusz@…>, 3 years ago

In 9a9c5b40:

[4.0.x] Fixed #33459 -- Clarified index type in full text search docs.

Backport of 3a9b8b25d48c5768633e73edc5ddca20c3fd716c from main

comment:15 by Mariusz Felisiak, 3 years ago

Patch needs improvement: unset
Note: See TracTickets for help on using tickets.
Back to Top