Opened 4 years ago
Closed 3 years ago
#32492 closed New feature (fixed)
Add django API for Postgres pg_trgm word_similarity, fuzzy full-text search.
Reported by: | Taneli | Owned by: | Nikita Marchant |
---|---|---|---|
Component: | contrib.postgres | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | Paolo Melchiorre | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description (last modified by )
Use case: filter queryset by an inexact substring. For example, retrieve an object with the following headline Dogecoin is following bitcoin in its dramatic rise when searching for dogge.
Currently it is possible to filter a queryset on the basis of trigram similarity between the search string and full text stored in a column.
Author.objects.annotate(similarity=TrigramSimilarity('name', test),).filter(similarity__gt=0.3)
This is a wrapper around the similarity function of the pg_trgm extension. While it allows comparing full strings, i.e. searching for 'doge' would find 'dogs' or 'dogge' it is useless for fuzzy searching of substrings.
SELECT similarity('dogge', 'doge'); --------- 0.57 SELECT similarity('dogge', 'dogecoin is following bitcoin'); ------------ 0.1
word_similarity does take into account the word boundaries
SELECT word_similarity('doge', 'dogecoin is following bitcoin'); -------------- 0.5
Adding a django API to word_similarity would allow for better fuzzy fulltext search without a need to use either raw SQL or external tools like elasticsearch.
Change History (13)
comment:1 by , 4 years ago
Description: | modified (diff) |
---|
comment:2 by , 4 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 4 years ago
Component: | Uncategorized → contrib.postgres |
---|---|
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → New feature |
Version: | 3.1 → master |
comment:4 by , 3 years ago
FWIW, I use pg_trgm, and I also have a connection_created
receiver that sets pg_trgm.similarity_threshold
. It might be nice to include the ability to set this in some way, so that other operations that use this can be included.
comment:5 by , 3 years ago
Owner: | removed |
---|
comment:6 by , 3 years ago
Owner: | set to |
---|
comment:7 by , 3 years ago
Hi 👋
I've open a pull request (https://github.com/django/django/pull/14833) implementing the trigram_word_similar
lookup and updating the documentation.
I also wrote a test, but for it to be meaningful, i needed a longer string that the 16 chars available on CharFieldModel.field
in the tests. I did increase the limit to 64 and updated the migration at the same time. Is this the right approach or should it create a new model or a new migration ? (As the test database is recreated almost each time, i guessed that modifying a migration is not a big deal but i could be wrong).
I ran the tests on PostgreSQL 13.4 on Mac with Python 3.9.1
I also have some code (adding TrigramWordSimilarity
and TrigramWordDistance
) almost ready for the rest of the ticket but i still have some troubles/questions because the order of the arguments of WORD_SIMILARITY()
is meaningful, unlike for SIMILARITY()
. Should we discuss it here or on GitHub ?
comment:8 by , 3 years ago
Has patch: | set |
---|
comment:10 by , 3 years ago
Needs documentation: | unset |
---|---|
Patch needs improvement: | unset |
comment:11 by , 3 years ago
Triage Stage: | Accepted → Ready for checkin |
---|
comment:12 by , 3 years ago
Cc: | added |
---|
Thanks. We should add also
TrigramWordDistance()
with the<<->
operator.