Opened 3 years ago

Closed 3 years ago

#32803 closed New feature (duplicate)

Not obvious interaction between Postgresql trgm index and icontains lookup

Reported by: Dmitri Emelianov Owned by: nobody
Component: Database layer (models, ORM) Version: dev
Severity: Normal Keywords: postgresql icontains lookup search
Cc: Triage Stage: Unreviewed
Has patch: no Needs documentation: no
Needs tests: no Patch needs improvement: no
Easy pickings: no UI/UX: no

Description

Hi! Long post ahead.

tldr; standard lookup is not working with standard index as expected

Using LIKE UPPER in icontains is not an accident, it is even stated that it's faster

# Use UPPER(x) for case-insensitive lookups; it's faster.
if lookup_type in ('iexact', 'icontains', 'istartswith', 'iendswith'):
    lookup = 'UPPER(%s)' % lookup

However, the main reason I've created the ticket is not the speed, but usage icontains together with trgm index.

Let's create a simple test model:

class Book(models.Model):
    name = models.TextField()

and prefil it with kind of random data.

Random function (idea taken from https://stackoverflow.com/a/5478750/1185696):

CREATE OR REPLACE FUNCTION get_random_name() RETURNS text as $$
SELECT array_to_string(array 
       ( 
              select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1)
              FROM   generate_series(1, (random() * 15)::integer)), '') || ' ' ||  array_to_string(array 
       ( 
              select substr('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789', trunc(random() * 62)::integer + 1, 1)
              from generate_series(1, (random() * 15)::integer)), '');
$$
VOLATILE
LANGUAGE SQL;

-- Example output: M2Fib gJr4CHLnZHcr
-- I want data to be kind of "real" if it is could be said about random string

->

INSERT INTO book_book(name) SELECT get_random_name() FROM generate_series(1, 10000);

Now let's do some search.

Book.objects.filter(name__icontains='orc')

# SELECT "book_book"."id", "book_book"."name" FROM "book_book" WHERE UPPER("book_book"."name"::text) LIKE UPPER('%orc%') ORDER BY "book_book"."id" ASC
# ->  Seq Scan on book_book  (cost=0.00..220.00 rows=80 width=24) (actual time=5.755..5.755 rows=0 loops=1)
#        Filter: (upper(name) ~~ '%ORC%'::text)
#        Rows Removed by Filter: 9996

Seq scan, totally understandable, no complains here. Now let's do 1m entries:

->  Parallel Seq Scan on book_book  (cost=0.00..8865.51 rows=3098 width=24) (actual time=2.024..341.389 rows=143 loops=3)
               Filter: (upper(name) ~~ '%ORC%'::text)
               Rows Removed by Filter: 369857
 Planning Time: 0.214 ms
 Execution Time: 374.926 ms

Still OK, but let's do better by adding indices. Since pg_trgm is very popular solution for text search, let's use it (I am not sure I know the alternatives aside from tsvector for such kind of queries):

indexes = [
            GinIndex(name='test_index', fields=['name'], opclasses=['gin_trgm_ops']),
        ]

Repeat query:

->  Parallel Seq Scan on book_book  (cost=0.00..14641.50 rows=46 width=24) (actual time=331.265..331.266 rows=0 loops=3)
               Filter: (upper(name) ~~ '%ORC%'::text)
               Rows Removed by Filter: 370000
 Planning Time: 1.524 ms
 Execution Time: 368.272 ms

Looks like index is not working.

Change query a bit:

EXPLAIN ANALYZE SELECT "book_book"."id", "book_book"."name" FROM "book_book" WHERE "book_book"."name"::text ILIKE '%orc%' ORDER BY "book_book"."id" ASC;

->

  ->  Bitmap Heap Scan on book_book  (cost=20.86..426.28 rows=111 width=24) (actual time=0.158..1.441 rows=430 loops=1)
         Recheck Cond: (name ~~* '%orc%'::text)
         Heap Blocks: exact=414
         ->  Bitmap Index Scan on test_index  (cost=0.00..20.83 rows=111 width=0) (actual time=0.075..0.076 rows=430 loops=1)
               Index Cond: (name ~~* '%orc%'::text)
 Planning Time: 0.178 ms
 Execution Time: 1.553 ms

I call "UPPPER" function usage as "implementation detail", because postgres have standard operator. And the main issue is that implementation detail adds additional complexity to index and you basically have to remember to use "upper" in all trgm indices. icontains is standard lookup. trgm index is kind of standard these days. I believe, one would expect pretty straightforward interaction between lookup and index (lookup should become faster). Current behavior, when default index is basically not working for default lookup is not obvious for me. This can be easily avoided just by using standard, out of the box operator.

Now let's do some measures to check if it is faster or not. I will use same data generator, I will use "orc" query always, I will always pick average of 3 measurements. No indices applied.

PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
+
en_US.UTF-8 collation

Results:

                         10k                     100k                     1m                      10m
ILIKE                     9.0                     111                    629                     4247 
UPPER                     6.6                     72                     448                     3175 
LOWER                     7.1                     71                     462                     2962

https://stackoverflow.com/questions/20336665/lower-like-vs-ilike question shows similar difference, but notes that everything could be more complicated.

Indead, in my environment, ILIKE shows lower performance and I have no explanation about that.

However, I would call this results implementation detail as well, because the main question is using "standard" operators and predictable interactions rather than performance. If one is concerned about performance, it could use an index. I also understand, that starting from 3.2 it could be solved by custom operator in Index, but how many people would expect that default won't work? And what about indices created before 3.2? Especially when docs clearly tells (with shady SQL equivalent term) that "icontains" will use ILIKE. I guess this either could be changed to ILIKE (with performance penalty and broken existing custom UPPER indices), either heavily documented - icontains lookup should clearly tells which query it will use and which index should be used. Also GinIndex/GistIndex could use a note about TEXT fields (that you probably want to use UPPER function for text).

Change History (1)

comment:1 by Mariusz Felisiak, 3 years ago

Component: contrib.postgresDatabase layer (models, ORM)
Owner: set to nobody
Resolution: duplicate
Status: newclosed
Type: UncategorizedNew feature

Closing as duplicate of #32485 and #20775.

Note: See TracTickets for help on using tickets.
Back to Top