Opened 9 years ago
Closed 9 years ago
#25811 closed Cleanup/optimization (fixed)
Error querying models in different databases in one queryset
Reported by: | Edwar Baron | Owned by: | nobody |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | database |
Cc: | edwar.baron@…, Anssi Kääriäinen | Triage Stage: | Ready for checkin |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Model1: Deportes from database1
Model2: Hecho1_VentasCadenasJuegos from database2
Code:
Deportes.objects.filter( pk__in=Hecho1_VentasCadenasJuegos.objects.all().values_list( 'juegos__deporte_id', flat=True ).distinct('juegos__deporte_id') )
Error:
Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) psycopg2.ProgrammingError: relation "admin_datamart_hecho1_ventascadenasjuegos" does not exist LINE 1: ...ISTINCT ON (U1."deporte_id") U1."deporte_id" FROM "admin_dat... ^ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<console>", line 4, in <module> File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 138, in __repr__ data = list(self[:REPR_OUTPUT_SIZE + 1]) File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 162, in __iter__ self._fetch_all() File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 965, in _fetch_all self._result_cache = list(self.iterator()) File "/usr/local/lib/python3.5/site-packages/django/db/models/query.py", line 238, in iterator results = compiler.execute_sql() File "/usr/local/lib/python3.5/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql cursor.execute(sql, params) File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 79, in execute return super(CursorDebugWrapper, self).execute(sql, params) File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.5/site-packages/django/db/utils.py", line 97, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/usr/local/lib/python3.5/site-packages/django/utils/six.py", line 658, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.5/site-packages/django/db/backends/utils.py", line 64, in execute return self.cursor.execute(sql, params) django.db.utils.ProgrammingError: relation "admin_datamart_hecho1_ventascadenasjuegos" does not exist LINE 1: ...ISTINCT ON (U1."deporte_id") U1."deporte_id" FROM "admin_dat...
Viewing the SQL generated
SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (SELECT DISTINCT ON (U1."deporte_id") U1."deporte_id" FROM "admin_datamart_hecho1_ventascadenasjuegos" U0 INNER JOIN "admin_datamart_dimensionjuegos" U1 ON ( U0."juegos_id" = U1."id" )) ORDER BY "admin_juego_deportes"."nombre" ASC
Solution:
Deportes.objects.filter( pk__in=list(Hecho1_VentasCadenasJuegos.objects.all().values_list( 'juegos__deporte_id', flat=True ).distinct('juegos__deporte_id')) )
Viewing the SQL generated
SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (1, 2) ORDER BY "admin_juego_deportes"."nombre" ASC
Nevertheless, the problem is given in the subquery that incrupta in the first query.
In the ORM should be verified that both models belong to different data base and generate the correct SQL, evaluating the second and not generate a subquery.
Change History (18)
comment:1 by , 9 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
Summary: | Error in consultation with models of different database → Error querying across foreign keys with models in different databases |
comment:2 by , 9 years ago
You're right actually Django does not currently provide this support, but both models are not directly related, plus I have a properly configured router according to specifications.
I show you some models:
class Deportes(models.Model): ... class DimensionJuegos(models.Model): ... deporte_id = models.IntegerField() ...
Django so no support for relationships between different database, the ORM should be able to identify subqueries that are different databases and evaluate them before processing the first query, which is a optimization generate internal inquiry but as subquery in this case it is not the right way, that only works when the models are in the same database.
Perhaps lazy functionality Django queries that are evaluated only when needed, but to build it must be verified in the generated subqueries.
By the way sorry for my bad English,
Regards.
comment:3 by , 9 years ago
Cc: | added |
---|---|
Resolution: | invalid |
Status: | closed → new |
comment:4 by , 9 years ago
Summary: | Error querying across foreign keys with models in different databases → Error querying models in different databases in one queryset |
---|---|
Triage Stage: | Unreviewed → Accepted |
OP is right to reopen -- the query had nothing to do with relations, only an inner query on a separate database.
A documentation fix is a much more likely fix than behavior change, though.
comment:5 by , 9 years ago
Well, given the way the queryset are evaluated the only solution is to force the evaluation of internal consultation, exactly as specified above with "list ()", but what happens when large projects where time is it separate the database ?, possibly like me exist any code with nested queries, Django should be able to generate differences that and not only evaluate the subquery separately.
I like it try to force this behavior, I have some experience using Django but still not at such a high level, but with some tutoring in things that maybe do not understand well make it, you think? I would try to fix or someone else will?
Excuse my bad English.
Regards.
comment:6 by , 9 years ago
Cc: | added |
---|
Maybe Anssi (akaariai) could say if this is feasible and guide you if so.
comment:7 by , 9 years ago
Thanks, I'm already writing an email to akaariai@…, will await their response.
I've been seeing some code, exactly:
https://github.com/django/django/blob/stable/1.8.x/django/db/models/sql/compiler.py
Line number: 882
return '%s.%s IN (%s)' % (qn(alias), qn2(columns[0]), sql), params
more or less there, shoulds verify that the added subquery belongs to the same database, otherwise not allowed and should evaluate the subquery but values.
I do not have much experience but I would like to make my first contribution, maybe not the right place in the code to solve the problem, but if we could you help me add the right solution.
comment:8 by , 9 years ago
Type: | Bug → Cleanup/optimization |
---|---|
Version: | 1.8 → master |
That would be roughly the right place @ebar0n, but you might want to try fixing it at a higher level (callers of "as_subquery_condition") if you go down that route.
My concern here is one of surprise. Users are mostly aware that passing in a queryset as an __IN
lookup results in a single query that has a subquery. It is what we document to happen. If we silently change this behaviour based on that subquery having to execute on a different database, then we're going from a potential efficient subquery to a potential extremely bad query that pulls out thousands/millions rows back to Django. Erroring in this condition seems like a good idea, so the user can explicitly wrap the subquery in a list().
Providing a nicer error message with a hint to list() the subquery seems like the right solution here.
comment:9 by , 9 years ago
It is effectively efficiently perform the subquery, but I think Django must be able to identify the model referenced in the subquery belongs to a different base data to the initial consultation; Only then the subquery should be evaluated before generating the entire query, and generate the subquery and avoid error.
Rest when they belong to the same database operation should be such as this, because it is more optimal.
That would be the need to use the same functional code with multiple database support.
Then you advise me? I'm anxious to try to fix it.
Excuse my bad English is not my native language.
follow-up: 12 comment:10 by , 9 years ago
It would also be necessary to specify that functionality in the documentation __IN
, that would be the only exception to not generate the subquery.
comment:11 by , 9 years ago
I've got the solution, you can see it in https://github.com/django/django/pull/5732
Tests conducted using debugsqlshell:
Two models that belong to the same database
Deportes.objects.filter( pk__in=Equipos.objects.all().values_list( 'deporte_id', flat=True ).distinct('deporte_id') ) ... SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (SELECT DISTINCT ON (U0."deporte_id") U0."deporte_id" FROM "admin_juego_equipos" U0) ORDER BY "admin_juego_deportes"."nombre" ASC LIMIT 21 [291.41ms]
1 models, using as an argument a list
Deportes.objects.filter( pk__in=[1, 2] ) ... SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (1, 2) ORDER BY "admin_juego_deportes"."nombre" ASC LIMIT 21 [0.88ms]
Models belonging to two different databases
Deportes.objects.filter( pk__in=DimensionJuegos.objects.all().values_list( 'deporte_id', flat=True ) ) ... SELECT U0."deporte_id" FROM "admin_datamart_dimensionjuegos" U0 [0.68ms] SELECT "admin_juego_deportes"."id", "admin_juego_deportes"."nombre", "admin_juego_deportes"."logo", "admin_juego_deportes"."orden", "admin_juego_deportes"."orden_equipos", "admin_juego_deportes"."count_apuesta", "admin_juego_deportes"."fondoweb", "admin_juego_deportes"."cantidad", "admin_juego_deportes"."cantidad_tiempos", "admin_juego_deportes"."runline_positivo", "admin_juego_deportes"."ganador_empate_not_null", "admin_juego_deportes"."resultado", "admin_juego_deportes"."created_at", "admin_juego_deportes"."updated_at" FROM "admin_juego_deportes" WHERE "admin_juego_deportes"."id" IN (1) ORDER BY "admin_juego_deportes"."nombre" ASC LIMIT 21 [1.74ms]
comment:12 by , 9 years ago
Replying to ebar0n:
It would also be necessary to specify that functionality in the documentation
__IN
, that would be the only exception to not generate the subquery.
You seem to be missing @jarshwah 's main concern: Your fix is all nice and well when the subquery returns a small number of records. But assume it can return thousands of records -- even if the external query returns only a few in the end. Say "find customers with last name Baron who made a purchase", written as
Customer.objects.filter(last_name='Baron', customer_id__in= Purchase.objects.all().values_list('customer_id'))
(I know this doesn't look realistic, it is simplified for the sake of example, but believe me, there are plenty of such queries).
Now, as long as we're all in one database, this may well be efficient enough, because only the small number of records will finally be returned, and the database can usually make sure to execute such a query efficiently internally.
But with your fix, and if Purchase
is on a different database, this query suddenly becomes a performance disaster. Further, there is no way for Django to tell if the inner query is "fast enough".
This is why Josh prefers that the fix will be, instead of "just go and get it", to make the error informative and tell the developer how to fix it. For what it's worth, I agree.
Josh, is it possible for a user to override the built-in lookups? @ebar0n's concern seems to be that there are a lot of queries in his code-base which would need to be changed if we don't fix it as he suggests, and where performance is probably not an issue. If he can just override __in
with a lookup that does what he wants, then we can have our cake and he can eat it too.
comment:13 by , 9 years ago
Thanks Shai for clearing that up. Your suggestion about implementing a custom IN lookup is a good one. That is definitely a way users such as ebar0n can opt in to new behaviour. The only issue is that they'd need to carry around and share the code outside of django. It would also be an all-or-nothing proposition. One __in
lookup can be registered per field type, so it's not really something you'd want provided by third party libs.
I still think a nicer error message would be good though, so I won't "wontfix" the ticket. But I don't think we should accept the change that ebar0n is putting forward.
ebar0n, you can use the code you've written in your own application by doing the following:
from django.db.models.lookups import In class MultiDBIn(In): def process_rhs(self, compiler, connection): db_rhs = getattr(self.rhs, 'db', None) # if the argument for the {{__in}} is a subquery, check if the db is different from the main query, # and if they are true evaluate the subquery because Django does not provide support for multiple # base relations of data if (db_rhs and db_rhs != connection.alias) or self.rhs_is_direct_value(): # rhs should be an iterable, we use batch_process_rhs # to prepare/transform those values rhs = list(self.rhs) if not rhs: from django.db.models.sql.datastructures import EmptyResultSet raise EmptyResultSet sqls, sqls_params = self.batch_process_rhs(compiler, connection, rhs) placeholder = '(' + ', '.join(sqls) + ')' return (placeholder, sqls_params) else: return super(In, self).process_rhs(compiler, connection) IntegerField.register_lookup(MultiDBIn) CharField.register_lookup(MultiDBIn) # .. and any other fields you'd want this behaviour to apply to
Thank you ebar0n for working on this and finding a solution, but I'm afraid it's not one that we should accept.
comment:14 by , 9 years ago
Thanks for the clarification, I fully understand the performance problems that this may cause, but because I'm pretty sure that when a developer decides to use different database and do this kind of consultation should take into account this potential problem (if you have a subquery with thousands of records), if it continues in the same database which is running as the subquery is generating optimally. Also in the official documentation shall specify the variant behavior.
But then I respect their opinions; I will correct the code to add the exception as suggested, so that other users do not fall into the same error time and tarden known to occur. and only I customize my code.
Thank you for your help.
comment:15 by , 9 years ago
Ready was added and the change https://github.com/django/django/pull/5732
Then pull that can be approved?
comment:16 by , 9 years ago
Has patch: | set |
---|
Per the multiple databases documentation: