#16187 closed Cleanup/optimization (fixed)
Refactor lookup system
Reported by: | Ulrich Petri | Owned by: | Anssi Kääriäinen |
---|---|---|---|
Component: | Database layer (models, ORM) | Version: | dev |
Severity: | Normal | Keywords: | |
Cc: | charette.s@…, sfllaw@…, FunkyBob, mpessas@…, loic@… | Triage Stage: | Accepted |
Has patch: | yes | Needs documentation: | no |
Needs tests: | no | Patch needs improvement: | yes |
Easy pickings: | no | UI/UX: | no |
Description
There are a few tickets that propose adding new lookup types [1][2][3][4] and
also one [5] that suggests adding the possibility for custom fields to have the
ability to add their own lookup types.
The current implementaion is not able to distinguish between field names and
lookup types when querying through a relation.
I briefly talked about this topic with Russel, Janis and Alex at the
djangocon.eu 2011 sprints. The consensus was that it would be worthwile to do a
refactor in this area. It would also allow django to have "awesome fields" as
Alex called it :)
I'm starting this ticket to track work in this area and also "bundle" the other
related tickets. I'll also attach a patch with a basic testcase that fails with
the current implementation.
Attachments (6)
Change History (47)
by , 13 years ago
Attachment: | tests-t16187.patch added |
---|
comment:1 by , 13 years ago
Triage Stage: | Unreviewed → Accepted |
---|
by , 13 years ago
Attachment: | tests-t16187-2.patch added |
---|
comment:2 by , 13 years ago
comment:3 by , 13 years ago
Has patch: | set |
---|---|
Owner: | changed from | to
Patch needs improvement: | set |
The t16187-proof-of-concept patch - as the name says - is a (quick and dirty) proof of concept (the testsuite passes, including those tests in tests-t16187-2.patch)
This means that fields may be named like lookup types and new lookup types can be added at will without backwards compatibility concerns.
This patch definitely needs improvement / cleanup before its RFC.
by , 13 years ago
Attachment: | t16187-proof-of-concept.patch added |
---|
comment:4 by , 13 years ago
See also #11670 which tackles the conflicts that already exist with the 'year', 'month' and 'day' lookups.
by , 13 years ago
Attachment: | 16187.lookup-refactor.diff added |
---|
comment:5 by , 13 years ago
I've updated the patch to avoid some test failures with foreign keys by adding the following special case:
if len(part) > 3 and part[-3:] == '_id' field, _, _, _ = model._meta.get_field_by_name(part[:-3]) ...
That feels a bit hacky but it works.
I'd really like to push this over the line for 1.4 as it's causing several issues in the ORM and the admin. So I'm wondering: What is the roadmap for this? How could the current patch be improved? What API is considered to make the lookups configurable?
I'm a total noob in regards to the ORM codebase but I'm keen to help out under some supervision :)
comment:6 by , 13 years ago
Not this ticket's problem, but it would make sense to do even bigger refactor in the direction that first do a traversal of the lookup path and determine if there are multi-joins in the path, raise errors if the path is not valid etc, and only after that do setup_joins with the valid path. This ticket's patch is one step in that direction, because this already traverses the lookup path upfront. The next step would be to check errors already at this point and use the info gathered during the traversal for setup_joins calls.
That being said, I will try to review (as in point out what needs improvement in my opinion) tomorrow.
comment:7 by , 13 years ago
So here's my idea for how custom field lookups should work (not necessarily related to this ticket):
(1) First the Query identifies the root object of the lookup, this should either be a field, or an aggregate I think.
(2) Then, it traverses the remaining bits, at each section it calls a method on the return value from either part 1, or the previous item in part two. This method should return a representation of the next step, so for example an fkey__field
would return something that says "I am a field traversal", and it could raise some exception to bail out (in particular this should be used on the final step, in order to indicate that an __
lookup like __lt
should use the comparison operator).
(3) Finally the last step of this should add itself to the query. In theory this calls for an even bigger refactor, because really it should add itself to the compiler at the very end, but I'm not sure if that's possible ATM, and it shouldn't block this happening.
comment:8 by , 13 years ago
I didn't have time to do the promised review before. But here it is:
I think the big issue is how to restructure the code in a way that there is not duplication of the lookup path resolving code. Currently you need to first walk the path in resolving the final field and lookup part of the path, and then again in setup_joins(). This means you have to do pretty much the same code in two places, which is IMHO not good. This is not an easy issue to solve, as setup_joins refactoring is needed, and that piece of code isn't the easiest to refactor.
I think there are some mistakes in the patch:
- IMHO it would be good to have the lookup path resolving done in different method than add_filter, if for no other reason than readability.
- I don't like this logic:
for ...: try: ... try: ... except FieldDoesNotExist: if really_not_exists: raise ... except FieldDoesNotExist: do stuff
Related to that, you are looking for aggregates only after you have walked the path, this results inFoo.objects.filter(id__someaggregate__exact=1)
being valid. I think the aggregate resolving must be done upfront.
As it happens, I had some code related to one of my projects lying around which could be reused for this. I have attached it as a patch. It does more than is needed (it actually tries to resolves the joins needed), and because of that there is a lot of duplication of code to setup_joins. But it does pass the test suite (well, there is one test failing, but it not failing before is a bug in current implementation). I hope it gives some ideas about how to solve this issue. It is incomplete and downright uqly in places, though.
I don't understand the last step's problem Alex mentioned above. Why can't you just add the lookup normally into the where / having clause, and resolve it into SQL in the compiler in the normal way:
# compiler.py:72-73 where, w_params = self.query.where.as_sql(qn=qn, connection=self.connection) having, h_params = self.query.having.as_sql(qn=qn, connection=self.connection)
Or is the idea that the lookup can return something else than things going into the where / having parts of the query?
Last: I had some problems making this work with GenericRelations. Is there some reason why they are using ManyToManyRel instead of ManyToOneRel? That seems just wrong...
comment:9 by , 13 years ago
akaariai, I agree that it'd be nice to avoid multiple traversals of the lookup query and that, to do so, some big-ish refactoring would be required. In the meantime, I have posted a patch (with tests) in #11670 with a slightly simpler algorithm than the ones originally posted here and there. That patch aims to fix the possible collisions between genuine field names and built-in lookup types.
I'm not sure whether that bug should be fixed in priority or if it should wait for the general refactoring to happen.
comment:10 by , 13 years ago
I've done some further refactoring in the latest patch for #11670 (see specifically ticket:11670#comment:15) following akaariai's suggestion to modify setup_joins
in comment:8. Some help pushing that one over the line would be appreciated.
comment:11 by , 12 years ago
I have done some initial work for custom lookups (and refactoring of the lookup system in general) - work available from https://github.com/akaariai/django/compare/ticket_10790_new...ticket_16187 (this is based on work done in #10790). The commit messages are worth a read.
The modeltests/field_subclassing/* files shows one trivial custom lookup. However, it should be possible to define complex lookups too.
All tests pass.
That was the positive news. The negative news are:
- The where.py / fields.py / lookups.py interaction is a total mess (it was messy before, now it is a bit worse).
- The API of Lookup is totally unfinished. We need to think how Lookup and Field interact (Lookup defines boolean filter conditions, Field datatypes)
- There should be easier way to add lookups to existing fields than subclassing SomeField.register_lookup comes to mind here.
- There should be (maybe?) better ways to write multibackend compliant Lookups.
- There is just one core lookup "BackwardsCompatLookup". All existing lookups should be defined as different Lookup subclasses.
- Probably a lot more...
Cleaning the mess in where.py & friends can be done later. What I did doesn't make it much worse than what it is already. The cleanup and core lookup rewrite is going to take some time, but it should not be particularly hard to do.
The API however is going to be somewhat hard to define. What does Field do, what does Lookup do. Currently Field still knows the lookup types, and we definitely want to change that. How to do that in a way that is even somewhat backwards compatible is going to be interesting job.
Still, I would say that the hard parts are now done.
comment:12 by , 12 years ago
Cc: | added |
---|---|
Version: | 1.3 → master |
comment:13 by , 12 years ago
Just a little bit more about the Lookup <-> Field interaction problem. Currently, value preparation for lookups is done by get_prep_lookup and get_db_prep_lookup. These are documented and thus public API methods. These do things like convert the given value to proper format for icontains queries ('foo' to 'foo%'). The base implementation of these methods is a long if-else chain listing different lookup names and then passing the actual conversion to the connection used.
We likely would want to move these methods to Lookup. They actually are already in the current Lookup but the implementation just passes the value onto the Field. But if we want to stop passing the implementation to the field, then we are breaking backwards compatibility. We could perhaps detect if the given field has overridden get_db_prep_lookup and get_prep_lookup, if so, pass to field and raise deprecation warning. This is likely going to be somewhat complex... For one we have to still define these methods in Django's core fields as otherwise subclasses calling super() will break...
I will next go through some of the basic lookup types to see how much work converting one lookup is. Will report back...
comment:14 by , 12 years ago
I have now something that seems to be fully working: https://github.com/akaariai/django/compare/ticket_16187
The above branch contains #10790 changes, too. After I commit #10790 changes I am going to rebase the above branch on master.
There is a lot to do still:
- API design (The Lookup <-> Field interface and the Backend <-> Lookup interface, how to register new lookups per field)
- How to implement backwards compatibility (the above branch should be 100% backwards compatible, it essentially contains both the new and old code...).
- Cleaning up the implementation
- ...
Adding new lookups seems to be somewhat easy, though it requires subclassing. Having the ability to hook a custom lookup to a given field instance would be really nice to have...
The new code seems to be a little bit faster for WhereNode generation (around 5% faster query execution speed for "complex" filter clause).
An example of a custom lookup can be found from the tests: https://github.com/akaariai/django/compare/ticket_16187#L16R75
In summary: a lot of work to do still but this feature will likely be in 1.6.
comment:15 by , 12 years ago
Cc: | added |
---|
comment:16 by , 12 years ago
#10790 just went in, rebased branch available from https://github.com/akaariai/django/compare/ticket_16187_new
The commit history is ugly... but the basic status is that the above branch does implement custom lookups in barckwards compatible way. The next steps would be to convert the GIS ORM to use the custom lookups, and then to try to implement things like datefield__year__2000=1
.
After that the big question is how exactly to handle backwards compatibility. Otherwise I don't see currently any major problems for this feature.
comment:17 by , 12 years ago
I have rebased the work of ticket_16187_new on current master, and I have also created a branch which cherry-picks moving the names_to_path call into add_filter(). This is a needed change in any case, so that we can handle the split_exclude() case properly (currently an F() expression is added two times into the query when doing a split_exclude()).
cherry-pick available from: https://github.com/akaariai/django/compare/django:master...ticket_16187_cherrypick
I am not entirely happy with the cherry-pick branch, but it is IMO move into right direction. So, I will likely commit the cherry-pick into master as separate (rebased) commit from the rest of stuff.
comment:18 by , 12 years ago
I implemented one nested lookup, year__somelookup
, where somelookup is a lookup supported by IntegerField.
I also converted django-hstore to work with the new lookup code, and added some nested lookups to hstore field. See this django-developers post for more details.
comment:19 by , 12 years ago
The following things now work: using .values(), .values_list(), aggregates (.annotate() and .aggregate()), .order_by(), distinct(*fields) and F() expressions with custom lookups. This means one can do things like:
Book.objects.order_by('pub_date__year')
This should be very useful when used with hstore or other similar fields. Consider something like:
SportActivity.objects.filter(type=RUNNING).aggregate(Avg('data__distance'))
where data is json field for example. In this example you store different types of sport activities in a DB table with some common fields, and store the details in the "data" field. But you are still able to access the details through the ORM.
I have update django-hstore (available from https://github.com/akaariai/django-hstore/compare/custom_lookups), it contains now one .order_by() and one .aggreagte() example which drills into the hstore.
As an example from the tests:
assertEqual(DataBag.objects.aggregate(avg=Avg('data__v__asint')), {'avg': 1.5})
generates this SQL:
SELECT AVG(("app_databag"."data" -> %s)::INTEGER) AS "avg" FROM "app_databag", ['v']
The current Django code can be found from a new branch: https://github.com/akaariai/django/tree/nested_lookups.
comment:20 by , 12 years ago
I think I know how to implement the nested lookups. Here is the idea: A lookup is something that has get_lookup() method if it supports nesting. Nesting is supported for those lookups that transform the column value in some way (for example, __year
lookup will extract the year from the column). These lookups will respond to as_sql() method. Similarly fields have get_lookup() method. Finally a new class Column is introduced, this too responds to as_sql() method.
So for example mydatefield__year__exact=2000
would resolve as follows:
- resolve field (mydatefield from table "mytable" found).
- field.get_lookup('year') -> resolves to YearLookup instance which is given a Column("mytable", "mydatefield") as lhs value.
- resolve 'exact' from the 'year' lookup. Will return an ExactLookup which is given the YearLookup as lhs value.
- as the ExactLookup is a final lookup, it will be given 2000 as a value (set_lookup_value() maybe?).
- The ExactLookup is added to WhereNode as is (no Constraint objects, just plain where.add(found_lookup, connector)).
When transforming the value to SQL at compiler stage the following things happen:
- The ExactLookup.as_sql() is called.
- The ExactLookup will call lhs.as_sql(), thus the YearLookup as_sql() is called.
- The YearLookup.as_sql() is called, this will again call the given Column's as_sql() and resolve to "mytable"."mydatefield".
- The YearLookup will wrap the "mytable"."mydatefield" to EXTRACT 'YEAR' FROM "mytable"."mydatefield".
- The ExactLookup will return EXTRACT 'YEAR' FROM "mytable"."mydatefield" = %s, (2000,).
So, a lookup .as_sql() works in two ways depending if the field has been given a value or not - it either does a transformation of the value (cast, extract, ...), or it produces boolean condition. There are a lot of details to make this work correctly. How to treat SQLEvaluators, subqueries, many special cases like isnull checks, how to handle nullability - certain lookups can transform non-null values to non-nullable values, how to handle return type information - from Column through lookup transformations, what is the final type? And so on...
Still, the above idea should have a solid foundation for making things work cleanly. For backwards compatibility I think the best approach is to have the current code as much as-is somewhere, and use that for the backwards compatibility period. The main problem is field.get[_db]_prep_lookup() methods which needs to be called both for new code and old code... Lets see how to make that work nicely.
comment:21 by , 12 years ago
Cc: | added |
---|
comment:22 by , 11 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
I am working on this again. I decided to start from implementing current base lookups as classes. The results can be found from: https://github.com/akaariai/django/tree/custom_lookups
Before there is full support for custom lookups, the following things need to be done:
- Add a way to nest lookups. There is some work on that already in the above branch. This shouldn't be too hard.
- Backwards compatibility - Maybe just returning None from a field's get_lookup() could enable backwards compat mode. This mode should work exactly like before, so those who have overridden WhereNodes etc will still get a working implementation.
- Add a way to register lookups to fields (I think either instances or classes is wanted behaviour).
- Add the ability to use "extracts" in .values(), .order_by() etc. The basic idea is that a Lookup itself can't be used in such situations, you will need to write an Extract class.
- Add ability to ask if given lookup ('myfieldcustomlookup') is allowed for given queryset - admin needs this. I think adding QuerySet.is_allowed_lookup(somelookup) is the way to go.
- Of course, also docs & tests needed.
Of the above the field registration API is the biggest question.
comment:23 by , 11 years ago
I just force-pushed an update to https://github.com/akaariai/django/tree/custom_lookups. I am going to use force-push for some time still.
The basic chained lookup functionality is there, but the implementation is far from finished (the code needs heavy cleanup, and additional features are needed, too).
From the TODO list in above comment nesting and registering lookups are implemented. Backwards compatibility should be there, but I haven't tested that too much. Extract and lookup_exists() functionality is still on TODO list, and no docs written yet. There are a couple of failing test cases related to error handling. The error messages will likely need to be changed anyway, so I am going to fix these later on.
So, work is progressing, but there is still a lot to do.
comment:24 by , 11 years ago
I have done some investigation into how GIS works, particularly how GIS transforms select columns. It is obvious there is room for big cleanup in there (see gis compiler, and its overridden get_columns/get_default_columns). Unfortunately this means a lot of work, but it seems this is worth it. The get_*_columns() methods are very much non-DRY. It should be possible to make GIS a lot less dependent on private ORM APIs.
Another possible painpoint is that query.select and query.related_select_cols will get a totally new signature. If subclassers override these, it will be hard to guarantee any form of backwards compatibility. It is possible to keep backwards compat paths, too, but this needs to be done in multiple places. The end result will not look pretty good. It might be enough to add some way to override the SQL output for columns. I have thought of optional connection.col_as_sql() method which can be used for things like django-nonrel.
Still one point: the new way of generating SQL for select queries will be a bit slower than the current way, maybe in the range of 2-10% for common query types. I think this performance penalty is acceptable given what will be gained. Also, I think the new way will be a bit easier to optimize (for example, there is a possibility to cache both Col instances, and their output when tables aren't aliased).
comment:25 by , 11 years ago
The changes are cascading a lot.
Current status of the patch (https://github.com/akaariai/django/compare/custom_lookups) is that it not only rewrites how lookups are implemented. It also rewrites how SELECT clause generation works, and how values are converted from db values to Python equivalents. The last change simplifies a lot of GIS and .dates()/.datetimes() queries. There isn't a lot of work to do to remove GeoQuery and GeoCompiler. Even now gis needs a lot less hacks to work.
Actually I think a good goal is to make GIS as much as possible public-API. If that can be done, then it is a strong signal that things are working as expected.
I believe the changes could be really tough for those who subclass SQLCompiler liberally (think django-nonrel).
comment:26 by , 11 years ago
A hack-less gis is definitely a strong indicator that this is going in a good direction. I think there will be a relatively small number of projects subclassing SQLCompiler so I'm not massively worried if we break them, so long as the resulting code will be easier for projects like that to work with in the future. What's your instinct about the possibilities of nonrel with this new style?
comment:27 by , 11 years ago
I think django-nonrel should be possible. It might be harder to do (maybe even a lot harder). Currently you can overwrite compiler and wherenode implementations for nonrel and you are done. The new way requires also rewrite of lookups and columns. This should be possible with added hooks (there is currently a dummy hook for lookup rewriting, but no hook for column rewriting). In addition I believe you still need to overwrite wherenode and compiler.
It must be said that it is hard for me to see if nonrel is possible to write - I haven't done any work on nonrel...
I have once again force-updated the branch. The select clause handling should be a lot better now. A lot of select clause generation has changed - query.select, extra and aggregates are gone, they are all collapsed into custom_select OrderedDict. Unfortunately this way is also a bit slower than current master for some use cases. Interestingly enough the reason seems to be OrderedDict creation. Python's OrderedDict implementation is just slow... Otherwise, there is a lot of API cleanup (no rewriting, just cleanup) and then docs + more tests.
comment:28 by , 11 years ago
A lot more work done. Biggest change is that now there is support for .alias(somecol=SomeSQLProducingObject) and .annotate(somecol=SomeSQLProducingObject). These can be used with other queryset operations. Best way to see what this means is to look at custom_lookups/tests.py - there are some tests for ARRAY_AGG and case when ... then ... end structures. Another case that uses the .annotate() feature is dates() and datetimes() implementations.
comment:29 by , 11 years ago
Recent developments: GeoWhereNode, GeoSQLCompiler and GeoQuery are gone.
GeoWhereNode was converted to GeoLookup. GeoLookup could be improved a lot, but for now on just doing exactly what the old code seems good enough. In general the new custom lookup and custom annotations features could be used to make the implementation of GIS a lot better (from ability to nest lookups to plain bugfixing).
I went into some trouble to remove all usage of GeoSQLCompiler and GeoQuery. The compiler subclass resulted in lots of clutter in different DB backends, and there wasn't almost anything left in any case. The GeoQuery removal was mostly motivated by principle of "least private API usage".
I am now waiting for composite fields implementation to get in to master. It should make some parts of custom_lookups code simpler (mainly in Query.build_lookup()). After that the API still needs to be cleaned up and documented. Finally I need to check how 3rd party SQL backends or non-SQL backends can adapt to the code.
comment:30 by , 11 years ago
Anssi - I'd like to review this as and when you're ready, but I'm not particularly familiar with how things worked before. I feel it would be a good exercise for me getting to know the ORM a bit more, as well as hopefully a useful fresh set of eyes. It's not a small patch, so any guidance you can give as to where to start would be greatly appreciated.
comment:31 by , 11 years ago
Just take simple queries and see how they are built and executed by the ORM.
Try to pick one area to concentrate at a time. For the lookup refactor checking where clause entries generation and execution is naturally important. The basic path for generating a clause is qs.filter() -> _filter_or_exclude() -> query.add_q() -> query._add_q() -> query.build_filter() -> query.setup_joins(). At execution time the query.where clause is converted to SQL by WhereNode.as_sql() (and WhereNode.make_atom() as part of that).
Try to ignore HAVING clause generation at start. If you wonder why something is needed, change it and run queries tests - that test module alone covers most of ORM features. The broken tests should tell you why any particular part is needed.
comment:32 by , 11 years ago
Cc: | added |
---|
comment:33 by , 11 years ago
Cc: | added |
---|
comment:34 by , 11 years ago
I have done another implementation of custom lookups. I intentionally did *not* implement support for using nested lookups outside of filtering. That is .order_by('datefield__year')
isn't supported. The reason for this decision is that implementing that in one go leads to a huge patch (the previous try was nearing 2000 lines of ORM code changes).
I think the code portions of the patch are actually pretty much ready. The biggest things are adding deprecation warnings for old-style lookups, and ensuring Django doesn't use those anywhere. In addition admin needs some work so that custom lookups are usable in admin, too. There is some code messiness inside sql/query.py due to the support for both old-style and new-style lookups. I don't think that needs to be cleaned up before deprecation removal.
The current work is available from https://github.com/akaariai/django/compare/lookups_3. The docs additions contain examples of how to write custom lookups and extracts.
The API feels correct to me now. 3rd party backends can customize SQL as needed, and writing simple extracts is straightforward. Complex cases can be handled, too. The docs of course need a lot of rewriting. But I am hopeful that we can get this into Django soon. Please help by testing the branch!
comment:35 by , 11 years ago
I did a little more work on the API. I am confident it is general enough yet simple enough now.
I created a quick-and-dirty HStoreField implementation, I'll attach a ready to run project. Quick instructions: edit database settings to point to a db with hstore extension installed, syncdb and run python tester.py. The interesting bits are in testproject/hstore.py. I'll paste that portion in here:
from django.db.models import TextField, Lookup, Extract # This one implements "get value of key X from the hstore". The # value is extracted from the hstore, and so further lookups on # the value are possible. The extracted values are of type text # in the DB, so output_type = TextField(). class HStoreExtract(Extract): output_type = TextField() def as_sql(self, qn, connection): lhs, params = qn.compile(self.lhs) # extracts know the lookups used for creating them, # so init_lookups[0] gives us the 'somekey' in # hstorefield__somekey='foo' params.append(self.init_lookups[0]) return '%s -> %%s' % (lhs), params # This one implements "does hstore contain key X". This is a # a lookup on the hstore field itself. class HStoreContains(Lookup): lookup_name = 'hcontains' def as_sql(self, qn, connection): lhs, params = self.process_lhs(qn, connection) rhs, rhs_params = self.process_rhs(qn, connection) params.extend(rhs_params) return '%s ? %s' % (lhs, rhs), params # Main HStoreField class HStoreField(TextField): # Following two methods needed for createdb & save support. def db_type(self, connection): return 'hstore' def get_db_prep_save(self, value, connection): data = [] for k, v in value.items(): data.append('%s=>%s' % (k, v)) return ','.join(data) def get_lookup(self, lookup): found_lookup = None # It is possible to rename Django's own lookups so that conflicts # between common names (contains, exact for example) can be avoided. if lookup.startswith('dj_'): # If it started with 'dj_' use Django's own lookups found_lookup = super(HStoreField, self).get_lookup(lookup[3:]) # If it wasn't something existing, then interpret as extract key from # hstore if found_lookup is None: return HStoreExtract else: return found_lookup HStoreField.register_lookup(HStoreContains)
Using HStoreField it is possible to do queries like qs.filter(hstoredata__somekey__icontains='foobar')
.
The test project contains some examples which go a bit further.
Now, docs need rewriting and this one is RFC.
by , 11 years ago
Attachment: | lookup_hstore_test.tar.gz added |
---|
An example project with hstore lookups
comment:36 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
New tests-patch. There was a small error in the old one