Right, good news is this isn't a regression from 7f65974f8219729c047fbbf8cd5cc9d80faefe77.
- The new example case fails on v2.2.3 &co.
- The suggestion for the regex change is in the part not changed as part of 7f65974f8219729c047fbbf8cd5cc9d80faefe77. (Which is why the new case fails, I suppose :)
I don't want to accept a tweaking of the regex here. Rather, we should move to using html5lib
as Florian suggests.
Possibly this would entail small changes in behaviour around edge cases, to be called out in release notes, but
would be a big win overall.
This has previously been discussed by the Security Team as the required way forward.
I've updated the title/description and will Accept accordingly.
I've attached an initial WIP patch by Florian of an html5lib
implementation of the core _truncate_html()
method.
An implementation of strip_tags()
using bleach
would go something like:
bleach.clean(text, tags=[], strip=True, strip_comments=True)
Thomas, would taking on making changes like these be something you'd be willing/keen to do? If so, I'm very happy to input to assist in any way. :)