#36190 closed Cleanup/optimization (wontfix)
High memory usage of CommonPasswordValidator
Reported by: | Michel Le Bihan | Owned by: | Priyanshu Singh Panda |
---|---|---|---|
Component: | contrib.auth | Version: | dev |
Severity: | Normal | Keywords: | CommonPasswordValidator |
Cc: | Priyanshu Singh Panda | Triage Stage: | Accepted |
Has patch: | no | Needs documentation: | yes |
Needs tests: | no | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
Hello,
I noticed that CommonPasswordValidator
has a very high memory usage. Loading the piotrcki-wordlist-top10m.txt
list that has 9872702
entries causes a 755M usage even though the disk size of the list is only 87M. Since CommonPasswordValidator
basically only needs to check for membership, I think that using a variant of a bloom filter would be much better than using a hash table (set()
).
Attachments (2)
Change History (16)
comment:1 by , 7 days ago
Component: | Uncategorized → contrib.auth |
---|---|
Keywords: | CommonPasswordValidator added |
comment:2 by , 6 days ago
Owner: | set to |
---|---|
Status: | new → assigned |
Triage Stage: | Unreviewed → Accepted |
Type: | Uncategorized → Cleanup/optimization |
follow-up: 4 comment:3 by , 5 days ago
Resolution: | → needsinfo |
---|---|
Status: | assigned → closed |
Triage Stage: | Accepted → Unreviewed |
comment:4 by , 4 days ago
Replying to Sarah Boyce:
Michel can you please share steps to reproduce? This will help in investigating and resolving any issues
I was able to generate this using the "piotrcki-wordlist-top10m.txt" file, which contains around 10 million commonly used passwords. The CommonPasswordValidator loads the file and stores the passwords in a list. By default, the original file contains only 2,000 passwords, which is much smaller compared to the new file and requires more memory.
try: with gzip.open(password_list_path, "rt", encoding="utf-8") as f: self.passwords = {x.strip() for x in f} except OSError: with open(password_list_path) as f: self.passwords = {x.strip() for x in f
I'm currently working on addressing the memory issue caused by loading such a large file. Please assign this task to me so I can continue working on optimizing it.
follow-up: 6 comment:5 by , 4 days ago
Hello,
Thanks for looking at this. How are you planning to reduce memory usage?
comment:6 by , 4 days ago
Replying to Michel Le Bihan:
Hello,
Thanks for looking at this. How are you planning to reduce memory usage?
I plan to optimize memory and processing time by implementing a Bloom filter. This will reduce memory usage by storing only essential data and speed up lookups using a probabilistic approach. I have also validated the improvements using the @profile function, which tracks memory usage per line of code. After implementing the Bloom filter, both time and memory usage have been significantly reduced.
follow-up: 8 comment:7 by , 4 days ago
How will you deal with false positives? Maybe using a sorted array of 64 bit hashes and doing a binary search on it would have a much better lower FP ratio.
comment:8 by , 4 days ago
Replying to Michel Le Bihan:
How will you deal with false positives? Maybe using a sorted array of 64 bit hashes and doing a binary search on it would have a much better lower FP ratio.
Both approaches seem optimal for handling false positives. I’ve already set a custom false positive rate for the Bloom filter, which provides a good balance between memory usage and performance. However, as you mentioned, using a sorted array of 64-bit hashes with binary search could further optimize the false positive rate.
Would you recommend switching to this method or keeping the Bloom filter with the custom FP rate? I can also implement the hash map approach if needed for further optimization.
Can you let me know how you'd like to proceed, and kindly assign this task to me.
comment:10 by , 4 days ago
Resolution: | needsinfo |
---|---|
Status: | closed → new |
comment:11 by , 4 days ago
Cc: | added |
---|---|
Needs documentation: | set |
Status: | new → assigned |
Triage Stage: | Unreviewed → Ready for checkin |
Version: | 5.1 → dev |
comment:12 by , 4 days ago
Triage Stage: | Ready for checkin → Accepted |
---|
comment:13 by , 4 days ago
Resolution: | → wontfix |
---|---|
Status: | assigned → closed |
There is often a trade-off between memory usage and performance.
For CommonPasswordValidator
to compromise on accuracy via a bloom filter would be backwards incompatible, and therefore, not acceptable without a strong consensus from the community.
In Django, you can write your own password validators. In this case, a custom very large file is being used, and so it might make sense to use a custom validator that uses a bloom filter (to deal with the very large file). This custom validator could be provided by a third-party package.
Anyone is welcome to discuss this further on the Django forum.
Note that any PR to improve the current state should not compromise on accuracy and needs to include performance and memory bench-marking.
by , 4 days ago
by , 4 days ago
comment:14 by , 4 days ago
Here is a simple benchmark:
(venv) michel@debian:/dev/shm$ python3 test.py Reading password list took: 3.177845547 Checking if password is in list took: 2.2280000000485245e-06 Password is in list: False Size of passwords in MB: 755.3031387329102 (venv) michel@debian:/dev/shm$ python3 test2.py Reading password list took: 17.459581553 Checking if password is in list took: 1.1341000000442136e-05 Password is in list: False Size of passwords in MB: 75.32260131835938
As you can see, reading the password list is much longer and not really optimized. However, checking if a password is in the list is basically instant and the memory usage reduction is significant. As for false positives, I think that they are extremely unlikely. 64 bits is really a lot.
Michel can you please share steps to reproduce? This will help in investigating and resolving any issues