#11565 closed (wontfix)
Rate Limit Error Emails
Reported by: | Andrew Wilkinson | Owned by: | nobody |
---|---|---|---|
Component: | Core (Other) | Version: | dev |
Severity: | Keywords: | ||
Cc: | Simon Litchfield | Triage Stage: | Design decision needed |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
We like the emails we receive when an exception occurs on the live site. However, if an error repeatedly occurs (e.g. the database goes down) we get spammed with a huge number of emails.
I'm attaching a patch which will only send an email once every settings.ERROR_EMAIL_RATE_LIMIT minutes for each error. It determines whether it has sent an email about a particular error by taking an MD5 of the traceback and storing it in whatever cache is configured. If the same MD5 already exists in the cache, then it doesn't send an email.
The big problem with this approach is what if your cache starts to produce errors? Certainly memcached is designed so that if the server goes away all keys appear not to be set, rather than an error. I'm not sure about how the database cache backend would work in this situation.
Any suggestions on how to improve the patch are appreciated. If it's ok then I'll produce a patch for the documentation too.
Cheers,
Andrew
Attachments (2)
Change History (10)
comment:1 by , 15 years ago
comment:2 by , 15 years ago
That's a fair point, I'm just wary about using an except block which catches all exceptions. I guess in this case it's the lesser of two evils. I'll produce a better patch for the code and docs today.
by , 15 years ago
Attachment: | error_emails_rate_limit.patch added |
---|
comment:3 by , 15 years ago
I've attached an improved patch including documentation. Tests will take me a bit longer to create as I don't think the current error emailing behavior is tested.
comment:4 by , 15 years ago
Triage Stage: | Unreviewed → Design decision needed |
---|
comment:5 by , 14 years ago
a year passed, design decision still needed.
this feature would make one nice django feature (error emails) actually useful for production without hacks. While the proposed solution is not perfect, a way to limit total error mail count to prevent mail overload on a busy site would be great.
comment:6 by , 14 years ago
Cc: | added |
---|---|
Needs tests: | set |
See discussion here
Python logging support- way to go. Arecibo- sounds cool.
Before it even gets there though, I need my errors rate limited. Why bother email, Python logging, Arecibo, or anything else with thousands+ of errors a second when every single one of those errors are identical.
I like Andrew's approach, so I decided to improve on it. Such that if cache is down, we use local memory (limited, of course). Updated patch attached.
by , 14 years ago
Attachment: | error_email_rate_limit_v2.diff added |
---|
Falls back to memory if cache isn't available
comment:7 by , 14 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Is there any reason you couldn't do this within a custom logging handler, as suggested by Russell on that thread? Why do you need your errors rate limited at that level of the code, rather than at another level? Something is dealing with thousands of errors a second, and with your implementation it is forced to be in-process Python code, while with a custom logging handler even the duplicate detection could be passed off to another process. (or not, whatever you decide).
If we added this, we would also need to decide on the definition of 'duplicate' - you have decided that it is based solely on the traceback, which is fine. But I can see other possibilities - like wanting to know how many times the error is happening per user, for sites with fewer users for example.
To me it seems this is better done within a custom logging handler, so that it can be developed and improved independently of Django itself. That's what Russell clearly thought too, in the thread on django-developers, so I'm going to close WONTFIX.
comment:8 by , 14 years ago
Oops, sorry Luke. Somehow I failed to realise the logging stuff had hit trunk already- which, by the way, is superb!
Here's the rate limiter as a logging filter
Surely the answer to your question about the cache throwing errors is to use the cache very defensively? Just put broad try/catch around both calls to the cache, and if anything fails, ignore those errors and make sure you send the e-mail.