#1180 closed defect (fixed)
Django session key generation flawed
Reported by: | Owned by: | Malcolm Tredinnick | |
---|---|---|---|
Component: | contrib.sessions | Version: | dev |
Severity: | normal | Keywords: | |
Cc: | Triage Stage: | Accepted | |
Has patch: | yes | Needs documentation: | no |
Needs tests: | yes | Patch needs improvement: | no |
Easy pickings: | no | UI/UX: | no |
Description
session_key = md5.new(str(random.randint(0, sys.maxint - 1)) + SECRET_KEY).hexdigest()
this is used throughout django and it often generates duplicate keys, today i spent 5 hours trying to find out what was causing my site to break and it was this (since I used this algorithm in another context without checking if a session with given key already exists).
i propose the following:
session_key = md5.new( str(random.randint(0, sys.maxint - 1)) + "#" + str(random.randint(0, sys.maxint - 1)) + "#" + str(time.time()) + "#").hexdigest()
secret_key is pretty useless
regards
Attachments (4)
Change History (46)
comment:1 by , 19 years ago
comment:2 by , 19 years ago
Whatever the solution hugo, it'd be better than what is there currently. For some reason the generator was generating duplicate keys every couple seconds (I have 30000 people online on my site so they generate a lot of 'logins') for the SECRET_KEY + one random number in the original version of the cookie generator.
And yeah secret_key can perhaps help but I don't really think anyone would predict two pseudo-random numbers and the current time including milliseconds.
comment:3 by , 19 years ago
Hmm. Actually the get_new_session_key function checks the generated session key for uniqueness, before returning. Ok, it doesn't register a new session key right away with the database, so another thread/process _can_ produce the same key, if the keys aren't actually used for storing stuff in the database (and since the system won't know what keys it handed out, it will happily mix sessions from different users). This get's worse after some time, of course.
After checking with the session middleware source (that's the only place where get_new_session_key is used), I still wonder how that actually can lead to problems - because the session_key is only generated when the session is sent out - and in those cases it will be stored in the database, and so shouldn't ever be regenerated (as get_new_session_key explicitely checks for that).
Anyway, I wrote a short test script:
from django.models.core.sessions import get_new_session_key h = {} for i in range(0,999999): k = get_new_session_key() if h.has_key(k): print 'dupliate', k h[k] = True if i % 1000 == 0: print 'generated', i
After around 200000 generated keys, the sytem starts to produce more and more duplicates. This is to be expected, since the random number generator isn't really a good idea to solely base those session keys on. And since the test script doesn't register new session keys, it will happily produce duplicates.
After changing the algorithm to the following:
sessin_key = md5.new(str(random.randint(0, sys.maxint - 1)) + str(random.randint(0, sys.maxint - 1)) + SECRET_KEY).hexdigest()
The system ran fine without dupliates to far higher numbers - I stopped at one million (and the hash took 0.5 GB up to then ;-) ). So it looks much better and I think it's ok to just use the double random number (I actually would have expected a much worse behaviour of the random number generator - seems they did some good work on that one).
Of course, if there are ever session_keys reused in some application without actually storing them in the database, you still will get duplicates after some time - but it's at least much better than with the original algorithm.
I am still curious why your site produces session keys that aren't registered with the database, though. Do you do anything special with session keys that's outside the standard stuff? It might be that different processes/threads produce the same random number, of course. But that doesn't really sound likely (but of course, as allways, how stuff sound isn't necessarily how stuff is ...)
comment:4 by , 19 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:5 by , 19 years ago
Please reopen if you still have the duplication problem - eventually we would have to throw in some more data to prevent duplicates (time, pid are two that come to mind). It's hard to reproduce a highly parallel production environment in test settings, so it's better to get direct feedback from the production system.
One thing that I did find on the random number generator in Python: it isn't thread safe, threads might expose duplicate random numbers. Could that be the case in your system? Are you running a multi-threaded environment or a multi-processed one?
comment:6 by , 19 years ago
Usually a timestamp is always part of the input data (at least all session key generators i have seen so far). Random generators can't be random, only pseudo random, if they do not messure some hardware parameters like fan speed or better use a noise generator. I suggest adding a time.ctime() to the input as well.
comment:7 by , 19 years ago
As written, I have run tests to see what the duplicate-rate of that is. Actually the random generator of Python has a period of over 6 billion (according to the doc) and we only take a rather small part of that (up to sys.maxint, which usually is half of 32bit). That was the problem with the former approach: the random numbers in the range of 0..maxint just repeated too often, now there needs to be a pair of random numbers in the range 0..maxint (out of a random stream with period of over 6 billion) to produce a dupliacte. time.ctime (or time.time) can be added if we really still have duplication happening. And that's best discovered with wojtek's site, since I don't know of any other site with such a heavy traffic :-)
comment:8 by , 19 years ago
My concern is more about session hijacking than the randomness of the session key. Pseudo ramdon generators need some sort of seed which they increase or change after every returned value. When the seed is not stored when the machine reboots, or the generator is recreated, the returned random numbers are the same every run. I havn't looked at the radom module which seed they use, but adding a timestamp make such attacks much much harder combined with a private site key, nearly impossible.
comment:9 by , 19 years ago
You can't guess a session key where part of the key generation process is a secret that you don't know. We don't use only random numbers, we use random numbers plus the secret key. And the latter isn't known to the end user, so he can't guess the next key, even if he succeeds to find the exact point in time of a random number generator with more than 6 billion steps in one round ...
Actually adding the time isn't really that more secure than without - although the python library returns milliseconds, depending on the system it might actually just fake those and in reality return times of a much lower stepping. And in those cases, if the server is synchronized to some outside time server, it might even be feasible to guess the time :-) (although admittedly it's near impossible to guess the time on systems with high resolution timers).
That's not to say that it isn't a good idea to add the time - wether or not they add it is up to the core devs. I only (hopefully) fixed the "duplication happens too soon" problem. How they see the likelihood of guessing the next session key with the current algorithm is up to adrian and jacob - I only applied as less change as possible to fix the given problem. I am usually a bit reluctant to add more if the problem is already fixed. It's not _my_ source ;-)
comment:10 by , 19 years ago
I know that there is a secret key. That's very usefull. The idea which poped in my mind was, that you use a script to collect some session keys for example. When the server restarts, the chances that these keys you saved before maight be generated again could be higher than random guesses, which is nearly impossible due the lenght of the md5. Adding a timestamp fixes this, because the timestamp is always different. It's not that you could recalculate the key, md5 is a crypto hash which makes it impossible to guess what was going in.
comment:11 by , 19 years ago
This discussion really should take place on the list, not on the ticket. (yes, I am guilty of that, too).
comment:12 by , 19 years ago
You may have a look at the GUID recipe in Python Cookbook for solving this issue:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163604
comment:13 by , 19 years ago
I tested the GUID recipe with md5 and secret key on 2 computers generating 10 mio keys each at the same time. The have not been any duplicates.
So I think this piece of code in django/models/core.py:
while 1: session_key = md5.new(str(random.randint(0, sys.maxint - 1)) + SECRET_KEY).hexdigest() try: get_object(session_key__exact=session_key) except SessionDoesNotExist: break return session_key
may be replaced with:
session_key = md5.new(GUID.generate() + SECRET_KEY).hexdigest()
Benefit: unique without the need to access the database, really fast (more than 20.000 session key per second)
Drawback: Additional third party dependency.
comment:15 by , 18 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Guys
I've got multiple reports from users that they were able by some chance to get to another user's session so I think the problem still exists and I can explain why: it exists 'by design'
I've released a 'newsession' module with is compatible to current sessions and has no such problem.
Gonna put a patch here
comment:17 by , 18 years ago
Triage Stage: | Unreviewed → Accepted |
---|
#3716 spawned some fairly intensive discussion
Here's a recap -
ubernostrum says the problem appears to be this:
1. User A visits the site and server process A generates a session key, checks the database to see whether that key is in use, finds that it is not, and assigns that key to User A. 2. At or around the same time, User B visits the site and server process B generates the same session key (probably because both processes started out with the same seed for Python's RNG). Because User A's session has not yet been serialized and stored, server process B also finds that the key is not in use, and assigns that key to User B. 3. At some later point, both sessions are serialized and stored; the first one executes an INSERT --- because Django does not distinguish INSERT from UPDATE for purposes of a model's 'save' method -- and the second executes an UPDATE on that row. Assume for the moment that User B's session gets the INSERT and User A's session gets the UPDATE. 4. When User B comes back to the site, he finds that the Django session framework thinks he is User A. There are several possible approaches here: 1. Manually seeding the RNG with a value unique to the server process, or to the request, would move the probability of key collision closer to what it "should" be, thus dramatically reducing the problem. 2. Changing the session framework's behavior so that it stores the session in the database as soon as it finds an unused key (e.g., by looping with 'get_or_create' until it gets a new object) would drastically reduce the odds of two independent processes with the same key each thinking that their key is unused. 3. Changing Django's ORM to separate INSERT from UPDATE would make it impossible for two processes to each execute a 'save' on sessions with the same key and think that they each created a new row. Options 1 and 2 are relatively easy and could be implemented quickly. Option 3 is attractive, but would be harder to do.
Benjamin Slavin suggests:
Do we really need to re-seed the RNG? What if we did something like: session_key = md5.new("%x%s%s" % (random.getrandbits(64), datetime.datetime.utcnow(), settings.SECRET_KEY)).hexdigest() datetime.datetime.utcnow could be replaced with whatever would be used for the seed. You could include os.getpid if you desire.
The general consensus is that this should be addressed after 0.96 is released, and the patch in #3716 from Anton Khalikov is a good starting point.
follow-up: 19 comment:18 by , 18 years ago
comment:19 by , 18 years ago
Replying to mtredinnick:
(In [4771]) Reduced the chances of session object collision. The window of opportunity is
now about five Python instructions in get_or_create(). This doesn't guarantee
no collisions, but should fix many occurrences. Refs #1180.
I suggest few more changes
models.py (code for the save function in Session Manager)
def save(self, session_key, session_dict, expire_date): "Prior to a call to this function, the session key will be created/stored" try: s = self.get(session_key=session_key) except ObjectDoesNotExist: raise #Caller should handle the exception if session_dict: s.session_data=self.encode(session_dict) s.expire_date=expire_date s.save() else: s.delete() # Clear sessions with no data. s=None return s
models.py (session_key is nolong a primary key)
session_key = models.CharField(_('session key'), maxlength=40, unique=True)
and finally in middleware.py (added an error check in process_response)
if new_session is not None: response.set_cookie(settings.SESSION_COOKIE_NAME, session_key, max_age=max_age, expires=expires, domain=settings.SESSION_COOKIE_DOMAIN)
This would permanently avoid collisions. It however introduces the problem of a ProgrammingError exception incase of collision. For that there does not seem to be a better solution around there seems to be no better solution than to lock the session table prior to updation.
I have also refined the save method in Session Manager and added an error check in the middleware.
comment:20 by , 17 years ago
We recently got into problem when using Django over FastCGI. It appears that the Python's own random number generator is initialized with random seed just before starting to fork processes. This way all fcgi processes got the "same" random number generator. The possibility to get duplicates rose so high, that we had to add additionally PID value to session key generation salt.
follow-up: 22 comment:21 by , 17 years ago
Does session_key
field necessarily needs to be a PK? If we change Session model to have a plain auto-incremented integer primary key and use unique=True
for session_key
field, collisions can be avoided completely since there will be no chance to INSERT
two different sessions with equal session keys.
comment:22 by , 17 years ago
Replying to Densetsu no Ero-sennin <densetsu.no.ero.sennin@gmail.com>:
Does
session_key
field necessarily needs to be a PK? If we change Session model to have a plain auto-incremented integer primary key and useunique=True
forsession_key
field, collisions can be avoided completely since there will be no chance toINSERT
two different sessions with equal session keys.
Actually there is, because some databases (SQLite) can re-use primary key values if rows have been deleted from the table, which means it's possible for two users get the same "incremented" primary key value on their sessions.
comment:23 by , 17 years ago
it's possible for two users get the same "incremented" primary key value on their sessions
Can't agree with that. Primary keys are unique by definition, even if they are reused somehow.
But anyway, that does not really matter for us. We just need to impose a UNIQUE
constraint on session_key, which would guarantee that all sessions in the DB have different keys. Unfortunately, Django tries to be smart and treats primary key fields specially. So when we save()
a new session with a newly created key and another session with this key already exists, Django happily runs UPDATE
instead of simply trying to INSERT
and failing. What we need is a way to tell Django not to UPDATE
here (save(update=False)
or something like that). If that's not possible or difficult to implement, we can just create a separate PK field and add unique=True
constraint to session_key
field. Any solution will do.
comment:24 by , 17 years ago
Has patch: | set |
---|---|
Version: | → SVN |
Not sure if this is still and issue, but since it is still open I figured I would throw my hat in the mix on this.
I am using uuid, which follows http://www.faqs.org/rfcs/rfc4122.html, to generate the session_key.
by , 17 years ago
Attachment: | sessions.uuid.patch added |
---|
by , 17 years ago
the uuid file fall back for python 2.3 & 2.4. I put it in django/utils
comment:25 by , 16 years ago
Needs tests: | set |
---|
Just checking in on this since it is still in reoppened status and whats in SVN is still an inferior way or creating unique Session IDs (I still get about 10% duplication with SVN)
comment:26 by , 16 years ago
The main reason why people get duplicated keys is sys.maxint
and the birthday paradox. On 32-bit systems, sys.maxint == (2 << 30) - 1 == 2147483646
.
Due to the paradox, sqrt(n)
is roughly the number you need to have a 50% chance of a collision when picking items uniformly at random from an inexhaustible set of n
items.
Thus. random.randint(0, sys.maxint - 1)
that translates to random.randint(0, 2147483646)
on 32-bit systems, will have a 50% chance of collisions when sqrt(2147483646) =~ 46341
random values have been picked. Testing proves that collisions really happen around that bound:
threadtest.py --- from __future__ import with_statement import sys, threading, time, random, md5 cache = {} class RandThread(threading.Thread): lock = threading.RLock() def run(self): while 1: rand = random.randint(0, 2147483646) # 32-bit with self.lock: if rand in cache: print 'collision, cache size: %d' % len(cache) sys.exit(0) else: cache[rand] = True for i in xrange(100): t = RandThread() t.run() --- $ python threadtest.py collision, cache size: 63059 $ python threadtest.py collision, cache size: 28648 $ python threadtest.py collision, cache size: 42447 $ python threadtest.py collision, cache size: 88932 $ python threadtest.py collision, cache size: 42349
No heavy-duty UUID machinery is required, we just need to explicitly say what is the number of keys we want with low collision probability, i.e. how many random bits we want. 63 bits (sys.maxint
in 64-bit systems) is quite enough in my opinion (I'm certain that those who reported this issue were using 32-bit Python), as collision probability increases at around 3,037,000,499 keys:
>>> sys.maxint 9223372036854775807 >>> math.sqrt(sys.maxint - 1) 3037000499.9760499 >>> sys.maxint == (2 << 62) - 1 True
In conclusion, I propose we update the key generation algorithm as follows:
- session_key = md5.new("%s%s%s%s" % (random.randint(0, sys.maxint - 1), - pid, time.time(), settings.SECRET_KEY)).hexdigest() + session_key = md5.new("%s%s%s%s" % + (random.randint(0, (2 << 62) - 2), pid, # always use 63 bits of randomness + time.time(), settings.SECRET_KEY)).hexdigest()
---
Another minor problem is seeding the random number generator. Since Python 2.4, it is seeded with entropy from the operating system if available (urandom
). This is all good and dandy. Python 2.3 however doesn't use OS entropy and thus collision probability between two simultaneously created processes increases. But we should address this only if people get hit by and report that.
comment:27 by , 16 years ago
Component: | Admin interface → django.contrib.sessions |
---|---|
milestone: | → 1.0 |
This is in scope for 1.0. Will upload a patch shortly.
comment:28 by , 16 years ago
Python 2.3 however doesn't use OS entropy and thus collision probability between two simultaneously created processes increases -- as pid
is used, this shouldn't be a problem.
comment:29 by , 16 years ago
mrts, we're all well aware of the birthday paradox, but I wonder whether your tests are really indicating the problem you think they are. What you need to test is not the probability that, after generating X keys an already-seen key will be regenerated; rather, you need to be testing the probability that the same key will be generated twice within an extremely small window of time (roughly speaking, the amount of time needed to execute the first "except" block in django.db.models.query.get_or_create).
So... is anyone actually seeing collisions happen with visible effects on a live site?
comment:30 by , 16 years ago
ubernostrum, Digitalxero@… reports that he gets duplicates. Digitalxero@…, can you please support the following information: platform (Windows, Linux, ...), architecture (32 or 64 bits), how do you detect duplicates.
Anyhow, I don't see any reason *not* to use 63 bits of randomness regardless of platform to minimize collision probability. The overhead of returning a long instead of a platform int is minimal.
Should the status be set to design decision needed?
I for one don't get duplicates with reasonably large number of users -- but I'm running on 64-bits.
comment:31 by , 16 years ago
Yes, with the system as is in SNV I am seeing collisions about 10% of the time on a live site, with the UUID method I see no collisions. I have not had a chance to test the use_63bit_random on my site yet, but in some personal testing I should not see any collisions and it is about twice as fast as the UUID method and no about 15 ms slower then the current method, so it is a better implementation.
comment:33 by , 16 years ago
Digitalxero, can you please describe your setup (platform, architecture) and how do the duplicates manifest? Otherwise the issue won't be fixed. See also http://groups.google.com/group/django-developers/browse_thread/thread/8cb4edee0db52197
comment:34 by , 16 years ago
As one of the people experiencing issues with the session collisions I will attempt to explain how it manifests and my setup.
My server is a shared host, running Apache, python2.5 and Django is configured though fast-cgi (My host wont let me use mod-python). The way it seems to work is a new thread is spawned on each page request and terminated when the request is finished.
The issues manifests when the sessions system generates a duplicate session id for two different users. Only 1 entry is entered into the DB, so both users are using the same session id, thus can (and do) have incorrect permissions and data. I am detecting the collisions by setting a setting a cookie value with the username they logged in with and checking it against the username in the session data. With the SVN implementation of session generation I get about 10% of my users getting an incorrect session when they login, with the latest patch I have gotten 0 collisions.
The issue may actually be with the get_or_create() method of blocking collisions since if it generates a duplicate id it just grabs that data from the table and runs with it (I think), but whatever the actual cause of the issue the latest patch prevents it on my setup.
by , 16 years ago
Attachment: | ticket_1180__rev_8168-getrandbits.diff added |
---|
Uses getrandbits to get random bits, rather than hacking it with random.randint
comment:35 by , 16 years ago
getrandbits
is available since Python 2.4. randint
is required for 2.3 compatibility. See http://docs.python.org/lib/module-random.html
comment:36 by , 16 years ago
Note to whoever is following this: r8267 brought in the force update/insert behaviour. I'm planning to update the patch accordingly when I get time.
comment:37 by , 16 years ago
@mrts: Don't bother updating the patch. I'm fixing a whole bunch of session-related problems in one go and it's just going to be duplicated work. That will be committed later on today, most likely.
comment:38 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | reopened → new |
comment:39 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Fixed in [8340] (I wrote the wrong ticket number in the commit message).
comment:40 by , 16 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Malcolm, there's a minor glitch in [8340], you assign to randint
in source:django/trunk/django/contrib/sessions/backends/base.py#L18 but use randrange
in source:django/trunk/django/contrib/sessions/backends/base.py#L138 .
So, either use randint
in line 138 or randrange = random.SystemRandom().randrange
.
comment:41 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
*sigh* That's an entirely new bug, not part of this ticket. The main issue in this ticket has been fixed.
I've opened #8310 for the implementation error. Otherwise the history of the changes gets mixed up.
secret_key isn't useless - it's needed. Because randoms aren't real random and so might be predictable. But I agree that adding more variable data to the stuff to build digests on would be in order to prevent duplicates. But I would opt for adding time.ctime() to the mix instead of yet another non-random random number, as that would be much better in preventing duplicates.