Django's contribution data
Inspired by David Eaves, here's some information on accessing Django's contribution data.
Please take it, mash it up, and show us the results!
If there's other data you'd like to see, please get in touch (jacob -at- jacobian.org) and let me know what you'd like to see. I'll do my best!
Trac's database
Data dumps out of Trac, our ticket tracking software. You could use this to get information about our ticket workflow, patches, etc.
There's two ways to access the data: Trac's RPC interface and the daily data dumps.
Daily data dumps
These are direct data dumps of the Trac database, collected nightly, in various formats. They're sanitized to remove some tables with sensitive info (session data, etc.) but are otherwise complete.
Dumps are currently available in the following formats:
- CSV (tar'd & bzipped directory; one CSV file per table; ~35MB).
The database schema is documented at http://trac.edgewall.org/wiki/TracDev/DatabaseSchema. The most interesting tables are probabably the ticket and ticket_change tables. ticket_change, in particular, contains each change ever made to a ticket and so probably has some of the most itnersting data available.
Trac's RPC interface
Trac has a XML-RPC and JSON-RPC interface. You view some documentation of these APIs at:
https://code.djangoproject.com/xmlrpc
Note
You'll need to be logged in to access this page and to access the data. If you need to create an account, the sign-up page is at https://www.djangoproject.com/accounts/register/.
The base URLs you'll use for for the XML-RPC and JSON-RPC APIs is:
https://{username}:{password}@code.djangoproject.com/login/rpc
The easiest way to access these APIs is with Python's xmlrpclib library. Here's a quick example:
>>> import xmlrpclib >>> rpc_url = "https://USERNAME:PASSWORD@code.djangoproject.com/login/rpc" >>> trac = xmlrpclib.ServerProxy(rpc_url) # Get a single ticket's info. >>> ticket, time_created, time_changed, attributes = trac.ticket.get(1337) >>> attributes['resolution'] 'wontfix' # Perform a search. - counts the open (i.e. not-closed) tickets. # Query syntax is documented at http://trac.edgewall.org/wiki/TracQuery#QueryLanguage >>> not_closed = trac.ticket.query('status=!closed&max=5000') >>> len(not_closed) 1850
Please be careful here. There are APIs that write data and using them could look like spam, so please ask me (jacob -at- jacobian.org) for permission first!
Repository data/dumps
Data and dumps from our source control repository. You could use this to mine information about who's committing code, when, etc.
There are a few ways of accessing this data: Querying the SVN repo, the GitHub API, and SVN data dumps in a variety of formats.
Querying the SVN repo
Django's SVN repository is at http://code.djangoproject.com/svn/django/; you can use the svn client binary to interact with this as a sort of "API". In particular, most svn commands take a --xml argument to return data in XML. For example, to get information about a particular commit you might do something like:
$ svn log http://code.djangoproject.com/svn/django/trunk -r1234 --xml <?xml version="1.0"?> <log> <logentry revision="1234"> <author>jacob</author> <date>2005-11-14T18:50:13.298556Z</date> <msg>Added NOINDEX tag to debug 500 page (for robots)</msg> </logentry> </log>
There are also a number of libraries in Python (and other languages) that can access SVN directly. pysvn seems to be a popular choice.
The GitHub API
Django's repository is mirrored onto GitHub (http://github.com/django/django), which means you can use GitHub's API to to pull commit data. For example:
$ curl -i https://api.github.com/repos/django/django/git/commits/a0d59b49019d65b38c5612eb0b4fab0bb37271ae HTTP/1.1 200 OK Server: nginx/1.0.4 Date: Wed, 07 Sep 2011 16:38:12 GMT Content-Type: application/json Connection: keep-alive Status: 200 OK X-RateLimit-Limit: 5000 X-RateLimit-Remaining: 4994 Content-Length: 995 { "parents": [ { "url": "https://api.github.com/repos/django/django/git/commits/6465e005fd564bd75ba64f2f09d5824ed2455c9c", "sha": "6465e005fd564bd75ba64f2f09d5824ed2455c9c" } ], "committer": { "date": "2005-11-14T10:50:13-08:00", "name": "jacob", "email": "jacob@bcc190cf-cafb-0310-a4f2-bffc1f526a37" }, "author": { "date": "2005-11-14T10:50:13-08:00", "name": "jacob", "email": "jacob@bcc190cf-cafb-0310-a4f2-bffc1f526a37" }, "message": "Added NOINDEX tag to debug 500 page (for robots)\n\ngit-svn-id: http://code.djangoproject.com/svn/django/trunk@1234 bcc190cf-cafb-0310-a4f2-bffc1f526a37\n", "url": "https://api.github.com/repos/django/django/git/commits/a0d59b49019d65b38c5612eb0b4fab0bb37271ae", "sha": "a0d59b49019d65b38c5612eb0b4fab0bb37271ae", "tree": { "url": "https://api.github.com/repos/django/django/git/trees/a5d296a396f5bbf70d074ce09fa947f95cd91523", "sha": "a5d296a396f5bbf70d074ce09fa947f95cd91523" } }
SVN data dumps
Finally, for convenience, we provide a couple of full dumps of repository data for off-line processing:
- Complete SVN log (bzipped XML; ~1 MB). This is the complete output of svn log --xml.
- Full SVN dump (bziiped SVN dump; ~200 MB, expands to ~ 1.8 GB). This is the result of a svnadmin dump.
Each dump is updated nightly.
Questions?
If you've got questions, please contact Jacob Kaplan-Moss (jacob -at- jacobian.org).