Posted on by & filed under Devops, django, programming, python, Tech.

One of the trickier parts of maintaining a frequently updated web application is deploying those updates without annoying the site’s users.  Updating the software that runs a web site can involve uploading new code, upgrading dependencies, changing the database schema, restarting several servers, working around long-running background tasks, and more.  Doing all this in the most obvious, straightforward manner typically means stopping everything, doing the deployment, and then restarting it all (which means the site is unavailable in the meantime).  Many sites used to handle this by scheduling downtime for updates late at night when not many people are using the site, but this has several drawbacks:

  • Sites with a global audience don’t really have a time when anywhere near all their users are asleep
  • It becomes a big deal to deploy new features (because it means taking the site down), so pressure to do it infrequently conflicts with pressure to fix bugs and release new features quickly
  • Nobody really wants to be up at 3 AM deploying new code when most of the people who could help if something goes wrong are asleep

So what you really want is a way to deploy a new version of the site without any of your users even noticing until they spot the improvements.  Given the huge number of web sites out there implemented in different languages, frameworks, and operating systems, people have come up with a variety of ways to attempt this.  What I’m going to outline here is what we’re currently doing to quickly deploy new versions of our main Django application in the middle of the day, while most of the company is online and working, without interrupting any user requests or long-running tasks.

Tools

First, a quick overview of what we’re working with:

  • Django is our web application framework, and contains utilities for collecting static assets for deployment, etc.
  • Fabric for breaking the deployment into steps and executing each one on an assortment of remote servers with different roles
  • Chef for updating server configurations and system dependencies
  • Virtualenv for maintaining a set of Python libraries using exactly the versions we want, independent of what our operating system currently packages by default
  • South for updating the database schema (soon to be replaced with Django 1.7’s built-in migrations)
  • uWSGI for linking our web application processes to our web server
  • Celery for running scheduled and asynchronous tasks that don’t need to finish in the duration of a single web request

There’s actually a lot more software than this involved in running the site, but these are some of the key tools that make it possible to perform seamless deployments.  The main deployment Fabric task will end up looking something like this:

 

Deployment: Getting the Code to the Servers

The first steps in our Fabric script for performing deployments are to prepare an archive of the new code (build_archive) and get it to all the servers that need to run it (deliver_tar).  This is really the easy part, as long as we upload the code someplace other than where the current code is running.

We only need to build the archive to be uploaded once (no matter how many different servers we’re deploying it too), and we can upload it to all the servers it needs to go to in parallel.  So far we haven’t done anything to interrupt the normal operation of the site.  Implementation details of the tasks have been omitted here to make the high-level organization more clear (and because they can vary a lot between sites depending on how things have been configured).

Pause Asynchronous Tasks

Before we start making changes that could potentially impact how the site behaves, we’re going to pause all our asynchronous task processing.

The only_roles decorator here lets us perform certain tasks only on the servers for which they’re relevant, even though the overall deployment task is being executed on a larger set of servers (app servers, Celery servers, etc.).  Stopping Celery this early is actually overkill, but our site has been designed such that anything in a Celery task can wait for minutes or even hours without inconveniencing anybody too badly; they’re things that we’d like to be done soon, but don’t need to be done right now.  Note from the docstring that we’re using a start/stop/restart script for Celery which initially sends each Celery process a SIGINT signal asking it nicely to wrap up whatever its doing and not start anything new from the queue.  Only if that takes an unusually long time will we actually resort to a KILL signal which interrupts the task (if this happens, we get an error email about it and write up a ticket to either optimize the code or subdivide it into smaller tasks).

Install and Upgrade Dependencies

Next we update the virtualenv on each server to use the correct versions of the libraries used by the code we’re deploying.

This is where we’re starting to take some risks, and this could be improved a little.  Because we’ve stopped all of our Celery processes, all of our web application processes have been running for a while and already loaded the code into memory, and the web servers are serving collected static assets from the previous deployment directory, the application isn’t really looking at this virtualenv on the filesystem much anymore.  But we do impose a limit on how many requests each Django process is allowed to handle before we restart it (to minimize risk of memory leaks and such), so bad timing could lead to one of them restarting in the middle of our deployment and a few requests hitting an odd in-between state of the code.  In practice this hasn’t happened very often, but the next improvement here is probably to clone the virtualenv in use, update the clone, and swap a symbolic link to point to the new one just before reloading the web application processes.

Update the Database

Not every deployment includes a database schema change, but it’s not uncommon either.

This is an area where we have to be particularly careful.  Because the main relational database is a single resource shared between all the application servers, it’s in constant use and changes really can’t be made to an offline copy and swapped in (the data is constantly changing).  So we need to take care that our migrations are backwards-compatible and finish pretty quickly.  This presentation (slides here) gives a pretty good overview of the potential problems and how to avoid them.  The main points I’d emphasize:

  • Do not create new fields on existing tables as not-null (especially on PostgreSQL; it triggers a full table rebuild).  If necessary, change the field type to not-null in a future migration after all records have been verified to have real values in it.
  • Perform data migrations in batched transactions.  A single huge transaction can be much slower and gives you no feedback on how much longer it’ll take.
  • Test migrations against a production-scale database before running them for real on production, to make sure they’ll finish in a reasonably length of time.
  • Don’t assume atomic deployments.  This whole deployment process takes a nonzero amount of time, so you can’t assume that a data migration will catch every single record created by the old code or that the new code will already be in place immediately when the migration finishes.

Swap the Deployment Directory Link

Now that all the changes are in place, we update the symbolic link which points at the currently deployed code on each server.

This one in particular we want to do as late as possible, and after all steps before are fully complete, because this is where all the processes look when loading the software that runs the site.  It needs to be all lined up and consistent when that happens.

Update the Server Configuration

Now we make any changes to init scripts, user permissions, system library versions, server-specific settings files, etc.

Depending on exactly how your application code is linked to your server configuration, this could well be more appropriate to do elsewhere in the deployment process.  We do it here because we have Chef configured to generate some server-specific settings files based on templates in the application directory, so we need to have the new code fully in place before it runs in order for the settings to be correctly populated.

Restart the Servers

With all the code and database changes in place, it’s time to actually start switching over to the new code.

For Celery, this is pretty easy; we stopped all the Celery processes earlier in the deployment process, so we just need to start them again.  The processes handling web requests are where we need to be careful; basically, we want to allow requests that are currently in progress to finish while starting new processes with the updated code to handle all future incoming requests.  This is actually pretty difficult to get just right; rather than failing to do the topic justice here, I’ll just point you to the very detailed page in the uWSGI documentation on the art of graceful reloading.

Room for Improvement

By implementing a deployment process following the outline above, we’ve managed to significantly improve our deployment process.  A non-developer can click a button in to launch a Jenkins job which runs the deployment script to deploy code which has already been tested in QA to production.  Within a few minutes, the new code is up and running on all the production servers with users none the wiser, unless maybe they suddenly notice “hey, did that page look this good before?”  When we were newer to all this, a deployment often prompted one or two dozen error emails as user requests and background tasks were rudely interrupted; now, a single error email arriving around the time of a deployment is cause for filing a new ticket to prevent it from happening again.

Still, there’s still room for improving the process:

  • I mentioned earlier that updating a clone of the virtualenv could reduce the risk of loading new library code too early; I’ve also seen interesting arguments for deploying the new code bundled with all dependencies as a native package, so that may be worth pursuing.
  • It would be nice to be able to wait longer before pausing the Celery workers, while remaining confident that the web processes will keep humming along happily in the meantime.
  • We could definitely get better at writing good database migrations and/or creating tools to automatically check a migration for potential problems before it goes live.
  • The database settings can probably be tweaked for better overall performance, which would also speed up our migrations.
  • The overall deployment is reasonably fast, but even faster would be good; some of the steps could probably stand to be optimized a bit (more efficient collection of static assets, etc.)

If anybody has good suggestions for even more stealthy deployments, I’d love to hear about them in the comments!

Tags: celery, deployments, deployments nobody notices, Django, site performance, websites,

2 Responses to “Stealthy Django Deployments”

  1. Mike Sokolov

    Nice post, Jeremy! I wonder if you’ve considered running a warm postgres backup (like this: and performing migrations there?

    -Mike

  2. Vasyl Dizhak

    Thank you for sharing great post. I would really like give a try to a bit “smoother database migration”, however I found that it is super hard to set up and maintain. So our monitoring system still fires exceptions about restarted application processes that are already handling new code. Another quick note is about only_rolse decorator, I assume it do the same as fabric.decorators.roles?