Sunday, September 9, 2012

Staticfiles on heroku with django pipeline and S3

Static files are always a nasty bit, even more so when you serve them from a completely different web server.
I was recently required to do so for a Django project that is hosted on Heroku. It is strongly discouraged to serve your static from the web dyno, so I went with S3.

requirements

  1. Static files are served from S3
  2. Compile, minify, and combine the JS/CSS
  3. When working locally - serve the files from the Django without changing them.
  4. Don't rely on browser cache expiration - manage the versions of the static files

Since the process in 2 might take sometime, I didn't want it to block the loading of the dynos, so I didn't want to call collectstatic on the dyno.

Moreover, I wanted the version of the files to be perfectly synced with the server code. i.e. every version that is uploaded should have a corresponding static file package that is somehow linked to it and is immutable in the same since that a git commit is immutable.
This is not a common requirement but it makes a lot of sense, since a great majority of the static files ARE code, and a mismatch between versions could cause unpredictable behaviors.

solution


overview

  1. Use  django-pipeline to define packages (while still getting the original files on local env) 
  2. When deploying a new version, Collect the files using Django's "collectstatic".
  3. Use Django's CachedFilesMixin for static files version management
  4. Upload the files to S3 with s3cmd
  5. Commit the hash names of the static files to the code - this synchronizes the file version with the code version

Defining packages

Using django-pipeline, you can define the different packages, and also include files that require compilation (like less or coffeescript).  This done on the settings file, like so:

# CSS files that I want to package
PIPELINE_CSS = {
    'css_package': {
        'source_filenames': (
            r'file1.css',
            r'file2.less',
            ),
        'output_filename': 'package.css', # Must be in the root folder or we will have relative links problems
    },
}
PIPELINE_JS = {
    'js_package' : {
        'source_filenames': (
            r'file1.js',
            r'file2.coffee',
            ),
        'output_filename': 'package.js', # Must be in the root folder or we will have relative links problems
    }
}

PIPELINE_YUI_BINARY = ...
PIPELINE_COMPILERS = (
    'pipeline.compilers.coffee.CoffeeScriptCompiler',
    'pipeline.compilers.less.LessCompiler',
)

PIPELINE_COFFEE_SCRIPT_BINARY = 'coffee'
PIPELINE_LESS_BINARY = ...
# Storage for finding and compiling in local environment
PIPELINE_STORAGE = 'pipeline.storage.PipelineFinderStorage'

collecting files and adding version management

Collection is composed of a few steps:
  1. Find all the static files in all the apps this project is using (via INSTALLED_APPS)
  2. Copy all the files to the same root folder on the local env
  3. Create packages according to the pipeline settings.
  4. Append the md5 hash of each file to its name (so file.js is renamed to file.****.js)
  5. Go over CSS files that have imports and image referencing (like url()), and change the path to the new file name of that resource
This can be done by using a custom storage for the staticfiles app.

# Local location to keep static files before uploading them to S3
# This should be some temporary location and NOT committed to source control
STATIC_ROOT = ...
# Storage for collection, processing and serving in production
STATICFILES_STORAGE = 'myapp.storage.PipelineCachedStorage'

And the storage is simply:


class PipelineCachedStorage(PipelineMixin, CachedFilesMixin, StaticFilesStorage):
    pass

So whenever we execute the collectstatic management command, we get all the steps that are described above.
One caveat that you might encounter is that during step 5, if a resource is not found, it will raise an exception and won't continue. for example, if one of the css files in one of the apps you are using (might be 3rd party) is referencing a background image that does not exist, the collection process will fail when it reaches that file.
This is a bit too strict in my opinion so I used a derived version of the CachedFilesMixin that is less strict:

class MyCachedFilesMixin(CachedFilesMixin):
    def hashed_name(self, name, *a, **kw):
        try:
            return super(MyCachedFilesMixin, self).hashed_name(name, *a, **kw)
        except ValueError:
            print 'WARNING: Failed to find file %s. Cannot generate hashed name' % (name,)
            return name

Upload the files to S3

To upload the files, I use s3cmd which faster than anything else I have tried. You can actually set Django to upload the files directly to S3 when collection, but it will be much slower and will result in more S3 activity then doing it this way.

you can sync the local folder with the S3 bucket this way:

s3cmd sync collected/ s3://mybucket -v -P

Notice you can do this without harming the current version in production since static files that have changed will have a different file name, since we added the MD5 hash to their name.

To make Django create links to the files on S3 we use django-storages.  we update the production version with the AWS settings and use an S3BotoStorage with a corresponding STATIC_URL:


AWS_STORAGE_BUCKET_NAME = os.environ.get('AWS_STORAGE_BUCKET_NAME')
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
AWS_ENABLED = os.environ.get('AWS_ENABLED', True) # Should only be True in production
AWS_S3_CALLING_FORMAT = ProtocolIndependentOrdinaryCallingFormat()
AWS_QUERYSTRING_AUTH = False

STATIC_URL = '//s3.amazonaws.com/%s/' % AWS_STORAGE_BUCKET_NAME  if AWS_ENABLED else '/static/'
STATICFILES_STORAGE = 'myapp.storage.S3PipelineStorage' if AWS_ENABLED else 'myapp.storage.PipelineCachedStorage'

A few notes about these settings:
  • AWS_ENABLED should only be true in production so are not using S3 when working locally
  • AWS_S3_CALLING_FORMAT is now default to S3 subdomain bucket url which is great for CNAME but chrome does not like when you directly download assets from *.s3.amazon.com and raises sporadic security errors, so I prefer to keep using the original url scheme
  • AWS_QUERYSTRING_AUTH is disabled because there are currently too many bugs that make the signature wrong when you use S3BotoStorage and CachedFilesMixin together. hopefully, that will change soon
Also notice that I changed the STATICFILES_STORAGE to be  'myapp.storage.S3PipelineStorage' on production. This is the S3 equivalent of what we have on local env:

class S3PipelineStorage(PipelineMixin, CachedFilesMixin, S3BotoStorage):
    pass

Linking the static files version to the code version

So now we have different versions of the static files reside side by side on S3 without interfering. The last issue is to make sure each code version is linked to the correct static files version. Since we don't want the resources themselves to be available on the web dyno, we need to keep a separate mapping between file name and the versioned file name (with the hash).
One way to do so is by using a filesystem based cache. When files are collected, the CachedFilesMixin uses a Django Cache backend called 'staticfiles' (or the default if that is not defined) to keep the file names mapping. Using a filesystem based cache we can keep this mapping after the collection and then commit it to the code so it will be available to the web dyno when we push.
To add the filesystem based cache:
 
CACHES = {
    ...,
    'staticfiles' : {
        'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
        'LOCATION': os.path.join(PROJECT_ROOT, 'static_cache'),
        'TIMEOUT': 100 * 365 * 24 * 60 * 60, # A hundred years!
        'OPTIONS': {
            'MAX_ENTRIES': 100 * 1000
        }
    },
}

Notice the cache is kept inside the project directory so it will be picked up by git.
The deployment script now contains:

rm -rf static_cache
manage.py collectstatic --noinput
s3cmd sync collected/ s3://bucket -v -P
git add static_cache
git commit static_cache -m "updated static files cache directory"

The cache is deleted in the beginning of the process and afterwards committed to the git repository (we commit just the folder that contains the cache, regardless of the status of the repository).
Again, this does not change anything on production. To do the actual deployment we just push to heroku and we immediately get all the code changes with the staticfiles changes.

Problem solved - let's go eat!

No comments:

Post a Comment