Thursday, September 27, 2012

Django app reset (with south)

When developing a new Django app, it is common to make lots of changes in the models.py module.
In order to actually test the new app, you need to update the database with the new schema.
However, Django's syncdb does not update existing tables, only adds missing ones. 

There are a few database migration tools out there, but south is by far the most common one.
South excels on small changes, like adding a field or removing a constraint. In the early stages of app development, however, you might make rapid large changes and in the same time, not care too much about the existing data in the database (for that particular app).

So in the process of your development, you might want to do some kind of "app resetting", meaning - 'drop all the tables for this app and recreate them according to the new models definition'.
As common as it seemed to me, I couldn't find a solution for that procedure in neither the Django native api, nor in the community.
The closest options are sqlclear, which prints the sql statements to drop tables for an app and flush, which actually resets the entire database. Obviously these solutions are not south-friendly.

But I wanted something that actually resets a single app and also plays nicely with south.

Enters "south_reset", a management command that is just a few south commands sewn together, but I still found it useful enough to share.

The usage is pretty straight forward, just list the app names you want to reset.
The optional "soft" flag means that the migrations are merged to a single initial migration without actually running them (just faking it),  so the data persists for the app. This is useful when you make lots of migrations and you want to get rid of the clutter, but still keep the existing data in the database.

Note that you should be very careful with this command if you are deploying the code somewhere, you might get ghost migrations.

So, with no further ado, here is the gist:

from optparse import make_option
from os.path import dirname, join, abspath, basename
import shutil
import subprocess
from django.core.management.base import AppCommand
class Command(AppCommand):
help = "Reset south migrations for app"
option_list = AppCommand.option_list + (
make_option('--soft',
action = 'store_true',
dest = 'soft',
default = False,
help = "Just merge the migrations and update the south DB accordingly, doesn't change actual app tables"),
make_option('--database',
action = 'store',
dest = 'database',
default = 'default',
help = 'Nominates a database to synchronize. Defaults to the "default" database.'),
)
def handle_app(self, app, soft = False, verbosity = 1, **options):
printer = Printer(verbosity = verbosity)
def manage_call(*a, **kw):
b = ['--%s%s' % (key, ('=' + value) if value!=True else '') for key, value in kw.iteritems() if value != False]
final_args = ['python', 'manage.py'] + list(a) + b
printer('Calling: '+ ' '.join(final_args), 3)
return subprocess.call(final_args)
app_name = basename(dirname(app.__file__))
printer('Resetting %s (%s)' % (app_name, 'soft' if soft else 'hard'))
migrations_dir = abspath(join(dirname(app.__file__), 'migrations'))
printer('Rolling back migrations', 2)
manage_call('migrate', app_name, 'zero', fake = soft, verbosity = verbosity, database = options['database'])
printer('Deleting migrations folder', 2)
try:
shutil.rmtree(migrations_dir)
except Exception, ex:
printer('Failed to delete migrations folder, probably does not exist')
pass
printer('Creating new initial migration', 2)
manage_call('schemamigration', app_name, initial = True, verbosity = verbosity)
printer('Migrating to new initial', 2)
manage_call('migrate', app_name, fake = soft, verbosity = verbosity, database = options['database'])
printer('Done')
class Printer(object):
def __init__(self, verbosity = 1):
self.verbosity = int(verbosity)
def __call__(self, s, verbosity = 1, prefix = '***'):
if verbosity <= self.verbosity:
print prefix, s
view raw south_reset.py hosted with ❤ by GitHub
If you are not familiar with management commands, you need to put this script under a "management/commands" folder in any of your apps. more information here.

Sunday, September 16, 2012

Django Cache Chaining

In a previous post, I used Django's FileBasedCache to synchronize the static files version with the code the version. One of the advantages of this method was that there was no performance hit when a process is recycled, since the cache starts "full".
On the other hand, there was also no performance improvement over time, file system cache can work great on a local machine with SSD, but on a cloud machine where the storage might not be as close, it is significantly slower than the local RAM or maybe even than a memcached instance.

So essentially, I needed to cache my cache.

One cache to rule them all


Django supports multiple Cache Backends, so you can define a local memory cache backend and a filebased backend. What I wanted to create, is a cache backend that chains the two together.

So here is the interface I wanted:
  • get - try getting from the first cache in the chain, if exists return, else go to the next cache.
    If there is hit in a deeper cache backend, update all the caches in the chain up to it
  • set - set the item in all the caches in the chain
That turned out to be very simple to implement:


from django.core.cache import BaseCache
from django.core.cache import get_cache
from lock_factory import LockFactory

class ChainedCache(BaseCache):
    def __init__(self, name, params):
        BaseCache.__init__(self, params)
        self.caches = [get_cache(cache_name) for cache_name in params.get('CACHES', [])]
        self.debug = params.get('DEBUG', False)

    def add(self, key, value, timeout=None, version=None):
        """
        Set a value in the cache if the key does not already exist. If
        timeout is given, that timeout will be used for the key; otherwise
        the default cache timeout will be used.

        Returns True if the value was stored, False otherwise.
        """
        if self.has_key(key, version=version):
            return False
        self.set(key, value, timeout=timeout, version=version)
        return True

    def get(self, key, default=None, version=None):
        """
        Fetch a given key from the cache. If the key does not exist, return
        default, which itself defaults to None.
        """
        def recurse_get(cache_number = 0):
            if cache_number >= len(self.caches): return None
            cache = self.caches[cache_number]
            value = cache.get(key, version=version)
            if value is None:
                value = recurse_get(cache_number + 1)
                # Keep the value from the next cache in this cache for next time
                if value is not None: cache.set(key, value, version = version) # Got to use the default timeout...
            else:
                if self.debug: print 'CACHE HIT FOR', key, 'ON LEVEL', cache_number
            return value

        value = recurse_get()
        if value is None:
            if self.debug: print 'CACHE MISS FOR', key
            return default
        return value

    def set(self, key, value, timeout=None, version=None):
        """
        Set a value in the cache. If timeout is given, that timeout will be
        used for the key; otherwise the default cache timeout will be used.
        """
        # Just to be sure we don't get a race condition between different caches, lets use a lock here
        with LockFactory.get_lock(self.make_key(key, version = version)):
            for cache in self.caches:
                cache.set(key, value, timeout = timeout, version = version)

    def delete(self, key, version=None):
        """
        Delete a key from the cache, failing silently.
        """
        # Just to be sure we don't get a race condition between different caches, lets use a lock here
        with LockFactory.get_lock(self.make_key(key, version = version)):
            for cache in self.caches:
                cache.delete(key, version = version)

    def clear(self):
        """Remove *all* values from the cache at once."""
        for cache in reversed(self.caches):
            cache.clear()


# For backwards compatibility
class CacheClass(ChainedCache):
    pass

And here are the settings:
CACHES = {
    'staticfiles' : {
        'BACKEND' : 'chained_cache.ChainedCache',
        'CACHES' : ['staticfiles-mem', 'staticfiles-filesystem'],
        'DEBUG' : False,
    },
    'staticfiles-filesystem' : {
        'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
        'LOCATION': os.path.join(PROJECT_ROOT, 'static_cache'),
        'TIMEOUT': 100 * 365 * 24 * 60 * 60, # A hundred years!
        'OPTIONS': {
            'MAX_ENTRIES': 100 * 1000
        }
    },
    'staticfiles-mem' : {
        'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
        'LOCATION': 'staticfiles-mem'
    }
}



You can also get the code in this gist

A few notes:
  • I am using a named lock factory, which is also useful for other stuff. you can check it out in the gist.
    Django is not strict about being thread safe in the cache backend so you can remove the lock altogether but I prefer it this way
  • calling "get" might cause a side effect of setting the item on the cache backends that missed. This might cause the item timeout to be larger than originally requested, but no larger than the sum of the default timeouts of the cache backends in the chain

Problem solved - let's go eat!

Sunday, September 9, 2012

Staticfiles on heroku with django pipeline and S3

Static files are always a nasty bit, even more so when you serve them from a completely different web server.
I was recently required to do so for a Django project that is hosted on Heroku. It is strongly discouraged to serve your static from the web dyno, so I went with S3.

requirements

  1. Static files are served from S3
  2. Compile, minify, and combine the JS/CSS
  3. When working locally - serve the files from the Django without changing them.
  4. Don't rely on browser cache expiration - manage the versions of the static files

Since the process in 2 might take sometime, I didn't want it to block the loading of the dynos, so I didn't want to call collectstatic on the dyno.

Moreover, I wanted the version of the files to be perfectly synced with the server code. i.e. every version that is uploaded should have a corresponding static file package that is somehow linked to it and is immutable in the same since that a git commit is immutable.
This is not a common requirement but it makes a lot of sense, since a great majority of the static files ARE code, and a mismatch between versions could cause unpredictable behaviors.

solution


overview

  1. Use  django-pipeline to define packages (while still getting the original files on local env) 
  2. When deploying a new version, Collect the files using Django's "collectstatic".
  3. Use Django's CachedFilesMixin for static files version management
  4. Upload the files to S3 with s3cmd
  5. Commit the hash names of the static files to the code - this synchronizes the file version with the code version

Defining packages

Using django-pipeline, you can define the different packages, and also include files that require compilation (like less or coffeescript).  This done on the settings file, like so:

# CSS files that I want to package
PIPELINE_CSS = {
    'css_package': {
        'source_filenames': (
            r'file1.css',
            r'file2.less',
            ),
        'output_filename': 'package.css', # Must be in the root folder or we will have relative links problems
    },
}
PIPELINE_JS = {
    'js_package' : {
        'source_filenames': (
            r'file1.js',
            r'file2.coffee',
            ),
        'output_filename': 'package.js', # Must be in the root folder or we will have relative links problems
    }
}

PIPELINE_YUI_BINARY = ...
PIPELINE_COMPILERS = (
    'pipeline.compilers.coffee.CoffeeScriptCompiler',
    'pipeline.compilers.less.LessCompiler',
)

PIPELINE_COFFEE_SCRIPT_BINARY = 'coffee'
PIPELINE_LESS_BINARY = ...
# Storage for finding and compiling in local environment
PIPELINE_STORAGE = 'pipeline.storage.PipelineFinderStorage'

collecting files and adding version management

Collection is composed of a few steps:
  1. Find all the static files in all the apps this project is using (via INSTALLED_APPS)
  2. Copy all the files to the same root folder on the local env
  3. Create packages according to the pipeline settings.
  4. Append the md5 hash of each file to its name (so file.js is renamed to file.****.js)
  5. Go over CSS files that have imports and image referencing (like url()), and change the path to the new file name of that resource
This can be done by using a custom storage for the staticfiles app.

# Local location to keep static files before uploading them to S3
# This should be some temporary location and NOT committed to source control
STATIC_ROOT = ...
# Storage for collection, processing and serving in production
STATICFILES_STORAGE = 'myapp.storage.PipelineCachedStorage'

And the storage is simply:


class PipelineCachedStorage(PipelineMixin, CachedFilesMixin, StaticFilesStorage):
    pass

So whenever we execute the collectstatic management command, we get all the steps that are described above.
One caveat that you might encounter is that during step 5, if a resource is not found, it will raise an exception and won't continue. for example, if one of the css files in one of the apps you are using (might be 3rd party) is referencing a background image that does not exist, the collection process will fail when it reaches that file.
This is a bit too strict in my opinion so I used a derived version of the CachedFilesMixin that is less strict:

class MyCachedFilesMixin(CachedFilesMixin):
    def hashed_name(self, name, *a, **kw):
        try:
            return super(MyCachedFilesMixin, self).hashed_name(name, *a, **kw)
        except ValueError:
            print 'WARNING: Failed to find file %s. Cannot generate hashed name' % (name,)
            return name

Upload the files to S3

To upload the files, I use s3cmd which faster than anything else I have tried. You can actually set Django to upload the files directly to S3 when collection, but it will be much slower and will result in more S3 activity then doing it this way.

you can sync the local folder with the S3 bucket this way:

s3cmd sync collected/ s3://mybucket -v -P

Notice you can do this without harming the current version in production since static files that have changed will have a different file name, since we added the MD5 hash to their name.

To make Django create links to the files on S3 we use django-storages.  we update the production version with the AWS settings and use an S3BotoStorage with a corresponding STATIC_URL:


AWS_STORAGE_BUCKET_NAME = os.environ.get('AWS_STORAGE_BUCKET_NAME')
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
AWS_ENABLED = os.environ.get('AWS_ENABLED', True) # Should only be True in production
AWS_S3_CALLING_FORMAT = ProtocolIndependentOrdinaryCallingFormat()
AWS_QUERYSTRING_AUTH = False

STATIC_URL = '//s3.amazonaws.com/%s/' % AWS_STORAGE_BUCKET_NAME  if AWS_ENABLED else '/static/'
STATICFILES_STORAGE = 'myapp.storage.S3PipelineStorage' if AWS_ENABLED else 'myapp.storage.PipelineCachedStorage'

A few notes about these settings:
  • AWS_ENABLED should only be true in production so are not using S3 when working locally
  • AWS_S3_CALLING_FORMAT is now default to S3 subdomain bucket url which is great for CNAME but chrome does not like when you directly download assets from *.s3.amazon.com and raises sporadic security errors, so I prefer to keep using the original url scheme
  • AWS_QUERYSTRING_AUTH is disabled because there are currently too many bugs that make the signature wrong when you use S3BotoStorage and CachedFilesMixin together. hopefully, that will change soon
Also notice that I changed the STATICFILES_STORAGE to be  'myapp.storage.S3PipelineStorage' on production. This is the S3 equivalent of what we have on local env:

class S3PipelineStorage(PipelineMixin, CachedFilesMixin, S3BotoStorage):
    pass

Linking the static files version to the code version

So now we have different versions of the static files reside side by side on S3 without interfering. The last issue is to make sure each code version is linked to the correct static files version. Since we don't want the resources themselves to be available on the web dyno, we need to keep a separate mapping between file name and the versioned file name (with the hash).
One way to do so is by using a filesystem based cache. When files are collected, the CachedFilesMixin uses a Django Cache backend called 'staticfiles' (or the default if that is not defined) to keep the file names mapping. Using a filesystem based cache we can keep this mapping after the collection and then commit it to the code so it will be available to the web dyno when we push.
To add the filesystem based cache:
 
CACHES = {
    ...,
    'staticfiles' : {
        'BACKEND': 'django.core.cache.backends.filebased.FileBasedCache',
        'LOCATION': os.path.join(PROJECT_ROOT, 'static_cache'),
        'TIMEOUT': 100 * 365 * 24 * 60 * 60, # A hundred years!
        'OPTIONS': {
            'MAX_ENTRIES': 100 * 1000
        }
    },
}

Notice the cache is kept inside the project directory so it will be picked up by git.
The deployment script now contains:

rm -rf static_cache
manage.py collectstatic --noinput
s3cmd sync collected/ s3://bucket -v -P
git add static_cache
git commit static_cache -m "updated static files cache directory"

The cache is deleted in the beginning of the process and afterwards committed to the git repository (we commit just the folder that contains the cache, regardless of the status of the repository).
Again, this does not change anything on production. To do the actual deployment we just push to heroku and we immediately get all the code changes with the staticfiles changes.

Problem solved - let's go eat!

Wednesday, August 22, 2012

Analytical fix for IE rotation origin

The problem

For a recent project I needed to create a client side image editor that allows rotating HTML elements.
The editor outputs an HTML page that contains the elements with the rotation applied on them as CSS rules.

The specification were as follows:
  1. The editor should run on a modern browser (i.e. - not IE<9)
  2. The result page should run on an old browser (i.e. including IE >= 7)
  3. The result page should contain only HTML and CSS, not JS (since JS might be disabled on the viewer).
In modern browsers, CSS transforms can rotate elements and you can also specify the rotation origin. However, in IE<9, they are not available so we are left with the Microsoft specific "DXImageTransform".

The problem with  DXImageTransform, is that the transform origin is not controllable by CSS and is calculated differently from the default in other browsers (which is to rotate around the center of the element).
So for IE, we need to have specific CSS rules that will fix the position of the rotated image to match the result we would get on a modern browser.

How IE positions rotated items

disclaimer: I understood the following from reading stuff online, and playing with the browser, I am not certain this is accurate or independent of other properties.

For the following demo, I assume you are using a modern browser. Consider 2 nested div and a 100X100 size (a red div inside a blue div):


Since they are completely on top of each other, you can only see the inner red div. Now let's rotate the inner div (using our current modern browser css transform).



As you 'should' see, the inner div is rotated by 40 degrees while remaining concentric with the container div.
For that reason, it "sticks out" from all four sides of the container.
in IE, however we would have gotten something like this:



Since IE moves the element just enough so it will never "stick out" the top or left side, just the bottom and the right.

Existing solutions

There are several plugins (jQuery or not) that handle cross browser rotations. Most of them work by first doing a feature detection and then applying the css accordingly.
They handle the IE problem by measuring the element size before and after the rotation. After the rotation, the offsetWidth and offsetHeight properties of the rotated element give the size of the bounding rectangle of the element and not of the element itself. The difference between the center of the bounding rectangle and the element itself is the dislocation we need to account for - problem solved.

However... I couldn't use this solution since it requires the fix to be calculated in JS on an older browser. Since my editor runs on a modern browser it can't measure the size difference and since I am not allowed to use JS on the viewer - I can't apply the rotation programmatically there also.

My solution

For my case, I have to calculate all the fixes in advance, so instead of measuring the bounding rectangle size in the browser, I am calculating it analytically using simple trigonometry.
I will not elaborate the calculation here unless someone explicitly wants me to, but here are the results:
// These variables are all you need to calculate the fix
var h = 80; // height in pixels
var w = 20; // width in pixels
var deg = 30; // rotation degrees
////////////////////////////////////////////////////////

// Calculate the radian and wrap around PI/2
var rad = deg * Math.PI / 180;
rad %= 2 * Math.PI;
if (rad < 0) rad += 2 * Math.PI;
rad %= Math.PI;
if (rad > Math.PI / 2) rad = Math.PI - rad;

// Precalculate cos and sin
var cos = Math.cos(rad);
var sin = Math.sin(-rad);

var top_fix = (h - h * cos + w * sin)/2;
var left_fix = (w - w * cos + h * sin)/2;

Now that I have the numbers, I still need to apply them without using JS.
To do that, I add another div in between the container and the inner div with IE conditional css (for details on this great method see here).
We style the div with a relative position and the left,top fix according to the calculation.

Demo

To see this at work, checkout this fiddle, It should show a blue rotated div in a modern browser and a red rotated div in IE<9, and in the same position. Notice the calculation of the IE fix is analytical and does not use DOM method so it can run on any browser apriori with the same results.

Problem solved - let's go eat!

Sunday, June 17, 2012

PIL, text and RTL

Recently, I needed to create an image using the python imaging library (PIL). Among other things, the image contains text in Hebrew that should fit in a predefined column width. This presented a few issues, first I needed to handle RTL language in PIL and secondly I needed to break the line of text into several lines so they don't exceed the defined column width.
This is a trivial task in HTML but not so much in a graphic library like PIL.

Drawing RTL Text

In PIL, texts are drawn on an image via the text method of the ImageDraw instance. However, writing an RTL text would result in reverse order result. You might be tempted to just reverse the str array, but that would cause erroneous results when other symbols appear in the text, like numbers and parenthesis.
Fortunately, there is a great python library for that called pybidi that reorders the text symbols according to the language they are written in.

The other issue we are facing is that the text is positioned relatively to the top left corner. In an RTL scenario you would probably want to align the text to the top right corner of the block, so that different lines would start on the same horizontal position.
PIL doesn't have a straightforward method for doing this so we need to hack it a bit. The idea is to calculate where the text block would end if it will start where we want it and then feed that ending position to the text method. Luckily, ImageDraw has a method (called textsize) to calculate the width (in pixels) that a rendered line of text will require without actually rendering it. So the final method is:
from bidi.algorithm import get_display
class ImageDrawRTL(ImageDraw.ImageDraw):
    def text_rtl(self, pos, text, font, fill):
        text = get_display(text)
        width, height = self.textsize(text, font = font)
        self.text((pos[0]-width, pos[1]), text,  font = font, fill = fill)
        return width, height
ImageDraw.ImageDraw = ImageDrawRTL
Notice I am using a subclass of the original ImageDraw and I replace the class on the module since the instance is created by an internal factory method.

Breaking a text line

The other issue I had to work out is that the line of text I am given, might be rendered outside the given column width. So I need to break the single line of text into multiple lines and render them one after the other.
To calculate the optimal breaks in the text, I used a recursive binary search, mainly because it's the coolest way I could think of. There is probably a faster way of doing this by trying to estimate the break position and look for a space near it. In any case, this is what I came up with :
def text_break(self, text, font, column_width, space_index = None, space_indexes = None, start_space_index = 0, end_space_index = None):
    if space_indexes is None: # Do this once to save some time
        space_indexes = [m.start() for m in re.finditer(' ', text)] + [len(text)]
    if space_index is None:
        space_index = len(space_indexes) - 1
    if end_space_index is None:
        end_space_index = len(space_indexes)
    index = space_indexes[space_index]
    width, _ = self.textsize(text[:index], font = font)
    if width <= column_width:
        if index == len(text): # Entire text can be inserted in a single column
            return [text]
        # Check if the next word can also be inserted
        width, _ = self.textsize(text[:space_indexes[space_index + 1]], font = font)
        if width <= column_width: # Next word can also be inserted in this column so this is not the breaking point
            return self.text_break(text, font, column_width, space_index = int(math.ceil(float(space_index + end_space_index)/2)), space_indexes = space_indexes, start_space_index = space_index, end_space_index = end_space_index)
        else: # This is the breaking point, so break the text
            return [text[:index]] + self.text_break(text[index+1:], font, column_width)
    else: # Text is too big
        return self.text_break(text, font, column_width, space_index = int(math.floor(float((start_space_index + space_index)/2))), space_indexes = space_indexes, start_space_index = start_space_index, end_space_index = space_index)
Again, you need to inject this method to the ImageDraw instance (by inheritance or otherwise).

So this is it for today, hope you've found it useful...

Saturday, June 16, 2012

My first post

As the title implies, this is my first blog post ever, and as such it should be quite boring.

I am a physicit by my academic background and a programmer by my actual work, so I guess you can say I'm an engineer. I spend most of my time between server (python) and web development.

I am writing this blog to keep my legacy alive for future generations. Well, not really, future generations will probably look back at this and think "wow, they were all idiots!", and in many ways, they will be correct. But seriously, from time to time I encounter some very weird ungooglable problems that I need to work out, and it would be cool if someone else would find my solutions helpful in any way.

Well enough chitchat then, let's get to work!