Tuesday, 16 June 2015

How Djangae's App Engine Datastore Connector Works

One of my many responsibilities is being the BDFL of the Djangae project. Djangae is a Django application which provides a compatibility layer to allow Django to integrate well with Google App Engine.

The most complex part of the Djangae project is the "datastore connector". This is a Django ORM backend which allows much (but importantly, not all!) of the Django ORM to function using App Engine's non-relational Datastore as the database.

In this post I'll summarize the key tricks we use to make the Django ORM work on the Datastore.

The goal of the connector is to make as much of the Django ORM function, without causing more queries than the user expects. To do this we use a few clever tricks, the main ones are:
  • Query normalization
  • Pre-computed fields
  • Unique constraint markers
  • Context and memcache caching

Query Normalization 

 

This is where the magic happens! The App Engine Datastore allows for queries with OR filters that have up to 30 branches, under the hood such queries are called "MultiQueries". We use this feature to support all kinds of queries that you wouldn't believe to be possible on the Datastore. We do this by transforming the Django WHERE tree into what's known as Disjunctive Normal Form. By normalizing the Django WHERE tree into DNF we produce an equivalent filter which is a series of AND branches under a single OR node. Providing that the normalized version of the query has fewer than 30 AND branches, we can build an App Engine query with it and it should work (providing you don't have more than one inequality).

The code for this can be found here.

Pre-computed Fields

 

The Datastore supports the following operations: =, <, >, <=, >=, !=. What it doesn't support is: iexact, istartswith, endswith etc. which when you are trying to support the Django ORM this is a bit of an issue.

Fortunately, there is a solution! The Datastore supports multi-valued fields. These "list properties" can contain several values per column (up to 500), and if any of them match your filter, the entity will be returned. We can use this functionality to create pre-computed fields when we save the entity, that we can then query on depending on the filter. For example, if I want to filter on __icontains, the Datastore connector will add an additional hidden property which is made up of all the combinations of letters in the field. When I subsequently perform an icontains query on that field, the connector will manipulate the query to instead do an equality filter on the generated hidden property. Because the hidden property is multivalued, if our lookup value is one of the generated values the entity is returned. In Djangae we do similar "special indexing" for many other Django queries.

This all happens almost transparently in Djangae, because we have an auto-generated registry of such indexes called djangaeidx.yaml. I say almost transparently because the first time you perform an iexact query or similar, that lookup and field will be added to the registry, but you'll need to resave your entities for them to get the new generated hidden properties. The same is true after deployment, keep an eye on your djangaeidx.yaml, if it changes you'll need to resave all your entities so that they will be returned by the query.

Unique Constraint Markers

 

The Datastore doesn't support unique constraints (which is a *huge* flaw IMO). The only thing guaranteed to be unique is the key of an entity. Django expects unique constraints to work, if you mark a field as unique=True, you'd expect that attempting to create a duplicate will cause an error.

With Djangae, we abuse the fact that keys are unique to support unique constraints. When you save an instance with a unique or unique_together, we create an entity which represents the field(s) and value(s). This "marker" is linked to the instance which holds that unique combination. This is all done transactionally using independent transactions. If you attempt to create a duplicate instance with the same value, a key existence check is performed on the the unique marker table, if the marker exists, and the instance it's linked to exists you'll get an IntegrityError.

There are some caveats:
  • It hits write performance. There is 1 extra write per unique combination whenever you save an instance.
  • Saving entities in any way except through the ORM can break the unique markers. You could end up with markers which no longer match an entity (which is handled in the constraint logic), or even worse, you can have an entity without a counterpart marker.
You can disable this functionality on a per-model level if you so wish, or globally. But if you disable this because don't want uniqueness to work, you should probably just not use unique and unique_together, otherwise you're just lulling yourself into a false sense of security. Djangae provides a tool called uniquetool which allows generation and repair of markers if they get out of sync.

Context and Memcache Caching

 

The last issue that we have to solve in the Datastore is that of "eventual consistency". On the Datastore, if you perform a query which isn't filtered on a key in some way (either directly, or via an ancestor) the results you'll get back will likely be stale. Either you'll get old data, or you'll get entities back that have been deleted, won't get entities back that were recently created, or get entities back which no-longer match your query. This "eventual consistency" is what powers the Datastore's ability to scale, but it's a right pain in the behind to code against.

To help defend against "eventual consistency", Djangae has a two-level caching layer in its connector. Firstly, there is a thread-local "context" cache, then there is a second level "memcache" cache. There are many rules about when it is permitted to read the caches, or update the caches, most of them surrounding transactions. But the main thing to know is that if you are querying on a unique combination or a key, Djangae will look in the caches before performing a query unless you are inside a transaction, then the rules get complicated.

Again, caveats:
  • Caching is only a protection against consistency issues, it does not prevent consistency issues and the really solution to them is to structure your models in a way that you always query against a set of keys where possible (see RelatedSetField and RelatedListField in Djangae)
  • You must use Djangae's atomic and non_atomic decorators. Do not use the built in db.run_in_transaction or you're in for a world of pain. Djangae's caching is heavily tied into its transaction decorators.

Summary

So, there you have it, a quick overview into the magical parts of the Djangae Datastore connector!

No comments:

Post a Comment