Tuesday, 21 October 2014

On Google App Engine, Ancestor Queries are Almost Never What You Need


Recently, the company where I work announced the alpha release of Djangae, a compatibility layer that allows your Django application to work on App Engine, and to store your data in the App Engine Datastore. One of the things missing from the alpha was support for the Datastore's so called "Ancestor queries".

The App Engine Datastore is a remarkable feat of engineering. It's a non-relational database, which can scale to store mind-boggling amounts of data and deal with crazy high amounts of traffic. Of course, the sacrifice is that it's non-relational - so there are no joins, aggregate queries or the like. And if you want to count things then expect it to take some time!

Behind the scenes your data is seamlessly replicated and distributed across Google servers, which makes it extraordinarily reliable and performant.

Google achieves this by dividing your entities into "entity groups". The Datastore allows you to mark entities as being in a group by specifying their Ancestor when you create them. When you do this, the path to the root ancestor forms part of your entity's primary key and each member of the tree is part of the same entity group. Each entity group has its own index, and updates within the entity group are consistent.

If you edit an entity, and then perform a query for it using the entity's Ancestor then your results are strongly consistent. However, if you query without specifying the entity's Ancestor, your results will likely be stale. Your query might return entities that have been deleted, might fail to return new entities, or might return entities with stale data. This is because when you don't specify an ancestor, your query will look at the global index of all entities which is not strongly consistent, it's eventually consistent (updates will lag for a few seconds). Eventual consistency is a bitch to work with.

When you perform an ancestor query, the Datastore only looks at the entity group's local index, so you've already eliminated most of the entities in your datastore except for the ones below the specified ancestor. Which is why consistent results are possible.

There are a bunch of drawbacks with using Ancestor queries though, these are:
  • All entities within the same group are limited to an overall total of 1 write per second
  • To look up a descendant by key, you need to know the entire path to the root ancestor
  • Keys just get confusing (does it have a parent? Can I look it up by kind and ID?)
  • Moving descendants around means destroying and recreating them, and transferring any references to their old key
These are some pretty annoying drawbacks, and I often wonder why Google decided to make the entity group part of the key, rather than having another ID property built into each entity (e.g. __group__) which would alleviate some of the issues. But then again I didn't build the Datastore, I'm sure there are reasons.

Anyway, there is an alternative to ancestor queries, which works better in nearly all situations; you can maintain a list of child IDs on the parent object. App Engine's list properties allow you to store up to 500 items (e.g. integers) in a single field. These items are indexed. By storing the IDs of related child entities on a logical parent, you gain the following things:
  • Each entity has its own entity group, suddenly the write rate isn't so bad
  • You can get all child entities consistently, at once by doing a Get rather than a query (you can then filter them in memory)
  • You can query for the parent object by child ID (as list properties are indexed)
  • Your entities can all be looked up with their ID
  • Migrating child entities to a different parent just means updating the list on the parent
Djangae will automatically transform a PK filter into a datastore Get. Which means MyModel.objects.filter(pk__in=parent.child_ids, username="bananas") would do a consistent Get, then only return the results where the username was 'bananas'. You can also use post-save/delete signals etc. to keep the list up to date.

This is why Djangae doesn't have ancestor support yet, in nearly every situation that I've ever come across denormalizing child IDs has always been a better solution than building an ancestor tree. We'll get ancestor support into Djangae in time, but don't wait for it, just smartly structure your data.

1 comment:

  1. Explore our blog to find out, get trending news and read articles dedicated to different mobile apps for parents, students and kids.

    ReplyDelete