README

StorageEngineForDocumentDB

Temporalize pluggable storage engine for Microsoft's Azure DocumentDB

Source code

Source Repository

Description

Temporalize is a temporally aware data store like Datomic except for node.js and and with MongoDB-like query syntax instead of Datalog.

StorageEngineForDocumentDB is a self-balancing storage engine for DocumentDB written originally for Temporalize but generalized to be useful elsewhere. Like Azure's own DocumentDB partition resolvers, this uses a consistent hashing approach, although this storage engine has a bigger interface than a mere partition resolver. I felt this larger surface area was necessary considering the cohesion between the partitioning scheme and things like supporting reads and writes during balancing.

This storage engine also provides temporal as well as cross-partition transaction and querying support.

I understand that a balancing resolver for node.js is on the Azure roadmap, but with a Fall 2016 estimate. I needed this before then. Also, it's unlikely to have the temporal and cross-partition query and transaction support that I require. So, I built my own.

Note, this has been designed to enable pluggable storage for Temporalize and other consumers so you can substitute in an alternative database (MS SQL Server, MongoDB, etc.) at a later date. Identifier choices enable this. It's called topLevelPartition and secondLevelPartition rather than databasePartition and collectionPartition to avoid DocumentDB-specific terminology. That said, the partitioning system was designed with DocumentDB's capabilities, performance characteristics, and capacities in mind, so I'm uncertain how well it will translate. It may be more efficient to have more top-level partitioning and less second-level partitioning when using another database system. This design allows for that flexibility which is hopefully all you need to make it efficient on another database.

Features

NoSQL (even when backed by an SQL database)
By default, maintains complete history of all entities (think immutable data store like Datomic)
Multi-document ACID transactions even across partitions
MongoDB-like query syntax
Query back in time
Incremental aggregation (think materialized views) incrementally updated by exploiting "the past never changes" design
Built-in User and Tenant (think Organization) entities with simple permission model

Planned:

Override default permission model
Filtered events. Will maintain at each servo so will have a 30 second delay. By storing the _ValidFrom when the eventFilter is added, we should be able to catch up with a flurry of events after the delay but the receiver must be able to handle duplicate events.
Hierarchy support

Inter-servo communication

The storage engine is designed to be run simultaneously on many node.js "servos". A single document in the entire system, {id: 'storage-engine-config'} is used to coordinate so no other inter-servo communication is required. The storage-engine-config is stored in @firstTopLevelID/@firstSecondLevelID and looks like this:

{
  id: 'storage-engine-config',  # There is only one of these so we've hard-coded the id.
  mode: 'RUNNING',
  lastValidFrom: '0001-01-01T00:00:00.000Z',
  partitionConfig: {
    topLevelPartitions: {
      'first': {
        id: 'first',
        secondLevelPartitions: {'0': {id: '0'}, '1': {id: '1'}, ...}
      }, ...
    }
    topLevelLookupMap: {'default': 'first', 'key-customer': 'second'}
  }
}

Two levels of partitioning

When instantiated, you tell it what field to use to use for top-level partitioning as well as second-level partitioning. For this DocumentDB-specific implementation, that maps to databases and collections. I think this will work well with up to 40 collections per database so you have a long way to go until you need to partition with databases. You can only have a max of 5 databases without requesting an from Azure. This gives us a 2TB total capacity without limit increases.

Top-level partitioning is done with a lookup map. Second-level partitioning is done using consistent hashing.

The topLevelLookupMap gives you manual control of the top-level partitioning. Once you start to need a second top-level partition, you'll need to manually specify which topLevelPartitionKey values go to which topLevelPartitions. It will use the default top-level partition if the key you provide is not in the map. This should work well in multi-tenant design where top-level partitioning is done at the tenant level. For instance, you can move your biggest customer off of the default 'first' top-level partition to a new top-level partition by specifying a single key in the topLevelLookupMap.

The second-level partitions is chosen using consistent hashing from the second-level partitions specified for the chosen top-level partition.

Temporalize uses _EntityID for second-level partitioning and _TenantID for top-level partitioning so this is the default, but can be overridden during instantiation.

Temporal support

Optional support for Richard Snodgrass-style temporal databases is implemented but only the mono-temporal valid-time form using the default temporalPolicy = 'VALID_TIME'. It can be disabled with temporalPolicy = 'NONE'. Later, we can support bi-temporal operation with temporalPolicy = 'VALID_AND_TRANSACTION_TIME'.

Even when temporalPolicy = 'NONE', the _ValidFrom field is added to each write to maintain transaction support.

Unlike my designs of the Rally LBAPI and the pre-Rally Temporalize prototype, there is never a gap for deletion. If an entity is deleted, a new document version with _Deleted = true is created but with all of the latest field values in place. Note, you can continue to update deleted entities. There is an explicit undelete command that will create a new document version where _Delete field is missing. Every query automatically gets {$and: [oldQuery, {_Deleted: {$exists: false}}]} unless the includeDeleted parameter is true. I confirmed that this converts correctly to NOT IS_DEFINED and behaves as expected with _Deleted missing and _Deleted = true.

A storage-engine-config.lastValidFrom value is updated upon each write. For now, there is a single such value for the entire system. Later, we may include one per top-level (or maybe per partition). Note that each successive transaction, will have a different _ValidFrom even if they both came in during the same ms. The second one will have 1 ms added to its time. This means that we cannot support more than 1000 write transactions per second steady state, so if that limit is being approached, we'll upgrade to have a different lastValidFrom per top-level partition (or maybe per partition). If memory serves, Rally only had a hundred per second or less steady state. The design decision to add a ms was made so that paging operations can key off of the _ValidTo time. Each later page will be > (not >=) the _ValidFrom from the last row of the prior page. I had to jump through hoops to eliminate duplicates when doing incremental aggregations with Rally's LBAPI. This design will avoid that hoop jumping. This might also enable the use of the BETWEEN operator if >/>=/</<= operations are less efficient considering range indexes on _ValidFrom and _ValidTo. Note, I still may use DocumentDB's continuation token inside this storage engine for paging in some circumstances if it's superior to querying with {$gt: lastValidFrom}.

The fetching and incrementing of storage-engine-config.lastValidFrom is done in a sproc so it's transactional. Even with parallel updates that interleave the read and write of storage-engine-config document, we'll be OK because only one of those will work (etag). This calls for appropriate retry functionality.

Queries

When querying, if a topLevelPartitionKey is provided, queries will only go to the top-level partition indicated by the lookup using that key. Similarly, if both a topLevelPartitionKey and a secondLevelPartitionKey are provided it will restrict the query to the one partition. If neither are provided, then queries go to all partitions. In any case, results are aggregated before being passed back to the caller.

Query language

The query language closely resembles MongoDB's as implemented by my (sql-from-mongo)[https://www.npmjs.com/package/sql-from-mongo] npm package. This makes it possible to implement a compatible storage engine for a different SQL or noSQL database.

Balancing

Automatic balancing occurs when adding/removing capacity.

Queries during balancing are supported by sending the query to the places specified by both the old config and the new one.

Calling addCollections assumes that the collectionIDs are stringified numbers like the example above. If you want complete control over the collection IDs or if you want to change the database-level partitioning, you must call updatePartitionConfig directly. Be careful to only add to the end of collections and move a limited portion at a time between databases or balancing can take a long time. That said, it's less expensive to add multiple keys to the topLevelLookupMap than it is to provide them one at a time. So trade off between expense and time to balance.

We could add automatic adding of databases ('second', 'third', etc.) at a later date if necessary but then it would have to automatically figure out what tenants to put in which databases. We'll cross that bridge if we come to it.

When starting balancing, there is a built-in delay to allow all servos to get the updated config before the balancing actually starts. However, the new databases and collections are already created before the new partition-config reaches any of the other servos so the worst that will happen is a few queries go to an empty collection(s) in addition to the old collection(s) which still house the data. Note, collections are not automatically deleted once removed from the partitionConfig.

Reads and writes are still enabled during balancing but balancing adds a lot of load so either use the msBetweenPagesDelay parameter to throttle back the balancing or make sure your balancing occurs in off-peak times.

If a read occurs after the document is written to the new collection but before it's deleted from the old one, then it's possible that a document will be duplicated, so we may add a document deduplication step when operating in 'BALANCING' mode. For now, we just live with this small risk of duplication since it will resolve itself as soon as the delete occurs.

Writes are supported during balancing. If temporalPolicy is 'VALID_TIME' then it will first move all docs with the same secondLevelPartitionKey to the new collection and do the update in the new location. If the temporalPolicy is 'NONE' it will just write it to the new location.

Install

npm install -save storage-engine-for-documentdb

Usage

Changelog

0.2.0 - 2015-12-06 - Totally refactored
0.1.0 - 2015-05-11 - Initial release

License

For evaluation and non-commercial use only. Contact me for a commercial license.

temporalize-documentdb

Usage no npm install needed!