In a previous post I described the impact of the CAP theorem on data systems and the distinction between ACID and BASE semantics. In this post I will focus on the use of BASE semantics in Windows Azure Storage.
PDC 2008 CTP
At the 2008 PDC, Microsoft announced CTP versions of Windows Azure including two storage products: Windows Azure Table and SQL Data Services. (SDS). It was observed immediately that they were very similar – both used BASE semantics. This caused confusion and consternation. Confusion as to the distinction of their functionality. Consternation that there was no offering with ACID semantics – there was no SQL Server in the cloud.
In Spring 2009, Microsoft formally withdrew SDS and announced it would be replaced by what is now termed SQL Azure Database – a restricted-functionality version of SQL Server hosted in an Azure datacenter. SQL Azure Database is expected to go to CTP in August 2009.
Consequently, Windows Azure will have two storage products, Windows Azure Table using BASE semantics and SQL Azure Database using ACID semantics. As I pointed out in the earlier post they are complementary in that it is likely that any large-scale service could use both Windows Azure Storage AND SQL Azure Database. The former for scale-out storage and the latter for its transactional and search functionality.
As described in the previous post, the CAP theorem states it is impossible to simultaneously satisfy consistency, availability and partition tolerance in any distributed data system. Systems satisfying ACID semantics are focused on providing consistency at the expense of availability. Systems satisfying BASE semantics are focused on providing availability at the expense of consistency.
In an article, Eventually Consistent, on what makes a data system eventually consistent Werner Vogels describes a data system with: N nodes storing data replicas; W replicas that must each write a change for it to be successful; and R replicas which respond to each read request. He writes:
If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario, which implements synchronous replication, N=2, W=2, and R=1. No matter from which replica the client reads, it will always get a consistent answer. In asynchronous replication with reading from the backup enabled, N=2, W=1, and R=1. In this case R+W=N, and consistency cannot be guaranteed.
To understand the behavior of a replicated data system it is not sufficient to know the number of data replicas it is also necessary to know the manner in which data is written to and read from these replicas. In short, it is necessary to know the values of N, W and R.
Partitions in Windows Azure Table
In Windows Azure Table, tables contain entities which are collections of name-value pairs. An entity has a compound primary key which is the combination of PartitionKey and RowKey. The PartitionKey specifies entity locality in that entities with the same PartitionKey are stored on the same storage node – they are in the same partition. Entities with distinct PartitionKey values may be stored on different storage nodes and presumably be subject to network partitioning in the CAP sense.
The David Chappell whitepaper, Introducing Windows Azure, indicates that N is 3:
Regardless of how data is stored—in blobs, tables, or queues—all information held in Windows Azure storage is replicated three times. This replication allows fault tolerance, since losing a copy isn’t fatal.
The Windows Azure Table whitepaper indicates that Windows Azure Table supports strong consistency for single entity transactions suggesting that W + R > 3. This leaves me puzzled as to the possibility of and effect of a network partition occurring between the replicated data of a single partition.
The Windows Azure Storage May CTP introduced the concept of entity group transactions described as:
For the entities stored within the same table and same partition (i.e., they have the same partition key value), the application can atomically perform a transaction involving those entities.
Entity Group Transactions implement ACID semantics for a group of related operations on a single partition. Presumably, implementing entity group transactions across partitions would run into the partition tolerance problem implied by the CAP theorem. While snapshot isolation is supported within a single partition it is explicitly not supported across partitions.