In this article let’s discuss about how partitioning works for Azure table storage.
Every row / data object in the table is called as Entity.
Every entity has two keys – partition key and row key. The partition key decides the physical partition where the entity should belong. The row key uniquely identifies a data object within a partition.
So, every entity in table has a primary key, which is combination of partition key and row key. The clustered index first sorts all records by partition key and then by row key. The comparison used while comparing is lexical comparison, meaning “111” will appear before “2”.
Every entity in the table also has one more property, timestamp. This is for traceability purpose. The value of timestamp specifies the date and time at which the entity was last modified.
Understanding the Partitions
Every partition is stored on a partition server. Each partition server can serve one or more partitions. So it may happen that some servers may have more than one partitions while the other servers have only one partitions.
How does this affect your application ?
A partition server has a rate limit of the number of entities it can serve from one partition over time. Specifically, a partition has a scalability target of 500 entities per second. So, the partition server always tries to throttle the throughput when the partition is too hot / active.
So, scalability of your application depends on how the data has been distributed across the partition servers. Servers that encounter high traffic for their partitions might not be able to sustain a high throughput. To increase the throughput of the server, the storage system load-balances the partitions to other servers.
For optimal load balancing of traffic, you should use more partitions so that Azure Table storage can distribute the partitions to more partition servers.
Entity Group Transactions
Entity Group Transactions allows you to group the atomic operations together. You can group almost 100 storage operations together. You can create entity group transactions only if all operations share the same partition key. These transactions help to reduce the individual storage operations submitted to storage and hence improve the throughput of the storage service.
When you specify the partition key, all the records having one value for that partition key will be placed in a single partition. Let’s say, over a period of time, the application has a lot of unique values for partition keys, then Azure table service may internally create range partitions. In range partitions, the ranges of partition keys are stored in one partition.
If you have a partition key which increases with every record (e.g. 1, 2, 3 and so on) or decreases with every new record (e.g. 999, 998, 997 and so on), then it may happen that your first and last partition servers may have more load and may affect performance of your application.
Understanding your data
Table storage is not like relational databases. You can define multiple indexes in relational databases tables and make sure that data retrieval is efficient. You can create indexes even after creating the tables.
In table storage, there is only one index, which involves partition key and row key. There is no other alternative. Hence it becomes immensely important to understand the nature of your data, understand quantity of data and how the queries will be performed against those data objects.
Choosing right partition key
The core of table storage tables design is scalability. If you want your application to be scalable, then you will have to put a lot of thought while choosing the partition key.
The number of partitions and the number of records in each partition affects the scalability and performance of your application. It can be challenging to determine the PartitionKey based on the partition size, especially if the distribution of values is hard to predict. A good rule of thumb is to use multiple, smaller partitions. Many table partitions make it easier for Azure Table storage to manage the storage nodes the partitions are served from.
Stress testing for verification
It would be better to perform the stress testing of the application, with sample production like data in the table storage. This would be helpful to bring the performance bottlenecks and will help to identify if the partition key choice still makes sense. Based on the result of stress testing, you may decide to take further action to re-design the partition and row keys if required.
In cloud world, we always consider the failure scenarios. It is important that your application is able to handle failure scenarios, to ensure that none of the important updates to application data goes missing.
If you have designed the table storage and you are getting a lot of server timeouts or server busy responses, then it would better to implement the retry strategies in storage operations.
I hope this article helps you to know more about the table storage. Let me know your thoughts.