Data and Analytics
Amazon Kinesis Data Streams On-demand vs. Provisioned Billing Mode Cost Comparison
Is on-demand pricing really ‘serverless’ pricing?
In Part 1 of this deep dive, we discussed in great detail how shards work and the ways in which records are distributed. Building onto our hypothetical problem, we will use this post to explore opportunities in how we can further scale that solution. As a refresher, we had data at a velocity that requires scaling, however, the partition keys of choice have a relatively low cardinality. This scenario may require precise scaling considerations.
The issue identified above does have a couple of solutions. You could simply increase the number of shards by performing the UpdateShardCount
API call. This function has all the requirements outlined in the previous blog post and allows you to let AWS split or merge shards based on their own algorithms. AWS will attempt to produce shards whose hash ranges are as even as possible. So in our case, if we decide to double our number of shards by performing the update shard count function to 4, we may have shards with the ranges [0,1], [2,3,4], [5,6], [7,8,9]. With this division, we immediately notice that shard 1 will have double the record count of shard 2 and 3 while shard 4 will contain absolutely no records. The image below illustrates this:
Another way of handling this issue is to manually split and merge shards yourself. If you first merge your shards into 1 shard or perform the UpdateShardCount
API down to 1 shard, you can then perform the SplitShard
API call. This API call allows you to provide your own minimum hash value for the shard in question. This operation can take one shard and produce two new shards whose hash ranges are [0,n-1] and [n,9] where n is your choice of minimum hash value. Looking at our example above, if we joined back to a single shard with the hash range of [0,9], we can then manually split with a minimum hash value of 3. This will create two shards with the following values: [0,2] and [3,9]. Even though new values will not be evenly distributed between these shards, we do know that our keys will be evenly distributed. If we have a new value of bam, this would have a 30% chance to be mapped to shard 1 and a 70% chance to be mapped to shard 2.
Some interesting things to consider when performing this operation is that Kinesis does not retroactively fix previous shards. If you took the time to review the DescribeStream
API call, you will notice two values for ParentShardId and AdjacentParentShardId. These values are how Kinesis hides from you the fact that records for some period of time may exist across multiple shards. Referring back to our example, if we have our initial shards with IDs 0 and 1, when we perform the merge, we create a new shard with id 2 whose parent shard is 0 and adjacent parent shard id is 1. Once we perform the split, we create two more shards with id 3 and 4. Their parent shard ids are both 2 and there is no adjacent parent shard id.
If you now take time to read through some of the restrictions and considerations for scaling from the API documents for UpdateShardCount, they should hopefully begin to make some sense.
Although we don’t have the ability to alter the underlying hash function which AWS uses for kinesis, an alternative solution is to provide our own hash value as an explicit hash. When posting a record to Kinesis, you have the opportunity to override the generated hash with an explicit hash value. The primary concern with this approach is that you have to be able to consistently maintain that hash value indefinitely and make certain any changes are a well-documented event. Altering the manual hash value can result in records being published to a different shard and resulting in loss of order over a short period of time.
These are not the only options to further scale your solution. Kinesis has a fairly large feature set and other advanced features to help in further optimization. Join us in our next blog post in this series as we dive once more into some unique features such as Parallelization Factor and Enhanced Fan Out.
To continue reading this series, check out:
Is on-demand pricing really ‘serverless’ pricing?