I just wanted to write a couple lines/guidance on this regard as this is a recurring question when configuring storage, not only in the cloud, but can also happen on bare metal servers.
What is throughput on a volume?
Throughput is the measure of the amount of data transferred from/to a storage device per time unit (typically seconds).
The throughput consumed on a volume is calculated using this formula:
IOPS (IO Ops per second) x BS (block size)= Throughput
As example, if we are writing at 1200 Ops/Sec, and the chunk write size is around 125Kb, we will have a total throughput of about 150Mb/sec.
Why is this important?
This is important because we have to be aware of the Maximum Total Throughput Capacity for a specific volume vs the Maximum Total Instance Throughput.
Because, if your instance type (or server) is able to produce a throughput of 1250MiB/s (i.e M4.16xl)) and your EBS Maximum Throughput is 500MiB/s (i.e. ST1), not only you will hit a bottleneck trying to write to the specific volumes, but also throttling might occur (i.e. EBS on cloud services).
How do I find what is the Maximum throughput for EC2 instances and EBS volumes?
Here is documentation about Maximum Instance Throughput for every instance type on EC2: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html
And here about the EBS Maximum Volume throughput: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html
How do I solve the problem ?
If we have an instance/server that has more throughput capabilities than the volume, just add or split the storage capacity into more volumes. So the load/throughput will be distributed across the volumes.
As an example, here are some metrics with different volume configurations:
1 x 3000GB – 9000IOPS volume:
3 x 1000GB – 3000IOPS volume:
Look at some of the metrics: these are using the same instance type (m4.10xl – 500Mb/s throughput), same volume type (GP2 – 160Mb/s throughput) and running the same job:
- Using 1 volume, Write/Read Latency is around 20-25 ms/op. This value is high compared to 3x1000GB volumes.
- Using 1 volume, Avg Queue length 25. The queue depth is the number of pending I/O requests from your application to your volume. For maximum consistency, a Provisioned IOPS volume must maintain an average queue depth (rounded to the nearest whole number) of one for every 500 provisioned IOPS in a minute. On this scenario 9000/500=18. Queue length of 18 or higher will be needed to reach 9000 IOPS.
- Burst Balance is 100%, which is Ok, but if this balance drops to zero (it will happen if volume capacity keeps being exceeded), all the requests will be throttled and you’ll start seeing IO errors.
- On both scenarios, Avg Write Size is pretty large (around 125KiB/op) which will typically cause the volume to hit the throughput limit before hitting the IOPS limit.
- Using 1 volume, Write throughput is around 1200 Ops/Sec. Having write size around 125Kb, it will consume about 150Mb/sec. (IOPS x BS = Throughput)