When picking storage systems, it is common that many revert to performance benchmarking to differentiate between solutions. While there are distinct differentiators between rivaling storage solutions, some customers rarely use capabilities beyond the basic functionalities hence a quantifiable metric is required. And in this instance, performance benchmarking in the form of IOPS or throughput.
So lets look at key metrics around performance (IOPS, Latency & Throughput). I will not go into depth describing what each of those are cause many have written in detail about it, but just a quick recap.
IOPS - IO’s Per Second. Effectively the number of IO requests per second. Very often, this is the number that most storage providers will provide.
Latency - This is the time taken for each corresponding IO to complete. It is not often provided, but good storage providers often are fairly transparent with this.
Throughput - Measure of data bandwidth in MB/s or GB/s. Throughput is basically derived from the number of IOPS x average block size of payload. Most if not all vendors, often use 4k blocks as their benchmark block size.
With that out of the way, lets look at how each of those correlate with each other.
With traditional spinning hard drives, historically, a SAN storage with an average of 10-15ms latency is considered a fairly performant system. All flash systems today changes the expectation, and often times we are observing between 1-3ms of latency on average. Given that each storage subsystem have a limit of sorts when it comes to servicing IO’s, latency climbs as drives saturate.
So when asked, how many IOPS can your system perform? A better way to ask this would be, what is the max IOPS your system can perform at an average latency of 3ms? A system that is capable of doing 1M IOPS is unusable for most, assuming the latency is >30ms.
Many synthetic load generators only generate a fixed block size (4k, 8k and etc), but in actual deployments, it is unlikely that all block size are of the same size. In fact, 4k blocks are fairly uncommon except for some selected applications. So how does block size play into this?
Without going too much into the details, a block size change often times equate to different IOPS numbers as per the example below.
If 4k block provides 100,000 IOPS,
2k block will provide 200,000 IOPS and
8k block will provide 50,000 IOPS
16k block will provide 25,000 IOPS
As you can see, depending on your workload profile, while anyone can provide you some IOPS numbers at its best, in reality, it is highly unlikely you will hit those numbers in production.
Throughput is really the result of the above, Block Size x IOPS. It is often more critical in sequential performance discussions such as backups, where large amount of data needs to be written to storage very quickly. Often times, you will see 128k, 256k or larger blocks being utilised, hence backup appliances often never quote throughput numbers, not IOPS.
So how best to test vSAN?
Many customers and partners have often configured a single VM with a VMDK, run a load generator like IOMeter against it. Shortly after, I will get a phone call as to why their All Flash cluster isn’t as high performance as they thought it would be. So why is that?
Understanding how data is placed on vSAN is a good starting point with regards to performance.
In the diagram above, a VM is created with a policy, FTT=1, FTM =1. This policy, by default will have 2 mirrored objects of the VMDK residing on 2 individual spindles/drives. Technically, while there are 4 drives per node, and 16 drives in total across the cluster, when load was generated on that particular VM & VMDK, only 2 of 16 drives will be used to service the IO’s. Performance can potentially be improved by adding Stripe Width = 2 to the policy, making each of the mirrored objects go onto 2 drives (4 in total).
It is important to understand that vSAN was designed as a scalable distributed storage subsystem and in does well when there’s a large level of parallelism when it comes to workloads. The more VM’s and VMDK’s you have, the more balanced and performant the subsystem becomes.
So, we often suggest that each server be loaded with 4 or more VM’s each with at least 8 VMDK’s each. That way, VMDK’s are distributed evenly across all drives in the cluster. Important thing to note, if the expectations is to have consistent performance across all the nodes (if node-1 is doing 10k IOPS, node-2, node-3 and node-4 should be exactly the same), you will quickly realise that it may or may not be the case.
For example, if you had a 3-node cluster and 10 x VM’s on the cluster with 1 x VMDK each. There will be a total of (10 VMDK x 2 copies) 20 objects, that needs to be placed in 3-nodes. Likely, you will observe the placement of objects being, 6-7-7, with 1 node potentially driving lesser than the others.
Read Caching & Write Caching
In every disk group within vSAN, dedicated flash drives are used for caching writes, and in hybrid systems, reads as well. This is very similar to traditional storage where you have a small amount of cache that is used to buffer writes and provide quick access to frequently used data.
Let’s discuss hybrid systems first. The larger the cache, the longer frequently access data resides in cache, which in turn result in better performance. There will be the occasional read request that is not in cache, which we term as “Read Cache Miss”, that will then fetch data directly from the spinning drive tier.
So with that understanding, we often size caching capacity to cater for 90% of the workload, and the remaining 10% will be Read Cache Misses. This 10% will then be sized accordingly and serviced by spinning drives. If assuming that the workload sent to vSAN is 100% random and the caching drives have insufficient capacity to buffer, the potential of a performance impact goes up because it is constantly fetching from the slower spinning tier.
A little more straight forward in All Flash systems, because the caching tier is strictly for writes and the capacity tier, strictly for reads.
With all synthetic tests (IOmeter, HCIBench, vdbench, HammerDB, etc), it goes without saying that it is crucial that we understand what we want to achieve out of it.
For example, some will test 4k blocks on 100% Read. This workload gives the highest level of performance because there is little overhead in performing a read. Sometimes, reads are also serviced from cache, but is this particular workload realistic and applicable to your environment? Most enterprise environments are often heavily skewed towards 70%-80% Read and 20%-30% Write, and not an average block size is 4k.
With regards to testing tools, I often recommend HCIBench because it does a lot of the heavy lifting for me. It creates a bunch of Photon VM’s, uses the vdbench engine, provides a nice GUI front end to define my workload, and spits out a summary at the end of it. Alternatively, there is the ever popular IOMeter, which requires an OS to be setup, configured with the profile and executed individually. Both works, so its a matter of preference.
I tend to shy away from consumer drive testing tools, because there is limited documentation on it and understanding of how it works.
Hardware, hardware, hardware
As much as vSAN is coined Software Defined Storage (SDS), there is still a large dependency on hardware. There is a slew of supported drives and controllers on the VCG lists, and software can only do so much if there is a ceiling for the hardware tier. It is also important that the supported firmware and drivers are loaded on those hardware to ensure it works optimally with vSAN.
It may come as a surprise to many, but for every 10 performance discussion, 8 of which is around unsupported hardware, firmware or drivers.
The above may not be all conclusive, but it gives an introduction to performance testing on vSAN and/or storage in general. Happy benchmarking!