Brief Problem Description
When backing up Hyper-V VM, true sequential read capacity of source storage is not fully utilized, even though no other bottlenecks (CPU, network, target disk, etc) exist in the backup system. This reduces job rates far below what the physical hardware is capable of. Other applications (for example simple file copy) achieve sequential read performance far greater than backup exec reads from the source disk.
The issue manifest as inexplicable reductions in job rate when any other IO is present on storage array.
Problem Cause
The backup exec remote agent only ever sends one outstanding read request to the source storage. Keeping such a short queue depth effectively tells the source storage that it is keeping up with the workload required for the sequential read stream. The adaptive read-ahead of source storage never scales the read-ahead size because of the absence of a higher IO pressure by the sequential read stream of BE remote agent. This leads to much lower sequential read throughput than the source array is capable of.
Workaround
Nearly all arrays utilize adaptive read-ahead which gives the best results in most situations provided that applications requiring higher sequential read performance put enough IO pressure to cause adaptive read-ahead to scale to match the desired workload. However, in the case that the source application does not put appropriate IO pressure, a half-effective work around is to manually specify a large read-ahead amount for the volume in question before backup, returning it to adaptive after backup.
In my own testing, I found this work around provided a 50% increase in job rate for two concurrent backups vs adaptive setting never scaling read-ahead because it never sees IO pressure. The difference is the elimination of time periods where source agent is needlessly waiting on IO from disk, and CPU stays relatively low. With the work around, CPU usage of the single core used for dedup thread stays consistently high, since it's not needlessly waiting on storage.
However, this workaround has a couple of problems:
This is a “dumb” setting. While it increases read-ahead for the desired backup sequential read stream, it also increases read-ahead for any other sequential read streams on the same volume.
This reduces the amount of read-ahead time and buffer space that can possibly be afforded to the backup read stream, since it is shared with other sequential IO, even if very small. If there is any other sequential IO on the volume, a manual read-ahead setting will never reach the performance of adaptive mechanism scaling read-ahead only for the backup read stream.
It also means that even small sequential non-backup reads (say two successive 64K reads) will also invoke a much-longer read ahead, and cause more disks- probably all disks on the array – to have to seek to satisfy the read-ahead setting. This amounts to more time spent in seeking that could have been spent reading.
A large static read-ahead setting is in no way appropriate for normal production use, so it requires setting a script to run before and after backup to turn it on and off.
True Solution
A real solution to this product issue is to keep more outstanding IOs against the source storage. In my specific situation, other applications that keep queue depth at 4 experience dramatically better sequential read performance.
Probably most customers would desire a registry setting to control this.
Call the setting to be put on the source host remote agent “ReadQueueDepth” or “ReadPressure” or similar. Say it could be set to 1, 2, 4, or 8.
1 = Less aggressive, Lowest Backup Performance, less impact on other IO. Other IO on volume or source array likely to reduce backup job rate.
2 = Default (Default of 1 for a high-performance sequential read application is pretty silly)
4= Aggressive, High Backup Performance
8 = Very Aggressive, Highest Backup Performance. This puts a high IO pressure on source storage in order to cause adaptive read-ahead algorithms to provide the best rate. May have significant impact on other IO happening on source storage array.
In my case, I’d probably set it to the highest setting, which for my client-side dedup hyper-v backup would cause me to hit a single-cpu-core bottleneck at around 300 MB/sec. Then, by running multiple jobs from multiple volumes I can use another core and get more aggregate throughput. I don’t see any reason why I couldn’t hit 600 MB/sec nearly continuously with concurrency. So, to put it simply, this one change would provide for 4 times the backup throughput on the same physical hardware vs the default queue depth of 1 for which I would get maybe 150 MB/sec average with wide variations. If there happens to be a lot of other IO on the source array during my backup window, this setting would be even more beneficial.
While at first glance it may appear this would only benefit high-bandwidth storage situations (such as my situation with source being SAS attached and network being 10 GbE), I believe it would benefit other lower-bandwidth storage situations as well, since it will tend to reduce IO waits at the source in all situations. There is no reason to keep queue depth so short when the agent knows the next Gigabytes of requests that it will issue against some large vhd file snapshot.