In this post, we'll take a quick look at rsync ("remote sync") and parallel rysnc—a way to increase the efficiency and speed of traditional rsync. At VividCortex, we've found each to be effective and handy at various times.
Rsync is a tool for copying files between volumes in the same or separate servers. The advantage of rsync is that instead of copying data blindly, it compares the source and destination directories, so that only the difference between the two is sent through the network (or between volumes).
Rsync can still be slow in certain situations, however—especially when there's a high volume of data that needs to be copied. In such a case, the process can take hours. Additionally, if the volume io has high latency—such as when cold Amazon EBS volumes are involved—the throughput can suffer, as rsync will only copy one chunk of data at a time.
As VividCortex engineer Alejandro Martinez summarized for me, our team recently had to re-create a replica for a MySQL server with more than three terabytes of data. We first tried standard rsync to handle the recreation, but the time to copy was far too long. We checked cpu/network/io consumption and none seemed even close to performing at capacity. (We were suspicious that io latency was the primary culprit.)
We've found that in situations like the one described above, we can significantly speed up a very slow copying process by running several rsync processes at a time, each with a subset of the data. This is where parallel
rsync comes in: as opposed to standard rsync, parallel rsync isn't limited to copying a single chunk of data at a time and can, instead, copy several pieces side-by-side—hence its name.
Parallel rsync can be set up using a wrapper like this one
"[Multi-Stream-rsync] will split the transfer in multiple buckets while the source is scanned… The main limitation is it does not handle remote source or target directory, they must be locally accessible (local disk, nfs/cifs/other mountpoint)."
This particular wrapper is simple to install, consisting of a single Python file. The ultimate benefit is maximized usage of available bandwidth, and the requirements are more than acceptable and minimal.
As is the case with many system solutions, rysnc is an excellent tool to have at hand, but it requires a user's working knowledge to maximize its value; there are scenarios when parallel rsync will be better. It's important to know when you've got such a situation at hand. If you've had issues where rsync simply seemed too slow, try parallel rysnc instead. Is this something you've experienced or experimented with yourself? Let us know in the comments below.
If you'd like to hear more from a VividCortex engineer on how to optimize your systems and workflow, watch a recording of Preetam Jinka's webinar "How to Be a Performance-Driven Engineer."