Fastest way to compress (i.e. archive into a single file) millions of small files on a powerful cluster to speed up transfer of files


IMPORTANT NOTE: Compression is NOT the goal, archiving/taping (packing all of the files into a single archive) is the goal.

I want to backup a single directory, which contains hundreds of sub-directories and millions of small files (< 800 KB). When using rsync to copy these files from one machine to another remote machine, I have noticed that the speed of transfer is painfully low, only around 1 MB/sec, whereas when I am copying huge files (e.g. 500 GB) the transfer rate is in fact around 120 MB/sec. So the network connection is not the problem whatsoever.

In such a case moving only 200 GB of such small files has taken me about 40 hours. So I am thinking of compressing the entire directory containing these files and then transferring the compressed archive to the remote machine, afterwards uncompressing it on the remote machine. I am not expecting this approach to reduce 40 hours to 5 hours, but I suspect it would definitely take less than 40 hours..

I have access to a cluster with 14 CPU cores (56 threads — Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz) and 128 GB RAM. Therefore, CPU/RAM power is not a problem.

But what is the fastest and most efficient way to create a single archive out of so many files? I currently only know about these approaches:

However, I do not know which is faster and how the parameters should be tuned to achieve maximum speed? (for example, is it better to use all CPU cores with 7zip or just one?)

N.B. File size and compression rate do NOT matter at all. I am NOT trying to save space at all. I am only trying to create a single archive out of so many files so that the rate of transfer will be 120 MB/s instead of 1 MB/s.

RELATED: How to make 7-Zip faster

Best Answer

Use tar, but forgo the gzipping part. The whole point of TAR is to convert files into a single stream (it stands for tape archive). Depending on your process you could write the stream to a disk and copy that, but, more efficiently, you could pipe it (for example via SSH) to the other machine - possibly uncompressing it at the same time.

Because the process is IO rather then CPU intensive, parellellizing the process won't help much, if at all. You will reduce the file transfer size (if files are not exactly divisible by block size), and you will save a lot by not having the back-and-forward for negotiating each file.

To create an uncompressed tar file:

tar -cf /path/to/files

To stream across the network:

tar -c /path/to/files | ssh user@dest.domain 'cd /dest/dir && tar -x'

Note: If writing an intermediate file to a hard drive as per example 1, it may actually be faster to gzip the file if there is a decent amount of compression because it will reduce the amount to be written to disk which is the slow pare of the process.