Linux – Create many tar files from a directory with 500000 files

bashlinuxparallel processingtarzsh

I have a directory containing around 500k files, and want to slice them into t tar files.

Put formally, let's call the files file_0, ..., file_{N-1}, where N around 500k. I want to create t tar files each containing T=N/t files, where the i-th tar file contains

file_(i*N), ..., file_((i+1)*N - 1),    i in {0, ..., t-1}

What's an efficient way to do this? I was going to write a Python script that just loops over the N files and divides them into t folders, and then calls tar in each, but this feels very unoptimal. I have many cores on the server and feel like this should happen in parallel.

Best Answer

You can use python concurrent library which is designed to process a request queue among all or some threads, eating the queue until all jobs are entirely executed.

  1. Generate a big list of list of files, like [ [f0..f0-1], [fn..f2n-1]..]
  2. Use a ThreadPoolExecutor to eat this list with all many thread your computer has. This can look like this:
import os
import sys
from concurrent.futures import ThreadPoolExecutor
import subprocess
import itertools
import math


def main(p, num_tar_files):
    files = list(split_files_in(p, num_tar_files))
    tar_up = tar_up_fn(p)
    with ThreadPoolExecutor(len(files)) as executor:
        archives = list(executor.map(tar_up, itertools.count(), files))
        print("\n {} archives generated".format(len(archives)))


def split_files_in(p, num_slices):
    files = sorted(os.listdir(p))
    N = len(files)
    T = int(math.ceil(N / num_slices))  # means last .tar might contain <T files
    for i in range(0, N, T):
        yield files[i:i+T]


def tar_up_fn(p):
    def tar_up(i, files):
        _, dir_name = os.path.split(p)
        tar_file_name = "{}_{:05d}.tar".format(dir_name, i)
        print('Tarring {}'.format(tar_file_name))
        subprocess.call(["tar", "-cf", tar_file_name] + files, cwd=p)
        return tar_file_name
    return tar_up


if __name__ == '__main__':
    main(sys.argv[1], int(sys.argv[2]))