bash – Calculating Number of Files for Batch Processing in Bash

bashlinuxulimit

For example, I have directory with multiple files created by this way:

touch files/{1..10231}_file.txt

I want to move them into new directory new_files_dir.

The simplest way to do this is:

for filename in files/*; do
    mv "${filename}" -t "new_files_dir"
done

This script works for 10 seconds on my computer. It is slow. The slowness happens due execution of mv command for every file.

###Edit start###

I have understood, that in my example the simplest way will be just

mv files/* -t new_files_dir

or, if the "Argument list too long":

printf '%s\0' files/* | xargs -0 mv -t new_files_dir

but aforementioned case is a part of task. The whole task is in this question: Moving large number of files into directories based on file names in linux.
So, the files must be moved into corresponding subdirectories, the correspondence of which is based on a number in the filename. This is the cause of for loop usage and other oddities in my code snippets.

###Edit end###

There is possibility to speedup this process by passing bunch of files to mv command instead of a single file, like this:

batch_num=1000

# Counting of files in the directory
shopt -s nullglob
file_list=(files/*)
file_num=${#file_list[@]}

# Every file's common part
suffix='_file.txt'

for((from = 1, to = batch_num; from <= file_num; from += batch_num, to += batch_num)); do
    if ((to > file_num)); then
        to="$file_num"
    fi  

    # Generating filenames by `seq` command and passing them to `xargs`
    seq -f "files/%.f${suffix}" "$from" "$to" |
    xargs -n "${batch_num}" mv -t "new_files_dir"
done

In this case the script works for 0.2 seconds. So, the performance has increased by 50 times.

But there is a problem: at any moment the program can refuse to work due "Argument list too long", because I can't guarantee that the bunch of filenames length is less than max allowable length.

My idea is to calculate the batch_num:

batch_num = "max allowable length" / "longest filename length"

and then use this batch_num in xargs.

Thus, the question: How can max allowable length be calculated?


I have done something:

  1. Overall length can be found by this way:

     $ getconf ARG_MAX
     2097152
    
  2. The environment variables contributes into the argument size too, so probably they should be subtracted from ARG_MAX:

     $ env | wc -c
     3403
    
  3. Made a method to determine the max number of files of equal sizes by trying different amount of files before the right value is found (binary search is used).

     function find_max_file_number {
         right=2000000
         left=1
         name=$1
         while ((left < right)); do
             mid=$(((left + right) / 2))
    
             if /bin/true $(yes "$name" | head -n "$mid") 2>/dev/null; then
                 left=$((mid + 1))
             else
                 right=$((mid - 1))
             fi
         done
         echo "Number of ${#name} byte(s) filenames:" $((mid - 1))
     }
    
     find_max_file_number A
     find_max_file_number AA
     find_max_file_number AAA
    

    Output:

     Number of 1 byte(s) filenames: 209232
     Number of 2 byte(s) filenames: 190006
     Number of 3 byte(s) filenames: 174248
    

    But I can't understand the logic/relation behind these results yet.

  4. Have tried values from this answer for calculation, but they didn't fit.

  5. Wrote a C program to calculate the total size of passed arguments. The result of this program is close, but some non-counted bytes are left:

     $ ./program {1..91442}_file.txt
    
     arg strings size: 1360534
     number of pointers to strings 91443
    
     argv size:  1360534 + 91443 * 8 = 2092078
     envp size:  3935
    
     Overall (argv_size + env_size + sizeof(argc)):  2092078 + 3935 + 4 = 2096017
     ARG_MAX: 2097152
    
     ARG_MAX - overall = 1135 # <--- Enough bytes are
                              # left, but no additional
                              # filenames are permitted.
    
     $ ./program {1..91443}_file.txt
     bash: ./program: Argument list too long
    

    program.c

     #include <stdio.h>
     #include <string.h>
     #include <unistd.h>
    
     int main(int argc, char *argv[], char *envp[]) {
         size_t chr_ptr_size = sizeof(argv[0]);
         // The arguments array total size calculation
         size_t arg_strings_size = 0;
         size_t str_len = 0;
         for(int i = 0; i < argc; i++) {
             str_len = strlen(argv[i]) + 1;
             arg_strings_size += str_len;
     //      printf("%zu:\t%s\n\n", str_len, argv[i]);
         }
    
         size_t argv_size = arg_strings_size + argc * chr_ptr_size;
         printf( "arg strings size: %zu\n"
                 "number of pointers to strings %i\n\n"
                 "argv size:\t%zu + %i * %zu = %zu\n",
                  arg_strings_size,
                  argc,
                  arg_strings_size,
                  argc,
                  chr_ptr_size,
                  argv_size
             );
    
         // The enviroment variables array total size calculation
         size_t env_size = 0;
         for (char **env = envp; *env != 0; env++) {
           char *thisEnv = *env;
           env_size += strlen(thisEnv) + 1 + sizeof(thisEnv);
         }
    
         printf("envp size:\t%zu\n", env_size);
    
         size_t overall = argv_size + env_size + sizeof(argc);
    
         printf( "\nOverall (argv_size + env_size + sizeof(argc)):\t"
                 "%zu + %zu + %zu = %zu\n",
                  argv_size,
                  env_size,
                  sizeof(argc),
                  overall);
         // Find ARG_MAX by system call
         long arg_max = sysconf(_SC_ARG_MAX);
    
         printf("ARG_MAX: %li\n\n", arg_max);
         printf("ARG_MAX - overall = %li\n", arg_max - (long) overall);
    
         return 0;
     }
    

    I have asked a question about the correctness of this program on StackOverflow: The maximum summarized size of argv, envp, argc (command line arguments) is always far from the ARG_MAX limit.

Best Answer

Just use a shell where mv is or can be made builtin, and you won't have the problem (which is a limitation of the execve() system call, so only with external commands). It will also not matter as much how many times you call mv.

zsh, busybox sh, ksh93 (depending on how it was built) are some of those shells. With zsh:

#! /bin/zsh -

zmodload zsh/files # makes mv and a few other file manipulation commands builtin
batch=1000
files=(files/*(N))

for ((start = 1; start <= $#files; start += batch)) {
  (( end = start + batch - 1))
  mkdir -p ${start}_${end} || exit
  mv -- $files[start,end] ${start}_${end}/ || exit
}

The execve() E2BIG limit applies differently depending on the system (and version thereof), can depend on things like stacksize limit. It generally takes into account the size of each argv[] and envp[] strings (including the terminating NUL character), often the size of those arrays of pointers (and terminating NULL pointer) as well (so it depends both on the size and number of arguments). Beware that the shell can set some env vars at the last minute as well (like the _ one that some shells set to the path of the commands being executed).

It could also depend on the type of executable (ELF, script, binfmt_misc). For instance, for scripts, execve() ends up doing a second execve() with a generally longer arg list (["myscrip", "arg", NULL] becomes ["/path/to/interpreter" or "myscript" depending on system, "-<option>" if any on the shebang, "myscript", "arg"]).

Also beware that some commands end up executing other commands with the same list of args and possibly some extra env vars. For instance, sudo cmd arg runs cmd arg with SUDO_COMMAND=/path/to/cmd arg in its environment (doubling the space required to hold the list of arguments).

You may be able to come up with the right algorithm for your current Linux kernel version, with the current version of your shell and the specific command you want to execute, to maximise the number of arguments you can pass to execve(), but that may no longer be valid of the next version of the kernel/shell/command. Better would be to take xargs approach and give enough slack to account for all those extra variations or use xargs.

GNU xargs has a --show-limits option that details how it handles it:

$ getconf ARG_MAX
2097152
$ uname -rs
Linux 5.7.0-3-amd64
$ xargs --show-limits < /dev/null
Your environment variables take up 3456 bytes
POSIX upper limit on argument length (this system): 2091648
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2088192
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647

You can see ARG_MAX is 2MiB in my case, xargs thinks it could use up to 2088192, but chooses to limit itself to 128KiB.

Just as well as:

$ yes '""' | xargs -s 230000 | head -1 | wc -c
229995
$ yes '""' | strace -fe execve xargs -s 240000 | head -1 | wc -c
[...]
[pid 25598] execve("/bin/echo", ["echo", "", "", "", ...], 0x7ffe2e742bf8 /* 47 vars */) = -1 E2BIG (Argument list too long)
[pid 25599] execve("/bin/echo", ["echo", "", "", "", ...], 0x7ffe2e742bf8 /* 47 vars */) = 0
[...]
119997

It could not pass 239,995 empty arguments (with total string size of 239,995 bytes for the NUL delimiters, so fitting in that 240,000 buffer) so tried again with half as many. That's a small amount of data, but you have to consider that the pointer list for those strings is 8 times as big, and if we add up those, we get over 2MiB.

When I did this same kind of tests over 6 years ago in that Q&A here with Linux 3.11, I was getting a different behaviour which had already changed recently at the time, showing that the exercise of coming up with the right algorithm to maximise the number of arguments to pass is a bit pointless.

Here, with an average file path size of 32 bytes, with a 128KiB buffer, that's still 4096 filenames passed to mv and the cost of starting mv is alreadly becoming negligible compared to the cost of renaming/moving all those files.

For a less conservative buffer size (to pass to xargs -s) but that should still work for any arg list with past versions of Linux at least, you could do:

$ (env | wc; getconf ARG_MAX) | awk '
  {env = $1 * 8 + $3; getline; printf "%d\n", ($0 - env) / 9 - 4096}'
228499

Where we compute a high estimate of the space used by the environment (number of lines in env output should be at least as big as the number of envp[] pointers we passed to env, and we count 8 bytes for each of those, plus their size (including NULs which env replaced with NL)), substract that from ARG_MAX and divide by 9 to cover for the worst case scenario of a list of empty args and add 4KiB of slack.

Note that if you limit the stack size to 4MiB or below (with limit stacksize 4M in zsh for instance), that becomes more conservative than GNU xargs's default buffer size (which remains 128K in my case and fails to pass a list of empty vars properly).

$ limit stacksize 4M
$ (env | wc; getconf ARG_MAX) | awk '
  {env = $1 * 8 + $3; getline; printf "%d\n", ($0 - env) / 9 - 4096}'
111991
$ xargs --show-limits < /dev/null |& grep actually
Maximum length of command we could actually use: 1039698
Size of command buffer we are actually using: 131072
$ yes '""' | xargs  | head -1 | wc -c
65193
$ yes '""' | xargs -s 111991 | head -1 | wc -c
111986
Related Question