For example, I have directory with multiple files created by this way:
touch files/{1..10231}_file.txt
I want to move them into new directory new_files_dir
.
The simplest way to do this is:
for filename in files/*; do
mv "${filename}" -t "new_files_dir"
done
This script works for 10 seconds on my computer. It is slow. The slowness happens due execution of mv
command for every file.
###Edit start###
I have understood, that in my example the simplest way will be just
mv files/* -t new_files_dir
or, if the "Argument list too long":
printf '%s\0' files/* | xargs -0 mv -t new_files_dir
but aforementioned case is a part of task. The whole task is in this question: Moving large number of files into directories based on file names in linux.
So, the files must be moved into corresponding subdirectories, the correspondence of which is based on a number in the filename. This is the cause of for
loop usage and other oddities in my code snippets.
###Edit end###
There is possibility to speedup this process by passing bunch of files to mv
command instead of a single file, like this:
batch_num=1000
# Counting of files in the directory
shopt -s nullglob
file_list=(files/*)
file_num=${#file_list[@]}
# Every file's common part
suffix='_file.txt'
for((from = 1, to = batch_num; from <= file_num; from += batch_num, to += batch_num)); do
if ((to > file_num)); then
to="$file_num"
fi
# Generating filenames by `seq` command and passing them to `xargs`
seq -f "files/%.f${suffix}" "$from" "$to" |
xargs -n "${batch_num}" mv -t "new_files_dir"
done
In this case the script works for 0.2 seconds. So, the performance has increased by 50 times.
But there is a problem: at any moment the program can refuse to work due "Argument list too long", because I can't guarantee that the bunch of filenames length is less than max allowable length.
My idea is to calculate the batch_num
:
batch_num = "max allowable length" / "longest filename length"
and then use this batch_num
in xargs
.
Thus, the question: How can max allowable length be calculated?
I have done something:
-
Overall length can be found by this way:
$ getconf ARG_MAX 2097152
-
The environment variables contributes into the argument size too, so probably they should be subtracted from
ARG_MAX
:$ env | wc -c 3403
-
Made a method to determine the max number of files of equal sizes by trying different amount of files before the right value is found (binary search is used).
function find_max_file_number { right=2000000 left=1 name=$1 while ((left < right)); do mid=$(((left + right) / 2)) if /bin/true $(yes "$name" | head -n "$mid") 2>/dev/null; then left=$((mid + 1)) else right=$((mid - 1)) fi done echo "Number of ${#name} byte(s) filenames:" $((mid - 1)) } find_max_file_number A find_max_file_number AA find_max_file_number AAA
Output:
Number of 1 byte(s) filenames: 209232 Number of 2 byte(s) filenames: 190006 Number of 3 byte(s) filenames: 174248
But I can't understand the logic/relation behind these results yet.
-
Have tried values from this answer for calculation, but they didn't fit.
-
Wrote a C program to calculate the total size of passed arguments. The result of this program is close, but some non-counted bytes are left:
$ ./program {1..91442}_file.txt arg strings size: 1360534 number of pointers to strings 91443 argv size: 1360534 + 91443 * 8 = 2092078 envp size: 3935 Overall (argv_size + env_size + sizeof(argc)): 2092078 + 3935 + 4 = 2096017 ARG_MAX: 2097152 ARG_MAX - overall = 1135 # <--- Enough bytes are # left, but no additional # filenames are permitted. $ ./program {1..91443}_file.txt bash: ./program: Argument list too long
program.c
#include <stdio.h> #include <string.h> #include <unistd.h> int main(int argc, char *argv[], char *envp[]) { size_t chr_ptr_size = sizeof(argv[0]); // The arguments array total size calculation size_t arg_strings_size = 0; size_t str_len = 0; for(int i = 0; i < argc; i++) { str_len = strlen(argv[i]) + 1; arg_strings_size += str_len; // printf("%zu:\t%s\n\n", str_len, argv[i]); } size_t argv_size = arg_strings_size + argc * chr_ptr_size; printf( "arg strings size: %zu\n" "number of pointers to strings %i\n\n" "argv size:\t%zu + %i * %zu = %zu\n", arg_strings_size, argc, arg_strings_size, argc, chr_ptr_size, argv_size ); // The enviroment variables array total size calculation size_t env_size = 0; for (char **env = envp; *env != 0; env++) { char *thisEnv = *env; env_size += strlen(thisEnv) + 1 + sizeof(thisEnv); } printf("envp size:\t%zu\n", env_size); size_t overall = argv_size + env_size + sizeof(argc); printf( "\nOverall (argv_size + env_size + sizeof(argc)):\t" "%zu + %zu + %zu = %zu\n", argv_size, env_size, sizeof(argc), overall); // Find ARG_MAX by system call long arg_max = sysconf(_SC_ARG_MAX); printf("ARG_MAX: %li\n\n", arg_max); printf("ARG_MAX - overall = %li\n", arg_max - (long) overall); return 0; }
I have asked a question about the correctness of this program on StackOverflow: The maximum summarized size of argv, envp, argc (command line arguments) is always far from the ARG_MAX limit.
Best Answer
Just use a shell where
mv
is or can be made builtin, and you won't have the problem (which is a limitation of theexecve()
system call, so only with external commands). It will also not matter as much how many times you callmv
.zsh
,busybox sh
,ksh93
(depending on how it was built) are some of those shells. Withzsh
:The
execve()
E2BIG limit applies differently depending on the system (and version thereof), can depend on things like stacksize limit. It generally takes into account the size of eachargv[]
andenvp[]
strings (including the terminating NUL character), often the size of those arrays of pointers (and terminating NULL pointer) as well (so it depends both on the size and number of arguments). Beware that the shell can set some env vars at the last minute as well (like the_
one that some shells set to the path of the commands being executed).It could also depend on the type of executable (ELF, script, binfmt_misc). For instance, for scripts,
execve()
ends up doing a secondexecve()
with a generally longer arg list (["myscrip", "arg", NULL]
becomes["/path/to/interpreter" or "myscript" depending on system, "-<option>" if any on the shebang, "myscript", "arg"]
).Also beware that some commands end up executing other commands with the same list of args and possibly some extra env vars. For instance,
sudo cmd arg
runscmd arg
withSUDO_COMMAND=/path/to/cmd arg
in its environment (doubling the space required to hold the list of arguments).You may be able to come up with the right algorithm for your current Linux kernel version, with the current version of your shell and the specific command you want to execute, to maximise the number of arguments you can pass to
execve()
, but that may no longer be valid of the next version of the kernel/shell/command. Better would be to takexargs
approach and give enough slack to account for all those extra variations or usexargs
.GNU
xargs
has a--show-limits
option that details how it handles it:You can see
ARG_MAX
is 2MiB in my case,xargs
thinks it could use up to2088192
, but chooses to limit itself to 128KiB.Just as well as:
It could not pass 239,995 empty arguments (with total string size of 239,995 bytes for the NUL delimiters, so fitting in that 240,000 buffer) so tried again with half as many. That's a small amount of data, but you have to consider that the pointer list for those strings is 8 times as big, and if we add up those, we get over 2MiB.
When I did this same kind of tests over 6 years ago in that Q&A here with Linux 3.11, I was getting a different behaviour which had already changed recently at the time, showing that the exercise of coming up with the right algorithm to maximise the number of arguments to pass is a bit pointless.
Here, with an average file path size of 32 bytes, with a 128KiB buffer, that's still 4096 filenames passed to
mv
and the cost of startingmv
is alreadly becoming negligible compared to the cost of renaming/moving all those files.For a less conservative buffer size (to pass to
xargs -s
) but that should still work for any arg list with past versions of Linux at least, you could do:Where we compute a high estimate of the space used by the environment (number of lines in
env
output should be at least as big as the number ofenvp[]
pointers we passed toenv
, and we count 8 bytes for each of those, plus their size (including NULs whichenv
replaced with NL)), substract that fromARG_MAX
and divide by 9 to cover for the worst case scenario of a list of empty args and add 4KiB of slack.Note that if you limit the stack size to 4MiB or below (with
limit stacksize 4M
inzsh
for instance), that becomes more conservative than GNUxargs
's default buffer size (which remains 128K in my case and fails to pass a list of empty vars properly).