Bash Script – Handling YouTube Filenames with Non-English Characters

asciibashshell-scriptunicode

I downloaded many YouTube videos and want to process them using bash scripts. However the filenames used contain all kinds of special and non-ASCII characters.

How do I handle this in a bash script?

Lets say I want to create a symbolic link to each such file in a folder:

# Write filenames to filelist.txt in parent folder
ls ./* > ../filelist.txt

# Create sym links for all files in filelist.txt
counter=0
while read video_name; 
  do 
  counter=$((counter+1)); 

  ln -s $video_name  link_name_${counter}.mp4

done < ../filelist.txt

The above function is not working due to the special characters in the filename.

Here are some example filenames:

पेट (Stomach) कम करने के लिए  5 योग आसन-3G4pEY5njYE.mp4
मन शांत करने के लिए करे वृक्षासन योग _ स्वामी रामदेव-sPytQlaxoIg.mp4
वृक्षासन करने का तरीका और फायदे _ Swami Ramdev-A-2d04ON9hA.mp4

Bonus:
I also would like to have "leading zeros" when printing the counter variable, but that's not crucial.

Best Answer

Variables in the shell can contain any character, except for the NUL character, just like filenames in the filesystem. You should therefore not have any problem storing the filenames in variables, unless you read the mangled output of ls, which will possibly be modified for display purposes (ls output is strictly for looking at).

In the edited question, you additionally read the filenames from a text file with read and the default value of $IFS (which determines aspects of how read works). This would strip flanking whitespace from the lines read from the file, and may interpret the \ character specially if it occurs in the input. Also note that technically, filenames may contain newline characters, so storing them as a newline-delimited list (lines in a text file) limits the types of names that can be used.

You also need to quote the expansion of variables. You have filenames with spaces in them, and without quoting the $video value, the shell would split these up in to multiple words and give these words (after additionally performing filename globbing with these as patterns) as separate arguments to ln -s.

Don't use ls to generate the list of the filenames, and quote the expansions of all variables:

counter=0

for video in ./*; do
    counter=$(( counter + 1 ))
    ln -s -- "$video" "link_name_$counter.mp4"
done

Note that the above code would generate the symbolic links in the current directory. If you run this a second time, it would pick up these links and create further links to those symbolic links. It would be better to create the links in a separate directory, to be more careful with the filename globbing pattern used with the loop so that the links are avoided, or explicitly test for links in the loop and skip these.

counter=0

for video in ./*; do
    [ -L "$video" ] && continue    # skip symbolic links
    counter=$(( counter + 1 ))
    ln -s -- "$video" "link_name_$counter.mp4"
done

To get a zero-filled counter with four digits, you may use

printf -v zcounter '%.4d' "$counter"

This prints the re-formatted counter directly to the zcounter variable. You would then use that variable in generating the filename. Or you could just generate the name of the symbolic link in one go in this way:

counter=0

for video in ./*; do
    [ -L "$video" ] && continue    # skip symbolic links
    counter=$(( counter + 1 ))

    printf -v linkname 'link_name_%.4d.md4' "$counter"
    ln -s -- "$video" "$linkname"
done

See also:

Related Question