Xargs to extract filename

awkechofindxargs

I would like to find all the .html files in a folder and append [file](./file.html) to another file called index.md. I tried the following command:

ls | awk "/\.html$/" | xargs -0 -I @@ -L 1 sh -c 'echo "[${@@%.*}](./@@)" >> index.md'

But it can't substitute @@ inside the command? What am I doing wrong?

Note: Filename can contain valid characters like space


Clarification:

index.md would have each line with [file](./file.html) where file is the actual file name in the folder

Best Answer

Just do:

for f in *.html; do printf '%s\n' "[${f%.*}](./$f)"; done > index.md

Use set -o nullglob (zsh, yash) or shopt -s nullglob (bash) for *.html to expand to nothing instead of *.html (or report an error in zsh) when there's no html file. With zsh, you can also use *.html(N) or in ksh93 ~(N)*.html.

Or with one printf call with zsh:

files=(*.html)
rootnames=(${files:r})
printf '[%s](./%s)\n' ${basenames:^files} > index.md

Note that, depending on which markdown syntax you're using, you may have to HTML-encode the title part and URI-encode the URI part if the file names contain some problematic characters. Not doing so could even end up introducing a form of XSS vulnerability depending on context. With ksh93, you can do it with:

for f in *.html; do
  title=${ printf %H "${file%.*}"; }
  title=${title//$'\n'/"<br/>"}
  uri=${ printf '%#H' "$file"; }
  uri=${uri//$'\n'/%0A}      
  printf '%s\n' "[$title]($uri)"
done > index.md

Where %H¹ does the HTML encoding and %#H the URI encoding, but we still need to address newline characters separately.

Or with perl:

perl -MURI::Encode=uri_encode -MHTML::Entities -CLSA -le '
  for (<*.html>) {
     $uri = uri_encode("./$_");
     s/\.html\z//;
     $_ = encode_entities $_;
     s:\n:<br/>:g;
     print "[$_]($uri)"
  }'

Using <br/> for newline characters. You may want to use ␤ instead or more generally decide on some form of alternative representation for non-printable characters.

There are a few things wrong in your code:

  • parsing the output of ls
  • use a $ meant to be literal inside double quotes
  • Using awk for something that grep can do (not wrong per se, but overkill)
  • use xargs -0 when the input is not NUL-delimited
  • -I conflicts with -L 1. -L 1 is to run one command per line of input but with each word in the line passed as separate arguments, while -I @@ runs one command for each line of input with the full line (minus the trailing blanks, and quoting still processed) used to replace @@.
  • using {} inside the code argument of sh (command injection vulnerability)
  • In sh, the var in ${var%.*} is a variable name, it won't work with arbitrary text.
  • use echo for arbitrary data.

If you wanted to use xargs -0, you'd need something like:

printf '%s\0' * | grep -z '\.html$' | xargs -r0 sh -c '
  for file do
    printf "%s\n" "[${file%.*}](./$file)"
  done' sh > file.md
  • Replacing ls with printf '%s\0' * to get a NUL-delimited output
  • awk with grep -z (GNU extension) to process that NUL-delimited output
  • xargs -r0 (GNU extensions) without any -n/-L/-I, because while we're at spawning a sh, we might as well have it process as many files as possible
  • have xargs pass the words as extra arguments to sh (which become the positional parameters inside the inline code), not inside the code argument.
  • which means we can more easily store them in variables (here with for file do which loops over the positional parameters by default) so we can use the ${param%pattern} parameter expansion operator.
  • use printf instead of echo.

It goes without saying that it makes little sense to use that instead of doing that for loop directly over the *.html files like in the top example.


¹ It doesn't seem to work properly for multibyte characters in my version of ksh93 though (ksh93u+ on a GNU system)

Related Question