Efficient data extraction from multiple files to a single CSV file

csvtext processingxmlzsh

I have a large collection of XML files with the same exact structure:

$ cat file_<ID>.xml
... 
 ... 
   ...
      <double>1.2342</double>
      <double>2.3456</double>
      ...
   ...
 ... 
... 

where the number of such <double> entries in each XML file is fixed and known (in my particular case, 168).

I need to build a single csv file with the content of all these XML files stored as follows:

file_0001 1.2342 2.3456 ... 
file_0002 1.2342 2.3456 ... 

etc.

How can I do this efficiently?


The best I have come up with is this:

#!/usr/bin/env zsh

for x in $path_to_xmls/*.xml; do 

    # 1) Get the doubles ignoring everything else
    # 2) Remove line breaks within the same file
    # 3) Add a new line at the end to construct the CSV file
    # 4) Join the columns together

    cat $x | grep -F '<double>' | \ 
    sed -r 's/.*>([0-9]+\.*[0-9]*).*?/\1/' | \
    tr '\n' ' ' | sed -e '$a\'  |  >> table_numbers.csv

    echo ${x:t} >> file_IDs.csv
done
    
paste file_IDs table_numbers.csv > final_table.csv

When I time the above script in a folder with ~10K XML files I get:

./from_xml_to_csv.sh  100.45s user 94.84s system 239% cpu 1:21.48 total

not terrible, but I am hoping to work with 100x or 1000x more files. How can I make this processing more efficient?

Also, with my solution above, could I ever end up in a situation where the glob expansion reaches a limit, e.g. when working with millions of files? (the typical "too many args" problem).

Update

For anyone interested in a great solution to this problem, please read @mikeserve's answer. It is the fastest and the one that scales up the best by far.

Best Answer

Regarding glob expansion possibly exceeding a limit - yes and no. The shell's already running, and so it won't stop. But if you were to pass the entire globbed array as arguments to a single command, then yes, that is a definite possibility. The portable and robust way to handle this involves find...

find . \! -name . -prune -name pattern -type f -exec cat {} + | ...

...which will only cat regular files in the current directory with a name which matches pattern, but will also only invoke cat as many times as is necessary to avoid exceeding ARG_MAX.

In fact, though, since you have a GNU sed we can almost do the whole thing with just sed in a find script.

cd /path/to/xmls
find . \! -name . -prune -name \*.xml -type f -exec  \
    sed -sne'1F;$x;/\n*\( \)*<\/*double>/!d' \
        -e  '$s//\1/gp;H' {} + | paste -d\\0 - -

I thought of another way. This will be very fast, but it absolutely depends on there being exactly 168 matches per file, and there can only be the one . dot in the filenames.

(   export LC_ALL=C; set '' - -
    while [ "$#" -lt 168 ]; do set "$@$@"; done
    shift "$((${#}-168))"
    find . \! -name . -prune -name \*.xml -type f      \
              -exec  grep -F '<double>' /dev/null {} + |
    tr \<: '>>' | cut -d\> -f1,4 | paste -d\  "$@"     |
    sed 'h;s|./[^>]*>||g;x;s|\.x.*||;s|..||;G;s|\n| |'
)

As requested, here's a little breakdown of how this command works:

  1. ( ... )

    • In the first place, the whole little script is run within its own subshell because there are a few global environmental properties that we'll be altering in the course of its execution, and this way when the job is done all of the properties we alter will be restored to their original values - whatever they were.
  2. export LC_ALL=C; set '' - -
    • By setting the current locale to C we can save our filters a lot of effort. In a UTF-8 locale any char might be represented by one or several bytes a piece, and any char found will need to be selected out of a group of many thousand possibles. In the C locale each char is a single byte, and there are only 128 of them. It makes char matching a much quicker affair overall.
    • The set statement changes the shell's positional parameters. Doing set '' - - sets $1 to \0, and $2 and $3 to -.
  3. while ... set "$@$@"; done; shift ...
    • Basically the whole point of this statement is to get an array of 168 dashes. We'll be using paste later to replace sequential sets of 167 newlines with spaces, while preserving the 168th. The most simple way to do this is to give it 168 argument references to - stdin and tell it to paste all of those together.
  4. find ... -exec grep -F '<double>' /dev/null' ...
    • The find bit has been previously discussed, but with grep we print only those lines which can be matched against the -Fixed string <double>. By making grep's first argument /dev/null - which is a file that can never match our string - we ensure that grep is always searching 2 or more file arguments for every invocation. When invoked with 2 or more named search files grep will always print the filename like file_000.xml: at the head of every output line.
  5. tr \<: '>>'
    • Here we translate every occurrence in grep's output of either : or < characters to >.
    • At this point a sample matched line will look like ./file_000.xml> >double>0.0000>/double>.
  6. cut -d\> -f1,4
    • cut will strip from its output all of its input which cannot be found within either the 1st or 4th fields as divided by > chars.
    • At this point a sample matched line will look like ./file_000.xml>0.0000.
  7. paste -d\ "$@"
    • Already discussed, but here we paste input lines in batches of 168.
    • At this point 168 matched lines occur together like: ./file_000.xml>0.000 .../file_000.xml>0.167
  8. sed 'h;s|./[^>]*>||g;x;s|\.xml.*||;s|..||;G;s|\n| |'
    • Now the faster, smaller utilities have already done the majority of the work. On a multicore system, they've probably even done it concurrently. And those utilities - especially cut and paste are far faster at what they do than any attempt at emulation we might do with higher-level utilities like sed, or, even worse, awk. But I've taken it as far as I could imagine I might do this far, and I have to call on sed.
    • First I hold a copy of every input line, then I globally remove every occurrence of the pattern ./[^>]*> in pattern space - so every occurrence of the filename. At this point sed's pattern space looks like: 0.000 0.0001...0.167
    • Then I exchange hold and pattern spaces and remove everything from \.xml.* on - so everything from the first filename on the saved copy of the line on. I then strip the first two chars - or ./ as well - and at this point pattern space looks like file_000.
    • So all that remains is to stick them together. I Get a copy of hold space appended to pattern space following a \newline char, then I s///ubstitute the \newline for a space.
    • And so, finally, pattern space looks like file_000 0.000...0.167. And that is what sed writes to output for each file find passes to grep.
Related Question