I have a large collection of XML files with the same exact structure:
$ cat file_<ID>.xml
...
...
...
<double>1.2342</double>
<double>2.3456</double>
...
...
...
...
where the number of such <double>
entries in each XML file is fixed and known (in my particular case, 168).
I need to build a single csv
file with the content of all these XML files stored as follows:
file_0001 1.2342 2.3456 ...
file_0002 1.2342 2.3456 ...
etc.
How can I do this efficiently?
The best I have come up with is this:
#!/usr/bin/env zsh
for x in $path_to_xmls/*.xml; do
# 1) Get the doubles ignoring everything else
# 2) Remove line breaks within the same file
# 3) Add a new line at the end to construct the CSV file
# 4) Join the columns together
cat $x | grep -F '<double>' | \
sed -r 's/.*>([0-9]+\.*[0-9]*).*?/\1/' | \
tr '\n' ' ' | sed -e '$a\' | >> table_numbers.csv
echo ${x:t} >> file_IDs.csv
done
paste file_IDs table_numbers.csv > final_table.csv
When I time the above script in a folder with ~10K XML files I get:
./from_xml_to_csv.sh 100.45s user 94.84s system 239% cpu 1:21.48 total
not terrible, but I am hoping to work with 100x or 1000x more files. How can I make this processing more efficient?
Also, with my solution above, could I ever end up in a situation where the glob expansion reaches a limit, e.g. when working with millions of files? (the typical "too many args"
problem).
Update
For anyone interested in a great solution to this problem, please read @mikeserve's answer. It is the fastest and the one that scales up the best by far.
Best Answer
Regarding glob expansion possibly exceeding a limit - yes and no. The shell's already running, and so it won't stop. But if you were to pass the entire globbed array as arguments to a single command, then yes, that is a definite possibility. The portable and robust way to handle this involves
find
......which will only
cat
regular files in the current directory with a name which matchespattern
, but will also only invokecat
as many times as is necessary to avoid exceedingARG_MAX
.In fact, though, since you have a GNU
sed
we can almost do the whole thing with justsed
in afind
script.I thought of another way. This will be very fast, but it absolutely depends on there being exactly 168 matches per file, and there can only be the one
.
dot in the filenames.As requested, here's a little breakdown of how this command works:
( ... )
export LC_ALL=C; set '' - -
C
we can save our filters a lot of effort. In a UTF-8 locale any char might be represented by one or several bytes a piece, and any char found will need to be selected out of a group of many thousand possibles. In the C locale each char is a single byte, and there are only 128 of them. It makes char matching a much quicker affair overall.set
statement changes the shell's positional parameters. Doingset '' - -
sets$1
to\0
, and$2
and$3
to-
.while ... set "$@$@"; done; shift ...
paste
later to replace sequential sets of 167 newlines with spaces, while preserving the 168th. The most simple way to do this is to give it 168 argument references to-
stdin and tell it to paste all of those together.find ... -exec grep -F '<double>' /dev/null' ...
find
bit has been previously discussed, but withgrep
we print only those lines which can be matched against the-F
ixed string<double>
. By makinggrep
's first argument/dev/null
- which is a file that can never match our string - we ensure thatgrep
is always searching 2 or more file arguments for every invocation. When invoked with 2 or more named search filesgrep
will always print the filename likefile_000.xml:
at the head of every output line.tr \<: '>>'
grep
's output of either:
or<
characters to>
../file_000.xml> >double>0.0000>/double>
.cut -d\> -f1,4
cut
will strip from its output all of its input which cannot be found within either the 1st or 4th fields as divided by>
chars../file_000.xml>0.0000
.paste -d\ "$@"
paste
input lines in batches of 168../file_000.xml>0.000 .../file_000.xml>0.167
sed 'h;s|./[^>]*>||g;x;s|\.xml.*||;s|..||;G;s|\n| |'
cut
andpaste
are far faster at what they do than any attempt at emulation we might do with higher-level utilities likesed
, or, even worse,awk
. But I've taken it as far as I could imagine I might do this far, and I have to call onsed
.h
old a copy of every input line, then Ig
lobally remove every occurrence of the pattern./[^>]*>
in pattern space - so every occurrence of the filename. At this pointsed
's pattern space looks like:0.000 0.0001...0.167
x
changeh
old and pattern spaces and remove everything from\.xml.*
on - so everything from the first filename on the saved copy of the line on. I then strip the first two chars - or./
as well - and at this point pattern space looks likefile_000
.G
et a copy ofh
old space appended to pattern space following a\n
ewline char, then Is///
ubstitute the\n
ewline for a space.file_000 0.000...0.167
. And that is whatsed
writes to output for each filefind
passes togrep
.