I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.
&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
to
&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
1.48971678130072078261E+01 1 1 1 1
-1.91501428271686324756E+00 1 1 2 1
4.38796949990802698238E+00 1 1 2 2
This is the script:
#!/bin/bash
# $1 : FCIDUMP file to convert from "new format" to "old format"
if [ ${#} -ne 1 ]
then
echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
exit 1
fi
if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
echo "The provided file is already in old FCIDUMP format." 1>&2
exit 2
fi
sed '
1,20 {
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i "${1}"
exit 0
This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":
script.sh: line 24: 406089 Killed sed '
1,20 {
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i "${1}"
How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.
Extra info:
-
after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.
-
discarding the -i flag and redirecting the output to another file gives an empty file.
Best Answer
General method
Light-weight tools to manage huge files
head
andtail
to create a head file and a data file.You can use
cat
to concatenate the modified head file and the data file.Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split
Test
I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
The big write operations were between the system partition on an SSD and a data partition on an HDD.
Shellscript
You need enough free space in the file system where you have
/tmp
for the huge temporary 'data' file, more than 9 GB according to your original question.This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).