Ubuntu – Sed script crashing on big file

command linesedtext processing

I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.

&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
  1.48971678130072078261E+01   1   1   1   1
 -1.91501428271686324756E+00   1   1   2   1
  4.38796949990802698238E+00   1   1   2   2

to

&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
  1.48971678130072078261E+01   1   1   1   1
 -1.91501428271686324756E+00   1   1   2   1
  4.38796949990802698238E+00   1   1   2   2

This is the script:

#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ ${#} -ne 1 ]
then
  echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
  exit 1
fi

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
  echo "The provided file is already in old FCIDUMP format." 1>&2
  exit 2
fi

sed '
1,20 {
   :a; N; $!ba
   s/\(=[^,]*,\)\n/\1 /g
   s/\(&FCI\)\n/\1 /
   s/ORBSYM/\n&/g
   s/&END/ISYM=1,\n\//
}' -i "${1}"

exit 0

This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":

script.sh: line 24: 406089 Killed                  sed '
1,20 {
   :a; N; $!ba
   s/\(=[^,]*,\)\n/\1 /g
   s/\(&FCI\)\n/\1 /
   s/ORBSYM/\n&/g
   s/&END/ISYM=1,\n\//
}' -i "${1}"

How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.

Extra info:

  • after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.

  • discarding the -i flag and redirecting the output to another file gives an empty file.

Best Answer

General method

  • You can split each file into a header and a second file with the data lines
  • Then you can easily edit a header separately with your current sed command
  • Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

Test

  • I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).

    $ ls -lh --time-style=full-iso huge*
    -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
    -rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
    
  • The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       106G   32G   69G  32% /

This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).

#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ $# -ne 2 ]
then
  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
  echo "Example:  $0 file.in file.out" 1>&2
  exit 1
fi

if [ "$1" == "$2" ]
then
  echo "The names of the input file and output file must differ"
  exit 2
exit
fi

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
  echo "Bad input file: the end marker of the header was not found"
  exit 3
fi
#echo "endheader=$endheader"

< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null
then
  echo "The provided file is already in old FCIDUMP format." 1>&2
  exit 4
fi

# run sed inline on /tmp/header 
sed '
{
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i /tmp/header 

if [ $? -ne 0 ]
then
  echo "Failed to convert the header format in /tmp/header"
  exit 5
fi

< "$1" tail -n +$(($endheader+1)) > /tmp/tailer

if [ $? -ne 0 ]
then
  echo "Failed to create the 'data' file /tmp/tailer"
  exit 6
fi

#echo "---"
#cat /tmp/tailer
#echo "---"

cat /tmp/header /tmp/tailer > "$2"

exit 0