Ubuntu – Sed script crashing on big file

command linesedtext processing

I have a shell script which is in essence a sed script with some checks. The goal of the script is to convert the header of a file from.

&FCI
NORB=280,
NELEC=78,
MS2=0,
UHF=.FALSE.,
ORBSYM=1,1,1,1,1,1,1,1,<...>
&END
  1.48971678130072078261E+01   1   1   1   1
 -1.91501428271686324756E+00   1   1   2   1
  4.38796949990802698238E+00   1   1   2   2

&FCI NORB=280, NELEC=78, MS2=0, UHF=.FALSE., 
ORBSYM=1,1,1,1,1,1,1,1,<...>
ISYM=1,
/
  1.48971678130072078261E+01   1   1   1   1
 -1.91501428271686324756E+00   1   1   2   1
  4.38796949990802698238E+00   1   1   2   2

This is the script:

#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ ${#} -ne 1 ]
then
  echo "Syntaxis: fcidump_new2old FCIDUMPFILE" 1>$2
  exit 1
fi

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' ${1} > /dev/null
then
  echo "The provided file is already in old FCIDUMP format." 1>&2
  exit 2
fi

sed '
1,20 {
   :a; N; $!ba
   s/\(=[^,]*,\)\n/\1 /g
   s/\(&FCI\)\n/\1 /
   s/ORBSYM/\n&/g
   s/&END/ISYM=1,\n\//
}' -i "${1}"

exit 0

This script works for "small" files and but now I encountered a file of approx 9 Gigabyte and the script crashes with the "super clear error message":

script.sh: line 24: 406089 Killed                  sed '
1,20 {
   :a; N; $!ba
   s/\(=[^,]*,\)\n/\1 /g
   s/\(&FCI\)\n/\1 /
   s/ORBSYM/\n&/g
   s/&END/ISYM=1,\n\//
}' -i "${1}"

How can I make this sed script to really only look at the header and to be able to handle such big files? The ugly hardcoded "20" is btw there because I do not know sth better.

Extra info:

after trying some things I saw that that strange files were produced: sedexG4Lg, sedQ5olGZ, sedXVma1Y, sed21enyi, sednzenBn, sedqCeeey sedzIWMUi. All were empty except sednzenBn which was like the input file only but half of it.
discarding the -i flag and redirecting the output to another file gives an empty file.

Best Answer

General method

You can split each file into a header and a second file with the data lines
Then you can easily edit a header separately with your current sed command
Finally you can concatenate the header and the file with the data lines.

Light-weight tools to manage huge files

You can use head and tail to create a head file and a data file.
You can use cat to concatenate the modified head file and the data file.
Efficient way to print lines from a massive file using awk, sed, or something else?
Another method is to use split

Test

I tested with your header and a file with 1080000000 numbered lines (size 19 Gib), totally 1080000007 lines, and it worked, the output file (with 1080000004 lines) was written in 5 minutes in my old hp xw8400 workstation (including typing the command to start the shellscript).
```
$ ls -lh --time-style=full-iso huge*
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:50:45.278328120 +0100 huge.in
-rw-r--r-- 1 sudodus sudodus 19G 2018-12-15 19:55:46.808798456 +0100 huge.out
```
The big write operations were between the system partition on an SSD and a data partition on an HDD.

Shellscript

You need enough free space in the file system where you have /tmp for the huge temporary 'data' file, more than 9 GB according to your original question.

$ LANG=C df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       106G   32G   69G  32% /

This may seem an awkward way to do things, but it works for huge files without crashing the tools. Maybe you must store the temporary 'data' file somewhere else, for example in an external drive (but it will probably be slower).

#!/bin/bash

# $1 : FCIDUMP file to convert from "new format" to "old format"

if [ $# -ne 2 ]
then
  echo "Syntaxis: $0 fcidumpfile oldstylefile " 1>&2
  echo "Example:  $0 file.in file.out" 1>&2
  exit 1
fi

if [ "$1" == "$2" ]
then
  echo "The names of the input file and output file must differ"
  exit 2
exit
fi

endheader="$(grep -m 1 -n '&END' "$1" | cut -d: -f1)"
if [ "$endheader" == "" ]
then
  echo "Bad input file: the end marker of the header was not found"
  exit 3
fi
#echo "endheader=$endheader"

< "$1" head -n "$endheader" > /tmp/header
#cat /tmp/header

if egrep '&FCI ([a-zA-Z2 ]*=[0-9 ]*,){2,}' /tmp/header  > /dev/null
then
  echo "The provided file is already in old FCIDUMP format." 1>&2
  exit 4
fi

# run sed inline on /tmp/header 
sed '
{
:a; N; $!ba
s/\(=[^,]*,\)\n/\1 /g
s/\(&FCI\)\n/\1 /
s/ORBSYM/\n&/g
s/&END/ISYM=1,\n\//
}' -i /tmp/header 

if [ $? -ne 0 ]
then
  echo "Failed to convert the header format in /tmp/header"
  exit 5
fi

< "$1" tail -n +$(($endheader+1)) > /tmp/tailer

if [ $? -ne 0 ]
then
  echo "Failed to create the 'data' file /tmp/tailer"
  exit 6
fi

#echo "---"
#cat /tmp/tailer
#echo "---"

cat /tmp/header /tmp/tailer > "$2"

exit 0

Related Solutions

Ubuntu – using Sed to search and replace text in XML file

The issue is that your search pattern contains / which you are using as the replacement delimiter, you need to use another character for that or escape the /:

sed -i 's#<!--UpdateAccountGUIDs>UpdateAndExit</UpdateAccountGUIDs-->#<UpdateAccountGUIDs>UpdateAndExit</UpdateAccountGUIDs>#' File.XML

sed -i 's/<!--UpdateAccountGUIDs>UpdateAndExit<\/UpdateAccountGUIDs-->/<UpdateAccountGUIDs>UpdateAndExit<\/UpdateAccountGUIDs>/' File.XML

Note that you should never use regular expressions to parse [X]HTML.

Finally, as a general rule, when working with regular expressions, less is more. You should try to specify the simplest possible exclusive pattern rather than repeat all text. This not only makes your code much easier to read, it also avoids problems like the one you were facing. For example:

sed -i -r 's/<!--(UpdateAccountGUIDs.+?)-->/<\1>/' File.XML

Here, the -r enables extended regular expression syntax so we can use () to capture a group (without needing to escape the parentheses) and then refer to the captured text as \1. So, the command above simply looks for a comment that is adjacent to UpdateAccountGUIDs, extends till the first end of comment statement and then does the replacement.

Ubuntu – sed : difference between ” sed ‘s/\-.// ” & ” sed ‘/\-/s/\-.//gw “

There is no difference between these in effect:

sed 's/\-.*//g'
sed '/\-/s/\-.*//g'

The first form acts on all lines, the second form acts only on those lines which match /-/ using addresses. Since the action taken includes -, in effect both lines will only affect those which contain -.

Now if you'd used /Raja/ as an address instead, you'd only have seen the last line in with - that is, only those lines which contained Raja, and had the substitution performed.