Shell – Remove all lines except D

shell-script

I have scenario where my three huge files Test.txt , Test1.txt and Test2.txt has following details.

H|||||||||||||||||||||||
D||||||||||||||||||||||||
D|||||||||||||||||||||||
H|||||||||||||||||||||
D||||||||||||||||||||||||
D||||||||||||||||||||||||
T||||||||||||||||||||||||

I have to delete all except D lines.
It should look like below in all my three files.(more than 10 GB)

D||||||||||||||||||||||||
D|||||||||||||||||||||||
D||||||||||||||||||||||||
D||||||||||||||||||||||||

So after retaining only D's lines in Test.txt, Test2.txt and Test3.txt,
I have to merge those into new file.

I have done the above operation using sed.

sed '/^\('D'\)|/!d' $Filename.txt >>  $NewFilename.txt

But because of huge file its taking very long time.

Can we do this operation using any other command in efficient way?

Best Answer

cat Test.txt Test2.txt Test3.txt | LC_ALL=C grep '^D' > newfile.txt

Or:

for file in Test.txt Test2.txt Test3.txt; do
  LC_ALL=C grep '^D' < "$file"
done > newfile.txt

Or if your grep like GNU grep supports the -h option (to avoid printing file names):

LC_ALL=C grep -h '^D' Test.txt Test2.txt Test3.txt > newfile.txt

By using LC_ALL=C we avoid grep trying to parse UTF-8 data. By using ^D, grep will only look at the first character of each line. grep, especially GNU grep is generally a lot faster than sed.

Related Solutions

Shell Script – Adding Lines to the Beginning and End of a Huge File

sed -i uses tempfiles as an implementation detail, which is what you are experiencing; however, prepending data to the beginning of a data stream without overwriting the existing contents requires rewriting the file, there's no way to get around that, even when avoiding sed -i.

If rewriting the file is not an option, you might consider manipulating it when it is read, for example:

{ echo some prepended text ; cat file ; } | command

Also, sed is for editing streams -- a file is not a stream. Use a program that is meant for this purpose, like ed or ex. The -i option to sed is not only not portable, it will also break any symlinks to your file, since it essentially deletes it and recreates it, which is pointless.

You can do this in a single command with ed like so:

ed -s file << 'EOF'
0a
prepend these lines
to the beginning
.
$a
append these lines
to the end
.
w
EOF

Note that depending on your implementation of ed, it may use a paging file, requiring you to have at least that much space available.

Shell – How to write a sed script to delete numbers from a line

Given this input, you want to keep the first and last fields. Pretty simple with awk:

awk '{print $1, $NF}' filename

Using sed, this will replace all space delimited digit-only words:

sed ':a; s/ [[:digit:]]\+ / /; ta'

Best Answer

Related Solutions

Shell Script – Adding Lines to the Beginning and End of a Huge File

Shell – How to write a sed script to delete numbers from a line

Related Question