Remove Duplicates – Remove Partial Duplicates Consecutive Lines but Keep First and Last

awksedtext processing

I have log files with a time stamp and six values in each line
i want to reduce the amount of data, by removing consecutive lines with the same values (ignoring time stamps) and keeping the first and last line of each duplicate set. Preferably using a bash script. It should be a magic sed or awk command combination.

Even if i have to parse the file multiple times, reading 3 lines at a time and removing the middle one, is a good solution.

original file:

1447790360      99999   99999   20.25   20.25   20.25   20.50
1447790362      20.25   20.25   20.25   20.25   20.25   20.50
1447790365      20.25   20.25   20.25   20.25   20.25   20.50
1447790368      20.25   20.25   20.25   20.25   20.25   20.50
1447790371      20.25   20.25   20.25   20.25   20.25   20.50
1447790374      20.25   20.25   20.25   20.25   20.25   20.50
1447790377      20.25   20.25   20.25   20.25   20.25   20.50
1447790380      20.25   20.25   20.25   20.25   20.25   20.50
1447790383      20.25   20.25   20.25   20.25   20.25   20.50
1447790386      20.25   20.25   20.25   20.25   20.25   20.50
1447790388      20.25   20.25   99999   99999   99999   99999
1447790389      99999   99999   20.25   20.25   20.25   20.50
1447790391      20.00   20.25   20.25   20.25   20.25   20.50
1447790394      20.25   20.25   20.25   20.25   20.25   20.50
1447790397      20.25   20.25   20.25   20.25   20.25   20.50
1447790400      20.25   20.25   20.25   20.25   20.25   20.50

desired result:

1447790360      99999   99999   20.25   20.25   20.25   20.50
1447790362      20.25   20.25   20.25   20.25   20.25   20.50
1447790386      20.25   20.25   20.25   20.25   20.25   20.50
1447790388      20.25   20.25   99999   99999   99999   99999
1447790389      99999   99999   20.25   20.25   20.25   20.50
1447790391      20.00   20.25   20.25   20.25   20.25   20.50
1447790394      20.25   20.25   20.25   20.25   20.25   20.50
1447790400      20.25   20.25   20.25   20.25   20.25   20.50

Best Answer

With awk one liner:

awk '{n=$2$3$4$5$6$7}l1!=n{if(p)print l0; print; p=0}l1==n{p=1}{l0=$0; l1=n}END{print}' file

The whole point is to manipulate few variables: n stores all fields except first in current line, l1 the same for previous line and l0 the whole previous line. The p is just a flag to mark if previous line was already printed.

Related Solutions

Shell – Print first and last line of all files in folder

So I like sed the answer can be

for file in file.log.*
do
   echo "file: $file"
   echo -n "first line: "
   cat "$file" | sed -n '/^\s*$/!{p;q}'
   echo -n "last line: "
   tac "$file" | sed -n '/^\s*$/!{p;q}'
done

Remove duplicate lines while keeping the order of the lines

I doubt it will make a difference but, just in case, here's how to do the same thing in Perl:

perl -ne 'print if ++$k{$_}==1' out.txt

If the problem is keeping the unique lines in memory, that will have the same issue as the awk you tried. So, another approach could be:

cat -n out.txt | sort -k2 -k1n  | uniq -f1 | sort -nk1,1 | cut -f2-

How it works:

On a GNU system, cat -n will prepend the line number to each line following some amount of spaces and followed by a <tab> character. cat pipes this input representation to sort.
sort's -k2 option instructs it only to consider the characters from the second field until the end of the line when sorting, and sort splits fields by default on white-space (or cat's inserted spaces and <tab>).
When followed by -k1n, sort considers the 2nd field first, and then secondly—in the case of identical -k2 fields—it considers the 1st field but as sorted numerically. So repeated lines will be sorted together but in the order they appeared.
The results are piped to uniq—which is told to ignore the first field (-f1 - and also as separated by whitespace)—and which results in a list of unique lines in the original file and is piped back to sort.
This time sort sorts on the first field (cat's inserted line number) numerically, getting the sort order back to what it was in the original file and pipes these results to cut.
Lastly, cut removes the line numbers that were inserted by cat. This is effected by cut printing only from the 2nd field through the end of the line (and cut's default delimiter is a <tab> character).

To illustrate:

$ cat file
bb
aa
bb
dd
cc
dd
aa
bb
cc
$ cat -n file | sort -k2 | uniq -f1 | sort -k1 | cut -f2-
bb
aa    
dd
cc

Best Answer

Related Solutions

Shell – Print first and last line of all files in folder

Remove duplicate lines while keeping the order of the lines

Related Question