Remove Duplicates – Remove Partial Duplicates Consecutive Lines but Keep First and Last

awksedtext processing

I have log files with a time stamp and six values in each line
i want to reduce the amount of data, by removing consecutive lines with the same values (ignoring time stamps) and keeping the first and last line of each duplicate set. Preferably using a bash script. It should be a magic sed or awk command combination.

Even if i have to parse the file multiple times, reading 3 lines at a time and removing the middle one, is a good solution.

original file:

1447790360      99999   99999   20.25   20.25   20.25   20.50
1447790362      20.25   20.25   20.25   20.25   20.25   20.50
1447790365      20.25   20.25   20.25   20.25   20.25   20.50
1447790368      20.25   20.25   20.25   20.25   20.25   20.50
1447790371      20.25   20.25   20.25   20.25   20.25   20.50
1447790374      20.25   20.25   20.25   20.25   20.25   20.50
1447790377      20.25   20.25   20.25   20.25   20.25   20.50
1447790380      20.25   20.25   20.25   20.25   20.25   20.50
1447790383      20.25   20.25   20.25   20.25   20.25   20.50
1447790386      20.25   20.25   20.25   20.25   20.25   20.50
1447790388      20.25   20.25   99999   99999   99999   99999
1447790389      99999   99999   20.25   20.25   20.25   20.50
1447790391      20.00   20.25   20.25   20.25   20.25   20.50
1447790394      20.25   20.25   20.25   20.25   20.25   20.50
1447790397      20.25   20.25   20.25   20.25   20.25   20.50
1447790400      20.25   20.25   20.25   20.25   20.25   20.50

desired result:

1447790360      99999   99999   20.25   20.25   20.25   20.50
1447790362      20.25   20.25   20.25   20.25   20.25   20.50
1447790386      20.25   20.25   20.25   20.25   20.25   20.50
1447790388      20.25   20.25   99999   99999   99999   99999
1447790389      99999   99999   20.25   20.25   20.25   20.50
1447790391      20.00   20.25   20.25   20.25   20.25   20.50
1447790394      20.25   20.25   20.25   20.25   20.25   20.50
1447790400      20.25   20.25   20.25   20.25   20.25   20.50

Best Answer

With awk one liner:

awk '{n=$2$3$4$5$6$7}l1!=n{if(p)print l0; print; p=0}l1==n{p=1}{l0=$0; l1=n}END{print}' file

The whole point is to manipulate few variables: n stores all fields except first in current line, l1 the same for previous line and l0 the whole previous line. The p is just a flag to mark if previous line was already printed.

Related Question