Bash – Reformat tables

bashshelltable

I have some tables (table.txt) that have been wrongly built and present redundancey in the results, as follow:

YEAR MONTH DAY RES
1971 1     1   245
1971 1     2   587
...
1971 12    31  685
1971 1     1   245
1971 1     2   587
...
1971 12    31  685
1972 1     1   549
1972 1     2   746
...

Instead I would like to have:

YEAR MONTH DAY RES
1971 1     1   245
1971 1     2   587
...
1971 12    31  685
1972 1     1   549
1972 1     2   746
...

So the problem is that the results are presented twice in the table. That means (with the provided example) that after the '1971' I should expected year '1972' and not '1971' again. Is there a way to delete the redundant results using sh/bash?

I have to notice that my data run throughout 1971 until 2099 day by day, and that they have exactly the same format even after year 2000, as follow:

YEAR MONTH DAY RES
1971 1     1   245
1971 1     2   587
...
2000 1     1   875
2000 1     2   456
...
2099 12    31  321

Best Answer

Here are two mutually exclusive sed loops:

sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn' <<""
YEAR MONTH DAY RES
1971 1     1   245
1971 1     2   587
...
1971 12    31  685
1971 1     1   245
1971 1     2   587
...
1971 12    31  685
1972 1     1   549
1972 1     2   746
...
1972 12    31  999
1972 1     1   933
1972 1     2   837
...
1972 12    31  343

YEAR MONTH DAY RES
1971 1     1   245
1971 1     2   587
...
1971 12    31  685
1972 1     1   549
1972 1     2   746
...
1972 12    31  999

Basically sed has two states - print and eat. In the first state - the print state - sed automatically prints every input line then checks it against the / 12 * 31 / pattern. If the current pattern space does ! not match it is deleted and sed pulls in the next input line and starts the script again from the top - at the print command without attempting to run anything that follows the delete command at all.

When an input line does match / 12 * 31 /, however, sed falls through to the second half of the script - the eat loop. First it defines a branch : label named n; then it overwrites the current pattern space with the next input line, and then it compares the current pattern space to the // last matched pattern. Because the line that matched it before has just been overwritten with the next one, the first iteration of this eat loop doesn't match, and every time it does ! not sed branches back to the :n label to get the next input line and once again compare it to the // last matched pattern.

When another match is finally made - some 365 next lines later - sed does -not automatically print it when it completes its script, pulls in the next input line, and starts again from the top at the print command in its first state. So each loop state will fall through to the next on the same key and do as little as possible in the meantime to find the next key.

Note that the entire script completes without invoking a single editing routine, and that it needs only to compile the single regexp. The automaton that results is very simple - it understands only [123 ] and [^123 ]. What's more, at least half of the comparisons will very likely be made without any compilations, because the only address referenced in the eat loop at all is the // empty one. sed can therefore complete that loop entirely with a single regexec() call per input line. sed may do similar for the print loop as well.


timed


I was curious about how the various answers here might perform, and so I came up with my own table:

dash <<""
    d=0 D=31 IFS=: set 1970 1
    while   case  "$*:${d#$D}" in (*[!:]) ;;
            ($(($1^($1%4)|(d=0))):1:)
                     D=29 set $1 2;;
            (*:1:)   D=28 set $1 2;;
            (*[3580]:)
                     D=30 set $1 $(($2+1));;
            (*:)     D=31 set $(($1+!(t<730||(t=0)))) $(($2%12+1))
            esac
    do      printf  '%-6d%-4d%-4d%d\n' "$@" $((d+=1)) $((t+=1))
    done|   head    -n1000054 >/tmp/dates

dash <<<''  6.62s user 6.95s system 166% cpu 8.156 total

That puts a million+ lines in /tmp/dates and doubles the output for each of years 1970 - 3338. The file looks like:

tail -n1465 </tmp/dates | head; echo; tail </tmp/dates

3336  12  27  728
3336  12  28  729
3336  12  29  730
3336  12  30  731
3336  12  31  732
3337  1   1   1
3337  1   2   2
3337  1   3   3
3337  1   4   4
3337  1   5   5

3338  12  22  721
3338  12  23  722
3338  12  24  723
3338  12  25  724
3338  12  26  725
3338  12  27  726
3338  12  28  727
3338  12  29  728
3338  12  30  729
3338  12  31  730

...some of it anyway.

And then I tried the different commands on it:

for  cmd in "sort -uVk1,3" \
            "sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn'" \
            "awk '"'{u=$1 $2 $3 $4;if (!a[u]++) print;}'\'
do   eval   "time ($cmd|wc -l)" </tmp/dates
done

500027
( sort -uVk1,3 | wc -l; ) \
1.85s user 0.11s system 280% cpu 0.698 total

500027
( sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn' | wc -l; ) \
0.64s user 0.09s system 110% cpu 0.659 total

500027
( awk '{u=$1 $2 $3 $4;if (!a[u]++) print;}' | wc -l; ) \
1.46s user 0.15s system 104% cpu 1.536 total

The sort and sed commands both completed in less than half the time awk did - and these results were typical. I did run them several times. It appears all of the commands are writing out the correct number of lines as well - and so they probably all work.

sort and sed were fairly well neck and neck - with sed generally a hair ahead - for completion time for every run, but sort does more actual work to achieve its results than either of the other two commands. It is running parallel jobs to complete its task and benefits a great deal from my multi-core cpu. awk and sed both peg the single-core assigned them for the entire time they process.

The results here are from a standard, up-to-date GNU sed, but I did try another. In fact, I tried all three commands with other binaries, but only the sed command actually worked with my heirloom tools. The others, as I guess due to non-standard syntax, simply quit with error before getting off the ground.

It is good to use standard syntax when possible - you can freely use more simple, honed, and efficient implementations in many cases that way:

PATH=/usr/heirloom/bin/posix2001:$PATH; time ...

500027
( sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn' | wc -l; ) \
0.31s user 0.12s system 136% cpu 0.318 total
Related Question