I have some tables (table.txt
) that have been wrongly built and present redundancey in the results, as follow:
YEAR MONTH DAY RES
1971 1 1 245
1971 1 2 587
...
1971 12 31 685
1971 1 1 245
1971 1 2 587
...
1971 12 31 685
1972 1 1 549
1972 1 2 746
...
Instead I would like to have:
YEAR MONTH DAY RES
1971 1 1 245
1971 1 2 587
...
1971 12 31 685
1972 1 1 549
1972 1 2 746
...
So the problem is that the results are presented twice in the table. That means (with the provided example) that after the '1971' I should expected year '1972' and not '1971' again. Is there a way to delete the redundant results using sh/bash?
I have to notice that my data run throughout 1971 until 2099 day by day, and that they have exactly the same format even after year 2000, as follow:
YEAR MONTH DAY RES
1971 1 1 245
1971 1 2 587
...
2000 1 1 875
2000 1 2 456
...
2099 12 31 321
Best Answer
Here are two mutually exclusive
sed
loops:Basically
sed
has two states -p
rint and eat. In the first state - thep
rint state -sed
automaticallyp
rints every input line then checks it against the/ 12 * 31 /
pattern. If the current pattern space does!
not match it isd
eleted andsed
pulls in the next input line and starts the script again from the top - at thep
rint command without attempting to run anything that follows thed
elete command at all.When an input line does match
/ 12 * 31 /
, however,sed
falls through to the second half of the script - the eat loop. First it defines a branch:
label namedn
; then it overwrites the current pattern space with then
ext input line, and then it compares the current pattern space to the//
last matched pattern. Because the line that matched it before has just been overwritten with then
ext one, the first iteration of this eat loop doesn't match, and every time it does!
notsed
b
ranches back to the:n
label to get then
ext input line and once again compare it to the//
last matched pattern.When another match is finally made - some 365
n
ext lines later -sed
does-n
ot automatically print it when it completes its script, pulls in the next input line, and starts again from the top at thep
rint command in its first state. So each loop state will fall through to the next on the same key and do as little as possible in the meantime to find the next key.Note that the entire script completes without invoking a single editing routine, and that it needs only to compile the single regexp. The automaton that results is very simple - it understands only
[123 ]
and[^123 ]
. What's more, at least half of the comparisons will very likely be made without any compilations, because the only address referenced in the eat loop at all is the//
empty one.sed
can therefore complete that loop entirely with a singleregexec()
call per input line.sed
may do similar for thep
rint loop as well.timed
I was curious about how the various answers here might perform, and so I came up with my own table:
That puts a million+ lines in
/tmp/dates
and doubles the output for each of years 1970 - 3338. The file looks like:...some of it anyway.
And then I tried the different commands on it:
The
sort
andsed
commands both completed in less than half the timeawk
did - and these results were typical. I did run them several times. It appears all of the commands are writing out the correct number of lines as well - and so they probably all work.sort
andsed
were fairly well neck and neck - withsed
generally a hair ahead - for completion time for every run, butsort
does more actual work to achieve its results than either of the other two commands. It is running parallel jobs to complete its task and benefits a great deal from my multi-core cpu.awk
andsed
both peg the single-core assigned them for the entire time they process.The results here are from a standard, up-to-date GNU
sed
, but I did try another. In fact, I tried all three commands with other binaries, but only thesed
command actually worked with my heirloom tools. The others, as I guess due to non-standard syntax, simply quit with error before getting off the ground.It is good to use standard syntax when possible - you can freely use more simple, honed, and efficient implementations in many cases that way: