To remove duplicates based on a single column, you can use awk
:
awk '!seen[$1]++' input-file > output-file
You can see an explanation for this in this Unix & Linux post.
Removing the older lines is more complicated. Given that duplicates always come together, you can do:
awk 'prev && ($1 != prev) {print seen[prev]} {seen[$1] = $0; prev = $1} END {print seen[$1]}' input-file > output-file
Here, in the middle block, {seen[$1] = $0}
saves the current line ($0
) to the seen
array with the first field ($1
) as index, then saves the first field in the prev
variable. This prev
is used in the first block when processing the next line.
In the first block, then, we check if prev
is set (only true for the second line onwards) and not equal to the current first field (here prev
was set while processing the previous line). If it isn't, we have moved past duplicates and can print the previous line. At the END
, we do that again for the last line.
Best Answer
there are discussions on this matter on community.metabrainz.org going on (see https://community.metabrainz.org/t/how-can-i-remove-all-of-my-duplicate-music/20495/8 for example).
additionally a wiki page on wiki.musicbrainz.org is showing an example on how you could find some duplicates:
https://wiki.musicbrainz.org/History:Find_Duplicate_Music_Files
This wiki boils down to get the deprectated libtunepimp (https://wiki.musicbrainz.org/History:libtunepimp/Download), install it and use the included
trm
util in the following script:in
/tmp/trmdupls.log
should be a list of all those duplicates. You should make sure there are no false positives in there before deleting all those files blindly.