Sorting based on last occurrence of character

sort

I want to sort the file based on the

  • text before the first comma, but also
  • want a second level sort based on the date YYYYMMMDD in the sample input:

Sample Input:

AX,NO,NO,\PathAX/subj/defn/some-file-name-20151229.txt,
CXX,NO,NO,\PathCXX/subj/defn/some-file-name-20151229.txt,
CXX,YES,YES,\Path/subj/defn/some-file-20140503.txt,6
DY,YES,MAYBE,\PathDY/subj/defn/some-file-name-20140720.txt,6

I can solve the first part with sort -t, -k1,1, but don't know how to daily do the second part as the number of dashes in the file name is not fixed. Thus I can't just specify the column number with a - as a separator.

Desired output:

AX,NO,NO,\PathAX/subj/categ/some-file-name-20151229.txt,
CXX,YES,YES,\Path/subj/categ/some-file-20140503.txt,6
CXX,NO,NO,\PathCXX/subj/categ/some-file-name-20151229.txt,
DY,YES,MAYBE,\PathDY/subj/categ/some-file-name-20140720.txt,6

Notes:

I case it helps:

  • The YYYYMMDD follows the last dash in the file name and before the file extension.
  • The dash prior to YYYYMMDD is always present, thus at least one dash.
  • There are no other dots in the file, except for the file extension.

I can think of a two pass process where the last - is replaced by some other character that will not appear in the file, and then sort and replace the character back with a dash, but wondering if there is an easier way.

Platform:

  • MacOS 10.9.5

Best Answer

A general approach for this kind of problem (sorting on a field that just can't be described as the Nth on the line) is to rewrite the lines to prepend the sorting key(s), then sort, then remove the sorting keys. You can use more flexible tools such as sed or awk to determine the sorting keys.

I don't know exactly how you determine where the date is. I'll go for the last sequence of 8 digits on the line, tweak as necessary.

sed 's/.*\([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\)/\1,&/; t; s/^/,/' |
sort -t, -k2,2 -k1,1 |
cut -d, -f 2-

The t; s/^/,/ bit in the sed script prepends a comma if the line doesn't contain any sequence of 8 digits, otherwise the last step would remove the first comma-separated field.

Related Question