SSIS – How to Parse Extensions from Path-Filename Strings

sql serverssisssis-2012

I'm importing a large set (126M rows) of file data from CSVs. Import works fine, except I'm trying to derive a column as I import rather than trying to script it out afterward (due to the number of rows).
SSIS is not terribly helpful with string manipulation functions, and the possibilities I have are pretty complicated as well.

The derived column needs to reflect the file extensions. File extensions range from 1-10 characters (anything longer can be truncated at 10), and are (of course) separated from the file name and path by a period. They don't correspond to a concise list of file types (something like "docx, xlsx, accdb, msg" are Office types). The file path may also have periods in it. And in some cases it will not have files at the end of the path, just a folder name itself.
Some examples:

\\Server\Share\named resource
\\Server\Share\this.looks.like.a.file_but.it.isnt\its_a_directory
\\Server\Share\Group\Year.Month.Day.Subfolder\File.ext
\\Server\Share\Team Folder 1.404\another.file.here.extension

So I need to be able to extract the extensions up to 10 characters.
I tried TOKEN() but with folder names possibly having periods, that didn't work well – adding \ as a token delimiter helped some, but I still got odd extensions from folder names.
I can't find a combination of SUBSTRING(), RIGHT(), FINDSTRING(), and/or TOKEN() that will meet the rules and the derived column definition doesn't allow logic like IF or IIF.

Some false positives are expected and I plan to sort them out after importing. If it helps here, I have a second column in the CSVs which is extracted by SSIS, it's the size of the file (for folders it populates a 0). I haven't gotten this to matter either, because of the lack of IF or IIF in derived column definition.

Best Answer

Firstly, the IF/IIF you are looking for is the "?" Conditional Operator

? Conditional

One way to implement the Derived Column would be to parse the string like this

 TOKENCOUNT( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), "." ) == 1
 ? "" : REVERSE( TOKEN( TOKEN( REVERSE( RTRIM( name ) ), "\\", 1), ".", 1) )

In this case, "name" is the column that holds the string to be parsed.

What we are doing here is:

  1. Remove any trailing white-space with RTRIM( name )
  2. Reverse the string so we are dealing with the end as the beginning with REVERSE()
  3. Grab the first "token" delimited by a backslash using TOKEN()
  4. Count how many "." delimiters are in the result using TOKENCOUNT()
  5. Test the count of "." delimited tokens is equal to 1 using ?
  6. If the count is 1, return an empty string - no extension.
  7. Otherwise return the first token before the "." using TOKEN()
  8. Reverse the extension to get back to "normal" using REVERSE()