I have a question on how to parse out a column and hoping a SSIS/Database guru can help out 🙂
Here is the problem. I have a column saved as a string/varchar format in SQL Server, and need to parse out data from it. I have Business Intelligence Development Studio(BIDS) and SSIS support. I am familiar with using the tool to use derived columns and splitting the data like that using BIDS.
The problem is the data, and here's an example of it.
Easy to parse with derived columns
24 Year
2 Month
3 Month
2 Month
8 Year
7 Year
No clue how to approach to parse
1x Month
3x per Month
1-2 x per month
6 days per month
So what i want to essentially do is, if data is standard I want to split the column into two columns, Number(int) and DayType(string), and if it falls into the 2nd category (erratic data), put it in a 3rd column, Other. Any suggestions on how to approach this problem?
Added Clarification: I want to put the non standard data into another 3rd column, so I can look at the data, and run commands on that particular column, to edit out some of the fluff, then make the data workable in the application.
For example, 3x per month, take out the "x per" and then parse that into the columns Number, and Daytype.
Also where there are cases of 1-2, I would like to take the higher number, and drop the lower number, so for example 4-6 days a week, keep 6, drop "a week" and keep days
Best Answer
For something like this the optimal solution of course is to control your input. That said, the reality is you have to parse the supplied input.
For something as complex as your parsing, I'd skip the
Derived Column transformation
and go straight for aScript transformation
. I select my source column,Input
and create three output columns: number, trash and Interval. number and Interval will hold the parsed values while trash will only be populated when the script can't make heads or tails from the input.I use two member variables, numbersRegex and periodDomain. periodDomain is just a list with the acceptable values. For string comparisons, I force everything to lowercase and hope for English. numbersRegex is a regular expression that is used to identify digits in a string.
For every row that comes in, the script will split the Input value based on whitespace. For each of those tokens, I test whether the token has a digit in it. If it does, we'll call the
GetBiggestNumber
method. Otherwise, we'll call theValidatePeriodDomain
Once all the tokens have been processed, then it's important to make certain both values have been set.GetBiggestNumber
attempts to look at all the groupings of number and find the largest set.ValidatePeriodDomain
attempts to compare the current value to a known list of acceptable values.Using the above script, you can see how it slices and dices the Input data. I made a minor change between the code that generated the below screenshot and what's posted. I observed that the input value 9000 was assigned to Row.number but as that Row never had an Interval assigned, I deferred the actual Row population to the end of the script (it was in the