For something like this the optimal solution of course is to control your input. That said, the reality is you have to parse the supplied input.
For something as complex as your parsing, I'd skip the Derived Column transformation
and go straight for a Script transformation
. I select my source column, Input
and create three output columns: number, trash and Interval. number and Interval will hold the parsed values while trash will only be populated when the script can't make heads or tails from the input.
I use two member variables, numbersRegex and periodDomain. periodDomain is just a list with the acceptable values. For string comparisons, I force everything to lowercase and hope for English. numbersRegex is a regular expression that is used to identify digits in a string.
For every row that comes in, the script will split the Input value based on whitespace. For each of those tokens, I test whether the token has a digit in it. If it does, we'll call the GetBiggestNumber
method. Otherwise, we'll call the ValidatePeriodDomain
Once all the tokens have been processed, then it's important to make certain both values have been set.
GetBiggestNumber
attempts to look at all the groupings of number and find the largest set.
ValidatePeriodDomain
attempts to compare the current value to a known list of acceptable values.
using System.Text.RegularExpressions;
using System.Collections.Generic;
/// <summary>
/// The amazing script transformation
/// </summary>
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
Regex numbersRegex;
List<string> periodDomain;
/// <summary>
/// Initialize member variables
/// </summary>
public override void PreExecute()
{
base.PreExecute();
// match consecutive digits
this.numbersRegex = new Regex(@"\d+", RegexOptions.Compiled );
this.periodDomain = new List<string>(){ "year", "month" };
}
/// <summary>
/// Parse the incoming data
/// </summary>
/// <param name="Row">The row that is currently passing through the component</param>
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string[] parts = Row.Input.Split();
string period = string.Empty;
int? foo = null;
foreach (string token in parts)
{
// try to do something with it
// If the token has a digit in it, then we'll extract the largest value
// if no digits, then the first token matching our domain is preserved
if (this.numbersRegex.IsMatch(token))
{
foo = GetBiggestNumber(token);
}
else
{
if (ValidatePeriodDomain(token))
{
period = token;
}
}
}
// at this point, we've processed the input data
// If the local variables are in their initial states, then we didn't find
// anything of note and need to populate the Row.Junk column
// Why local variables, because can't read from Row.column
if (period == string.Empty || (foo == null))
{
Row.trash = Row.Input;
}
else
{
Row.number = foo.Value;
Row.Interval = period;
}
}
private bool ValidatePeriodDomain(string token)
{
return (this.periodDomain.Contains(token.ToLower()));
}
private int? GetBiggestNumber(string token)
{
int? bigOne = null;
int? current = null;
// Get all the groups of numbers and compare them
foreach (Match item in this.numbersRegex.Matches(token))
{
current = int.Parse(item.Value);
if (!bigOne.HasValue)
{
bigOne = current;
}
if (current.Value > bigOne.Value)
{
bigOne = current;
}
}
return bigOne;
}
}
Using the above script, you can see how it slices and dices the Input data. I made a minor change between the code that generated the below screenshot and what's posted. I observed that the input value 9000 was assigned to Row.number but as that Row never had an Interval assigned, I deferred the actual Row population to the end of the script (it was in the
Usually the way you'd go about this is to have a Script component and enter your VB / C# code to handle this type of input. If you have a variable number of columns than you should forget about SSIS being able to handle this since it doesn't work well with non-finite number of columns (being that tables have a finite number of columns on them).
Having said this, if you have a finite number of columns in the input file, than what you can do is specify the delimiter (the semicolon in the above case) and have the SSIS split the data into the columns and then do the ETL process on them, if needed.
If you have an upper limit of columns, but you get files with less columns than placing a script task before the DFT module to append the extra delimiters that are needed.
Since the first row looks like the header row (column names) you can set the flat file source to skip the first row.
Another alternative would be, if you have a small number of file formats, would be to create different names for each of the log files and have individual DFT modules handle each type of file.
Hope this helps.
Best Answer
You don't really have much control over how the SSIS Balanced data distributer control distributes data, but you could influence the order in which the data by the BDD is received.
From the docs (emphasis mine):
The pipeline buffer size can be influenced by properties such as the DefaultBufferMaxRows and DefaultBufferSize or AutoAdjustBufferSize. Check the docs on data flow performance features for details on how it's all calculated.
So in short, You'll have a hard time figuring out exactly how many records are sent out each time the BDD switches to a new output.
Knowing how the BDD distributes data by just taking whatever is in the incoming buffer and sending it to an output destination you could maybe figure out ways to make sure the incoming data is sorted in a way that you can avoid locks later on, but remember that each flow after the BDD will probably have some processing in it too, and there is no way you can tell if at the end the data will be processed in the same amount of time by each path the data takes, so sorting it before the BDD doesn't guarantee it's still in the same order later on.