We need to do some reporting on values that are usually mixed strings of numbers and letters that need to be sorted 'naturally'. Things like, e.g. "P7B18" or "P12B3". @The strings will mostly be sequences of letters then numbers alternating. The number of these segments and the length of each could vary, though.
We'd like the numeric portions of these to be sorted in numeric order. Obviously, if I just handle those string values directly with ORDER BY
, then "P12B3" is going to come before "P7B18", since "P1" is earlier than "P7", but I'd like the reverse, as "P7" naturally precedes "P12".
I'd also like to be able to do range comparisons, e.g. @bin < 'P13S6'
or some such. I don't have to handle floating point or negative numbers; these will strictly be non-negative integers that we're dealing with. String lengths and number of segments could potentially be arbitrary, with no fixed upper bounds.
In our case, string casing isn't important, though if there's a way to do this in a collation-aware fashion, others might find that useful. The ugliest part of all this is I'd like to be able to do both ordering, and range filtering in the WHERE
clause.
If I were doing this in C#, it would be a pretty simple task: do some parsing to separate the alpha from the numeric, implement IComparable, and you're basically done. SQL Server, of course, doesn't appear to offer any similar functionality, at least as far as I'm aware.
Anybody know any good tricks to make this work? Is there some little-publicized ability to create custom CLR types that implement IComparable and have this behave as expected? I'm also not opposed to Stupid XML Tricks (see also: list concatenation), and I've got CLR regex matching/extracting/replacement wrapper functions available on the server as well.
EDIT:
As a slightly more detailed example, I'd want the data to behave something like this.
SELECT bin FROM bins ORDER BY bin
bin
--------------------
M7R16L
P8RF6JJ
P16B5
PR7S19
PR7S19L
S2F3
S12F0
i.e. break the strings into tokens of all letters or all numbers, and sort them either alphabetically or numerically respectively, with the leftmost tokens being the most significant sorting term. Like I mentioned, piece of cake in .NET if you implement IComparable, but I don't know how (or if) you can do that sort of thing in SQL Server. It's certainly not something I've ever come across in 10 or so years of working with it.
Best Answer
Want a sensible, efficient means of sorting numbers in strings as actual numbers? Consider voting for my Microsoft Connect suggestion: Support "natural sorting" / DIGITSASNUMBERS as a Collation option
There is no easy, built-in means of doing this, but here is a possibility:
Normalize the strings by reformatting them into fixed-length segments:
VARCHAR(50) COLLATE Latin1_General_100_BIN2
. The max length of 50 might need to be adjusted based on the max number of segments and their potential maximum lengths.AFTER [or FOR] INSERT, UPDATE
Trigger such that you are guaranteed of properly setting the value for all records, even those coming in via ad hoc queries, etc. Of course, that scalar UDF can also be handled via SQLCLR, but it would need to be tested to determine which one was actually more efficient. **UPPER()
function to the final result of all segments (so that it only needs to be done once and not per segment). This will allow for proper sorting given the binary collation of the sort column.AFTER INSERT, UPDATE
Trigger on the table that calls the UDF to set the sort column. To improve performance, use theUPDATE()
function to determine if this code column is even in theSET
clause of theUPDATE
statement (simplyRETURN
if false), and then join theINSERTED
andDELETED
pseudo-tables on the code column to only process rows that have changes in the code value. Be sure to specifyCOLLATE Latin1_General_100_BIN2
on that JOIN condition to ensure accuracy in determining if there is a change.Example:
In this approach, you can sort via:
And you can do range filtering via:
or:
Both the
ORDER BY
and theWHERE
filter should use the binary collation defined forSortColumn
due to Collation Precedence.Equality comparisons would still be done on the original value column.
Other thoughts:
Use a SQLCLR UDT. This might could work, though it is unclear if it presents a net-gain as compared to the approach described above.
Yes, a SQLCLR UDT can have its comparison operators overridden with custom algorithms. This handles situations where the value is being compared to either another value that is already the same custom type, or one that needs to be implicitly converted. This should handle the range filter in a
WHERE
condition.With regards to sorting the UDT as a regular column type (not a computed column), this is only possible if the UDT is "byte ordered". Being "byte ordered" means that the binary representation of the UDT (which can be defined in the UDT) naturally sorts in the appropriate order. Assuming that the binary representation is handled similarly to the approach described above for the VARCHAR(50) column that has fixed-length segments that are padded, that would qualify. Or, if it was not easy to ensure that the binary representation would naturally be ordered in the proper way, you could expose a method or property of the UDT that outputs a value that would be properly ordered, and then create a
PERSISTED
computed column on that method or property. The method needs to be deterministic and marked asIsDeterministic = true
.Benefits of this approach are:
Parse
method of the UDT takes in theP7B18
value and converts it, then you should be able to simply insert the values naturally asP7B18
. And with the implicit conversion method set in the UDT, the WHERE condition would also allow for using simply P7B18`.Consequences of this approach are:
PERSISTED
computed column on a property or method of the UDT, then you would get the representation returned by the property or method. If you want the originalP7B18
value, then you need to call a method or property of the UDT that is coded to return that representation. Since you have to override theToString
method anyway, that is a good candidate for providing this.It is unclear (at least to me right now since I have not tested this part) how easy/difficult it would be to make any changes to the binary representation. Changing the stored, sortable representation might require dropping and re-adding the field. Also, dropping the Assembly containing the UDT would fail if used in either manner, so you would want to make sure that there was nothing else in the Assembly besides this UDT. You can
ALTER ASSEMBLY
to replace the definition, but there are some restrictions on that.On the other hand, the
VARCHAR()
field is data that is disconnected from the algorithm so it would only require updating the column. And if there are tens of millions of rows (or more) then that can be done in a batched approach.Implement the ICU library which actually allows for doing this alphanumeric sorting. While highly functional, the library only comes in two languages: C/C++ and Java. Which means you might need to either do some tweaks to get it to work in Visual C++, or there is the off chance that the Java code can be converted to MSIL using IKVM. There are one or two .NET side projects linked on that site that provide a COM interface that can be accessed in managed code, but I believe they have not been updated in a while and I have not tried them. The best-bet here would be to handle this in the app layer with the goal of generating sort keys. The sort keys would then be saved into a new sort column.
This might not be the most practical approach. However, it is still very cool that such an ability exists. I provided a more detailed walk-through of an example of this in the following answer:
Is there a collation to sort the following strings in the following order 1,2,3,6,10,10A,10B,11?
But the pattern being dealt with in that question is a bit simpler. For an example showing that the type of pattern being dealt with in this Question also works, please go to the following page:
ICU Collation Demo
Under "Settings", set the "numeric" option to "on" and all of the others should be set to "default". Next, to the right of the "sort" button, uncheck the option for "diff strengths" and check the option for "sort keys". Then replace the list of items in the "Input" text area with the following list:
Click the "sort" button. The "Output" text area should display the following:
Please note that the sort keys are structure in multiple fields, separated by commas. Each field needs to be sorted independently, so that presents another small problem to solve if needing to implement this in SQL Server.
** If there is any concern about performance regarding the use of User-Defined Functions, please note that the proposed approaches make minimal use of them. In fact, the main reason for storing the normalized value was to avoid calling a UDF per each row of each query. In the primary approach, the UDF is used to set the value of
SortColumn
, and that is only done uponINSERT
andUPDATE
via the Trigger. Selecting values is much more common than inserting and updating, and some value are never updated. Per eachSELECT
query that uses theSortColumn
for a range filter in theWHERE
clause, the UDF is only needed one time per each of the range_start and range_end values to get the normalized values; the UDF is not called per-row.With regards to the UDT, the usage is actually the same as with the scalar UDF. Meaning, inserting and updating would call the normalization method once per each row to set the value. Then, the normalization method would be called once per query per each range_start and range_value in a range filter, but not per row.
A point in favor of handling the normalization entirely in a SQLCLR UDF is that given it is not doing any data access and is deterministic, if it is marked as
IsDeterministic = true
, then it can participate in parallel plans (which might help theINSERT
andUPDATE
operations) whereas a T-SQL UDF will prevent a parallel plan from being used.