Are strings in table columns represented as bit patterns or Unicode? Does the database engine bring them into memory in a different representation character set from the disk-persisted representations for optimal performance?
I want to know for the purposes of index creation and optimal query writing. Select statements pull data that matches a where clause.
Here is an example query:
select name from carTable where name = 'FordEscort97'
Here is the carTable:
name columnA columnB columnC
FordEscort91
FordEscort92
FordEscort93
FordEscort94
FordEscort95
FordEscort96
FordEscort97
FordEscort98
Would the queries perform faster if the column in the table was designed to have the year at the left of the name in the strings that are stored? (assuming the year component of the string was the most unique)
If there is a character-by-character, left-to-right matching process, the process could be more resource intensive than evaluating bit pattern hashes representing the strings being compared. If the pattern being matched (the anchor string in the where clause) was a substring with a wildcard, each comparison operation would be more CPU intensive because if there are characters in the column that start out matching the wild card in a leftmost moiety of the string, a rightmost moiety may start out as a second potential match late in the string comparison. Early non-matching recognition would obviate the need for as much evaluation. But I don't know for sure that there is a character-by-character pattern match.
There is an implication for creating indexes on unique columns. While a consideration should be made for a likelihood of uniqueness of strings in a column, designing data in the column can be arbitrary. Could an index provide a greater benefit if primary column's data had its uniqueness of the string happening mostly within the leftmost characters compared to the rightmost characters? Does it depend on the type of SQL you are using (e.g., Oracle, MySQL, Postgres, etc.)? Sometimes underlying data is stored in hexadecimal format or with bit patterns.
Best Answer
Computers only deal with 1's and 0's (i.e. binary); datatypes indicate how to interpret that info.
If your data has "components" that need to be parsed out, that will never perform well regardless of RDBMS or indexing. The entire purpose of having different fields and different datatypes is to represent each component individually according to what it actually is. As @Michael pointed out in his comment, your one field is poorly modeled and should be three fields:
Then you can index them as appropriate. Comparing
91
(or actually1991
) to1991
as a numeric comparison is faster than comparing that value as a string to partial-match against1991FordEscort
, of even if it were a separate string field. The degree to which it is faster is related to several factors, including the size of the field, the field's collation, and vendor-specific implementations and features.Left-to-right vs right-to-left, case-sensitive vs case-insensitive, etc is usually determined by the collation of the field and the language/locale/lcid of the database. But searching on fields can't be matching hashes as hashes can only rule out exact matches and cannot prove a positive match (due to collisions) or help with partial matches. If you need very string character matching, look into using a BINARY collation.