How is encoding done in SQL database tables for pattern matching

collationencodingstring-representationunicode

Are strings in table columns represented as bit patterns or Unicode? Does the database engine bring them into memory in a different representation character set from the disk-persisted representations for optimal performance?

I want to know for the purposes of index creation and optimal query writing. Select statements pull data that matches a where clause.

Here is an example query:

select name from carTable where name =  'FordEscort97'

Here is the carTable:

name                        columnA columnB     columnC
FordEscort91
FordEscort92
FordEscort93
FordEscort94
FordEscort95
FordEscort96
FordEscort97
FordEscort98

Would the queries perform faster if the column in the table was designed to have the year at the left of the name in the strings that are stored? (assuming the year component of the string was the most unique)

If there is a character-by-character, left-to-right matching process, the process could be more resource intensive than evaluating bit pattern hashes representing the strings being compared. If the pattern being matched (the anchor string in the where clause) was a substring with a wildcard, each comparison operation would be more CPU intensive because if there are characters in the column that start out matching the wild card in a leftmost moiety of the string, a rightmost moiety may start out as a second potential match late in the string comparison. Early non-matching recognition would obviate the need for as much evaluation. But I don't know for sure that there is a character-by-character pattern match.

There is an implication for creating indexes on unique columns. While a consideration should be made for a likelihood of uniqueness of strings in a column, designing data in the column can be arbitrary. Could an index provide a greater benefit if primary column's data had its uniqueness of the string happening mostly within the leftmost characters compared to the rightmost characters? Does it depend on the type of SQL you are using (e.g., Oracle, MySQL, Postgres, etc.)? Sometimes underlying data is stored in hexadecimal format or with bit patterns.

Best Answer

Are strings in table columns represented as bit patterns or Unicode?

Computers only deal with 1's and 0's (i.e. binary); datatypes indicate how to interpret that info.

Would the queries perform faster if the column in the table was designed to have the year at the left of the name in the strings that are stored? (assuming the year component of the string was the most unique)

If your data has "components" that need to be parsed out, that will never perform well regardless of RDBMS or indexing. The entire purpose of having different fields and different datatypes is to represent each component individually according to what it actually is. As @Michael pointed out in his comment, your one field is poorly modeled and should be three fields:

Make  VARCHAR(something)
Model VARCHAR(something)
Year  INT (or maybe SMALLINT)

Then you can index them as appropriate. Comparing 91 (or actually 1991) to 1991 as a numeric comparison is faster than comparing that value as a string to partial-match against 1991FordEscort, of even if it were a separate string field. The degree to which it is faster is related to several factors, including the size of the field, the field's collation, and vendor-specific implementations and features.

If there is a character-by-character, left-to-right matching process, the process could be more resource intensive than evaluating bit pattern hashes representing the strings being compared.

Left-to-right vs right-to-left, case-sensitive vs case-insensitive, etc is usually determined by the collation of the field and the language/locale/lcid of the database. But searching on fields can't be matching hashes as hashes can only rule out exact matches and cannot prove a positive match (due to collisions) or help with partial matches. If you need very string character matching, look into using a BINARY collation.

Related Solutions

Sql-server – Why are non-digits LIKE [0-9]

[0-9] is not some type of regular expression defined to just match digits.

Any range in a LIKE pattern matches characters between the start and end character according to collation sort order.

SELECT CodePoint,
       Symbol,
       RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM   #CodePage
WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER  BY Symbol COLLATE Latin1_General_CI_AS

Returns

CodePoint            Symbol Rnk
-------------------- ------ --------------------
48                   0      1
188                  ¼      2
189                  ½      3
190                  ¾      4
185                  ¹      5
49                   1      5
50                   2      7
178                  ²      7
179                  ³      9
51                   3      9
52                   4      11
53                   5      12
54                   6      13
55                   7      14
56                   8      15
57                   9      16

So you get these results because under your default collation these characters sort after 0 but before 9.

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between 0 and 1.

You could also use a set rather than a range. To avoid 2 matching ² you would need a CS collation

SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS

Postgresql – PL/pgSQL – Indexes vs Collation vs Pattern Ops

consider using the citext data type available in contrib. you don't have to do this upper / lower nightmare anymore. for the regular expression type you can use gist along with gist_trgm_ops. it will boost regular expressions nicely.

Best Answer

Related Solutions

Sql-server – Why are non-digits LIKE [0-9]

Postgresql – PL/pgSQL – Indexes vs Collation vs Pattern Ops

Related Question