T-sql – Lookup Table Decisions – Lookup Based On Characters In Specific Positions

schemat-sql

(I do not know of a better wording for the title, but I am open to any edits on that)

Disclaimer: not in anyway close to being a DBA, but I have at least basic knowledge. I am a software developer that deals with front and back end, and I am using T-SQL.

Say that I am managing distribution center data, and there was a requirement that each product needs a numbering system that identifies specific information, such that position 3 specified category, position 4 specified exception, and position 5, 6, and 7 "could" specify type.

For positions 5, 6, and 7, they are used in order to define the type, where if 5 has a valid type value, then 6 and 7 do not represent the logical type, then if 5 does not have a valid value, then check 6 and so on.

For that example, order is determined by the starting index (5 is first, 6 is second, and 7 is last). An instance has come up where position 6 has a specific value that forces the order to change. Let us say that when position 6 is N, then its order is determined to be after position 7.

  • Position 5
    • Order 1
      • A
      • B
  • Position 6
    • Order 2
      • 2
    • Order 4
      • N
  • Position 7
    • Order 3
      • 1

As far as a lookup table, the current concept is to take the substring of the provided value and use the specified position as the starting index. The idea of a variable length has been reviewed, and the idea has sparked some variations that could be achieved for the "newer" case.

Here are three ideas on usage for the lookup:

  1. Specify each index and corresponding value, specify precedence, on joins use ROW_NUMBER in respect to order
  2. Specify the base index, specify the value as each possible combination (such as 'BN1' ties to the lookup for position 7, or 'A21' ties to position 5)
  3. The same concept as 2, except that the values are stored with regular expressions ('[^AB]N[^1]' ties to position 6)

For idea 3, it seems like an attempt to push the row count down and does not give a whole lot to helping the subsequent queries that could be used for the table.

For idea 2 and 3, the length of the value would usually be 3, so that each position is represented, but when position 5 occurs the length used can be 1.

For idea 1, one of the main queries that I see when using that style of table is to have the ROW_NUMBER being used in a sub-query, then the main query will have a ROW_NUMBER = 1 check to get the lowest order represented.

All three ideas work, and there are not many differences query-wise that I can see. As for the table itself, the differences are mainly in the possible size for the reference values (larger to accept regular expressions or combinations rather than single characters), and a column for order. It is difficult to see if there will be any changes in the foreseeable future where more values or different ordering styles could be added.

I am having trouble seeing where the over-engineering line is for this specifically, since it is a smaller dataset.

As far as the ideas above, are there any pros and cons that are not currently being expressed? For the table and queries that would be used, is there a concept/format with lookup tables that proves to be more self-explanatory?

it is more likely that there will not be changes to the table for some time (decisions made at a much higher level and tend to be slow) and that someone new (to the team or the tables) would be reviewing this table and associated queries.

Here is the rough draft table schema as a script:

CREATE TABLE ProductNumberLookup
(
    StartingIndex INT NOT NULL,
    OrderId INT NOT NULL, --This column does not exist for ideas 2 and 3
    ValueLength INT NOT NULL,
    ValueDescription VARCHAR(20) NOT NULL,
    TypeValue VARCHAR(20) NOT NULL
)

I am not necessarily looking for the best practices, and I can see where the above content could be from that side, but I am looking for any perspectives that are being missed when considering design in respect to the ideas specified above. I also do not know if the usage and questions around it makes this mostly subjective. Please let me know if any re-wording will help.

Edit (Requested concrete examples):

Product Id examples:

  • BCDS192
    • would give back null since there is no associated value
  • DFS2N31
    • would give back the type value associated with position 7 being 1
  • SNJ4XN1
    • would give back the type value associated with position 7 being 1 because position 6 has the value making it last in ordering
  • CVB2A21
    • would give back the type value associated with 5 being A since it is first in order the other 2 positions are not used even though they have a valid value

As for type value expected:

  • position 5
    • when A, get Type Value (future ordered)
    • when B, get Type Value (back dated)
  • position 6
    • when 2, get Type value (order/payment in future)
    • when N, get Type value (re-order/damaged on transport)
  • position 7
    • When 1, get Type Value (re-order/payment in past)

These are arbitrary results, but the outcome follows the pattern.

The query would contain a join similar to this:

Left Join ProductNumberLookup pnl
On SUBSTRING (provided.ProductId, pnl.startingindex, pnl.valuelength) = pnl.valuedescription

Best Answer

I'm not sure I grok the intended results here, but consider these additions:

  • A calculated and persisted column on your Inventory table (is that a good name for the first table?) on SUBSTRING(ProductID, 5, 3).
  • Create the proposed ProductNumberLookup table, with a VARCHAR(3) for the PK. Depending on the universe of valid product codes, you may be able to populate this once; otherwise, you may need to check for new codes on a nightly or near-real-time basis.
  • Join Inventory to ProductNumberLookup on I.CalculatedProductCode = PNL.ProductCode. There should be exactly one record in PNL for each record in Inventory.

Product ID "12345" will have calculated code "5" and match the lookup record "5". Product ID "1234567" will have calculated code "567" and match the lookup record "567". The join will be narrow (three bytes, assuming you're just using ANSI for your codes) and SARGable, and the lookup table will be small (no more than ~50,000 records), so this should be fast.

The downside is an extra field in your Inventory table, which I'm guessing is very large. If that's unacceptable, you can refrain from persisting the column, but you'll probably need an index scan every time you join. I trust that there is an index on I.ProductID.


As far as keeping the ProductNumberLookup table populated, that should not be a problem. Assuming there are only (26 letters + 10 numbers + null) ^ 3 = ~50,000 possibilities, you can easily populate this table with all possible values at creation time. For any product codes which are not currently valid, set ValueDescription and TypeValue to "Invalid" or "Unknown". Any reports which use this table can INNER JOIN, confident that there will be a record on the other end.

The nice thing about this approach is that if someone creates a new product code without telling you about it (and you know they will), their reports will still total correctly.

If you choose instead to only populate the table with known good values, you'll need to periodically scan the Inventory table for missing values:

SELECT DISTINCT CalculatedProductCode
FROM Inventory AS I
WHERE NOT EXISTS (SELECT * FROM ProductNumberLookup AS PNL WHERE I.CalculatedProductCode = PNL.ProductCode)

...and insert the resulting values. However often you run it, there will be a window in which Inventory records might have no matching ProductNumberLookup record, possibly causing reports to under-count, or NULL exceptions to propagate through your apps. That might be acceptable if the universe of product codes was huge, but with just three characters I'd stick to populating all 50,000 ahead of time.

Rather than running this scan on a scheduled basis, you could put an INSERT/UPDATE trigger on Inventory, to insert the calculated code into ProductNumberLookup if it does not already exist. There would be overhead, but it may be acceptable.