PostgreSQL – How to Use Column with Hash Based on Other Columns as Unique Index

postgresqlpostgresql-9.3

My previous problem is still not fixed: We cannot set a unique index because NULL values are allowed…

we use this table in our database:

CREATE TABLE offer (
   offer_id   serial PRIMARY KEY
 , product_id int NOT NULL REFERENCES product
 , price_old  numeric(10,2);
 , price      numeric(10,2); 
 , price_alt  text          -- overrules price if present
 , valid_from timestamp NOT NULL
 , valid_to   timestamp     -- optional
 --   more attributes of the offer
 , CONSTRAINT some_kind_of_price_required
      CHECK (price IS NOT NULL OR price_alt IS NOT NULL)
);

And we use a unique index which doesn't work:
ALTER TABLE offers ADD CONSTRAINT offer_unique_index UNIQUE(product_id, price_old, price, price_alt, valid_from, valid_to);

But the unique index doesn't work because price_old, price and price_alt can be NULL…. It's possible that only price is filled, only price and price_old or only price_alt…

Possible solution:

Now I might have a solution, but I don't know how to realize it exactly…

I thought if I could use NEWID() or MD5() to create an extra column named "uid" (which contains a hash based on the data of all other columns) and add a unique_index on that "uid" column, it could fix the problem. So when I try to insert the exact same values (prices, valid dates etc.), it will create a same hash (because the values are the same, obviously) and that triggers the unique index violation for the hash column.

Nice idea? But more important: possible?

Best Answer

You could use a UNIQUE INDEX instead of the UNIQUE CONSTRAINT using the function coalesce to treat null as a regular value for your uniqueness:

CREATE UNIQUE INDEX ON offer (
       product_id,
       coalesce(price_old,-1),
       coalesce(price,-1),
       coalesce(price_alt,''),
       valid_from,
       valid_to);

This will enforce your uniqueness as you described.

For details between a Unique constraint and a unique index,see: https://stackoverflow.com/questions/23542794/postgres-unique-constraint-vs-index

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

The execution plan shown does not seem to match the big SELECT DISTINCT query because the Sort and Unique steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.

For the 2nd part of the question:

How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?

To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.

In one single query, this could be written like this:

WITH logs as (
  SELECT log_time_mapped, syslog_priority_mapped, 
     operation_mapped, message_code_mapped, protocol_mapped, 
     source_ip_mapped, destination_ip_mapped, 
     source_port_mapped, destination_port_mapped, 
     destination_service_mapped, direction_mapped, 
     connections_built_mapped, connections_torn_down_mapped, 
     hourofday_mapped, meridiem_mapped 
  FROM firewall_logs_mapped 
  WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14,  hourofday_mapped, meridiem_mapped FROM logs
;

The first SELECT DISTINCT subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT eliminates duplicates, so the client side doesn't have to do it. The UNION ALL concatenates the results without any further processing.

However it's a heavy query and it's dubious that it would be any faster than what you're already doing.

The WITH subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:

SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;

PostgreSQL – Unique Index vs Unique Constraint with Operator Class

For a moment I thought one might be able to use a pre-existing text_pattern_ops index with the USING INDEX clause when adding a UNIQUE CONSTRAINT. But that fails, because:

ERROR: index "book2_name_like" does not have default sorting behavior

Per documentation:

The index cannot have expression columns nor be a partial index. Also, it must be a b-tree index with default sort ordering. These restrictions ensure that the index is equivalent to one that would be built by a regular ADD PRIMARY KEY or ADD UNIQUE command.

For instance, a unique index like that would allow a FK constraint referencing it, but perform terribly, because it does not support standard operators.
Per documentation:

Note that you should also create an index with the default operator class if you want queries involving ordinary <, <=, >, or >= comparisons to use an index.

So to answer the question:

If you need a UNIQUE CONSTRAINT (among other reasons: to reference it with a FK), your first variant with constraint and index is the only option. Additionally, the default operator class of the index created by the constraint supports more operations (like sorting in default sort order).

If you don't need any of that go with your second variant because, obviously, just one index is cheaper to maintain: just a UNIQUE text_pattern_ops index.

Differences between index and constraint:

How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?

Alternative with COLLATE "C"

Instead of creating two indexes, there is another alternative for xxx_pattern_ops indexes that may be preferable. The documentation:

The difference from the default operator classes is that the values are compared strictly character by character rather than according to the locale-specific collation rules. This makes these operator classes suitable for use by queries involving pattern matching expressions (LIKE or POSIX regular expressions) when the database does not use the standard "C" locale.

And:

The index automatically uses the collation of the underlying column.

You can create the column without collation (using COLLATE "C"). Then the default operator class behaves the same way as text_pattern_ops would - plus the index can be used with all the standard operators.

CREATE TABLE book2 (
   book_id serial PRIMARY KEY,
   book    text NOT NULL COLLATE "C" UNIQUE  -- that's all!
);

Now, LIKE can use the index:

SELECT * FROM book2 WHERE book LIKE 'foo%';

But ILIKE still can't:

SELECT * FROM book2 WHERE book ILIKE 'foo%';

SQL Fiddle.

Consider a trigram index using the additional module pg_trgm for a more general solution.:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Best Answer

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

PostgreSQL – Unique Index vs Unique Constraint with Operator Class

Alternative with COLLATE "C"

Related Question