MySQL – How to Create PostgreSQL-like Partial Index

indexMySQL

I have big data where I only select a small interval of data at a time such that the selection is always in a sequence.
I am trying to implement PostgreSQL like Partial index in MySQL which is targeted for such purposes.
I am not sure if the partial unique constraint is the same as the one which I want.

Code in PostgreSQL 9.4

CREATE UNIQUE INDEX dir_events
    ON events (measurement_id)
    USING btree
    (eventBody)
    WHERE is_active;

Attempt on ypercube's partial index in MySQL

CREATE UNIQUE INDEX dir_events
    [index_type] -- TODO what here?
    ON events (measurement_id, is_active)
    [index_type] -- TODO what here?

How can you create PostgreSQL-like partial index in MySQL 5.5 or similar?

Best Answer

Neither MySQL nor the siblings (MariaDB, Drizzle, etc) have implemented partial indexes.

What you can do, with this restriction in mind:

a) make a simple (not partial) index on (is_active, measurement_id). It will be used in queries where the partial index would. Of course if the is_active column is 3% True and 97% false, this index will be much bigger (than a partial index). But still smaller than the table and useful for these queries.
Another limitation is the index cannot be UNIQUE with this solution so the constraint is not enforced. If the index is created with UNIQUE, the uniqueness will be enforced for rows with is_active = FALSE as well. I assume you don't want that:
```
CREATE INDEX dir_events
    ON events (is_active, measurement_id)
    USING btree ;
```
b1) (the simple variation of b): add another table in your design, with only the primary key columns of events and a foreign key to events. This table should only have rows where the is_active is true in the original table (this will be enforced by your application/procedures). Queries with is_active = TRUE would be changed to join to that table (instead of the WHERE condition.)
The UNIQUE is not enforced either with this solution but the queries would only do a simple join (to a much smaller index) and should be quite efficient:
```
CREATE TABLE events_active
( event_id INT NOT NULL,         -- assuming an INT primary key on events
  PRIMARY KEY (event_id),
  FOREIGN KEY (event_id)
    REFERENCES events (event_id)
) ;

INSERT INTO events_active 
  (event_id)
SELECT event_id
FROM events
WHERE is_active = TRUE ;
```
b2) a more complex solution: add another table in your design, with only the primary key columns of the table and measurement_id. As in previous suggestion, this table should only have rows where the is_active is true in the original table (this will be enforced by your application/procedures, too). Then use this table only instead for queries that have WHERE is_active = TRUE and need only the measurement_id column. If more columns are needed from events, you'll have to join, as before.
The UNIQUE constraint can be enforced with this solution. The duplication of measurement_id column can also be secured to be consistent (with an extra unique constraint on events and a composite foreign key):
```
ALTER TABLE events
  ADD UNIQUE (event_id, measurement_id) ;

CREATE TABLE events_active
( event_id INT NOT NULL,
  measurement_id INT NOT NULL.
  PRIMARY KEY (event_id, measurement_id),
  UNIQUE (measurement_id),
  FOREIGN KEY (event_id, measurement_id)
    REFERENCES events (event_id, measurement_id)
) ;

INSERT INTO events_active 
  (event_id, measurement_id)
SELECT event_id, measurement_id
FROM events
WHERE is_active = TRUE ;
```
c) maybe the simplest of all: use PostgreSQL. I'm sure there are packages for your Linux distribution. They may not be the latest version of Postgres but partial indexes were added in 7.0 (or earlier?) so you shouldn't have a problem. Plus, I'm confident that you could install the latest version in almost any Linux distribution - even with a little hassle. You only need to install it once.

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

The execution plan shown does not seem to match the big SELECT DISTINCT query because the Sort and Unique steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.

For the 2nd part of the question:

How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?

To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.

In one single query, this could be written like this:

WITH logs as (
  SELECT log_time_mapped, syslog_priority_mapped, 
     operation_mapped, message_code_mapped, protocol_mapped, 
     source_ip_mapped, destination_ip_mapped, 
     source_port_mapped, destination_port_mapped, 
     destination_service_mapped, direction_mapped, 
     connections_built_mapped, connections_torn_down_mapped, 
     hourofday_mapped, meridiem_mapped 
  FROM firewall_logs_mapped 
  WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14,  hourofday_mapped, meridiem_mapped FROM logs
;

The first SELECT DISTINCT subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT eliminates duplicates, so the client side doesn't have to do it. The UNION ALL concatenates the results without any further processing.

However it's a heavy query and it's dubious that it would be any faster than what you're already doing.

The WITH subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:

SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;

MySQL – Creating Unique Index for Specific Column Values

You don't have to check what value you could use in "deleted" if you have a way of generating unique integers, and MySQL does... the UUID_SHORT() function generates a unique and monotonic BIGINT UNSIGNED value every time it's called.

The function is good for 16.7M new values per second, every second, and will never return a duplicate (or smaller than previous) value as long as the @@SERVER_ID remains the same and you don't shut down the server, set the system clock backward in time, and restart the server... and even if you do, the odds of duplicate values are still very small, but no longer truly impossible.

You could also make "deleted" be a DATETIME(6) or TIMESTAMP(6) on MySQL 5.6, using '0000-00-00 00:00:00.000000' to mean "not deleted" and NOW() to mean deleted.

Try as I might, I have not been able to get MySQL to produce a NOW() collision down to the microsecond.

Attempt on ypercube's partial index in MySQL

Best Answer

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

MySQL – Creating Unique Index for Specific Column Values

Related Question