The execution plan shown does not seem to match the big SELECT DISTINCT
query because the Sort
and Unique
steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.
For the 2nd part of the question:
How would I go about selecting only the unique combinations of
adjacent columns? Is this too complicated a task to perform through a
database query? Would it speed up the query?
To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.
In one single query, this could be written like this:
WITH logs as (
SELECT log_time_mapped, syslog_priority_mapped,
operation_mapped, message_code_mapped, protocol_mapped,
source_ip_mapped, destination_ip_mapped,
source_port_mapped, destination_port_mapped,
destination_service_mapped, direction_mapped,
connections_built_mapped, connections_torn_down_mapped,
hourofday_mapped, meridiem_mapped
FROM firewall_logs_mapped
WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14, hourofday_mapped, meridiem_mapped FROM logs
;
The first SELECT DISTINCT
subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT
eliminates duplicates, so the client side doesn't have to do it. The UNION ALL
concatenates the results without any further processing.
However it's a heavy query and it's dubious that it would be any faster than what you're already doing.
The WITH
subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;
You don't have to check what value you could use in "deleted" if you have a way of generating unique integers, and MySQL does... the UUID_SHORT()
function generates a unique and monotonic BIGINT UNSIGNED
value every time it's called.
The function is good for 16.7M new values per second, every second, and will never return a duplicate (or smaller than previous) value as long as the @@SERVER_ID
remains the same and you don't shut down the server, set the system clock backward in time, and restart the server... and even if you do, the odds of duplicate values are still very small, but no longer truly impossible.
You could also make "deleted" be a DATETIME(6)
or TIMESTAMP(6)
on MySQL 5.6, using '0000-00-00 00:00:00.000000' to mean "not deleted" and NOW()
to mean deleted.
Try as I might, I have not been able to get MySQL to produce a NOW()
collision down to the microsecond.
Best Answer
Neither MySQL nor the siblings (MariaDB, Drizzle, etc) have implemented partial indexes.
What you can do, with this restriction in mind:
a) make a simple (not partial) index on
(is_active, measurement_id)
. It will be used in queries where the partial index would. Of course if theis_active
column is 3% True and 97% false, this index will be much bigger (than a partial index). But still smaller than the table and useful for these queries.Another limitation is the index cannot be
UNIQUE
with this solution so the constraint is not enforced. If the index is created withUNIQUE
, the uniqueness will be enforced for rows withis_active = FALSE
as well. I assume you don't want that:b1) (the simple variation of b): add another table in your design, with only the primary key columns of
events
and a foreign key toevents
. This table should only have rows where theis_active
is true in the original table (this will be enforced by your application/procedures). Queries withis_active = TRUE
would be changed to join to that table (instead of theWHERE
condition.)The
UNIQUE
is not enforced either with this solution but the queries would only do a simple join (to a much smaller index) and should be quite efficient:b2) a more complex solution: add another table in your design, with only the primary key columns of the table and
measurement_id
. As in previous suggestion, this table should only have rows where theis_active
is true in the original table (this will be enforced by your application/procedures, too). Then use this table only instead for queries that haveWHERE is_active = TRUE
and need only themeasurement_id
column. If more columns are needed fromevents
, you'll have tojoin
, as before.The
UNIQUE
constraint can be enforced with this solution. The duplication ofmeasurement_id
column can also be secured to be consistent (with an extra unique constraint onevents
and a composite foreign key):c) maybe the simplest of all: use PostgreSQL. I'm sure there are packages for your Linux distribution. They may not be the latest version of Postgres but partial indexes were added in 7.0 (or earlier?) so you shouldn't have a problem. Plus, I'm confident that you could install the latest version in almost any Linux distribution - even with a little hassle. You only need to install it once.