The execution plan shown does not seem to match the big SELECT DISTINCT
query because the Sort
and Unique
steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.
For the 2nd part of the question:
How would I go about selecting only the unique combinations of
adjacent columns? Is this too complicated a task to perform through a
database query? Would it speed up the query?
To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.
In one single query, this could be written like this:
WITH logs as (
SELECT log_time_mapped, syslog_priority_mapped,
operation_mapped, message_code_mapped, protocol_mapped,
source_ip_mapped, destination_ip_mapped,
source_port_mapped, destination_port_mapped,
destination_service_mapped, direction_mapped,
connections_built_mapped, connections_torn_down_mapped,
hourofday_mapped, meridiem_mapped
FROM firewall_logs_mapped
WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14, hourofday_mapped, meridiem_mapped FROM logs
;
The first SELECT DISTINCT
subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT
eliminates duplicates, so the client side doesn't have to do it. The UNION ALL
concatenates the results without any further processing.
However it's a heavy query and it's dubious that it would be any faster than what you're already doing.
The WITH
subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;
NEW
is a record, not a table. Basics:
Slightly modified setup
CREATE TABLE product (
product_id serial PRIMARY KEY,
product_name text UNIQUE NOT NULL -- must be UNIQUE
);
CREATE TABLE purchase (
purchase_id serial PRIMARY KEY,
product_id int REFERENCES product,
when_bought date
);
CREATE VIEW purchaseview AS
SELECT pu.purchase_id, pr.product_name, pu.when_bought
FROM purchase pu
LEFT JOIN product pr USING (product_id);
INSERT INTO product(product_name) VALUES ('foo');
product_name
has to be UNIQUE
, or the lookup on this column could find multiple rows, which would lead to all kinds of confusion.
1. Simple solution
For your simple example, only looking up the single column product_id
, a lowly correlated subquery is simplest and fastest:
CREATE OR REPLACE FUNCTION insert_purchaseview_func()
RETURNS trigger AS
$func$
BEGIN
INSERT INTO purchase(product_id, when_bought)
SELECT (SELECT product_id FROM product WHERE product_name = NEW.product_name), NEW.when_bought
RETURNING purchase_id
INTO NEW.purchase_id; -- generated serial ID for RETURNING - if needed
RETURN NEW;
END
$func$ LANGUAGE plpgsql;
CREATE TRIGGER insert_productview_trig
INSTEAD OF INSERT ON purchaseview
FOR EACH ROW EXECUTE PROCEDURE insert_purchaseview_func();
No additional variables. No CTE (would only add cost and noise). Columns from NEW
are spelled out once only (your point 1).
The appended RETURNING purchase_id INTO NEW.purchase_id
takes care of your point 2: Now, the returned row includes the newly generated purchase_id
.
If the product is not found (NEW.product_name
does not exist in table product
), the purchase is still inserted and product_id
is NULL
. This may or may not be desirable.
2.
To skip the row instead (and possibly raise a WARNING
/ EXCEPTION
):
CREATE OR REPLACE FUNCTION insert_purchaseview_func()
RETURNS trigger AS
$func$
BEGIN
INSERT INTO purchase AS pu
(product_id, when_bought)
SELECT pr.product_id, NEW.when_bought
FROM product pr
WHERE pr.product_name = NEW.product_name
RETURNING pu.purchase_id
INTO NEW.purchase_id; -- generated serial ID for RETURNING - if needed
IF NOT FOUND THEN -- insert was canceled for missing product
RAISE WARNING 'product_name % not found! Skipping INSERT.', quote_literal(NEW.product_name);
END IF;
RETURN NEW;
END
$func$ LANGUAGE plpgsql;
This piggybacks NEW
columns to SELECT .. FROM product
. If the product is found, everything proceeds normally. If not, no row is returned from the SELECT
and no INSERT
happens. The special PL/pgSQL variable FOUND
is only true if the last SQL query processed at least one row.
Could be EXCEPTION
instead of WARNING
to raise an error and roll back the transaction. But I'd rather declare purchase.product_id NOT NULL
and insert unconditionally (query 1 or similar), to the same effect: raises an exception if product_id
is NULL
. Simpler, cheaper.
3. For multiple lookups
CREATE OR REPLACE FUNCTION insert_purchaseview_func()
RETURNS trigger AS
$func$
BEGIN
INSERT INTO purchase AS pu
(product_id, when_bought) -- more columns?
SELECT pr.product_id, i.when_bought -- more columns?
FROM (SELECT NEW.*) i -- see below
LEFT JOIN product pr USING (product_name)
-- LEFT JOIN tbl2 t2 USING (t2_name) -- more lookups?
RETURNING pu.purchase_id -- more columns?
INTO NEW.purchase_id; -- more columns?
RETURN NEW;
END
$func$ LANGUAGE plpgsql;
The LEFT JOIN
s make the INSERT
unconditional again. Use JOIN
instead to skip if one is not found.
FROM (SELECT NEW.*) i
transforms the record NEW
into a derived table with a single row, which can be used like any table in the FROM
clause - what you were looking for, initially.
db<>fiddle here
Best Answer
Altering a columns position is going to require a full table rewrite. My suggestion is not to alter column positions. However, there is an abundance of people doing this and there are numerous ways to do it. In my experience,
My suggestion is to use
pg_dump
.Dump with
--column-inserts
Remove from dump what you don't need.