\df *crypt
in psql reveals the argument types of the pgcrypto encrypt
and decrypt
functions (as do the PgCrypto docs):
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+-----------------+------------------+--------------------------+--------
...
public | decrypt | bytea | bytea, bytea, text | normal
public | encrypt | bytea | bytea, bytea, text | normal
...
so both the encrypt
and decrypt
functions expect the key to be bytea
. As per the error message, "you might need to add explicit type casts".
However, it works fine here on Pg 9.1, so I suspect there's more to it than you've shown. Perhaps you have another function also named encrypt
with three arguments?
Here's how it works on a clean Pg 9.1:
regress=# create table demo(pw bytea);
CREATE TABLE
regress=# insert into demo(pw) values ( encrypt( 'data', 'key', 'aes') );
INSERT 0 1
regress=# select decrypt(pw, 'key', 'aes') FROM demo;
decrypt
------------
\x64617461
(1 row)
regress=# select convert_from(decrypt(pw, 'key', 'aes'), 'utf-8') FROM demo;
convert_from
--------------
data
(1 row)
Awooga! Awooga! Key exposure risk, extreme admin caution required!
BTW, please think carefully about whether PgCrypto is really the right choice. Keys in your queries can be revealed in pg_stat_activity
and the system logs via log_statement
or via crypto statements that fail with an error. IMO it's frequently better to do crypto in the application.
Witness this session, with client_min_messages
enabled so you can see what'd appear in the logs:
regress# SET client_min_messages = 'DEBUG'; SET log_statement = 'all';
regress=# select decrypt(pw, 'key', 'aes') from demo;
LOG: statement: select decrypt(pw, 'key', 'aes') from demo;
LOG: duration: 0.710 ms
decrypt
------------
\x64617461
(1 row)
Whoops, key possibly exposed in the logs if log_min_messages
is low enough. It's now on the server's storage, along with the encrypted data. Fail. Same issue without log_statement
if an error occurs to cause the statement to get logged, or possibly if auto_explain
is enabled.
Exposure via pg_stat_activity
is also possible.. Open two sessions, and:
- S1:
BEGIN;
- S1:
LOCK TABLE demo;
- S2:
select decrypt(pw, 'key', 'aes') from demo;
- S1:
select * from pg_stat_activity where current_query ILIKE '%decrypt%' AND procpid <> pg_backend_pid();
Whoops! There goes the key again. It can be reproduced without the LOCK TABLE
by an unprivileged attacker, it's just harder to time it right. The attack via pg_stat_activity
can be avoided by revoking access to pg_stat_activity
from public
, but it just goes to show that it might not be best to send your key to the DB unless you know your app is the only thing ever accessing it. Even then, I don't like to.
If it's passwords, should you store them at all?
Furthermore, if you're storing passwords, don't two-way encrypt them; if at all possible salt passwords then hash them and store the result. You usually don't need to be able to recover the password cleartext, only confirm that the stored hash matches the password the user sends you to log in when it's hashed with the same salt.
If it's auth, let someone else do it for you
Even better, don't store the password at all, authenticate against LDAP, SASL, Active Directory, an OAuth or OpenID provider, or some other external system that's already designed and working.
Resources
and lots more.
No automatic predicate pushdown for CTEs
PostgreSQL 9.3 doesn't do predicate pushdown for CTEs.
An optimizer that does predicate pushdown can move where clauses into inner queries. The goal is to filter out irrelevant data as early as possible. As long as the new query is logically equivalent, the engine still fetches all the relevant data, so produces the correct result, only more quickly.
Core developer Tom Lane alludes to the difficulty of determining logical equivalence on the pgsql-performance mailing list.
CTEs are also treated as optimization fences; this is not so much an
optimizer limitation as to keep the semantics sane when the CTE contains
a writable query.
The optimizer doesn't distinguish read-only CTEs from writable ones, so is overly conservative when considering plans. The 'fence' treatment stops the optimizer from moving the where clause inside the CTE, although we can see it is safe to do so.
We can wait for the PostgreSQL team to improve CTE optimization, but for now to get good performance you have to change your writing style.
Rewrite for performance
The question already shows one way to get a better plan. Duplicating the filter condition essentially hard-codes the effect of predicate pushdown.
In both plans, the engine copies result rows to a worktable so it can sort them. The larger the worktable, the slower the query.
The first plan copies all the rows in the base tables to the worktable and scans that to find the result. To make things even slower, the engine must scan the whole worktable because it has no indexes.
That's a ridiculous amount of unnecessary work. It reads all the data in the base tables twice to find the answer, when there are just an estimated 5 matching rows out of an estimated 19350 rows in the base tables.
The second plan uses the indexes to find the matching rows and copies just those to the worktable. The index effectively filtered the data for us.
On page 85 of The Art of SQL, Stéphane Faroult reminds us of user expectations.
To a very large extent, end users adjust thier patience to the number of rows they expect: when they ask for one needle, they pay little attention to the size of the haystack.
The second plan scales with the needle, so is more likely to keep your users happy.
Rewrite for maintainability
The new query is harder to maintain because you can introduce a defect by changing one filter expression but not the other.
Wouldn't it be great if we could write everything just once and still get good performance?
We can. The optimizer does predicate pushdown for subqeries.
A simpler example is easier to explain.
CREATE TABLE a (c INT);
CREATE TABLE b (c INT);
CREATE INDEX a_c ON a(c);
CREATE INDEX b_c ON b(c);
INSERT INTO a SELECT 1 FROM generate_series(1, 1000000);
INSERT INTO b SELECT 2 FROM a;
INSERT INTO a SELECT 3;
This creates two tables each with an indexed column. Together they contain a million 1
s, a million 2
s, and one 3
.
You can find the needle 3
using either of these queries.
-- CTE
EXPLAIN ANALYZE
WITH cte AS (
SELECT c FROM a
UNION ALL
SELECT c FROM b
)
SELECT c FROM cte WHERE c = 3;
-- Subquery
EXPLAIN ANALYZE
SELECT c
FROM (
SELECT c FROM a
UNION ALL
SELECT c FROM b
) AS subquery
WHERE c = 3;
The plan for the CTE is slow. The engine scans three tables and reads about four million rows. It takes nearly 1000 milliseconds.
CTE Scan on cte (cost=33275.00..78275.00 rows=10000 width=4) (actual time=471.412..943.225 rows=1 loops=1)
Filter: (c = 3)
Rows Removed by Filter: 2000000
CTE cte
-> Append (cost=0.00..33275.00 rows=2000000 width=4) (actual time=0.011..409.573 rows=2000001 loops=1)
-> Seq Scan on a (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.010..114.869 rows=1000001 loops=1)
-> Seq Scan on b (cost=0.00..18850.00 rows=1000000 width=4) (actual time=5.530..104.674 rows=1000000 loops=1)
Total runtime: 948.594 ms
The plan for the subquery is fast. The engine just seeks each index. It takes less than a millisecond.
Append (cost=0.42..8.88 rows=2 width=4) (actual time=0.021..0.038 rows=1 loops=1)
-> Index Only Scan using a_c on a (cost=0.42..4.44 rows=1 width=4) (actual time=0.020..0.021 rows=1 loops=1)
Index Cond: (c = 3)
Heap Fetches: 1
-> Index Only Scan using b_c on b (cost=0.42..4.44 rows=1 width=4) (actual time=0.016..0.016 rows=0 loops=1)
Index Cond: (c = 3)
Heap Fetches: 0
Total runtime: 0.065 ms
See SQLFiddle for an interactive version.
Best Answer
Your aggregate function is smart and fast, but there is a bug. If one string matches the tail of another completely, the
UNION ALL
part kicks in to returnLEAST($1, $2)
. That must instead be something likeCASE WHEN length($1) > length($2) THEN $2 ELSE $1 END
. Test with 'match' and 'aa_match'. (See fiddle below.)Plus, make the function
IMMUTABLE STRICT
:NULL values are ignored and empty strings lead to zero-length common suffix. You may want to treat these special cases differently ...
While we only need the length of the common suffix, a very simple
FINALFUNC
returns just that:Then your query can look like:
.. using the custom aggregate as window function.
db<>fiddle here
I also tested with Postgres 9.4, and it should work with your outdated Postgres 9.1, but that's too old for me to test. Consider upgrading to a current version.
Related: