Postgresql – What exactly does PostgreSQL (or other databases) do internally when you “prepare” a query rather than just running it directly

optimizationpostgresqlprepared-statement

When I face a huge number of repeated INSERTs in a loop, I tend to first create a "prepare skeleton query" prior to the loop, and in the loop, I simply "execute" this prepared query and send it all the values. I heard long ago, and can understand in some abstract sense, that this is more optimized than just having the loop with a normal parameterized query in it.

However, I don't understand what exactly PG is doing that makes it so much faster. If it even is so much faster. I frankly have never done any real benchmarks to compare the two different methods.

Also, shouldn't PG be "warmed up" to a query once you repeat it several (let alone many) times in very short succession, and perhaps do the same thing as I do manually, but internally, when I just use a normal parameterized query?

I find myself constantly second-guessing what the database is up to internally. I have no real grasp of how "smart" it is. I fear that I'm doing a lot of things which are meaningless because it's already taken care of internally by its smart code.

Maybe "preparing" and "executing" is an archaic practice which has no real benefit these days?

It sounds like PG is basically allocating resources to "prepare" for upcoming huge amounts of similar INSERT queries, but I don't understand what exactly it would do differently compared to just executing them one by one. I also don't understand what the point would be of a non-INSERT prepared query/statement.

PS: Just to clear up any confusion: I never use non-parameterized queries, whether they are prepared or not. A lot of people confuse "parameterized queries" and "prepared statements". Even I called the latter "prepared queries" in this question…

Best Answer

Actually, the difference between a parameterized statement and a prepared statement is not that big in PostgreSQL: the former is internally an unnamed prepared statement, and the main difference is that it is automatically deallocated when the next unnamed prepared statement comes along.

When you prepare a statement, PostgreSQL remembers that statement in your current session (nothing is shared between sessions).

For the first five executions, PostgreSQL will generate a custom query plan that takes the parameter values into account. After that, PostgreSQL will generate a generic plan that is independent of the query parameters, and if that generic plan is estimated to be no more expensive than the previous custom plans, it will be used from that point on.

For insert statement like you describe, PostgreSQL will always switch to the generic plan. The main advantage there is that the query doesn't have to be planned any more. With simple statements, planning time can exceed execution time, so the savings can be considerable.

PostgreSQL only caches query plans for prepared statements and statements in PL/pgSQL functions.

The other advantage of prepared (or parameterized) statements is that they avoid the danger of SQL injection.

If you have to insert a lot of rows, using COPY would be even faster.

Related Solutions

Postgresql – PL/pgSQL functions and optimizations fences

Functions written in language SQL can under certain conditions (e.g., no side effects) be "inlined" (at the discretion of the optimizer), in which case they will take part in the query plan optimization. Functions written in all other languages (including PL/pgSQL) will not take part in the optimization of the main query, so they are indeed optimization fences, as you call it.

On the other hand, a query run by a PL/pgSQL function will be compiled and prepared the first time it is run, so subsequent calls of that function (even as part of the same top-level query) will be faster. SQL functions do not implicitly prepare the queries they run, so if inlining of an SQL function fails, repeated calls will probably slower.

In practice, it depends on the specific circumstances which is faster. Generally, I stick to the basic rule of using LANGUAGE SQL when it's possible and LANGUAGE plpgsql only when I need the additional expressive power. But it's definitely a useful microoptimization strategy to compare implementations in the two languages.

Sql-server – Specifying columns in LEFT JOIN through subquery beneficial performance wise

For most simple cases, like your, there will not be any difference. SQL Server is lazy and will figure out the most easiest way of completing your query and it will do a very good job at it (most of the time).

The best way of analyzing the difference yourself is to look at the query plan of both queries.

Given this example of two queries similar to yours:

SELECT 
    c.CustomerId,
    o.OrderId,
    o.ArticleId,
    o.Customerid
FROM 
    dbo.Customer c
INNER JOIN 
    dbo.orders o
    ON o.CustomerId = c.CustomerId


SELECT 
    c.CustomerId,
    o.OrderId,
    o.ArticleId,
    o.Customerid
FROM 
    dbo.Customer c
INNER JOIN 
(
    SELECT 
        o.OrderId,
        o.ArticleId,
        o.Customerid
    from    dbo.orders o
) AS o
    ON o.CustomerId = c.CustomerId

The query plans for both will look exactly the same. Query Plan of both queries

If you hover your mouse over the Index Scan operator you will see the Output List at the bottom of the popup. The Output List will show exactly which columns that will be "extracted" from the table. Both queries produce the same Output Lists which means that there is no benefit of "pre-selecting" the columns because SQL Server is already doing this for you.

This is true in your example since you are specifically specifying columns from both tables in your select clause, if you would do SELECT * then there could be a difference.

Best Answer

Related Solutions

Postgresql – PL/pgSQL functions and optimizations fences

Sql-server – Specifying columns in LEFT JOIN through subquery beneficial performance wise

Related Question