Postgres – When to Mark Functions as PARALLEL RESTRICTED vs PARALLEL SAFE

functionsparallelismpostgresql

The following operations are always parallel restricted.

Scans of common table expressions (CTEs).

Scans of temporary tables.

…

Further down on the same manual page:

[…] Similarly, functions must be marked PARALLEL RESTRICTED if they access
temporary tables, client connection state, cursors, prepared
statements, or miscellaneous backend-local state which the system
cannot synchronize across workers. For example, setseed and random are
parallel restricted for this last reason.

No mention of CTEs. Now I am unsure whether I can use PARALLEL SAFE for functions containing a CTE. It would make sense to me for those to be only PARALLEL RESTRICTED.

Context: I have to determine the best label for existing user-defined functions. The setting is new since Postgres 9.6, and can have a massive impact on performance as operations involving functions that are not PARALLEL SAFE will not be performed by parallel workers, PARALLEL RESTRICTED only by the leader. (And PARALLEL USAFE disables parallelism altogether.)

I posted a related question on pgsql-general.

Best Answer

If a function contains an query that uses a CTE, and that function is used by a parallel worker process, the query and the CTE will only be executed in the private process context of the parallel worker. There is no shared state involved in creating or scanning that CTE.

So it is safe to mark the function PARALLEL SAFE.

The quotation from the documentation says that a CTE defined in a query that should be parallelized can only be scanned by the leader process, not by the parallel workers, precisely for the reason that the CTE is not shared between processes. This has no impact on CTEs defined in queries running in functions called by parallel workers, because such nested queries won't be parallelized anyway:

Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:

[...]

The query is running inside of another query that is already parallel. For example, if a function called by a parallel query issues an SQL query itself, that query will never use a parallel plan.

Related Solutions

Not seeing any benefits of parallel execution when parallel clause is used

You're not seeing the benefits of parallel execution because both insert methods use single-row inserts, hence parallel does not kick in !

Parallel execution is only available for bulk operations. Lots of tiny operations doesn't qualify as a set operation. You're inserting rows one by one here, you should try it with a bulk operation (INSERT /*+APPEND*/ INTO SELECT, CREATE TABLE AS SELECT ...) and see if you can achieve some performance improvement.

You can tell when Oracle does a parallel execution because the explain plan will display parallel operations.

For further reading:

Sql-server – Running functions in parallel

Using T-SQL scalar functions will frequently lead to performance problems* because SQL Server makes a separate function call (using a whole new T-SQL context) for each row. In addition, parallel execution is disallowed for the whole query.

T-SQL scalar functions can also make it difficult to troubleshoot performance problems (whether those problems are caused by the function or not). The function appears as a 'black box' to the query optimizer: it is assigned a fixed low estimated cost, regardless of the actual content of the function.

See this and this for more on the pitfalls of scalar functions.

You will probably be better off using the new TRY_CONVERT function in SQL Server 2012:

SELECT
    InsertID,
    dt1 = TRY_CONVERT(smalldatetime, MangledDateTime1),
    dt2 = TRY_CONVERT(smalldatetime, MangledDateTime2),
    dt3 = TRY_CONVERT(smalldatetime, MangledDateTime3)
FROM dbo.RawData;

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║         dt1         ║         dt2         ║         dt3         ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ NULL                ║ NULL                ║
║        1 ║ NULL                ║ 2013-06-30 00:00:00 ║ NULL                ║
║        1 ║ NULL                ║ NULL                ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝

After the edit to the question

I see the function contains some specific logic. You could still look to use TRY_CONVERT as part of that, but you should definitely convert the scalar function to an in-line function. In-line functions (RETURNS TABLE) use a single SELECT statement and are expanded into the calling query and fully optimized in much the same way views are. It can be helpful to think of in-line functions as parameterized views.

For example, an approximate translation of the scalar function to an in-line version is:

CREATE FUNCTION dbo.CleanDate
    (@UnformattedString  varchar(12))
RETURNS TABLE
AS RETURN
SELECT Result =
    -- Successful conversion or NULL after
    -- workarounds applied in CROSS APPLY
    -- clauses below
    TRY_CONVERT(smalldatetime, ca3.string)
FROM
(
    -- Logic starts here
    SELECT        
        CASE
            WHEN @UnformattedString IS NULL
                THEN NULL
            WHEN LEN(@UnformattedString) <= 1
                THEN NULL
            WHEN LEN(@UnformattedString) = 12
                THEN LEFT(@UnformattedString, 8)
            ELSE @UnformattedString
        END
) AS Input (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT 
        CASE 
            WHEN @UnformattedString = '20000000' 
            THEN '20790606' 
            ELSE Input.string
        END
) AS ca1 (string)
CROSS APPLY 
(
    -- Next stage using result so far
    SELECT CASE
        WHEN LEFT(ca1.string, 2) = '00' THEN '20' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '18' THEN '19' + RIGHT(ca1.string, 6)
        WHEN LEFT(ca1.string, 2) = '19' THEN ca1.string
        WHEN LEFT(ca1.string, 2) = '20' THEN ca1.string
        WHEN LEN(ca1.string) <> 6 THEN '20' + RIGHT(ca1.string, 6)
        ELSE ca1.string
    END
) AS ca2 (string)
CROSS APPLY
(
    -- Next stage using result so far
    SELECT
        CASE 
            WHEN TRY_CONVERT(integer, LEFT(ca2.string, 4)) > YEAR(GETDATE())
                THEN '20790606'
            WHEN YEAR(GETDATE()) - TRY_CONVERT(integer, LEFT(ca2.string, 4)) >= 100
                THEN '20790606'
            ELSE ca2.string
        END
) AS ca3 (string);

The function used on the sample data:

SELECT
    InsertID,
    Result1 = CD1.Result,
    Result2 = CD2.Result,
    Result3 = CD3.Result
FROM dbo.RawData AS RD
CROSS APPLY dbo.CleanDate(RD.MangledDateTime1) AS CD1
CROSS APPLY dbo.CleanDate(RD.MangledDateTime2) AS CD2
CROSS APPLY dbo.CleanDate(RD.MangledDateTime3) AS CD3;

Output:

╔══════════╦═════════════════════╦═════════════════════╦═════════════════════╗
║ InsertID ║       Result1       ║       Result2       ║       Result3       ║
╠══════════╬═════════════════════╬═════════════════════╬═════════════════════╣
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ NULL                ║
║        1 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║ 2079-06-06 00:00:00 ║
║        1 ║ 2000-10-10 00:00:00 ║ 2079-06-06 00:00:00 ║ 2013-06-30 00:00:00 ║
╚══════════╩═════════════════════╩═════════════════════╩═════════════════════╝

_{*CLR scalar functions have a much faster invocation path than T-SQL scalar functions and do not prevent parallelism.}

Best Answer

Related Solutions

Not seeing any benefits of parallel execution when parallel clause is used

Sql-server – Running functions in parallel

After the edit to the question

Related Question