Are there standard SQL functions with side effects

functionssql-standard

Do the SQL standards define functions with side-effects?

For example, do they have functions for writing into files* or to update values in certain columns of a table when you do something like

SELECT myfunction(params...);

I've seen these sometimes, but I am just curious if the SQL standards do the same.

_{* This is not a question about PostgreSQL specifically. I'm only using the side-effects examples I see in PostgreSQL.}

Best Answer

You have a few different questions in here.

Q: What are ANSI standard SQL functions?

ANSI standard functions are things like AVG, COUNT, MIN, MAX. They're covered in the 1992 ANSI standard, but that's one heck of a dry, boring read.

Q: Do ANSI standard SQL functions change data in the database?

No. You can use them to change data - for example, I can say:

INSERT INTO dbo.MyReport SELECT MAX(SalespersonRevenue) FROM dbo.Sales

But by themselves, just the use of a AVG, COUNT, MIN, MAX, etc shouldn't change data permanently inside your database.

Q: Does the ANSI standard allow me to write my own functions?

Yes, but the exact implementation varies from vendor to vendor. The functions you write may conform to the ANSI language standard, but what you do inside your function can be horrifically awful, like creating side effects.

When discussing intended behavior, it's possible to get a cross-platform answer.
When discussing side effects, it is not.

Q: Can I create my own function to write data?

Why sure, if you're creative. I'm a Microsoft SQL Server guy, so I'm going to focus on that platform. Books Online's function page says:

User-defined functions cannot be used to perform actions that modify the database state.

To which I say:

You're not my real dad.

So here's how I'd break the rules. Warning: very bad ideas follow.

In your function, query a new table specially created for this evil purpose, and then create something that watches the table for select statements, and then fires an action (Extended Events, auditing, or a Profiler trace). You can hook together a Rube Goldberg sort of contraption to perform work based on those select statements.
In the function, call CLR code - heck, you can even call a web service. That web service could very well push data back into your own database.
In the function, call xp_cmdshell and do something through the command prompt. (HT @AaronBertrand in the comments.)

All of these examples have huge drawbacks in the form of performance and transactional consistency. You just asked if it could theoretically be done, and the answer there is yes. I wouldn't ever use either of those in my own code - I'd step back and ask, "What's the business goal I'm trying to achieve here, and is there a way I can do it to achieve performance and transactional consistency?" If you'd like specific advice on those, I'd ask a separate Stack question with specifics.

Related Solutions

PostgreSQL – is it possible to use a plpgsql function in a bash script

As you were hinted in comments above, your RBAR approach might be very inefficient. Consider the suggestions there.

Also, I am not going into details about the different approaches of UPSERT, as it is a very broad topic, especially when one wants to do concurrency-safe. PostgreSQL 9.5 helps a lot in this regard.

So, to your actual question: it is not directly possible writing PL/pgSQL 'commands' interspersed with bash, but the task is definitely solvable.

For this, you need to define a function like this (left out the complexity to show my point more concisely):

CREATE OR REPLACE FUNCTION dynamic_upsert(
    table_to_upsert name, 
    key_column name, 
    value_column name, 
    key integer, 
    value text
)
RETURNS void
LANGUAGE plpgsql
AS $BODY$
BEGIN
    ...
    EXECUTE $$ UPDATE $$ || quote_ident(table_to_upsert) || 
            $$ SET $$ || quote_ident(value_column) || $$ = $2 
               WHERE $$ || quote_ident(key_column) || $$  = $1 $$ 
        USING key, value;
    ...
END;
$BODY$;

Please note that this is fully dynamic in that you can choose any table, match against any column of it and modify any (other) column. The type of the key (integer) and the value (text) are fixed, though.

Now you can call this from a shell script like

psql -h $DBHOST -U $DBUSER -d $DBNAME \
     -c "SELECT dynamic_upsert($TABLE, $KEY_COL, $VAL_COL, $key, $value)"

If you are still RBARing, this will produce an additional overhead of connecting to the database for every single change, so be patient (or set up a connection pool).

Notes:

using concatenation, one has to be absolutely sure to quote the object names properly. This is here done using quote_ident(), which is the way to go if you are concatenating the dynamic statement. In newer versions (from 9.1), one would use the format() function to avoid the mess:

EXECUTE format($$ UPDATE %I SET %I = $2 WHERE %I = $1$$, 
               table_to_upsert, value_column, key_column) 
USING key, value;

here I am using dollar quoting when building the dynamic query. This allows me to use 'normal' syntax, as opposed to multiplicating single quotes, for example (not present in this example). This way most editors will highlight the statements nicely.

Sql-server – How to implement a Set based algorithm/UDF

With regards to methodology, I believe you are barking up the wrong b-tree ;-).

What we know:

First, let's consolidate and review what we know about the situation:

Somewhat complex calculations need to be performed:
- This needs to happen on every row of this table.
- The algorithm changes frequently.
- The algorithm ... [uses] values from some columns to manipulate other columns
- Current processing time is: 7 hours
The table:
- contains 800,000 rows.
- has 38 columns.
The application back-end:
- is Microsoft Excel.
- uses VBA (Visual Basic for Applications)^** as the language.
Database is SQL Server 2014, Enterprise Edition.

There is a Stored Procedure that is called for every row:

This takes 50 ms (on avg, I assume) to run.
It returns approximately 4000 rows.

The definition (at least in part) is:

SELECT AVG([AD_Sensor_Data])
           OVER (ORDER BY [RowID] ROWS BETWEEN 5 PRECEDING AND 5 FOLLOWING)
           as 'AD_Sensor_Data'
FROM   [AD_Points]
WHERE  [FileID] = @FileID
ORDER BY [RowID] ASC

What we can surmise:

Next, we can look at all of these data points together to see if we can synthesize additional details that will help us find one or more bottle necks, and either point towards a solution, or at least rule some possible solutions out.

The current direction of thought in the comments is that the major issue is data transfer between SQL Server and Excel. Is that really the case? If the Stored Procedure is called for each of the 800,000 rows and takes 50 ms per each call (i.e. per each row), that adds up to 40,000 seconds (not ms). And that is equivalent to 666 minutes (hhmm ;-), or just over 11 hours. Yet the whole process was said to take only 7 hours to run. We are already 4 hours over the total time, and we have even added in time to do the calculations or save the results back to SQL Server. So something is not right here.

Looking at the definition of the Stored Procedure, there is only an input parameter for @FileID; there isn't any filter on @RowID. So I suspect that one of the following two scenarios is happening:

This Stored Procedure does not actually get called per each row, but instead per each @FileID, which appears to span approximately 4000 rows. If the stated 4000 rows returned is a fairly consistent amount, then there are only 200 of those grouping in the 800,000 rows. And 200 executions taking 50 ms each amounts to only 10 seconds out of that 7 hours.
If this stored procedure actually does get called for every row, then wouldn't the first time a new @FileID is passed in take slightly longer to pull new rows into the Buffer Pool, but then the next 3999 executions would typically return faster due to already being cached, right?

I think that focusing on this "filter" Stored Procedure, or any data transfer from SQL Server to Excel, is a red herring.

For the moment, I think the most relevant indicators of lackluster performance are:

There are 800,000 rows
The operation works on one row at a time
The data is being saved back to SQL Server, hence "[uses] values from some columns to manipulate other columns" [ my emphasis ;-) ]

I suspect that:

while there is some room for improvement on the data retrieval and calculations, making those better wouldn't amount to a significant reduction in processing time.
the major bottleneck is issuing 800,000 separate UPDATE statements, which is 800,000 separate transactions.

My recommendation (based on currently available information):

Your biggest area of improvement would be to update multiple rows at one time (i.e. in one transaction). You should update your process to work in terms of each FileID instead of each RowID. So:
1. read in all 4000 rows of a particular FileID into an array
2. the array should contain elements representing the fields being manipulated
3. cycle through the array, processing each row as you currently do
4. once all rows in the array (i.e. for this particular FileID) have been calculated:
  1. start a transaction
  2. call each update per each RowID
  3. if no errors, commit the transaction
  4. if an error occurred, rollback and handle appropriately
If your clustered index isn't already defined as (FileID, RowID) then you should consider that (as @MikaelEriksson suggested in a comment on the Question). It won't help these singleton UPDATEs, but it would at least slightly improve the aggregate operations, such as what you are doing in that "filter" stored procedure since they are all based on FileID.
You should consider moving the logic to a compiled language. I would suggest creating a .NET WinForms app or even Console App. I prefer Console App as it is easy to schedule via SQL Agent or Windows Scheduled Tasks. It shouldn't matter whether it is done in VB.NET or C#. VB.NET might be a more natural fit for your developer, but there will still be some learning curve.

I don't see any reason at this point to move to SQLCLR. If the algorithm changes frequently, that would get annoying have to re-deploy the Assembly all of the time. Rebuilding a Console App and having the .exe get placed into the proper shared folder on the network such that you just run the same program and it just happens to always be up-to-date, should be fairly easy to do.

I don't think moving the processing fully into T-SQL would help if the problem is what I suspect and you are just doing one UPDATE at a time.
If the processing is moved into .NET, you can then make use of Table-Valued Parameters (TVPs) such that you would pass the array into a Stored Procedure that would call an UPDATE that JOINs to the TVP table variable and is hence a single transaction. The TVP should be faster than doing 4000 INSERTs grouped into a single transaction. But the gain coming from using TVPs over 4000 INSERTs in 1 transaction likely won't be as significant as the improvement seen when moving from 800,000 separate transactions to only 200 transactions of 4000 rows each.

The TVP option is not natively available for the VBA side, but someone came up with a work-around that might be worth testing:

How do I improve the database performance when going from VBA to SQL Server 2008 R2?
IF the filter proc is only using FileID in the WHERE clause, and IF that proc is really being called per every row, then you can save some processing time by caching the results of the first run and using them for the rest of the rows per that FileID, right?
Once you get the processing done per FileID, then we can start talking about parallel processing. But that might not be necessary at that point :). Given that you are dealing with 3 fairly major non-ideal parts: Excel, VBA, and 800k transactions, any talk of SSIS, or parallelograms, or who-knows-what, is premature optimization / cart-before-the-horse type stuff. If we can get this 7 hour process down to 10 minutes or less, would you still be thinking of additional ways to make it faster? Is there a target completion time that you have in mind? Keep in mind that once processing is done on a per FileID basis, if you had a VB.NET Console App (i.e. command-line .EXE), there would be nothing stopping you from running a few of those FileIDs at a time :), whether via SQL Agent CmdExec step or Windows Scheduled Tasks, etc.

AND, you can always take a "phased" approach and make a few improvements at a time. Such as starting with doing the updates per FileID and hence using one transaction for that group. Then, see if you can get the TVP working. Then see about taking that code and moving it to VB.NET (and TVPs work in .NET so it will port nicely).

What we do not know that could still help:

Does the "filter" Stored Procedure run per RowID or per FileID? Do we even have the full definition of that Stored Procedure?
Full schema of the table. How wide is this table? How many variable length fields are there? How many fields are NULLable? If any are NULLable, how many contain NULLs?
Indexes for this table. Is it partitioned? Is either ROW or PAGE Compression being used?
How large is this table in terms of MB / GB?
How is index maintenance handled for this table? How fragmented are the indexes? How update to date are the statistics?
Do any other processes write to this table while this 7 hour process is taking place? Possible source of contention.
Do any other processes read from this table while this 7 hour process is taking place? Possible source of contention.

UPDATE 1:

^** There seems to be some confusion about what VBA (Visual Basic for Applications) and what can be done with it, so this is just to make sure we are all on the same web-page:

VBA is not Visual Basic (stand-alone language), nor is it VBScript (works in IIS and Windows Scripting Host). VBA is a reduced form of VB that is specific to Microsoft Office products, to extend their capabilities. Hence, there is no real way to simply "remove" the VBA code from Excel in order to get Excel out of the picture.
Can VBA Code Be Run Outside of MS Applications?
Run VBA outside of Office applications
Run Excel Macro from Outside Excel Using VBScript From Command Line

UPDATE 2:

One more point to consider: How are connections being handled? Is the VBA code opening and closing the Connection per each operation, or does it open the connection at the start of the process and close it at the end of the process (i.e. 7 hours later)? Even with connection pooling (which, by default, should be enabled for ADO), there should still be quite an impact between opening and closing once as opposed to opening and closing either 800,200 or 1,600,000 times. Those values are based on at least 800,000 UPDATEs plus either 200 or 800k EXECs (depending on how often the filter stored procedure is actually being executed).

This issue of too many connections is automatically mitigated by the recommendation I outlined above. By creating a transaction and doing all of the UPDATEs within that transaction, you are going to be keeping that connection open and reusing it for each UPDATE. Whether or not the connection is kept open from the initial call to get the 4000 rows per the specified FileID, or closed after that "get" operation and opened again for the UPDATEs, is far less impacting since we are now talking about a difference of either 200 or 400 total connections across the entire process.

UPDATE 3:

I did some quick testing. Please keep in mind that this is a rather small scale test, and not the exact same operation (pure INSERT vs EXEC + UPDATE). However, the differences in timing related to how connections and transactions are handled are still relevant, hence the information can be extrapolated to having a relatively similar impact here.

Test Parameters:

SQL Server 2012 Developer Edition (64-bit), SP2

Table:

 CREATE TABLE dbo.ManyInserts
 (
    RowID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
    InsertTime DATETIME NOT NULL DEFAULT (GETDATE()),
    SomeValue BIGINT NULL
 );

Operation:

INSERT INTO dbo.ManyInserts (SomeValue) VALUES ({LoopIndex * 12});

Total Inserts per each test: 10,000
Resets per each test: TRUNCATE TABLE dbo.ManyInserts; (given the nature of this test, doing the FREEPROCCACHE, FREESYSTEMCACHE, and DROPCLEANBUFFERS didn't seem to add much value.)
Recovery Model: SIMPLE (and maybe 1 GB free in the Log file)
Tests that use Transactions only use a single Connection regardless of how many Transactions.

Results:

Test                                   Milliseconds
-------                                ------------
10k INSERTs across 10k Connections     3968 - 4163
10k INSERTs across 1 Connection        3466 - 3654
10k INSERTs across 1 Transaction       1074 - 1086
10k INSERTs across 10 Transactions     1095 - 1169

As you can see, even if the ADO connection to the DB is already being shared across all operations, grouping them into batches using an explicit transaction (the ADO object should be able to handle this) is guaranteed to significantly (i.e. over 2x improvement) reduce the overall process time.

Best Answer

Related Solutions

PostgreSQL – is it possible to use a plpgsql function in a bash script

Sql-server – How to implement a Set based algorithm/UDF

Related Question