Sql-server – Extremly large import taking over 24hrs

sql serversql-server-2016

We have a query that we run to import a Master Do Not Call List into our database with about 200 million records.

We first mark the existing numbers in the database table with an expiration date. This process takes about 19hrs or more, and eats up the log files, and tempdb log files.

The second part of the query is the import from a text file located on the c: drive. This part also takes quite sometime, and last time SQL Studio crashed so the query tanked and did not finish.

Below is the query that we run:

To give some background on the server setup:
Windows 2016 standard
intel xeon 8cores 2.59ghz
128gb ram
sql 2016 standard

Data volume located on 3par 8200 SAN spinning drives.

Log volume located on 3par 8200 SAN spinning drives (current size is now 1.46tb because 3 times we ran out of space)

Temdb data and log volumes located on 3par SAN as well. (one volume)

Is there a better way to run this query and make it more efficient? Should we create a job with two steps one for the update expiration, and second for bluk insert? Should we move data and log volumes to SSDs on the 3par…currently they are located on spinning drives.

What do you think we can do to make the process more efficient?

Any information would be greatly appreciated!
Thanks!

Best Answer

The first problem I see is that you are performing a massive update (19+hours in duration) in a single run, and that is why your server is running low in resources. You need to perform the updates in batch. They will be faster, and your server setup should be able to cope with the resource load.

Try the following:

DECLARE @Rows INT,
        @BatchSize INT; 

SET @BatchSize = 2000;

SET @Rows = @BatchSize; 

WHILE (@Rows = @BatchSize)
  BEGIN
      UPDATE TOP (@BatchSize) tab
      SET    tab.expiration = dateadd(day, 5, getdate())
      FROM  I3dialer.dbo.MasterDNC tab
      WHERE tab.expiration < dateadd(day, 5, getdate());
      SET @Rows = @@ROWCOUNT;
  END;

Before testing this, be sure that there's an index in the expiration field to accelerate the query.

For the second part of your problem, other than increasing the batch insert to say, 10000 should help a bit. You can actually experiment with this to find your optimal batch size, for example, testing against 1 million records to see if there's any performance difference. There are also other things to consider. You should first load the data into an empty table, perform whatever operations you need to clean up your data, and finally insert them into your main table.

In summary, you should now have a 3 part process:

Perform the batch update on your main table
Insert data into a temporal or stage empty table
Insert data from your stage table to your main table

You should definitively get a better performance after implementing this. Do share your results.

MakeAllTheFiles

    public void Main()
    {
        int NumberOfFilesToGenerate = (Int32)Dts.Variables["User::FilesToGenerate"].Value;
        string baseFolder = Dts.Variables["User::FolderInput"].Value.ToString();
        System.Random rand = null;
        int fileRows = 0;
        DateTime current = DateTime.Now;
        int currentRandom = -1;
        int seed = 0;
        string folder = string.Empty;
        string currentFile = string.Empty;

        for (int i = 0; i < NumberOfFilesToGenerate; i++)
        {
            seed = i * current.Month * current.Day * current.Hour * current.Minute * current.Second;
            rand = new Random(seed);
            currentRandom = rand.Next();

            // Create files in sub folders
            folder = System.IO.Path.Combine(baseFolder, string.Format("f_{0}", currentRandom % 7));

            // Create the folder if it does not exist
            if (!System.IO.Directory.Exists(folder))
            {
                System.IO.Directory.CreateDirectory(folder);
            }

            currentFile = System.IO.Path.Combine(folder, string.Format("input_{0}.txt", currentRandom));

            System.IO.FileInfo f = new FileInfo(currentFile);
            using (System.IO.StreamWriter writer = f.CreateText())
            {
                int upperBound = rand.Next(50);
                for (int row = 0; row < upperBound; row++)
                {
                    if (row == 0)
                    {
                        writer.WriteLine(string.Format("{0}|{1}", "Col1", "Col2"));                        }

                    writer.WriteLine(string.Format("{0}|{1}", row, seed));
                }
            }
            ;
        }
        Dts.TaskResult = (int)ScriptResults.Success;
    }

ReadAllTheFiles

The general appearance of the package is thus

Read All The Files!

I have two Connection Managers defined: One is to my database and the other is to a Flat File with an Expression on the ConnectionString property such that it uses my Variable @[User::CurrentFileName]

Variables, I like lots of Variables so there are plenty.

enter image description here

My Execute SQL Task simply stands up a table for me to write to, knocking it down if it already exists.

IF EXISTS
(
    SELECT * FROM sys.tables AS T WHERE T.name = 'dbase_54462' AND T.schema_id = SCHEMA_ID('dbo')
)
BEGIN
    DROP TABLE dbo.dbase_54462;
END

CREATE TABLE
    dbo.dbase_54462
(
    CurrentFile varchar(256) NOT NULL
,   Col1 int NOT NULL
,   Col2 varchar(50) NOT NULL
,   InsertDate datetime NOT NULL DEFAULT(CURRENT_TIMESTAMP)
);

My Foreach Enumerator simply looks at everything in my Input folder based on the file mask of *.txt and traverses subfolders. The current file name is assigned to my variable @[User::CurrentFileName]`

enter image description here

The Data Flow is bog standard. The Derived Column Transformation there simply adds in the Current File Name variable into the data flow so I can record it in my table.

enter image description here

Analysis

I'm lazy and didn't want to do anything special to record processing times so I deployed my packages into the SSISDB catalog and ran them from there.

This query looks at the catalog data to find out how long the package ran, how many files it processed and then generates a running average for file count. Run 10047 was bad and was excluded from analysis.

SELECT
    E.execution_id
,   DATEDIFF(s, E.start_time, E.end_time) As duration_s
,   ES.rc AS FilesProcessed
,   AVG(ES.rc / (1.0 * DATEDIFF(s, E.start_time, E.end_time))) OVER (PARTITION BY ES.rc ORDER BY E.execution_id) AS running_average
FROM
    catalog.executions As E
    INNER JOIN
    (
        SELECT
            MIN(ES.start_time) As start_time
        ,   MAX(ES.end_time) AS end_time
        ,   count(1) As rc
        ,   ES.execution_id
        FROm
            catalog.executable_statistics AS ES
        GROUP BY
            ES.execution_id
    ) AS ES 
    ON ES.execution_id = E.execution_id
WHERE
    E.package_name = 'ReadAllTheFiles.dtsx'
    AND E.execution_id <> 10047
ORDER BY 1,2

The resulting data (gratuitous SQLFiddle)

execution_id    duration_s  FilesProcessed  running_average
10043   15  104 6.93333333333333
10044   13  104 7.46666666666666
10045   13  104 7.64444444444444
10050   102 1004    9.84313725490196
10051   101 1004    9.89186565715395
10052   102 1004    9.87562285640328
10053   106 1004    9.77464167060435
10055   1103    10004   9.06980961015412
10056   1065    10004   9.23161842010053
10057   1033    10004   9.38255038913446
10058   957 10004   9.65028792246735
10059   945 10004   9.83747901522255

Based on this sampling size, I see no appreciable difference between processing 100, 1000 or 10,000 files with SSIS as described herein.

Root cause assumption

Based on the comment about DTExecUI.exe that says you're running the package from within Visual Studio (BIDS/SSDT/name-of-the-week). To get the pretty color changes and debugging capability, the native execution (dtexec.exe) is wrapped up in the debugging process. That creates an appreciable drag on execution.

Use the design environment to create your packages and to run them for smaller data sets. Larger ones are best handled through the non-graphical & non-debugger execution interfaces (shift-F5 in VS, deploy to SSIS catalog and execution from there, or shell to the command line interface and use dtutil.exe)

Best Answer

Related Solutions

Sql-server – Strange behaviour DBCC Shrinkfile

Sql-server – Why is SSIS slow to enumerate over many files in a directory and import them

MakeAllTheFiles

ReadAllTheFiles

Analysis

Root cause assumption

Related Question