SQL Server – Fast Methods to Load Large Amounts of Test Data

bulk-insertoptimizationperformancesql server

I've trying to populate a table with a load of dummy data so I can about with optimisation etc.

I've got the following:

WHILE @RowCount < 3000000
BEGIN

    SELECT @Random = ROUND(@Upper * RAND(), 0)

    INSERT INTO [dbo].[Test]
               ([Id]
               ,[OtherKey]
               ,[Description])
         VALUES
               (@RowCount
               ,@Random
               ,CAST(@Random AS VARCHAR(max)))

    SET @RowCount = @RowCount + 1
END

However, this seems very slow.

Is there a better way to automate loading semi-random rows into a database table?

New Script

This one seems to be pretty quick:

USE [Test]
GO

/****** Object:  Table [dbo].[Test]    Script Date: 17/10/2016 21:22:39 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

USE [Test]
GO

CREATE TABLE [dbo].[Test](
    [Id] [int] NOT NULL,
    [OtherKey] [int] NOT NULL,
    [Description] [varchar](max) NOT NULL,
    [Time] [datetime] NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

GO

-- Add sample data

DECLARE @RowCount INT
DECLARE @Random INT
DECLARE @Upper INT

SET @Upper = 1000
SET @RowCount = 0

WHILE @RowCount < 1000000
BEGIN

    SELECT @Random = ROUND(@Upper * RAND(), 0)

    INSERT INTO [dbo].[Test]
               ([Id]
               ,[OtherKey]
               ,[Description]
               ,[Time])
         VALUES
               (@RowCount
               ,@Random
               ,CAST(@Random AS VARCHAR(max))
               ,GETDATE())

    SET @RowCount = @RowCount + 1
END

GO

/****** Object:  Index [IX_ID]    Script Date: 17/10/2016 22:18:48 ******/
CREATE CLUSTERED INDEX [IX_ID] ON [dbo].[Test]
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO



/****** Object:  Index [IX_OtherKey]    Script Date: 17/10/2016 21:22:46 ******/
CREATE NONCLUSTERED INDEX [IX_OtherKey] ON [dbo].[Test]
(
    [OtherKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO

If anyone has any faster way though then I'd be very happy to hear it.

Best Answer

Here are a few guiding principles to load not small amounts of test data:

1) When possible use minimal logging.

If you can avoid writing unnecessary data to the transaction log, great, why not do that? This will require your test database to use a recovery model of simple. You also need to be careful to follow the rules which allow you to get minimal logging. If you need to insert into a table with a clustered index that already contains data trace flag 610 can help.

2) Avoid table and column DDL choices which contribute to unnecessary overhead.

As you discovered it is sometimes more efficient to create a clustered index after the data is loaded instead of before it, although it can go either way depending on the data. Sometimes creating the clustered index after can be faster simply because SQL Server can insert data in parallel into an existing heap (as of SQL Server 2016) and it can create the clustered index in parallel. This won't be a minimally logged operation and it will require all of the table's data to be sorted so this might not work if your development system isn't sized large enough.

As you also discovered, definitely create nonclustered indexes after you have loaded the data instead of before.

I believe that there is some overhead to using the VARCHAR(MAX) datatype so I would avoid that if possible.

3) Avoid row by row operations when possible. SQL Server is generally more efficient with set based solutions.

There are a few row by row operations in your code. You are doing a WHILE loop and just processing one row at a time. You're also creating a transaction for every row that you're inserting. There must be some overhead to creating and committing a transaction, right? Why pay that for every row? If your system is big enough you can often just insert your test data with a single query, especially if you are able to use minimal logging.

4) Use parallelism when possible.

The code that you have runs one row at a time and cannot take advantage of multiple cores on the server.

I will give you one way to approach problems like this, but this query may not meet your needs exactly.

WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3),
L5 AS (SELECT 1 AS c FROM L4 A CROSS JOIN L4),
NUMS AS (SELECT 1 AS NUM FROM L5)   
SELECT TOP 1000000
  CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS INT) [ID]
, CAST(t.RAND_VALUE AS INT) [OTHERKEY]
, CAST(t.RAND_VALUE AS VARCHAR(100)) [DESCRIPTION]
, CAST(GETDATE() AS DATETIME) [TIME]
INTO [X_JRO_TEST_RAND_2]
FROM NUMS CROSS JOIN (SELECT ROUND(1000 * RAND(CHECKSUM(NEWID())), 0) RAND_VALUE) t;

I will break the query down for you using the guiding principles that I listed at the top of the post.

1) SELECT INTO creates a HEAP table as part of the insert. If the database has a recovery model of simple then this operation will be minimally logged. I can save work by not writing unnecessary data to the transaction log.

2) The table does not have any indexes before loading data. I'm also using VARCHAR(100) instead of VARCHAR(MAX).

3) To generate numbers efficiently I'm using a technique popularized (probably created) by Itzik Ben-Gan. All of the CTEs are there to generate 1000000 rows. I would check out the article for more details on how the approach works. To make RAND() return a different value for each row I used one of the techniques documented in this stack overflow post.

4) As of SQL Server 2014 the query optimizer can insert data into a heap in parallel when using SELECT INTO syntax. As of SQL Server 2016 the query optimizer can insert data into a heap in parallel even without SELECT INTO syntax. On my test system the query runs in parallel.

On a test system here your code took 10:30 for a million rows but the query above took only 23 seconds. I only measured the data loading step and didn't bother with the indexes. It's worth pointing out that there are probably even more efficient ways to generate data in the form that you wanted, and that my code does not have different values for the TIME column.

It's also worth pointing out that if your solution runs fast enough for your purposes, great, go ahead and keep it. Sometimes it isn't necessary to get every last bit of performance out of code, especially code run behind the scenes for development purposes.

Related Solutions

SQL Server – Generating Large Strings for Test Data

You can use REPLICATE:

SELECT 'A' + REPLICATE('a', 1000) + 'ha!';

Sql-server – Best way to re-import large amount of data with minimal downtime

A solution I've used in the past (and have recommended here and on StackOverflow before) is to create two additional schemas:

CREATE SCHEMA shadow AUTHORIZATION dbo;
CREATE SCHEMA cache  AUTHORIZATION dbo;

Now create a mimic of your table in the cache schema:

CREATE TABLE cache.IPLookup(...columns...);

Now when you are doing your switch operation:

TRUNCATE TABLE cache.IPLookup;
BULK INSERT cache.IPLookup FROM ...;

-- the nice thing about the above is that it doesn't really
-- matter if it takes one minute or ten - you're not messing
-- with a table that anyone is using, so you aren't going to
-- interfere with active users.


-- this is a metadata operation so extremely fast - it will wait
-- for existing locks to be released, but won't block new locks
-- for very long at all:

BEGIN TRANSACTION;
  ALTER SCHEMA shadow TRANSFER    dbo.IPLookup;
  ALTER SCHEMA dbo    TRANSFER  cache.IPLookup;
COMMIT TRANSACTION;


-- now let's move the shadow table back over to
-- the cache schema so it's ready for next load:

ALTER SCHEMA cache TRANSFER shadow.IPLookup;
TRUNCATE TABLE cache.IPLookup; 

-- truncate is optional - I usually keep the data
-- around for debugging, but that's probably not
-- necessary in this case.

This will be more cumbersome if you have foreign keys and other dependencies (since you may have to drop those and re-create them), and of course it completely invalidates statistics etc. and this, in turn, can affect plans, but if the most important thing is getting accurate data in front of your users with minimal interruption, this can be an approach to consider.

New Script

Best Answer

Related Solutions

SQL Server – Generating Large Strings for Test Data

Sql-server – Best way to re-import large amount of data with minimal downtime

Related Question