SQL Server – Archiving Data Functionality

archiveperformanceperformance-tuningsql serversql-server-2012stored-procedures

Our company is looking for functionalilty/or automated way to archive data in SQL Server.
Currently, we create archive tables with identity surrogate, and develop stored procedures to import data. We are conducting this for 100+ tables, data is 2 years or older. Does SQL Server have functionality or automated way to archive data (Wizard tool, UI) or does person have to develop stored procedures? I am thinking of creating dynamic sql to write script for 100+ tables.

Thanks,

-- If no functionality exists, will create dynamic sql for tables using template below

create table dbo.CustomerTransaction
(
    CustomerTransactionId bigint primary key identity(1,1) not null ,
    CustomerName varchar(255) null ,
    ProductName varchar(255) null,
    Quantity int null ,
    CreateDatetimestamp datetime not null  
 )


create table dbo.CustomerTransactionArchive
(
    CustomerTransactionArchiveId bigint primary key identity(1,1) not null ,
    CustomerTransactionId bigint null,
    CustomerName varchar(255) null ,
    ProductName varchar(255) null,
    Quantity int null ,
    CreateDatetimestamp datetime not null  
 )


create procedure dbo.CustomerTransactionArchive
as

    declare rowmin bigint = (Select min(CustomerTransactionId) from dbo.CustomerTransaction where Createdatetimestamp < dateadd(year,-2,CreateDatetimestamp ))
    declare rowmax bigint = (Select max(CustomerTransactionId) from dbo.CustomerTransaction where Createdatetimestamp < dateadd(year,-2,CreateDatetimestamp ))
    declare rowcounter bigint = rowcountermin


      -- insert into 100,000 record block, do not want to lock too many pages in table
    while rowcounter <= rowmax 
    begin

        insert into dbo.CustomerTransactionArchive  (
        select * from dbo.CustomerTransactionArchive  
        where CustomerTransactionId between rowcounter and (rowcounter +100000) 
        and Createdatetimestamp < dateadd(year,-2,CreateDatetimestamp ))

        rowcounter = rowcounter + 100001
    end

Archive tables will be in different database

Best Answer

There is no built in functionality.

The stored procedure route will work, but you might also want to look into an ETL package (like SSIS) to do this as it may be easier to maintain.

You could also look into partitioning on CreateDate for all your tables in order to efficiently drop older data, but that would still require some custom code to rotate the partitions. They could, however, use a common partition scheme and function if the DDL on the CreateDate is the same for each.

If you were to upgrade, you could also look at using Temporal Tables or Stretch Database from SQL 2016, though those aren't out of the box auto-archive features necessarily.

YOUR QUERY

SELECT post.postid, post.attach FROM newbb_innopost AS post WHERE post.threadid = 51506;

At first glance, that query should only touches 1.1597% (62510 out of 5390146) of the table. It should be fast given the key distribution of threadid 51506.

REALITY CHECK

No matter which version of MySQL (Oracle, Percona, MariaDB) you use, none of them can fight to one enemy they all have in common : The InnoDB Architecture.

InnoDB Architecture

CLUSTERED INDEX

Please keep in mind that the each threadid entry has a primary key attached. This means that when you read from the index, it must do a primary key lookup within the ClusteredIndex (internally named gen_clust_index). In the ClusteredIndex, each InnoDB page contains both data and PRIMARY KEY index info. See my post Best of MyISAM and InnoDB for more info.

REDUNDANT INDEXES

You have a lot of clutter in the table because some indexes have the same leading columns. MySQL and InnoDB has to navigate through the index clutter to get to needed BTREE nodes. You should reduced that clutter by running the following:

ALTER TABLE newbb_innopost
    DROP INDEX threadid,
    DROP INDEX threadid_2,
    DROP INDEX threadid_visible_dateline,
    ADD INDEX threadid_visible_dateline_index (`threadid`,`visible`,`dateline`,`userid`)
;

Why strip down these indexes ?

The first three indexes start with threadid
threadid_2 and threadid_visible_dateline start with the same three columns
threadid_visible_dateline does not need postid since it's the PRIMARY KEY and it's embedded

BUFFER CACHING

The InnoDB Buffer Pool caches data and index pages. MyISAM only caches index pages.

Just in this area alone, MyISAM does not waste time caching data. That's because it's not designed to cache data. InnoDB caches every data page and index page (and its grandmother) it touches. If your InnoDB Buffer Pool is too small, you could be caching pages, invalidating pages, and removing pages all in one query.

TABLE LAYOUT

You could shave of some space from the row by considering importthreadid and importpostid. You have them as BIGINTs. They take up 16 bytes in the ClusteredIndex per row.

You should run this

SELECT importthreadid,importpostid FROM newbb_innopost PROCEDURE ANALYSE();

This will recommend what data types these columns should be for the given dataset.

CONCLUSION

MyISAM has a lot less to contend with than InnoDB, especially in the area of caching.

While you revealed the amount of RAM (32GB) and the version of MySQL (Server version: 10.0.12-MariaDB-1~trusty-wsrep-log mariadb.org binary distribution, wsrep_25.10.r4002), there are still other pieces to this puzzle you have not revealed

The InnoDB settings
The Number of Cores
Other settings from my.cnf

If you can add these things to the question, I can further elaborate.

UPDATE 2014-08-28 11:27 EDT

You should increase threading

innodb_read_io_threads = 64
innodb_write_io_threads = 16
innodb_log_buffer_size = 256M

I would consider disabling the query cache (See my recent post Why query_cache_type is disabled by default start from MySQL 5.6?)

query_cache_size = 0

I would preserve the Buffer Pool

innodb_buffer_pool_dump_at_shutdown=1
innodb_buffer_pool_load_at_startup=1

Increase purge threads (if you do DML on multiple tables)

innodb_purge_threads = 4

GIVE IT A TRY !!!

Sql-server – deteriorating stored procedure running times

What is up with FROM part JOIN model ON 1=1? This the same as FROM part, model, which is a cartesian join and will result in a very large number of rows. Is that join supposed to be like that?

You will likely help us help you if you provide details about the tables involved. Please "script" the definition of the tables, along with any indexes defined on those tables.

This sounds like a classic case of parameter sniffing resulting in good plan/bad plan choices for various scenarios in your data.

You may be able to get more reliable performance by making SQL Server cache different plans for different scenarios by using sp_executesql, as in the following example:

CREATE PROCEDURE [dbo].[create_grid_materials2] 
(
    @partlistid bigint
    , @pid bigint
    , @masterid bigint
)
AS
BEGIN
    begin
        DECLARE @cmd NVARCHAR(MAX);

        SET @cmd = '   
        INSERT INTO material (partid, personid, modelID)
        SELECT 
            partid = part.id
            , personid = @pid
            , modelid = model.id  
        FROM part
            INNER JOIN model ON 1=1
        WHERE (
            model.masterid = ' + CONVERT(NVARCHAR(50), @masterid) + ' 
                AND model.modelSetID IS NULL
                AND part.partlistid = ' + CONVERT(NVARCHAR(50), @partlistid) + '
                AND (
                    part.partType = 100 
                    or part.partType=120 
                    or part.partType = 130
                )
            )
            AND NOT EXISTS (
                SELECT 1 
                FROM material AS a1 
                WHERE a1.partid = part.id 
                    AND a1.personid=@pid 
                    AND a1.modelid=model.id
                )';
        DECLARE @Params VARCHAR(200);
        SET @Params = '@pid INT';
        EXEC sys.sp_executesql @cmd
            , @Params
            , @pid = @pid;
    end
End

The above code will cause a new plan to be generated for each combination of @partlistid, and @masterid.

The presumption here is some combinations of those two variables lead to a very small number of rows, whereas some combinations lead to a very large number of rows.

Forcing a plan for each combination allows SQL Server to generate more efficient plans for each. I've explicitly not included @pid since you probably want to try it with a fairly small number of combinations first; adding a third variable to the mix will make for an exponentially larger number of possible plans.