Mysql – Database design that handles data growth with time

database-designmaintenanceMySQL

I have a big table that stores video rental data with the following columns:

id, title, language, duration, owner, remarks, closing_date_for_loan

Assuming every day there are thousands of data inserts to this table, then within a year, I could have a million rows of data. The search on the data record involves range query on several columns and will always contain a WHERE closing_date_for_loan > NOW() condition.

To maintain this table, I could perform a query to transfer data with closing_date_for_loan < NOW() out of this big table periodically so that the table won't get too big causing excessive query times. I am looking for a more automated way of handling such data growth similar to how a log rotation works. Any ideas?

Additional Note:

I have tested a few composite index and the query time can range from a few seconds to 50s if the row gets to 5 million. Range queries can be hard to optimize, so I am looking for other ways like keeping the table to a manageable size.

Best Answer

Your first tactic should be good indexing. Somewhere down the road, you'll want to consider partitioning.

You always want to think about sargable expressions.

Related Solutions

Sql-server – Database structure, relational or data warehouse

It seems you want to aggregate location based statistics over time for rainfall. A database structure like the one below would let you do that. The 'data source' could be just a filename, or some indication as to where it came from.

create table DimDataSource (
       DataSourceID      int identity (1,1) not null
       DataSourceDesc    nvarchar (100)  -- May need unicode for file names
)
go

alter table DimDataSource
  add constraint PK_DataSource
      primary key clustered (DataSourceID)
go

create table DimLocation (
       LocationID        int identity (1,1) not null
       LocationDesc      varchar (50)
)
go

alter table DimLocation
  add constraint PK_Location
      primary key clusterd (LocationID)
go

create table DimDate (
       DateID           smalldatetime not null  -- 'Date' is a reserved word
      ,MonthID          int not null
      ,MonthDesc        varchar (15)
      ,QuarterID        int not null
      ,QuarterDesc      varchar (15)
      ,YearID
)
go

alter table DimDate
  add constraint PK_Date
      primary key clustered (DateID)
go

create table DimTime (
       TimeID           time not null  -- 'Time' is a reserved word
      ,Hour             int not null
)
go

alter table DimTime
  add constraint PK_Time
      primary key clustered (TimeID)
go


-- If the table is <50GB, don't bother with partitioning, but put a clustered
-- index on DateID or LocationID and DateID, depending on how you normally expect
-- to query the data.

create table FactRainfall (
       RainfallID        int identity (1,1) not null -- May need a wider type if >4B rows.
                                                     -- SSAS likes an identity column for
                                                     -- incremental loads
      ,DataSourceID      int not null
      ,LocationID        int not null
      ,DateID            smalldatetime not null
      ,TimeID            time not null
      ,Rainfall          float
)
go

-- Add foreign keys as necessary

Populate the dimensions with the appropriate list of locations, date ranges, time of day to the right grain and one data source record per file. This table will also allow you to put a cube over the top, or can be flattened with a view, which will help people using tools like Excel or stats packages to get and use the data.

How to design this database

Indexes don't have to be unique. Primary keys do.

The purpose of a Primary Key is to uniquely identify a single row of data. If you don't have something that is naturally unique then add a column such as an identity column and define it as the primary key. It's standard practice, and will help you later if you need to update a row.

If the table is being queried for analysis or reporting a single table is fine. Do some analysis on the types of queries you are performing and add indexes on relevant key columns. Good indexes will significantly improve query performance.

For example if you are looking for results for a single store, by adding an index on the store id, you could exclude the other 299 stores from the select. This reduces IO and speeds up the query. If you have years worth of data but are only looking for things that happened in the last week then adding an index to a date column may be a big help.

Look at your queries and see if there fields you are regularly filtering on. Start with those.

Ensure the db has updated statistics for the table and see if performance improves.

The results you get will vary depending upon what you are trying to do. If you are trying to aggregate records (i.e total sales for each store) then the query may still have to read the whole table.

Below I've added a script with a basic test example.

USE [YOURDBNAME]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

-- Create Test Table. 

CREATE TABLE [dbo].[rewards](
    [id] [int] IDENTITY(1,1) NOT NULL,
    [fname] [varchar](50) NULL,
    [lname] [varchar](50) NULL,
    [tstamp] [datetime] NULL,
    [card_id] [int] NULL

) ON [PRIMARY] 

-- Add Primary key. This can be done at table creation. Or added after as shown below.

ALTER TABLE rewards ADD CONSTRAINT pk_rewards PRIMARY KEY NONCLUSTERED  ([id])

-- Add an index on columns that your query will use.

CREATE CLUSTERED INDEX [IDX01] ON [dbo].[rewards]
(   [card_id] ASC,
    [fname] ASC,
    [lname] ASC 
) ON [PRIMARY]

-- Create some test data. 

INSERT INTO [dbo].[rewards] VALUES ('Allan2', 'Aadvark2', getdate(), 2345) ;
INSERT INTO [dbo].[rewards] VALUES ('Billy2', 'Babo0n', getdate(), 2356) ;
INSERT INTO [dbo].[rewards] VALUES ('Chester2', 'Cheetah', getdate(), 2367) ;
INSERT INTO [dbo].[rewards] VALUES ('Doodoo', 'Dog', getdate(), 2376) ;
INSERT INTO [dbo].[rewards] VALUES ('Esme', 'Elephant', getdate(), 1235) ;
INSERT INTO [dbo].[rewards] VALUES ('Freddy', 'Fox', getdate(), 1239) ;

GO

-- run the following query repeatedly to generate volume. 
INSERT INTO [dbo].[rewards]
           ([fname]
           ,[lname]
           ,[tstamp]
           ,[card_id])     
     select [fname]
           ,[lname]
           ,[tstamp]
           ,[card_id]
           from [dbo].[rewards] ;

GO 

SET STATISTICS TIME ON
GO 

-- Test query. This shows that the card 1234 is use by 2 different customers. 

Select r.card_id, count(*) 
from (
    select card_id, fname, lname
    from [dbo].[rewards] 
    group by card_id, fname, lname ) r

group by r.card_id
having count(*) > 1 ;

GO 

SET STATISTICS TIME OFF

GO 

-- I just tested this with 3,538,944 rows. And I get SQL Server Execution Times:  CPU time = 1904 ms,  elapsed time = 490 ms.
-- On repeated execution I can reliably get this under .5 of a second.

Your results may vary depending upon machine setup and load, But I think you should resonably expect to get your query down to ~1 sec or less.

Best Answer

Related Solutions

Sql-server – Database structure, relational or data warehouse

How to design this database

Related Question