How to design this database

database-design

So I have a ton of transaction logs in pipe-separated format spread across a few dozen files. Concatenated into one file, it's around 2.4 million lines.

Each transaction has a store number and a company name and ~30 fields total. Right now, I have all 2.4m records in ONE table with no indexes. A simple query takes like 30 seconds to run and it's honestly unacceptable to me.

The thing is, there are no unique pieces of data that I can create primary keys with. Each store can have multiple transactions and each transaction number can have multiple lines (multiple products, returns, etc). There are 300+ stores and so a table for each store doesn't really seem realistic since I would have to join all of these just to get the data I need. What can I do?

Best Answer

Indexes don't have to be unique. Primary keys do.

The purpose of a Primary Key is to uniquely identify a single row of data. If you don't have something that is naturally unique then add a column such as an identity column and define it as the primary key. It's standard practice, and will help you later if you need to update a row.

If the table is being queried for analysis or reporting a single table is fine. Do some analysis on the types of queries you are performing and add indexes on relevant key columns. Good indexes will significantly improve query performance.

For example if you are looking for results for a single store, by adding an index on the store id, you could exclude the other 299 stores from the select. This reduces IO and speeds up the query. If you have years worth of data but are only looking for things that happened in the last week then adding an index to a date column may be a big help.

Look at your queries and see if there fields you are regularly filtering on. Start with those.

Ensure the db has updated statistics for the table and see if performance improves.

The results you get will vary depending upon what you are trying to do. If you are trying to aggregate records (i.e total sales for each store) then the query may still have to read the whole table.

Below I've added a script with a basic test example.

USE [YOURDBNAME]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

-- Create Test Table. 

CREATE TABLE [dbo].[rewards](
    [id] [int] IDENTITY(1,1) NOT NULL,
    [fname] [varchar](50) NULL,
    [lname] [varchar](50) NULL,
    [tstamp] [datetime] NULL,
    [card_id] [int] NULL

) ON [PRIMARY] 

-- Add Primary key. This can be done at table creation. Or added after as shown below.

ALTER TABLE rewards ADD CONSTRAINT pk_rewards PRIMARY KEY NONCLUSTERED  ([id])

-- Add an index on columns that your query will use.

CREATE CLUSTERED INDEX [IDX01] ON [dbo].[rewards]
(   [card_id] ASC,
    [fname] ASC,
    [lname] ASC 
) ON [PRIMARY]

-- Create some test data. 

INSERT INTO [dbo].[rewards] VALUES ('Allan2', 'Aadvark2', getdate(), 2345) ;
INSERT INTO [dbo].[rewards] VALUES ('Billy2', 'Babo0n', getdate(), 2356) ;
INSERT INTO [dbo].[rewards] VALUES ('Chester2', 'Cheetah', getdate(), 2367) ;
INSERT INTO [dbo].[rewards] VALUES ('Doodoo', 'Dog', getdate(), 2376) ;
INSERT INTO [dbo].[rewards] VALUES ('Esme', 'Elephant', getdate(), 1235) ;
INSERT INTO [dbo].[rewards] VALUES ('Freddy', 'Fox', getdate(), 1239) ;

GO

-- run the following query repeatedly to generate volume. 
INSERT INTO [dbo].[rewards]
           ([fname]
           ,[lname]
           ,[tstamp]
           ,[card_id])     
     select [fname]
           ,[lname]
           ,[tstamp]
           ,[card_id]
           from [dbo].[rewards] ;

GO 

SET STATISTICS TIME ON
GO 

-- Test query. This shows that the card 1234 is use by 2 different customers. 

Select r.card_id, count(*) 
from (
    select card_id, fname, lname
    from [dbo].[rewards] 
    group by card_id, fname, lname ) r

group by r.card_id
having count(*) > 1 ;

GO 

SET STATISTICS TIME OFF

GO 

-- I just tested this with 3,538,944 rows. And I get SQL Server Execution Times:  CPU time = 1904 ms,  elapsed time = 490 ms.
-- On repeated execution I can reliably get this under .5 of a second.

Your results may vary depending upon machine setup and load, But I think you should resonably expect to get your query down to ~1 sec or less.

Related Solutions

Mysql – Database Design (Stores + Products)

Your problem can be modelled by an entity relationship diagram (created with dia):

ER-diagram .

There are two entities "product" and "store" and there is an m-n-relationship "stocks" between these entities. "product" has an attribute "productId" that uniquely determines the product. "store" has the unique attribute "storeId" that determines uniquely the store.

I have added the attributes "quanity", "storeAddress" and "productName" to make the model a little bit more complex. These attributes cannot be derived by your description, you can remove it from the diagram and from the resulting tables if you want the design follow your description closely. An n-m-realtionship means that a store can stock different products and a product can be in stock in different stores. We have the following problem desription:

Each store has a unique Id and an address
Each product has a unique Id an a name
a store can stock different products with different quantities

The last sentence can also be formulated in this way

a product can be stocked in different stores with different quantities

Such a diagram can be transformed in a relational model. Each entity is a table where the attributes are the columns. The unique identifier is the primary key. The m-n-relation is a table where the primary key is the a (storeId,productId) pair. If the relation has an attribute this attribute is a column in the table.

so we have the following relations (tables)

store("storeId", storeAddress)
product("productId",productName)
stock("storeId","productId",quantity)

Scalable Relational Database Design

So you're going to want to store your images on a fast CDN like Cloudfront. Normally I recommend storing pictures in-database, but not 240M of them :)

I'd avoid storing the whole URL, as it's likely to be repetitive. Just store the file name (a five-char case-sensitive name using a-z 0-9 would give you ~900M combinations. If your CDN is case-insensitive wrt URLs then use 6 chars ).

Are there really going to be 240M unique images, or can they be "shared" among entries? Do you expect every user to fill every entry with the max # of images?

You'd want some tables like this:

create table users (
  user_id serial primary key,
  ...
);

create table entries (
  entry_id serial primary key,
  user_id int not null references users(user_id), --index me
  description text,
  ...
);


create table pictures (
  file_name char(5) primary key,
  entry_id int not null references entries(entry_id) --index me
);

or perhaps:

create table pictures (
  picture_id serial primary key,
  file_name char(5) not null unique,
  entry_id int not null references entries(entry_id) -- index me
);

Use connection pooling!

Best Answer

Related Solutions

Mysql – Database Design (Stores + Products)

Scalable Relational Database Design

Related Question