Cons of using a star schema in a columnar database

database-designmemory-optimized-tablesolapredshiftstar-schema

My organisation is implementing a new data warehouse using a columnar data store (Redshift) and the performance is currently abysmal. There are many causes for this but I think the key reason is because of the way we have our data structured, we're implementing a classic star schema design, which surely is optimal for a traditional RDBMS (OLTP) and not necessarily for OLAP?

I'm looking for someone with experience on this to briefly explain the pros and cons of structuring large quantities of data in a star schema on amazon redshift.

I'm obviously no expert, but I'm looking for some advice and guidance on this.

thanks,
Matt

Best Answer

You are mixing up three orthogonal concepts: data model (star schema), workload characteristics (OLTP vs. OLAP), and physical data organisation (columnar).

Your data model has no bearing on whether column-organised tables are appropriate for you; however, data organisation must reflect the nature of your workload (i.e. queries). For example, SELECT * FROM... cannot possibly perform well in a columnar database.

Related Solutions

Mysql – IP Address lookup star schema design

Assuming you are using SQL Server, you could adapt this to work:

CREATE TABLE [dbo].[IP2Location](
    [ID] [int] NOT NULL CONSTRAINT PK_IP2Location PRIMARY KEY CLUSTERED IDENTITY(1,1) ,
    [ip] [bigint] NULL,
    [country_code] [varchar](2) NULL,
    [country_name] [varchar](255) NULL,
    [region_name] [varchar](255) NULL,
    [city_name] [varchar](255) NULL,
    [latitude] [float] NULL,
    [longitude] [float] NULL,
    [zip_code] [varchar](255) NULL,
    [time_zone] [varchar](255) NULL,
    [isp] [varchar](255) NULL,
    [domain] [varchar](255) NULL,
    [net_speed] [varchar](255) NULL,
    [idd_code] [varchar](255) NULL,
    [area_code] [varchar](255) NULL,
    [IP2LocationTimeStamp] [timestamp] NOT NULL,
);

/*
    Create a non-clustered index on the numeric IP address to speed lookups
    I indexed the country_code field here as well since our lookups use it.
    I INCLUDED country_name, region_name, and time_zone since that is what we 
    typically request in a SELECT statement from this table.
*/
CREATE NONCLUSTERED INDEX [IX_IP2Location_IP2LocationServiceQuery] ON [dbo].[IP2Location]
(
    [ip] ASC,
    [country_code] ASC
)
INCLUDE (   [country_name],
    [region_name],
    [time_zone])
GO

/* 
    Create BIGINT versions of the dotted-ip field ip_address
*/
UPDATE IP2Location SET IP = 
            (CAST(substring(ip_address,1,charindex('.',ip_address)-1) AS BIGINT) * 0x01000000) 
            + (CAST(substring(ip_address, charindex('.',ip_address,charindex('.',ip_address))+1, (charindex('.', ip_address, charindex('.', ip_address, charindex('.',ip_address))+1)-charindex('.',ip_address,charindex('.',ip_address)))-1) AS BIGINT) * 0x00010000) 
            + (CAST(substring(ip_address, charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address))+1)+1, charindex('.', ip_address, charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address))+1)+1) - (charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address))+1)+1)) as BIGINT) * 0x00000100) 
            + (CAST(substring(ip_address, charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address)+1))+1)+1, len(ip_address) - (charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address,charindex('.',ip_address)+1))+1))) as BIGINT)
            );

/*
    Function to return a TVF containing location info about the given numeric IP address
    For example:
        SELECT country_code, country_name, region_name, city_name, latitude, longitude, time_zone 
        FROM GetLocationFromIPv4(16777215);
*/
CREATE FUNCTION [dbo].[GetLocationFromIPv4]
(
    @IP bigint
)
RETURNS @IPLoc TABLE
(
    country_code varchar(2)
    , country_name varchar(255)
    , region_name varchar(255)
    , city_name varchar(255)
    , latitude float
    , longitude float
    , time_zone varchar(255)
    , isp varchar(255)
    , domain varchar(255)
    , net_speed varchar(255)
    , idd_code varchar(255)
    , area_code varchar(255)
)
AS
BEGIN
    /*
        TVF to return (typically) a single record from IP2Location as quickly as possible
    */
    INSERT INTO @IPLoc
    SELECT country_code
        , country_name
        , region_name
        , city_name
        , latitude
        , longitude
        , time_zone
        , isp
        , domain
        , net_speed
        , idd_code
        , area_code 
    FROM IP2Location
    WHERE ip = @IP;
    RETURN
END
GO

/*
    Function to return a TVF containing location info about the given dotted-decimal IP address
    For example:
        SELECT country_code, country_name, region_name, city_name, latitude, longitude, time_zone 
        FROM GetLocationFromIPv4('192.168.0.1');
*/
CREATE FUNCTION [dbo].[GetLocationFromIPv4String]
(
    @IP nvarchar(15)
)
RETURNS @IPLoc TABLE
(
    country_code varchar(2)
    , country_name varchar(255)
    , region_name varchar(255)
    , city_name varchar(255)
    , latitude float
    , longitude float
    , time_zone varchar(255)
    , isp varchar(255)
    , domain varchar(255)
    , net_speed varchar(255)
    , idd_code varchar(255)
    , area_code varchar(255)
)
AS
BEGIN
    /*
        TVF to return (typically) a single record from IP2Location as quickly as possible
    */
    DECLARE @IPInt64 bigint;
    DECLARE @n1 bigint;
    DECLARE @n2 bigint;
    DECLARE @n3 bigint;
    DECLARE @n4 bigint;

    set @n1 = substring(@IP
        ,1
        ,charindex('.',@IP)-1
        );

    set @n2 =  substring(
            @IP
            , charindex('.',@IP,charindex('.',@IP))+1
            , (charindex('.', @IP, charindex('.', @IP, charindex('.',@IP))+1)-charindex('.',@IP,charindex('.',@IP)))-1
        );

    set @n3 =  substring(
            @IP
            , charindex('.',@IP,charindex('.',@IP,charindex('.',@IP))+1)+1
            , charindex('.', @IP, charindex('.',@IP,charindex('.',@IP,charindex('.',@IP))+1)+1) - (charindex('.',@IP,charindex('.',@IP,charindex('.',@IP))+1)+1)
        );

    set @n4 =  substring(
            @IP
            , charindex('.',@IP,charindex('.',@IP,charindex('.',@IP,charindex('.',@IP)+1))+1)+1
            , len(@IP) - (charindex('.',@IP,charindex('.',@IP,charindex('.',@IP,charindex('.',@IP)+1))+1))
        );

    SET @IPInt64 = (@n1 * 0x01000000) + (@n2 * 0x00010000) + (@n3 * 0x00000100) + @n4;

    INSERT INTO @IPLoc
    SELECT TOP(1)country_code
        , country_name
        , region_name
        , city_name
        , latitude
        , longitude
        , time_zone
        , isp
        , domain
        , net_speed
        , idd_code
        , area_code 
    FROM IP2Location
    WHERE ip = @IPInt64;
    RETURN
END
GO

Alternative to EAV for dynamic fields in a star schema data warehouse

I can think of three solutions - EAV, XML, and Sparse Columns. The latter is vendor-specific and may not be useful to you.

Whichever method you choose, you may wish to consider storing the original request data in a raw format, in a table or flat file. It will make it easy to try new ways of storing the data, allow you to reload data if you discover a mistake with the way you're parsing your requests, and offer opportunities for parsing the API requests using batch processing or "big data" tools if you find that your data warehouse isn't able to efficiently deal with the data.

EAV considerations

EAV/KVS, as you've described it above, is likely to be the most straightforward implementation.

Unfortunately it's also going to be very expensive - to get any sort of efficient queries on commonly used keys you'll need to have indexes on the key column, which could get very fragmented. Querying for particular keys would be extremely expensive.

You may be able to reduce the cost of indexing or index scans by supporting your EAV store with materialised views (many vendors support this) for querying keys or values that you care about.

XML

Most enterprise database systems offer very mature XML handling, including validation, indexing, and sophisticated querying.

Loading the API request into the database as XML would provide one tuple per request, which logically might be a bit more palatable to you than having an unknown number of rows in an EAV table.

Whether this is efficient would depend a lot on your RDBMS vendor and your implementation.

The biggest downside is that this is probably the only way of managing data that's more complicated than string manipulation of the original request!

Sparse Columns / traditional tables

It's possible that you could load your data into a traditional table structure, with one column per key.

SQL Server's Sparse Columns feature is a great alternative to an EAV store. A table with Sparse Columns behaves much the same as a normal table, except that it can have up to 30,000 columns, and NULL values in sparse columns consume no space in the table.

Combining them with Filtered Indexes (another SQL Server specific feature) can provide an extremely efficient alternative to an EAV store if you're frequently querying for a couple of specific columns and/or values.

Using a traditional table with other vendors may be viable - IBM supports over 700 columns per table and Oracle about 1000, and features such as compression or Oracle's treatment of trailing nulls might mean that you can store your API data fairly efficiently.

The obvious downside of this approach is that as you added new keys to your API, you'd need to adjust your schema accordingly.

Best Answer

Related Solutions

Mysql – IP Address lookup star schema design

Alternative to EAV for dynamic fields in a star schema data warehouse

Related Question