Is Star/Snowflake database design good for applications? What is the alternative

Architecturedatabase-diagramsrelational-theorysnowflakestar-schema

I need to solve an exercise for a pharmaceutical company, they ask me to create a database structure from scratch given a few informations.

Because I want it to be scalable and "future proof" I decided to follow the star/snowflake structure. For now it looks more star than snowflake but the idea is there. I have a particular question about the main Fact Table (Study) and how to represent the Dimension Table Status.

I came up with 2 options:

OPTION 1: Fact Table Study > Foreign Key > Dimension Table Contract > Foreign Keys > Dimension Table Status.

So we pass through two Dimension Tables in order to represent the Status on the Fact Table. I will do that through a JOIN in a view or whatever…

OPTION 2: Fact Table Study > Foreign Key > Dimension Table Status.

This way we link directly the Fact Table (Study) to the Dimension Table Status.

Should I go for Option 1 or Option 2?

I'm afraid to link too many Dimension Tables to the Fact Table. This is the first time I try to create a database structure.

Any advice is welcome, any suggestion is welcome, especially from seasoned database architects but also from amateurs.

Thank you

EDIT: adding a few more info

Thank you @AntC for your question. In fact this is not at all a Data Warehouse scenario, the scenario is that of a pharmaceutical company that needs a new software to track their clinical trial.

But of course the most known schemas are Star/Snowflake and I don't want to use any hierarchical schema. At the same time I want to avoid shapes like triangle, diamond, circle because even if now there are only 100 users who knows what this database will be in 10 years. The idea is to shape something for the long term and in my opinion the Star/Snowflake shape is choose also for normal application as far as I know.

Best Answer

In Option 2 the Status_Code appears as non-key in two tables: Study, Contract. But which of those entities does it belong to? And what are the relative cardinalities of those entities?

Looking at Option 1, the Status_Code belongs to the Contract; there can be multiple Studys for a given Contract.

So if you were to adopt option 2, you give OLTP a headache: anytime there's a change in Status of a Contract, transaction processing needs to replicate that change to all Studys, with an obvious risk of getting out of synch. That's called an update anomaly.

Denormalising the model like Option 2 is legitimate in data warehouses, because we want fast reporting. It is unwise in a transaction processing-oriented data model, because of the processing cost and risks of maintaining data in synch.

So for an application focussing on transaction processing, Option 1 is indicated. And yes, you would create a Join view Study > Contract > Status.

Related Solutions

Alternative to EAV for dynamic fields in a star schema data warehouse

I can think of three solutions - EAV, XML, and Sparse Columns. The latter is vendor-specific and may not be useful to you.

Whichever method you choose, you may wish to consider storing the original request data in a raw format, in a table or flat file. It will make it easy to try new ways of storing the data, allow you to reload data if you discover a mistake with the way you're parsing your requests, and offer opportunities for parsing the API requests using batch processing or "big data" tools if you find that your data warehouse isn't able to efficiently deal with the data.

EAV considerations

EAV/KVS, as you've described it above, is likely to be the most straightforward implementation.

Unfortunately it's also going to be very expensive - to get any sort of efficient queries on commonly used keys you'll need to have indexes on the key column, which could get very fragmented. Querying for particular keys would be extremely expensive.

You may be able to reduce the cost of indexing or index scans by supporting your EAV store with materialised views (many vendors support this) for querying keys or values that you care about.

XML

Most enterprise database systems offer very mature XML handling, including validation, indexing, and sophisticated querying.

Loading the API request into the database as XML would provide one tuple per request, which logically might be a bit more palatable to you than having an unknown number of rows in an EAV table.

Whether this is efficient would depend a lot on your RDBMS vendor and your implementation.

The biggest downside is that this is probably the only way of managing data that's more complicated than string manipulation of the original request!

Sparse Columns / traditional tables

It's possible that you could load your data into a traditional table structure, with one column per key.

SQL Server's Sparse Columns feature is a great alternative to an EAV store. A table with Sparse Columns behaves much the same as a normal table, except that it can have up to 30,000 columns, and NULL values in sparse columns consume no space in the table.

Combining them with Filtered Indexes (another SQL Server specific feature) can provide an extremely efficient alternative to an EAV store if you're frequently querying for a couple of specific columns and/or values.

Using a traditional table with other vendors may be viable - IBM supports over 700 columns per table and Oracle about 1000, and features such as compression or Oracle's treatment of trailing nulls might mean that you can store your API data fairly efficiently.

The obvious downside of this approach is that as you added new keys to your API, you'd need to adjust your schema accordingly.

Sql-server – Stored Procedure Processing and Error Log

Here is a structure that is at least very close.

There is no programmatic way to get the parameters (unfortunately). You need to format them into XML to pass in.

The Login that initiates a SQL Agent job seems to only be logged in the message column of msdb.dbo.sysjobhistory, for step_id = 0. This value can be extracted, but not during the execution of the job.

You get ObjectID to pass in from @@PROCID.

Below is the schema (2 tables) and stored procedures (3 procs). The concept is to separate "init", "in process", and "completed (success or error)" logging. This allows for setting certain columns only at the appropriate time (e.g. only need to set DatabaseID, StartedAt, etc at the very beginning). Separating the type of event also makes it easier to have event-specific logic (yes, can have it even in a single proc, but then you still have all the input parameters when you only need a subset per each event-type).

A "process" record get updated via its IDENTITY (and clustered PK) value. This is another benefit of having the "event type" separation: it makes it much easier to handle capturing the SCOPE_IDENTITY() and passing it back to be used for the two other logging stored procedures. If a stored procedure fails and doesn't go to the CATCH block, then there is no need to worry about accidentally updating that process record as the next time any stored procedure (that is being logged) starts, it will get a new / unique ID to update.

Cleanup (optional) and Schema

/* -- optional cleanup
DROP PROCEDURE [dbo].[ProcessLogDemo];

DROP PROCEDURE [Logging].[ProcessLog_Log];
DROP PROCEDURE [Logging].[ProcessLog_Start];
DROP PROCEDURE [Logging].[ProcessLog_Stop];

DROP TABLE [Logging].[ProcessLog];
DROP TABLE Logging.[Status];

DROP SCHEMA [Logging];
*/

CREATE SCHEMA [Logging];
GO

Tables and Indexes

CREATE TABLE Logging.[Status]
(
  [StatusID] TINYINT NOT NULL 
              CONSTRAINT [PK_Status] PRIMARY KEY CLUSTERED,
  [StatusName] VARCHAR(50) NOT NULL
);

CREATE TABLE [Logging].[ProcessLog]
(
  ProcessLogID  INT NOT NULL IDENTITY(-2147483648, 1) -- start at INT min value
                 CONSTRAINT [PK_ProcessLog] PRIMARY KEY CLUSTERED,
  DatabaseID INT NOT NULL,
  ObjectID INT NULL, -- NULL = ad hoc query
  SessionID SMALLINT NOT NULL
             CONSTRAINT [DF_ProcessLog_SessionID] DEFAULT (@@SPID),
  Step TINYINT NOT NULL, -- if you have more than 255 steps, consult psychiatrist
  StatusID TINYINT NOT NULL
            CONSTRAINT [FK_ProcessLog_Status]
                FOREIGN KEY REFERENCES [Logging].[Status]([StatusID]),
  Remarks NVARCHAR(MAX) NULL, -- or maybe VARCHAR(MAX)?
  Params XML NULL,
  RowsSelected INT NULL,
  RowsInserted INT NULL,
  RowsUpdated INT NULL,
  RowsDeleted INT NULL,
  StartedBy [sysname] NULL,
  StartedAt DATETIME2 NOT NULL
             CONSTRAINT [DF_ProcessLog_StartedAt] DEFAULT (SYSDATETIME()),
  UpdatedAt DATETIME2 NULL, -- use to show progress / "heartbeat"
  StoppedAt DATETIME2 NULL
);
GO

Stored Procedure to call at the very beginning of "logged" stored procedures

CREATE PROCEDURE [Logging].[ProcessLog_Start]
(
  @DatabaseID INT,
  @ObjectID INT,
  @Params XML,
  @ProcessLogID INT = NULL OUTPUT
)
AS
SET NOCOUNT ON;

-- First, capture the MAX "instance_id" from sysjobhistory if this process is a SQL
-- Server Agent job (use later to get the "invoked by" Login), else grab the Login.
DECLARE @StartedBy [sysname];

IF (EXISTS(
           SELECT *
           FROM   sys.dm_exec_sessions sdes
           WHERE  sdes.[session_id] = @@SPID
           AND    sdes.[program_name] LIKE N'SQLAgent - TSQL JobStep (%'))
BEGIN
  DECLARE @JobID UNIQUEIDENTIFIER;

  SELECT @JobID = CONVERT(UNIQUEIDENTIFIER, 
                           CONVERT(BINARY(16),
                                   SUBSTRING(sdes.[program_name],
                                        CHARINDEX(N'(Job 0x', sdes.[program_name]) + 5,
                                             34), 1
                                  )
                          )
  FROM  sys.dm_exec_sessions sdes
  WHERE sdes.[session_id] = @@SPID;

--SELECT @JobID;

  SELECT @StartedBy = N'sysjobhistory.instance_id: '
                       + CONVERT(NVARCHAR(20), MAX(sjh.[instance_id]))
  FROM   msdb.dbo.sysjobhistory sjh
  WHERE  sjh.[job_id] = @JobID;
END;
ELSE
BEGIN
  SET @StartedBy = ORIGINAL_LOGIN();
END;

-- Now it should be safe to create a new entry
INSERT INTO [Logging].[ProcessLog] ([DatabaseID], [ObjectID], [Step], [StatusID],
                                    [Params], [StartedBy])
VALUES (@DatabaseID, @ObjectID, 0, 1, @Params, @StartedBy);

SET @ProcessLogID = SCOPE_IDENTITY();
GO

Stored Procedure to call after all but the final step

CREATE PROCEDURE [Logging].[ProcessLog_Log]
(
  @ProcessLogID INT,
  @Step TINYINT,
  @RowsSelected INT = NULL,
  @RowsInserted INT = NULL,
  @RowsUpdated INT = NULL,
  @RowsDeleted INT = NULL
)
AS
SET NOCOUNT ON;

UPDATE pl
SET    pl.[StatusID] = 2, -- In process
       pl.[Step] = @Step,
       pl.[UpdatedAt] = SYSDATETIME(),
       pl.[RowsSelected] = ISNULL(@RowsSelected, pl.[RowsSelected]),
       pl.[RowsInserted] = ISNULL(@RowsInserted, pl.[RowsInserted]),
       pl.[RowsUpdated] = ISNULL(@RowsUpdated, pl.[RowsUpdated]),
       pl.[RowsDeleted] = ISNULL(@RowsDeleted, pl.[RowsDeleted])
FROM   [Logging].[ProcessLog] pl
WHERE  pl.[ProcessLogID] = @ProcessLogID;

IF (@@ROWCOUNT = 0)
BEGIN
 RAISERROR('No initial or in-process record for ProcessLogID = %d !', 16, 1,
           @ProcessLogID);
  RETURN;
END;
GO

Stored Procedure to call after the final step and/or in a CATCH block

CREATE PROCEDURE [Logging].[ProcessLog_Stop]
(
  @ProcessLogID INT,
  @Step TINYINT,
  @StatusID TINYINT,
  @Remarks NVARCHAR(MAX) = NULL,
  @RowsSelected INT = NULL,
  @RowsInserted INT = NULL,
  @RowsUpdated INT = NULL,
  @RowsDeleted INT = NULL
)
AS
SET NOCOUNT ON;

UPDATE pl
SET    pl.[StatusID] = @StatusID, -- 3 = Success, 4 = Fail
       pl.[Step] = @Step,
       pl.[Remarks] = @Remarks,
       pl.[StoppedAt] = SYSDATETIME(),
       pl.[RowsSelected] = ISNULL(@RowsSelected, pl.[RowsSelected]),
       pl.[RowsInserted] = ISNULL(@RowsSelected, pl.[RowsInserted]),
       pl.[RowsUpdated] = ISNULL(@RowsSelected, pl.[RowsUpdated]),
       pl.[RowsDeleted] = ISNULL(@RowsSelected, pl.[RowsDeleted])
FROM   [Logging].[ProcessLog] pl
WHERE  pl.[ProcessLogID] = @ProcessLogID;

IF (@@ROWCOUNT = 0)
BEGIN
 RAISERROR('No initial or in-process record for ProcessLogID = %d !', 16, 1,
           @ProcessLogID);
  RETURN;
END;
GO

Demo stored procedure (input parameters are formatted as XML)

The reason for putting "StepNumber" in a variable is so that the value can get passed to the CATCH block. The @StepNumber variable is incremented prior to each operation. If the operation succeeds, that value is used to call the "Log" stored procedure that captures the number of rows affected for that step and the time it was called. If the operation fails, that same @StepNumber value is used to call the "Stop" stored procedure that marks the process as "failed" and passes in the error message. This makes the data less confusing since the Step column on failed records will be the step it was actually working on when the error occurred.

CREATE PROCEDURE [dbo].[ProcessLogDemo]
(
  @Param1 INT,
  @Param2 DATETIME,
  @Param3 NVARCHAR(50) = NULL
)
AS
SET NOCOUNT ON;

DECLARE @ProcessID INT,
        @DB_ID INT = DB_ID(),
        @Params XML,
        @StepNumber TINYINT;

SET @Params = (
   SELECT @Param1 AS [Param1],
          @Param2 AS [Param2],
          @Param3 AS [Param3]          
   FOR XML PATH(N'Params')
); -- missing elements mean the value == NULL
--SELECT @Params;

BEGIN TRY

  EXEC [Logging].[ProcessLog_Start]
    @DatabaseID = @DB_ID,
    @ObjectID = @@PROCID,
    @Params = @Params,
    @ProcessLogID = @ProcessID OUTPUT;

  SET @StepNumber = 1;

  -- do something

  EXEC [Logging].[ProcessLog_Log]
    @ProcessLogID = @ProcessID,
    @Step = @StepNumber,
    @RowsSelected = @@ROWCOUNT;

  SET @StepNumber = 2;

  -- do something else

  EXEC [Logging].[ProcessLog_Log]
    @ProcessLogID = @ProcessID,
    @Step = @StepNumber,
    @RowsUpdated = @@ROWCOUNT;

  SET @StepNumber = 3;

  -- do final thingy

  EXEC [Logging].[ProcessLog_Stop]
    @ProcessLogID = @ProcessID,
    @Step = @StepNumber,
    @StatusID = 3, -- success
    @RowsInserted = @@ROWCOUNT;

END TRY
BEGIN CATCH
  DECLARE @ErrorMessage NVARCHAR(MAX) = ERROR_MESSAGE();

  EXEC [Logging].[ProcessLog_Stop]
    @ProcessLogID = @ProcessID,
    @Step = @StepNumber,
    @StatusID = 4, -- fail
    @Remarks = @ErrorMessage;
END CATCH;
GO

NOTES:

With regards to getting the "invoked by" Login for SQL Server Agent jobs: the step_id = 0 record (which is the only place that this info exists) does not exist until the job completes (success or failure). Hence, it isn't available while the stored procedure is running, let alone at the beginning. For now we capture the MAX(sjh.[instance_id]) FROM msdb.dbo.sysjobhistory sjh for the currently executing job for the current session. Later (i.e. after the job completes), that can get replaced with the job invoker Login.
I would generally recommend against adding this type of logging to Stored Procedures that are executed very frequently as the additional read and write operations will have a negative impact on performance.

ADDENDUM

Here is an Inline Table-Valued Function (ITVF) to get the job outcome info (including the "invoked by" user or schedule or whatever) based on the instance_id value that was captured into the ProcessLog.StartedBy column. The instance_id value returned in the result set is the row for step_id = 0.

CREATE FUNCTION dbo.GetSqlServerAgentJobOutcome
(
  @InstanceID INT
)
RETURNS TABLE
AS RETURN

WITH cte AS
(
  SELECT TOP (1)
         sjh.[instance_id],
         sjh.job_id,
         sjh.[message],
         sjh.[run_date],
         sjh.[run_time],
         sjh.[run_duration],
         sjh.[run_status],
         sjh.[sql_message_id],
         sjh.[sql_severity],
         (CHARINDEX(N' was invoked by ', sjh.[message]) + 16) AS [invoker_begin],
         CHARINDEX(N'.  The last step to run', sjh.[message]) AS [invoker_end]
  FROM   msdb.dbo.sysjobhistory  sjh
  WHERE  sjh.[job_id] = (SELECT sjh2.[job_id]
                         FROM   msdb.dbo.sysjobhistory sjh2
                         WHERE  sjh2.[instance_id] = @InstanceID)
  AND    sjh.[step_id] = 0
  AND    sjh.[instance_id] >= @InstanceID
  ORDER BY instance_id ASC
)
SELECT [instance_id], [job_id],
       --[message],
       [run_date], [run_time],
       [run_duration], [run_status],
       [sql_message_id], [sql_severity],
       SUBSTRING([message], invoker_begin, ([invoker_end] - [invoker_begin]))
          AS [InvokedBy]
FROM   cte;
GO

Best Answer

Related Solutions

Alternative to EAV for dynamic fields in a star schema data warehouse

Sql-server – Stored Procedure Processing and Error Log

Related Question