I can think of three solutions - EAV, XML, and Sparse Columns. The latter is vendor-specific and may not be useful to you.
Whichever method you choose, you may wish to consider storing the original request data in a raw format, in a table or flat file. It will make it easy to try new ways of storing the data, allow you to reload data if you discover a mistake with the way you're parsing your requests, and offer opportunities for parsing the API requests using batch processing or "big data" tools if you find that your data warehouse isn't able to efficiently deal with the data.
EAV considerations
EAV/KVS, as you've described it above, is likely to be the most straightforward implementation.
Unfortunately it's also going to be very expensive - to get any sort of efficient queries on commonly used keys you'll need to have indexes on the key column, which could get very fragmented. Querying for particular keys would be extremely expensive.
You may be able to reduce the cost of indexing or index scans by supporting your EAV store with materialised views (many vendors support this) for querying keys or values that you care about.
XML
Most enterprise database systems offer very mature XML handling, including validation, indexing, and sophisticated querying.
Loading the API request into the database as XML would provide one tuple per request, which logically might be a bit more palatable to you than having an unknown number of rows in an EAV table.
Whether this is efficient would depend a lot on your RDBMS vendor and your implementation.
The biggest downside is that this is probably the only way of managing data that's more complicated than string manipulation of the original request!
Sparse Columns / traditional tables
It's possible that you could load your data into a traditional table structure, with one column per key.
SQL Server's Sparse Columns feature is a great alternative to an EAV store. A table with Sparse Columns behaves much the same as a normal table, except that it can have up to 30,000 columns, and NULL values in sparse columns consume no space in the table.
Combining them with Filtered Indexes (another SQL Server specific feature) can provide an extremely efficient alternative to an EAV store if you're frequently querying for a couple of specific columns and/or values.
Using a traditional table with other vendors may be viable - IBM supports over 700 columns per table and Oracle about 1000, and features such as compression or Oracle's treatment of trailing nulls might mean that you can store your API data fairly efficiently.
The obvious downside of this approach is that as you added new keys to your API, you'd need to adjust your schema accordingly.
Here is a structure that is at least very close.
There is no programmatic way to get the parameters (unfortunately). You need to format them into XML to pass in.
The Login that initiates a SQL Agent job seems to only be logged in the message
column of msdb.dbo.sysjobhistory
, for step_id = 0
. This value can be extracted, but not during the execution of the job.
You get ObjectID to pass in from @@PROCID
.
Below is the schema (2 tables) and stored procedures (3 procs). The concept is to separate "init", "in process", and "completed (success or error)" logging. This allows for setting certain columns only at the appropriate time (e.g. only need to set DatabaseID
, StartedAt
, etc at the very beginning). Separating the type of event also makes it easier to have event-specific logic (yes, can have it even in a single proc, but then you still have all the input parameters when you only need a subset per each event-type).
A "process" record get updated via its IDENTITY (and clustered PK) value. This is another benefit of having the "event type" separation: it makes it much easier to handle capturing the SCOPE_IDENTITY()
and passing it back to be used for the two other logging stored procedures. If a stored procedure fails and doesn't go to the CATCH
block, then there is no need to worry about accidentally updating that process record as the next time any stored procedure (that is being logged) starts, it will get a new / unique ID to update.
Cleanup (optional) and Schema
/* -- optional cleanup
DROP PROCEDURE [dbo].[ProcessLogDemo];
DROP PROCEDURE [Logging].[ProcessLog_Log];
DROP PROCEDURE [Logging].[ProcessLog_Start];
DROP PROCEDURE [Logging].[ProcessLog_Stop];
DROP TABLE [Logging].[ProcessLog];
DROP TABLE Logging.[Status];
DROP SCHEMA [Logging];
*/
CREATE SCHEMA [Logging];
GO
Tables and Indexes
CREATE TABLE Logging.[Status]
(
[StatusID] TINYINT NOT NULL
CONSTRAINT [PK_Status] PRIMARY KEY CLUSTERED,
[StatusName] VARCHAR(50) NOT NULL
);
CREATE TABLE [Logging].[ProcessLog]
(
ProcessLogID INT NOT NULL IDENTITY(-2147483648, 1) -- start at INT min value
CONSTRAINT [PK_ProcessLog] PRIMARY KEY CLUSTERED,
DatabaseID INT NOT NULL,
ObjectID INT NULL, -- NULL = ad hoc query
SessionID SMALLINT NOT NULL
CONSTRAINT [DF_ProcessLog_SessionID] DEFAULT (@@SPID),
Step TINYINT NOT NULL, -- if you have more than 255 steps, consult psychiatrist
StatusID TINYINT NOT NULL
CONSTRAINT [FK_ProcessLog_Status]
FOREIGN KEY REFERENCES [Logging].[Status]([StatusID]),
Remarks NVARCHAR(MAX) NULL, -- or maybe VARCHAR(MAX)?
Params XML NULL,
RowsSelected INT NULL,
RowsInserted INT NULL,
RowsUpdated INT NULL,
RowsDeleted INT NULL,
StartedBy [sysname] NULL,
StartedAt DATETIME2 NOT NULL
CONSTRAINT [DF_ProcessLog_StartedAt] DEFAULT (SYSDATETIME()),
UpdatedAt DATETIME2 NULL, -- use to show progress / "heartbeat"
StoppedAt DATETIME2 NULL
);
GO
Stored Procedure to call at the very beginning of "logged" stored procedures
CREATE PROCEDURE [Logging].[ProcessLog_Start]
(
@DatabaseID INT,
@ObjectID INT,
@Params XML,
@ProcessLogID INT = NULL OUTPUT
)
AS
SET NOCOUNT ON;
-- First, capture the MAX "instance_id" from sysjobhistory if this process is a SQL
-- Server Agent job (use later to get the "invoked by" Login), else grab the Login.
DECLARE @StartedBy [sysname];
IF (EXISTS(
SELECT *
FROM sys.dm_exec_sessions sdes
WHERE sdes.[session_id] = @@SPID
AND sdes.[program_name] LIKE N'SQLAgent - TSQL JobStep (%'))
BEGIN
DECLARE @JobID UNIQUEIDENTIFIER;
SELECT @JobID = CONVERT(UNIQUEIDENTIFIER,
CONVERT(BINARY(16),
SUBSTRING(sdes.[program_name],
CHARINDEX(N'(Job 0x', sdes.[program_name]) + 5,
34), 1
)
)
FROM sys.dm_exec_sessions sdes
WHERE sdes.[session_id] = @@SPID;
--SELECT @JobID;
SELECT @StartedBy = N'sysjobhistory.instance_id: '
+ CONVERT(NVARCHAR(20), MAX(sjh.[instance_id]))
FROM msdb.dbo.sysjobhistory sjh
WHERE sjh.[job_id] = @JobID;
END;
ELSE
BEGIN
SET @StartedBy = ORIGINAL_LOGIN();
END;
-- Now it should be safe to create a new entry
INSERT INTO [Logging].[ProcessLog] ([DatabaseID], [ObjectID], [Step], [StatusID],
[Params], [StartedBy])
VALUES (@DatabaseID, @ObjectID, 0, 1, @Params, @StartedBy);
SET @ProcessLogID = SCOPE_IDENTITY();
GO
Stored Procedure to call after all but the final step
CREATE PROCEDURE [Logging].[ProcessLog_Log]
(
@ProcessLogID INT,
@Step TINYINT,
@RowsSelected INT = NULL,
@RowsInserted INT = NULL,
@RowsUpdated INT = NULL,
@RowsDeleted INT = NULL
)
AS
SET NOCOUNT ON;
UPDATE pl
SET pl.[StatusID] = 2, -- In process
pl.[Step] = @Step,
pl.[UpdatedAt] = SYSDATETIME(),
pl.[RowsSelected] = ISNULL(@RowsSelected, pl.[RowsSelected]),
pl.[RowsInserted] = ISNULL(@RowsInserted, pl.[RowsInserted]),
pl.[RowsUpdated] = ISNULL(@RowsUpdated, pl.[RowsUpdated]),
pl.[RowsDeleted] = ISNULL(@RowsDeleted, pl.[RowsDeleted])
FROM [Logging].[ProcessLog] pl
WHERE pl.[ProcessLogID] = @ProcessLogID;
IF (@@ROWCOUNT = 0)
BEGIN
RAISERROR('No initial or in-process record for ProcessLogID = %d !', 16, 1,
@ProcessLogID);
RETURN;
END;
GO
Stored Procedure to call after the final step and/or in a CATCH block
CREATE PROCEDURE [Logging].[ProcessLog_Stop]
(
@ProcessLogID INT,
@Step TINYINT,
@StatusID TINYINT,
@Remarks NVARCHAR(MAX) = NULL,
@RowsSelected INT = NULL,
@RowsInserted INT = NULL,
@RowsUpdated INT = NULL,
@RowsDeleted INT = NULL
)
AS
SET NOCOUNT ON;
UPDATE pl
SET pl.[StatusID] = @StatusID, -- 3 = Success, 4 = Fail
pl.[Step] = @Step,
pl.[Remarks] = @Remarks,
pl.[StoppedAt] = SYSDATETIME(),
pl.[RowsSelected] = ISNULL(@RowsSelected, pl.[RowsSelected]),
pl.[RowsInserted] = ISNULL(@RowsSelected, pl.[RowsInserted]),
pl.[RowsUpdated] = ISNULL(@RowsSelected, pl.[RowsUpdated]),
pl.[RowsDeleted] = ISNULL(@RowsSelected, pl.[RowsDeleted])
FROM [Logging].[ProcessLog] pl
WHERE pl.[ProcessLogID] = @ProcessLogID;
IF (@@ROWCOUNT = 0)
BEGIN
RAISERROR('No initial or in-process record for ProcessLogID = %d !', 16, 1,
@ProcessLogID);
RETURN;
END;
GO
Demo stored procedure (input parameters are formatted as XML)
The reason for putting "StepNumber" in a variable is so that the value can get passed to the CATCH
block. The @StepNumber
variable is incremented prior to each operation. If the operation succeeds, that value is used to call the "Log" stored procedure that captures the number of rows affected for that step and the time it was called. If the operation fails, that same @StepNumber
value is used to call the "Stop" stored procedure that marks the process as "failed" and passes in the error message. This makes the data less confusing since the Step
column on failed records will be the step it was actually working on when the error occurred.
CREATE PROCEDURE [dbo].[ProcessLogDemo]
(
@Param1 INT,
@Param2 DATETIME,
@Param3 NVARCHAR(50) = NULL
)
AS
SET NOCOUNT ON;
DECLARE @ProcessID INT,
@DB_ID INT = DB_ID(),
@Params XML,
@StepNumber TINYINT;
SET @Params = (
SELECT @Param1 AS [Param1],
@Param2 AS [Param2],
@Param3 AS [Param3]
FOR XML PATH(N'Params')
); -- missing elements mean the value == NULL
--SELECT @Params;
BEGIN TRY
EXEC [Logging].[ProcessLog_Start]
@DatabaseID = @DB_ID,
@ObjectID = @@PROCID,
@Params = @Params,
@ProcessLogID = @ProcessID OUTPUT;
SET @StepNumber = 1;
-- do something
EXEC [Logging].[ProcessLog_Log]
@ProcessLogID = @ProcessID,
@Step = @StepNumber,
@RowsSelected = @@ROWCOUNT;
SET @StepNumber = 2;
-- do something else
EXEC [Logging].[ProcessLog_Log]
@ProcessLogID = @ProcessID,
@Step = @StepNumber,
@RowsUpdated = @@ROWCOUNT;
SET @StepNumber = 3;
-- do final thingy
EXEC [Logging].[ProcessLog_Stop]
@ProcessLogID = @ProcessID,
@Step = @StepNumber,
@StatusID = 3, -- success
@RowsInserted = @@ROWCOUNT;
END TRY
BEGIN CATCH
DECLARE @ErrorMessage NVARCHAR(MAX) = ERROR_MESSAGE();
EXEC [Logging].[ProcessLog_Stop]
@ProcessLogID = @ProcessID,
@Step = @StepNumber,
@StatusID = 4, -- fail
@Remarks = @ErrorMessage;
END CATCH;
GO
NOTES:
With regards to getting the "invoked by" Login for SQL Server Agent jobs: the step_id = 0
record (which is the only place that this info exists) does not exist until the job completes (success or failure). Hence, it isn't available while the stored procedure is running, let alone at the beginning. For now we capture the MAX(sjh.[instance_id]) FROM msdb.dbo.sysjobhistory sjh
for the currently executing job for the current session. Later (i.e. after the job completes), that can get replaced with the job invoker Login.
I would generally recommend against adding this type of logging to Stored Procedures that are executed very frequently as the additional read and write operations will have a negative impact on performance.
ADDENDUM
Here is an Inline Table-Valued Function (ITVF) to get the job outcome info (including the "invoked by" user or schedule or whatever) based on the instance_id
value that was captured into the ProcessLog.StartedBy
column. The instance_id
value returned in the result set is the row for step_id = 0
.
CREATE FUNCTION dbo.GetSqlServerAgentJobOutcome
(
@InstanceID INT
)
RETURNS TABLE
AS RETURN
WITH cte AS
(
SELECT TOP (1)
sjh.[instance_id],
sjh.job_id,
sjh.[message],
sjh.[run_date],
sjh.[run_time],
sjh.[run_duration],
sjh.[run_status],
sjh.[sql_message_id],
sjh.[sql_severity],
(CHARINDEX(N' was invoked by ', sjh.[message]) + 16) AS [invoker_begin],
CHARINDEX(N'. The last step to run', sjh.[message]) AS [invoker_end]
FROM msdb.dbo.sysjobhistory sjh
WHERE sjh.[job_id] = (SELECT sjh2.[job_id]
FROM msdb.dbo.sysjobhistory sjh2
WHERE sjh2.[instance_id] = @InstanceID)
AND sjh.[step_id] = 0
AND sjh.[instance_id] >= @InstanceID
ORDER BY instance_id ASC
)
SELECT [instance_id], [job_id],
--[message],
[run_date], [run_time],
[run_duration], [run_status],
[sql_message_id], [sql_severity],
SUBSTRING([message], invoker_begin, ([invoker_end] - [invoker_begin]))
AS [InvokedBy]
FROM cte;
GO
Best Answer
In Option 2 the
Status_Code
appears as non-key in two tables:Study
,Contract
. But which of those entities does it belong to? And what are the relative cardinalities of those entities?Looking at Option 1, the
Status_Code
belongs to theContract
; there can be multipleStudy
s for a givenContract
.So if you were to adopt option 2, you give OLTP a headache: anytime there's a change in
Status
of aContract
, transaction processing needs to replicate that change to allStudy
s, with an obvious risk of getting out of synch. That's called an update anomaly.Denormalising the model like Option 2 is legitimate in data warehouses, because we want fast reporting. It is unwise in a transaction processing-oriented data model, because of the processing cost and risks of maintaining data in synch.
So for an application focussing on transaction processing, Option 1 is indicated. And yes, you would create a
Join
viewStudy
>Contract
>Status
.