Designing an Accumulating Snapshot with Repeated Groups, need advise

data-warehousedatabase-design

Situation

I need to create a pie chart that has to be able to be represented at the following levels: District, Region, and School.

At the student level it has to show the trend of the student over a set of periods: "I1", "G1", "I2", "G2", "I3", "G3", "I4", "G4"

The pie chart will show a number and pct of students with 1,2,3, or none risks factors. Risk factors being: Academics, Discipline, and Attendance. I have been given a formula to determine if a student is at risk at any of the categories above.

We have a Ralph Kimball based approach Data Warehouse.

I have a working query at the school level that I am able to plug the given formula through subselects and sums/counts that returns me as follows:

STUDENT_ID  STUDENT_NAME  PERIOD SCHOOL_YEAR  ACD_RISK  DISC_RISK  ATT_RISK   
1           Viriato       I1     2010/2011    1         1          1  

1           Viriato       G1     2010/2011    1         0          0

2           Teodosio      I1     2010/2011    0          1         0

.....................

The query that produces me the total number of students having 1,2,3 or 0 risk factors by doing a count sum on the output above works fine as far as performance is concerned. However when I run it at the region and district levels, performance is not acceptable.

Solution

I need to create an Accumulating Snapshot

Concerns

Ralph Kimball in "The Complete Guide to Dimensional Modeling" chapter 12 page 244 states on Accumulating Snapshot:

A single row represents the complete history of something.

Solution A: Prior to Reading Kimball

So should I have a job that produces a table with the resultset above. But then I would have 8 rows per student per year. I would not have a single row like Kimball suggests.

Solution B: After Reading Kimball

student_id,

name, 

school_year, 

period, 

att_risk_I1, 

adm_risk_I1, 

disc_ris_I1, 

att_risk_G1, 

adm_risk_G1, 

disc_ris_G1,

.

.

.

.

att_risk_G4,

adm_risk_G4,

dis_ris_G4

As you can see with solution B I would have one row per student per year, and I could generate a trend graph for that student all the way from elementary school to high school with 12 rows (Grades 1st through 12th). However I am not sure I like having ATT_RISK, DISC_RISK and ADM_RISK repeated 8 times.

Which approach would you take?

Best Answer

It's not really clear what is asked here but I think your solution is simplest when you have one row per student. That allows further aggregating later however you want to do it.

So this is a vote for b.

Related Solutions

Sql-server – Data Model For Summarizing Student Info

Calculate those counts on the fly.

For most data sets of this nature, I would imagine calculating the count of tardiness events would be relatively cheap. How many rows do you expect there to be in the long run in your students or attendance tables? Probably on the order of tens of thousands at most. Contrast that to calculating an account balance from its transaction history in database for a large bank with billions of transactions. In this case I would consider persisting a summary of these aggregations somewhere.

If you wanted to pre-calculate this data anyway, I would summarize the count of tardiness events per student per year in an indexed view as suggested in the top answer here.

For example:

CREATE VIEW dbo.tardiness_summary
WITH SCHEMABINDING
AS
SELECT
     student_id
   , year
   , COUNT_BIG(*) AS tardy_count
FROM dbo.attendance
GROUP BY 
     student_id
   , year
;

CREATE UNIQUE CLUSTERED INDEX IX_tardiness_summary
ON dbo.tardiness_summary (
     student_id
   , year
;

This is not a flexible approach, however, as this view is now schema-bound to the base tables and thus any modifications to either the view or the table will require rebuilding the view. Indexed views also have many restrictions on how they may be created or queried. Their advantage is that they guarantee the summary table will stay in sync with its sources because the database engine is now doing this work for you.

Need help designing table with list of IDs to store

Do not use a list of ids. You will regret it. Use junction tables.

DDL:

create table organization (
  organization_id bigserial primary key,
  name text not null
);

create table individual (
  individual_id bigserial primary key,
  email_address varchar(255) not null
);

create table organization_member (
  organization_id bigint references organization(organization_id),
  individual_id bigint references individual(individual_id),

  primary key (organization_id, individual_id)
);

create table email_message (
  email_message_id bigserial primary key,
  send_at datetime not null,
  subject varchar(80),
  body text
);

create table email_message_organization (
  email_message_id bigint references email_message(email_message_id),
  organization_id bigint references organization(organization_id),

  primary key (email_message_id, organization_id)
);

DML:

-- add an org:

insert into organization (name) values ('Acme, Inc');

-- add an individual:

insert into individual (email_address) values ('user@example.com');

-- associate individual w org:

insert into organization_member values (1,1);

-- add an email message:

insert into email_message (send_at, subject) values ('2013-07-31 12:12:12', 'test');

-- associate an email message with an org:

insert into email_message_organization values (1, 1);

-- when it's time to send the email, get the right people:

select 
e.subject, 
e.body, 
i.email_address 
from individual i 
join organization_member om using (individual_id) 
join email_message_organization emo using (organization_id) 
join email_message e using (email_message_id);

-- send the emails, record that they were sent in another table

You should use the Party model for representing individuals and organizations, which is not what I used, but you should still look it up.