Sql-server – Query to return fields of distinct values per key

aggregatedata-warehousesql serversql-server-2012

In the process of building a data warehouse we are creating some aggregation tables for a reporting engine. We would like to build a single aggregation table from the fact table that can exist as list of all possible values of certain fields that are present in the fact table.

Assume that all of the source data is in a single flat Fact table. Imagine something like this:

|Org   |Gender| Age|State|
--------------------------
|     1|     M|   1|   FL|
|     1|     F|   2|   VA|
|     1|     M|   3|   CA|
|     1|     M|   2|   NJ|
|     2|     M|   2|   VA|
|     2|     M|   3|   VA|
|     2|     M|   5|   VA|
(Times 400+ million, but for only about 100 Orgs...)

I would like to create an aggregation table that has the following output structure:

|Org   |Gender| Age|State|
--------------------------
|     1|     M|   1|   FL|
|     1|     F|   2|   VA|
|     1|  NULL|   3|   CA|
|     1|  NULL|NULL|   NJ|
|     2|     M|   2|   VA|
|     2|  NULL|   3| NULL|
|     2|  NULL|   5| NULL|

So that for each Org, the non-NULL DISTINCT list in each column represents the list of distinct values in the Fact table for each column. I can't use a DISTINCT operator because that will look for the list of distinct combinations of fields, not where each field is distinct and the number of rows for each Org is at most the number of distinct values for that Org.

Added comments:
This is for a Data Warehouse, so the source data started off in the transactional relational database that is the companies working database. It's then denormalized into a Fact table and set of Dimension tables that support the possible values of other data. I don't have access to the original relational DB for this project.

Purpose:
But we are now putting an ad-hoc reporting interface on top of the data warehouse and in order to efficiently query the Fact and Dimension tables we have to build some aggregation tables so that the selection fields of the reporting interface don't have to query the hundreds of millions of rows in the fact+dimension tables to find the possible values for each org to present in select dropdowns after the Org is chosen.

I've done a lot of googling and haven't found much, but I don't even know what you would call this kind of structure. "Columnar distinct?"

The database I'm using is SQL Server 2012, if it matters.

Best Answer

Without knowing anything about the source data, perhaps this would do what you want?

USE Test;
GO
CREATE TABLE GENDER
(
    ORG INT NOT NULL
    , GENDER VARCHAR(1) NOT NULL
);

CREATE TABLE AGE
(
    ORG INT NOT NULL
    , AGE TINYINT
);

CREATE TABLE STATES
(
    ORG INT NOT NULL
    , STATENAME VARCHAR(255)
);

INSERT INTO Gender (ORG, GENDER) VALUES (1, 'M');
INSERT INTO Gender (ORG, GENDER) VALUES (1, 'F');
INSERT INTO Gender (ORG, GENDER) VALUES (2, 'M');
INSERT INTO Gender (ORG, GENDER) VALUES (2, 'F');
INSERT INTO Gender (ORG, GENDER) VALUES (3, 'M');
INSERT INTO Gender (ORG, GENDER) VALUES (3, 'F');

INSERT INTO AGE (ORG, AGE) VALUES (1,27);
INSERT INTO AGE (ORG, AGE) VALUES (1,28);
INSERT INTO AGE (ORG, AGE) VALUES (1,29);
INSERT INTO AGE (ORG, AGE) VALUES (1,30);
INSERT INTO AGE (ORG, AGE) VALUES (2,37);
INSERT INTO AGE (ORG, AGE) VALUES (2,38);
INSERT INTO AGE (ORG, AGE) VALUES (2,39);
INSERT INTO AGE (ORG, AGE) VALUES (2,40);
INSERT INTO AGE (ORG, AGE) VALUES (3, 2);

INSERT INTO STATES (ORG, STATENAME) VALUES (1,'FL');
INSERT INTO STATES (ORG, STATENAME) VALUES (1,'GA');
INSERT INTO STATES (ORG, STATENAME) VALUES (1,'MN');
INSERT INTO STATES (ORG, STATENAME) VALUES (1,'NM');
INSERT INTO STATES (ORG, STATENAME) VALUES (2,'FL');
INSERT INTO STATES (ORG, STATENAME) VALUES (2,'MN');
INSERT INTO STATES (ORG, STATENAME) VALUES (2,'NM');
INSERT INTO STATES (ORG, STATENAME) VALUES (3,'FL');
INSERT INTO STATES (ORG, STATENAME) VALUES (3,'GA');
INSERT INTO STATES (ORG, STATENAME) VALUES (3,'NM');

CREATE TABLE FACTS
(
    ORG INT NOT NULL
    , GENDER VARCHAR(1) NULL
    , AGE INT NULL
    , STATENAME VARCHAR(255) NULL
);

INSERT INTO FACTS (ORG, GENDER, AGE, STATENAME)
SELECT ORG, GENDER, NULL, NULL
FROM GENDER
GROUP BY ORG, GENDER
UNION ALL
SELECT ORG, NULL, AGE, NULL
FROM AGE
GROUP BY ORG, AGE
UNION ALL
SELECT ORG, NULL, NULL, STATENAME
FROM STATES;

SELECT *
FROM FACTS
ORDER BY ORG;

The results:

enter image description here

This will create a FACTS table that has all the data from several source tables. As @ypercube and @jon-seigel said, this really doesn't make much sense; perhaps we are missing something compelling about your setup.

If this is not what you were expecting, please provide the source tables, and any other pertinent details.