Sql-server – Query or ETL process to insert data into fact tables in SSMS

etlimportsql serverssms

I'm working on creating a star schema for my project. I want to use it for reporting. I have used Google Play Store dataset from Kaggle.

The thing that I'm having trouble loading data into my fact tables. Below is my schema design. I was able to load data into my dimension table:

Best Answer

While you don't specify the specific problem related to loading your fact table, I'm going to assume the question approximates "how do I translate AppName to an AppKey so I can store that ID into my fact table?"

What you're looking to do is search your dimension table for a given value and return the key if it's found. Otherwise, we typically have a sentinel value to indicate this is an unknown value. Since you say you've loaded your dimension tables, then this shouldn't be an issue but it's a best practice.

If you are doing this through SSMS, then you would be writing a query that approximates the following

INSERT INTO dbo.Sentimental_Fact
SELECT
   URD.UserKey
,  AD.AppKey
,  D.Positive
,  D.Negative
,  D.Neutral
FROM
    dbo.kaggle_data AS D
    LEFT OUTER JOIN
        dbo.User_Rating_Dim AS URD
        -- I assume the appname in user rating is a typo and
        -- should be UserName
        ON URD.AppName = D.UserName
    LEFT OUTER JOIN
        dbo.App_Dim AS AD
        ON AD.AppName = D.AppName

If you were looking to use SSIS, I suggest you read the excellent Stairway to Integration Services by Andy Leonard. In short, you'll pull in in your base kaggle_data table into a Data Flow Task via an OLE DB Source Component and then use a series of Lookup Components to augment or enrich your data flow with the keys from the dimension tables. Finally, you'd land that into an OLE DB Destination (not an OLE DB Command)

Related Solutions

Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

Getting multiValued data into Solr via CSV:

The solr documentation describes a "split" function in UpdateCSV. Essentially, it parses a field value using a second CSV parser. See Solr - UpdateCSV - split. The parameters look like so (adjust field name, separator, and encapsulator as necessary):

f.fieldA.split=true&f.fieldA.separator=%2C&f.fieldA.encapsulator='

Getting multiValued data from separate fields to maintain position:

Since asking this question, I've done some reading about dimensional models. It seems that what I was trying to do is poor design, because it places too many expectations on the application, to much complexity in the warehouse, or both.

When trying to preserve the relationships between two field values on a single record, it's better to store them separately as well as together. Here's a comparison of my former input to the new input:

Former CSV input:

name|licenseState|licenseType
Josh|MA,CA|123,456
Fred|MD,OH|789,123

Transformed CSV input:

name|licenseState|licenseType|licenseStateType
Josh|MA,CA|123,456|MA123,CA456
Fred|MD,OH|789,123|MD789,OH123

This way your application can use the licenseState and licenseType dimension values independently, or it can use the licenseStateType dimension values, all without requiring complicated app or warehouse logic.

Sql-server – Is this a good strategy for importing a large amount of data and decomposing as an ETL

If you are confident in the integrity of the data being imported, it may be a good idea to disable all the constraints to your database before beginning your inserts and then re-enabling them after the fact.

See this helpful stack overflow answer from awhile back: Can foreign key constraints be temporarily disabled using T-SQL?

This will save you the head ache of having to worry about layering the inserts in order to respect the existing constraints of the database you are loading into.

In terms of the actual inserts themselves, I'd be on the side of not using cursors. Not only is the process slow but they take up a large amount off memory and create db locks. If you are cursor-ing through a very large amount of rows you also run the risk of very quickly escalating the size of the database logs. If the server is only an average one then, space may eventually be a concern. Try to consider a more set based approach when doing the additional inserts needed for your process.

example, if you can do this:

insert into t1 (col1)
SELECT col1 FROM t2

instead of this:

...
insert into t1 (col1) values ('foo');
insert into t1 (col1) values ('bar');
insert into t1 (col1 values 
...

Best Answer

Related Solutions

Loading multiValued fields into Solr via flat file, and possibly value position preservation in those fields

Sql-server – Is this a good strategy for importing a large amount of data and decomposing as an ETL

Related Question