Database Design help

database-designdatabase-recommendation

I am building an application for comparing two or more songs. We have millions of songs. Currently we have access to raw data files (form a company like soundcloud) consisting of

1) Daily Users.csv with new signups

2) Daily Tracks.csv with new songs released

3) Daily streams.csv with today's streams.

For my application, the input will be two or more songs which has to be compared. The songs will be compared based on the data of users who listen to the songs.
For the database design, I am thinking of relational database with

1) One Table for for users data

2) One table for tracks data.

3) Millions of tables; one for each song. The reason for this is that on query, it will be very easy to pick up relevant tables and compute statistics for each table.

Moreover, for a particular song, for a particular user, we want to store ordered time series data of the source of stream ( collection, album page, search). I am thinking of storing it as a json in Stream Source field in Song_ID tables.

enter image description here

Is it a good relational database design? Or should I switch to non relational databases? In that case which database would you suggest?

Best Answer

If you wish to compute statistical data a Relational Database is generally better than a non-relational as all the links for your statistics will be inherent in your table design. Millions of rows will require watching your performance and adjusting appropriately but is not in the danger zone architecturally speaking, and since your data is being inserted once a day that gives you a reliable window for any index rebuilds, table statistic updates, etc...

However, do not create millions of tables for each song, that is a very bad idea. You'll instead want to create a series of tables representing the relations between your Users and the Songs that you will pull the statistics from. Based on your examples I would just add a single linking table between Users and Tracks instead of your millions of tables, it should look something like this:

CREATE TABLE UserTracks
(
    userID INT
    ,trackID INT
    ,streamSource NVARCHAR(100) 
    ,CONSTRAINT FK_UserTracks_Users FOREIGN KEY (userID) REFERENCES Users (userID)
    ,CONSTRAINT FK_UserTracks_Tracks FOREIGN KEY (trackID) REFERENCES Tracks (trackID)
)
GO
CREATE CLUSTERED INDEX CI_UserTracks ON dbo.UserTracks (userID,trackID)
GO

Since it's a pure joining table between your other two tables creating the Clustered Index on the join columns that would be present most of the time should help with performance. If for some statistics you only need to do half the join (for example Users per track or Tracks per User) you'll probably want to add two more indexes like the following:

GO
CREATE INDEX NCI_UserTracks_userID_streamSource ON dbo.UserTracks (userID) INCLUDE (streamSource)
GO
CREATE INDEX NCI_UserTracks_trackID_streamSource ON dbo.UserTracks (trackID) INCLUDE (streamSource)
GO

Those should cover most of your potential performance problems right off the bat. Remember, clean INNER JOINs are your friend, make your structures do most of the work for you, avoid trying to piece the data together yourself via complex dynamic queries.