PostgreSQL Optimization – Best Way to Store Large Lists of User Statistics

optimizationpostgresql

I am trying to figure out the best way to store user statistics data in a table. For each user property, there can be anywhere from one to a couple thousand integer values (realistically, these would would average to a number in the hundreds). I know that PostgreSQL allows columns to contain integer[] values, but is this really the best solution when I could potentially be storing a a couple hundred or a thousand integers in one cell?

Here is example table with a few of the user properties I might store in it:

| UserStatistics                                                                                      |
|-----------------------------------------------------------------------------------------------------|
| user_id | user_song_likes | user_album_likes | user_artist_likes | user_followers | user_following  |
|-----------------------------------------------------------------------------------------------------|
| 1       | [12,56,34,11,7] | [1,154,32,33,84] | [5,14,54,43,10,2] | [55,63,23,2,4] | [43,78,2,4,22]  |
|-----------------------------------------------------------------------------------------------------|
| 2       | [1,2,3,4,87,89] | [62,71,4,6,12,2] | [13,17,35,76,1,2] | [91,17,7,87,9] | [21,17,79,98,1] |
|-----------------------------------------------------------------------------------------------------|
| 3       | [45,53,42,19,2] | [7,14,144,67,93] | [13,14,2,23,3,73] | [11,56,5,12,1] | [34,91,23,10,5] |
|-----------------------------------------------------------------------------------------------------|

In this example, the user_song_likes array will realistically hold hundreds of values (I scaled down the array for the sake of this example). So will the user_album_likes array. user_artist_likes will most likely come close to a hundred values, but it will definitely surpass that for some users. user_followers will average a between 20 and 80 values, but there are going to be a few select users with tens of thousands of followers. user_following is going to average between 100 and 2,000 values for almost every user.

All this considered, I think it would be very inefficient to store user statistics with integer[] columns. What are my alternative options? What would be a reasonably efficient way to store this data?

Best Answer

As others have mentioned, I'd look to normalize the data in your UserStatistics table. You could do something stupid simple as a generic Likes table and a Follows table (don't take my poor names, I'm sure you can come up with better ones :).

The Likes table would have its own primary key, and a foreign key field referencing user_id, and it could have a column called like_id and a column called like_type. The like_id would store one row per ID from the user_song_likes, user_album_likes, and user_artist_likes columns of your previous schema, and the like_type would represent if it's of type song, album, or artist.

You could then implement the same idea for your Follows table (e.g. store one row per ID of each follower and following in a generic follow_id column with a follow_type column to determine which is which). Using a more normalized schema (even though this one is generic to keep things simple) is a more relational form of storing your data and will be more efficient for querying.