PostgreSQL Optimization – Best Way to Store Large Lists of User Statistics

optimizationpostgresql

I am trying to figure out the best way to store user statistics data in a table. For each user property, there can be anywhere from one to a couple thousand integer values (realistically, these would would average to a number in the hundreds). I know that PostgreSQL allows columns to contain integer[] values, but is this really the best solution when I could potentially be storing a a couple hundred or a thousand integers in one cell?

Here is example table with a few of the user properties I might store in it:

| UserStatistics                                                                                      |
|-----------------------------------------------------------------------------------------------------|
| user_id | user_song_likes | user_album_likes | user_artist_likes | user_followers | user_following  |
|-----------------------------------------------------------------------------------------------------|
| 1       | [12,56,34,11,7] | [1,154,32,33,84] | [5,14,54,43,10,2] | [55,63,23,2,4] | [43,78,2,4,22]  |
|-----------------------------------------------------------------------------------------------------|
| 2       | [1,2,3,4,87,89] | [62,71,4,6,12,2] | [13,17,35,76,1,2] | [91,17,7,87,9] | [21,17,79,98,1] |
|-----------------------------------------------------------------------------------------------------|
| 3       | [45,53,42,19,2] | [7,14,144,67,93] | [13,14,2,23,3,73] | [11,56,5,12,1] | [34,91,23,10,5] |
|-----------------------------------------------------------------------------------------------------|

In this example, the user_song_likes array will realistically hold hundreds of values (I scaled down the array for the sake of this example). So will the user_album_likes array. user_artist_likes will most likely come close to a hundred values, but it will definitely surpass that for some users. user_followers will average a between 20 and 80 values, but there are going to be a few select users with tens of thousands of followers. user_following is going to average between 100 and 2,000 values for almost every user.

All this considered, I think it would be very inefficient to store user statistics with integer[] columns. What are my alternative options? What would be a reasonably efficient way to store this data?

Best Answer

As others have mentioned, I'd look to normalize the data in your UserStatistics table. You could do something stupid simple as a generic Likes table and a Follows table (don't take my poor names, I'm sure you can come up with better ones :).

The Likes table would have its own primary key, and a foreign key field referencing user_id, and it could have a column called like_id and a column called like_type. The like_id would store one row per ID from the user_song_likes, user_album_likes, and user_artist_likes columns of your previous schema, and the like_type would represent if it's of type song, album, or artist.

You could then implement the same idea for your Follows table (e.g. store one row per ID of each follower and following in a generic follow_id column with a follow_type column to determine which is which). Using a more normalized schema (even though this one is generic to keep things simple) is a more relational form of storing your data and will be more efficient for querying.

Related Solutions

Mongodb – Schema design for privacy settings in MongoDB

It's important to remember that since there are multiple ways to design schema in MongoDB for a given set of data, it's critical to consider the types of reads and writes you will be making from your application.

Assuming that you will be querying for a subset of documents that a given user can see, it's likely that you will want to structure the document collection in a way that will give you the ability to query for all "visible" documents in a single query.

Document structure such as:

{ _id: ...,
  public: true,
  groups: ['group1', 'group2',...],
  users: ['user1','user2',...],
}

You could now query for all documents visible to user X if you have a list of groups user X belongs to by querying:

db.documents.find( { $or : [
     { public : true },
     { users  : X },
     { groups : { $in : [list-of-X's-groups] } }
] } )

Postgresql – postgres single rows vs array

Regardless of EC2,

do what makes sense for the data
don't pre-optimize

Columns are not for similar-data that serve the same purpose. If you're writing entries use rows. If you're describing them use columns. Let aggregates do the voodoo of taking the data from the rows and rolling it up to something useful.

Now for your application, if I was doing GPS stuff, I would not be using separate columns for lat and long but instead I would use

PostGIS and the GEOGRAPHY type,
or, if not possible, the point(long, lat) (at least it supports <@> (distance)) as well as some other useful operators and you can upgrade later

Best Answer

Related Solutions

Mongodb – Schema design for privacy settings in MongoDB

Postgresql – postgres single rows vs array

Related Question