Database Design – Storing Multi-Select Data with Single-Select Data

database-design

I've got a database with some multi-select fields. To keep it simple, let's just say those fields are MatchGender and Interests. I want users to be able to find people who match (hence Gender is a multi-select; they can choose to match against one or both genders) and interests/hobbies. Along with those fields, there are several "single-select" fields such as the user's city, state, their gender, etc…

I'm trying to figure out the best way to store this data so it's easily searchable. Do I store the multi-select data in a single field that is comma delimited? Do I store it in a separate table with a UserID? Pros/Cons of each?

By the way, this is a SQL Server back-end database that will be accessed through an ASP web interface.

Best Answer

Specific answers

First of all, here is my personal approach to your specific questions:

Do I store the multi-select data in a single field that is comma delimited?

No, I do not recommend you to proceed that way. Since you have the chance to make use of a relational database management system (RDBMS), I personally consider that it would be highly convenient and reasonable to design and implement your database in a relational fashion. I am convinced that there is no need to use a format as limited as CSV when one can take advantage of the robustness of relational tools. If you opt for the CSV method, you would be adding unnecessary complexity to your project.

Pros. CSV format has its use in certain contexts, e. g., supplying information interchange between applications that handle data in incompatible formats. Another feature of CSV is that it is easily modified by spreadsheet aplications such as MS Excel or LibreOffice Calc. Anyway, due to the fact that you are going to use a RDBMS, I do not think any of these facts is relevant in terms of database design or storage in a relational platform. If you come across the need to deliver some of your data in CSV format in the future, then you can export your structured data to a comma delimited file without undermining your database soundness.
Cons. As you may already know, by storing CSV data in a single column, you would be breaking First Normal Form. This way, having unnormalized data (nonsimple or multivalued columns), you lose many of the essential technical advantages that are provided by RDBMS which are firmly grounded in relational theory. Since many people have rightly enumerated the abundant problems that arise by storing data in CSV format in an RDBMS, I will only mention that you would be putting at risk the data integrity (which is widely considered one of the most important properties of a database), the performance of your system would be substantially diminished and you will end up re-implementing in your own application many of the capabilities that RDBMS offer natively, which have been vastly optimized over the years.

Do I store it in a separate table with a UserID?

Yes, to me that is the best approach and can be easily implemented in a relational system by means of normalization.

As I understand your specifications, MatchGender represents the gender of potential partners with which a given User is interested in getting in touch. So, the assertions a User seeks one-to-many MatchGenders and a MatchGender is sought by one-to-many Users are business rules of particular importance in this specific context.

If the above is correct, then there is a many-to-many relationship between User and MatchGender, and this kind of scenarios are very common in relational database development. I will model this aspect and offer an implementation example below, in my suggestion.

Pros. You can expect predictable results since your database will be built on relational principles. You can take full advantage of RDBMS capabilities, such as aggregate functions, JOINs, referential integrity, etc. You can prevent data redundancy and full table scans. Your database structure will be more flexible for future changes or extensions. You can deliver database performance that largely meets the needs of your clients. Your data retrieval will be straightforward.
Cons. At first glance, the relational approach may seem a bit cumbersome, since it requires the database designers and programmers to be more disciplined but, as the project grows, that discipline will pay off.

Other relevant aspects of your scenario

I want to describe some other points about my understanding of your situation. This way, I think that the statements a User pursues one-to-many Hobbies and a Hobby is pursued by one-to-many Users are important business rules as well. So, there is also a many-to-many relationship between the entities User and Hobby, and this also has to be modelled and implemented.

I assume that you will have a set of Hobbies that will be inserted into the database by some sort of system administrator. If so, a HobbyCode that uniquely identifies a given occurence of a Hobby may be defined and you can set it as the meaningful PRIMARY KEY of such table.

Modeling and implementation suggestion

Therefore, to deal with your situation, I propose the following IDEF1X data model:

Users Interests DM

And then, the next DDL can be derived and physically implemented with ease from such data model:

CREATE TABLE Gender
(
    GenderCode      CHAR(1)  NOT NULL,
    Name            CHAR(11) NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    CONSTRAINT PK_Gender PRIMARY KEY CLUSTERED 
    (
        GenderCode ASC 
    ),
    CONSTRAINT UIX_Gender_Name UNIQUE
    (
        Name
    )
)

CREATE TABLE Hobby
(
    HobbyCode       CHAR(2)  NOT NULL,
    Name            CHAR(30) NOT NULL,
    Descriptionn    CHAR(90) NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    CONSTRAINT PK_Hobby PRIMARY KEY CLUSTERED 
    ( 
        HobbyCode ASC 
    ),
    CONSTRAINT UIX_Hobby_Name UNIQUE
    (
        Name
    )
)

CREATE TABLE UserProfile
(
    UserId          INT      NOT NULL,
    FirstName       CHAR(30) NOT NULL,
    LastName        CHAR(30) NOT NULL,
    BirthDate       DATETIME NOT NULL,
    GenderCode      CHAR(1)  NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    CONSTRAINT PK_UserProfile PRIMARY KEY CLUSTERED 
    (
        UserId ASC
    ),
    CONSTRAINT UIX_UserProfile_FirstName_LastName_and_BirthDate UNIQUE
    (
        FirstName ASC,
        LastName  ASC,
        BirthDate ASC
    ),
    CONSTRAINT FK_UserProfile_Gender FOREIGN KEY 
    (
        GenderCode
    ) 
    REFERENCES Gender (GenderCode)
)

CREATE TABLE UserMatchGender
(
    UserId          INT      NOT NULL,
    GenderCode      CHAR(1)  NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    CONSTRAINT PK_UserMatchGender PRIMARY KEY CLUSTERED 
    (
        UserId     ASC,
        GenderCode ASC
    ),
    CONSTRAINT FK_UserMatchGender_UserProfile FOREIGN KEY 
    (
        UserId
    )
    REFERENCES UserProfile (UserId),
    CONSTRAINT FK_UserMatchGender_Gender FOREIGN KEY 
    (
        GenderCode
    )
    REFERENCES Gender (GenderCode)
)

CREATE TABLE UserHobby
(
    UserId          INT      NOT NULL,
    HobbyCode       CHAR(2)  NOT NULL,
    CreatedDateTime DATETIME NOT NULL,
    CONSTRAINT PK_UserHobby PRIMARY KEY CLUSTERED 
    (
        UserId    ASC,
        HobbyCode ASC
    ),
    CONSTRAINT FK_UserHobby_UserProfile FOREIGN KEY 
    (
        UserId
    )
    REFERENCES UserProfile (UserId),
    CONSTRAINT FK_UserHobby_Hobby FOREIGN KEY 
    (
        HobbyCode
    )
    REFERENCES Hobby (HobbyCode)
)

Related Solutions

PostgreSQL Database Design – Need Help Designing a Database

If you have an index on LastLoggedIn, deleting a few tens of thousands of records should be fast without needing to partition. I ran a small test on MS SQL:

CREATE TABLE Temp.Temp
(
UserID  INT NOT NULL PRIMARY KEY,
LastLoggedIn    DATETIME2(0) NOT NULL INDEX,
Dummy1  VARCHAR(30),
Dummy2  INT
)

I insert ~57 M dummy records, then selected a LastLoggedIn date such that 35,000 records were at least that old. Deleting these records took less than a second. This is on a four-core machine with 128 GB of memory, under light load; your mileage may vary.

Of course, if you run this purge only every few weeks, you may have millions of records to delete rather than thousands (300 K records took seven seconds for me; 1.3 M records took 30 seconds). But in that scenario, it's part of your monthly maintenance window rather than daily upkeep, and a brief delay should be acceptable.

If you get poor performance with just an index, then sure, consider partitioning. As others have pointed out, there will be a significant cost to UPDATE operations, a greater overhead than maintaining an index. You will have many more updates than purges, and the latter can happen during relatively idle periods, so I would worry about the updates more.

If you need always-on uptime, and the delete locks the table for an unacceptable length of time, that might justify using partitioning, but even there I'd look for alternatives. You could SELECT a list of old users and then delete them one at a time (confirming their LastLoggedIn dates first of course) with a cursor, for example.

I'm an MS SQL guy, so my apologies if any of this is incorrect for PostgreSQL.

Mysql – Storing FTSE stock data

What I would do is to create a table with the stock_id (that can be the alphanumeric code or a integer), the timestamp of the measurement and the current value. That is your entry data, 3 columns.

From that point you can add columns for calculations (the difference absolute or percent) with the previous value. Having all in the same table will simplify the model and ease your queries. Try to create a date (not timestamp) column and create a partition by it. It may lighten a bit the access to the table as long as you set it in your queries.