Joining on a M-N lookup table vs keyword searching a single column

performancequery-performanceschema

I am in the process of writing an image gallery webapp similar to that of Danbooru, and in the process, decided to take a look at danbooru's database schema for an idea about how things are structured. In doing so, I've noticed that the 'uploads' table contains a column named tag_string, which I'm guessing contains comma-separated tags. There's also a tags table which (again, by the looks of it) contains metadata about individual tags.

This got me thinking about the practicality of my initial design whereby I have a M-N lookup table specifically for mapping N individual tags to M different files. My design would use SQL Join to get a result set containing all of the images I'm looking for, while I can forsee danbooru's solution using a search on the text column containing the list of tags associated with that image.

My lookup table could concievably get very large (in terms of # of rows) if every file contains upwards of 20 or more tags, whereas the other solution only extends the size of the files table by one column.

With these differences in mind, what are the pros and cons of the different implementations?

For reference, here is an incomplete (but contains enough information regarding this question) design of my schema.

Best Answer

It is best to store your tags in a separate table because of the many to many relationship. With something like tagging images you can end up with a very large number of comma separated values that would need to be parsed any time you search your database which will lead to massive performance problems as your data grows.

Going that route and searching for a tag or partial tag will require you to parse through multiple strings multiple times. But if all you are doing is displaying the data to be read then comma separation is a fantastic idea and it would be much faster than joining to another table.

Here is great explanation on why, in general, comma separation is not a good practice.

--EDIT

After looking through the githuh link you have posted and the web site, I think Danbooru is doing a combination of two things. If I had to quickly guess, they are keeping the comma separated values of tags to make it easy and fast to view the text upon a mouse hover. They also seem to be inserting the tags into a well structured set of tables. It may be nice to keep that in mind for your practices but I would recommend leaving that out if you can.

Related Solutions

SQL Server – Is Table Partitioning Improving Performance?

Using partitioning is only going to help your query performance if the partitioning scheme is built to serve your specific queries.

You're going to have to review your query patterns and see how they are accessing the table in order to identify the best approach. The reason for this is you can only partition on a single column (the partitioning key) and this is what will be used for partition elimination.

There are two factors that affect if partition elimination can occur and how well it will perform:

Partition Key - Partitioning can only occur on a single column and your query must include that column. For example, if your table is partitioned on date and your query uses that date column, then partition elimination should occur. However, if you don’t include the partition key within the query predicate, the engine can not perform elimination.
Granularity - If your partitions are to big, you won’t gain any benefit from elimination because it will still pull back more data than it needs to. However, make it to small and it becomes difficult to manage.

In many ways, partitioning is just like using any other index, with some added benefits. However, you don’t realize those benefits unless you're dealing with incredibly large tables. Personally, I don't even consider partitioning until my table is over 250 GB in size. Most of the time, well defined indexing will cover many of the use cases on tables smaller than that. Based on your description, you're not seeing huge data growth, so it could be that a properly index table will perform just fine for your table.

I would strongly recommend that you review whether partitioning is actually necessary to solve your issues. One would usually partition a very large table for the purpose of:

Distributing data between different types of disk so that more “active” data can be placed on faster, more expensive storage while less active data is placed on cheaper, slower storage. This is mostly a cost savings measure.
Assisting in index maintenance for extremely large tables. Since you can rebuild partitions individually, this can assist in keeping indexes properly maintained with minimal impact.
Leveraging partitioning for improved archival process. See sliding windows.

Database Design – Schema for Storing Historical Data on Shared Identifiers

Shouldn't be too difficult, the below assumes you are using SQL Server. My design would be to have a table called "CurrentLogOn", which would look like:

LogOnID (PK - AutoIncrementKey)
UserName
IPAddress
LogOnTime (DateTime, default to GETDATE())
LogOffTime (DateTime, no default)

You would then want a second table called "HistoricalLogOn", which would look similar to the first. The reason to separate the tables would be so CurrentLogOn table operates with the smallest number of records for performance reasons.

HistoricalLogOnID (PK - AutoIncrementKey)
LogOnID
UserName
IPAddress
LogOnTime (DateTime, default to GETDATE())
LogOffTime (DateTime, no default)

I would then place a trigger on the CurrentLogOn table. The trigger would be an INSERT trigger (would trigger when a record is inserted) and would look like:

CREATE TRIGGER UpdateUserLogOn
AFTER INSERT
AS
UPDATE CurrentLogOn
SET LogOffTime = GETDATE()
WHERE UserName = (SELECT UserName FROM Inserted)
AND LogOffTime IS NULL
END

This should update the "old" logon every time the user logs in again. Note that this assumes a record is added to the table each time a user logs on.

Lastly, at the end of the day or at some point during the day, you will want to move all "old" data into the history table. To do so you could run a job that does two operations:

INSERT INTO HistoricalLogOn (LogOnID, UserName, IPAddress, LogOnTime, LogOffTime)
SELECT LogOnID, UserName, IPAddress, LogOnTime, LogOffTime
FROM CurrentLogOn
WHERE LogOffTime IS NOT NULL;
GO
DELETE FROM CurrentLogOn
WHERE LogOffTime IS NOT NULL;
GO

This last part will insert the old records into the history table and subsequently delete them from the Current table.

Best Answer

Related Solutions

SQL Server – Is Table Partitioning Improving Performance?

Database Design – Schema for Storing Historical Data on Shared Identifiers

Related Question