PostgreSQL – Using GIN Index on Integer Array for Tag Sets

arraydatabase-designindexpostgresql

In PostgreSQL, what are the pros and cons of using a GIN index on a column of type integer[] to model sets of tags, instead of doing a relation table?

I am trying to design a simple data model with a table called post where each post can have a set of tags, similarly to Instagram or StackExchange.
It seems that I would be able to search for a post that contains a set of tags without having to join two tables, but I won't be able to easily get the list of tag strings for one specific post.

Best Answer

I think you already more or less answered your own question.

While a GIN index can speed up searching inside an array quite significantly, you have to be aware that relational database engines are designed to do joins very efficiently. Also, when you need the actual tags, you will have to join the tag table regardless of you use the array approach or not.

Furthermore, storing the tags as an array of integer makes enforcing referential integrity (that it is impossible to insert a nonexistent tag ID into the array) quite complicated. The only way I can imagine this is creating a trigger, which never makes changing the data faster.

The best way to choose among these alternatives is, though, to create a sufficiently big amount of test data and measure the performance yourself.

Related Solutions

Mysql – Need advice for a ‘tag’ table design

I would go for option 1 (two tables).

If you only plan to have 1000 content items (and in 3 years, not now), I don't think you will have many performance problems unless each content item is tagged with a dozen tags where most are unique to that item. And even then I don't think you'll see much of a performance problems. It also simplifies your structure if you have 1 tag table for every content type and then you decide to add/remove/change content types later.

If you're really not sure, do you have the time/resources to build two sample DBs, one based on each structure, populate them with what you think is a realistic load (you can write a script/program to generate & insert test data, right? with "boring" tags and content such as "content_1..content_2000 and tag_1... tag_5000) and actually test which performs better?

Database Design – Poor Man’s Referential Integrity Schema Design Pattern

Your design looks a bit like the "supertype/subtype" pattern. Search for that and for "table inheritance". It needs quite a lot of work to be able to enforce integrity constraints though.

You are missing a generic Fruit table (that's the "supertype") and a FruitType table to store the alllowed fruit types:

FruitType 
    fruit_type PK

Fruit
    fruit_type PK, FK -> FruitType (fruit_type)
    fruit_id   PK

Then the 3 (or 4 or more) tables would be (the "subtype" tables):

Apple
    fruit_type 
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Apple')

Banana 
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Banana')

Orange
    fruit_type PK
    fruit_id PK
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)
    CHECK (fruit_type = 'Orange')

And any other table can reference the Fruit table:

FruitPack 
    fruitpack_id PK 
    destination

FruitPackFruits 
    fruitpack_id FK -> FruitPack (fruitpack_id)
    fruit_id     
    fruit_type
    (fruit_type, fruit_id) FK -> Fruit (fruit_type, fruit_id)

It doesn't look very nice and one column in every "fruit" table seems redundant as it has one and only one allowed value. And every time you need to add a new fruit (say Cherry), you have to add a row in the table FruitType and a new table (Cherry), similar to the other ones. So, it works better if your design is more or less stable. If you find that you may need to add a new "fruit" every few days or if you have a thousand (or more!) different fruits, it's not the best way.

On the other hand, it enforces integrity and you can't insert cherries into the Bananas or oranges into the Apples.

Best Answer

Related Solutions

Mysql – Need advice for a ‘tag’ table design

Database Design – Poor Man’s Referential Integrity Schema Design Pattern

Related Question