Postgresql – capability and performance of tags implemented as array vs text vs full text

arrayfull-text-searchpostgresql

In terms of search capability and performance, what are the pros & cons of implementing tags as an array or text or full text field?

I am seeing limitations of capability in choosing to implement tags with an array, because it seems to me that if an array contains ['juicy fruit'] then someone searching for 'fruit' will not find that record (for example, tags && ARRAY['fruit'] will not find it). The person who created that record would have had to enter something more like ['juicy fruit', 'juicy', 'fruit'] to have their record found with just a search for 'fruit' or 'juicy'. Whereas if I implement tags as text, then a search for 'fruit' will find 'juicy fruit', and further still, if I implement tags as full text then I will find 'juicy fruit' when searching with the string 'fruits' (plural). Furthermore, I think there will be no performance loss in going for full text search. Thoughts?

But maybe the whole point of tags is exact match.

Best Answer

You can get search functionality with arrays by using the ANY/SOME or generate_subscripts functions. However, the PostgreSQL documentation notes the following:

Tip: Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.

So it looks like you're on the right track, because you're correct about not losing performance (you would more likely gain it) and the other methods allow greater flexibility. Tags themselves are about exact matches, but most sites have a way to search tags too (for instance, this very site).

If you implement tag storage as full text, that doesn't preclude matching, including regex, with simpler expressions such as LIKE/ILIKE. A common scenario is to store documents (the chunks of text being searched) as text or varying character and store the tsvector type necessary for full-text search in a separate column. See the PostgreSQL documentation for more on creating a full text index.

The pros and cons should then be evident: using an array limits scalability and search options because it lacks text indexing, stemming and inflection support, but is easy to use when an exact match is desired. Using a text datatype is more flexible when searching for strings. Using a text datatype with a generated tsvector column provides the most flexibility and good performance in all scenarios, but requires more storage.