Postgresql – How to implement full-text search on a single value in a text[] column

arraydatabase-designfull-text-searchperformancepostgresql

I am using Postgresql 9.3 on an Ubuntu 12.4 machine.

I've only recently started studying database and database design, and I am to create a database that includes text that will later be searched through. I've read up on tsvectors and stuff better than WHERE column LIKE 'query', but I don't have much experience with it. I am thinking of building a table where one column is of type text[], and another one will be the tsvector build from the text. I will use and array of text because some of the text will be displayed to the user with different formatting etc. I have never worked with array columns before. My questions are:

  • Is using a text array a good idea at all in this case?
  • Will the text array work well with the tsvectors?
  • Will using text[] compromise my performance?
  • Are there any pitfalls I should stay alert for?

Thanks for the attention, and this is my first post ever here, so if I messed up on this question do tell me. P.S.: I've searched this topic here beforehand, I hope I am not creating a redundant question. P.P.S.: Not a native speaker, sorry for the weird language.

EDIT: Thanks for the answers! The texts in the array won't be of fixed size. They will range from 6 ~ 20 lines, but they need to be editable, and some of the texts inside an array will be formatted in a different manner when displayed to the user. I'm not using multiple text columns because some of the entries in the table will have a single, two, or more elements inside the array. I know how each element will be formatted because every row will also have a "type" column, which will define it's formatting.

Best Answer

There are a lot of problems with using text[] like this, but fundamentally I agree @jjanes when he says,

How will you know which member of the array gets what kind of formatting? If that logic is hard-coded into the position of the member in the array, then why not just have multiple columns and key the formatting to the column names?

You can keep this more direct and simpler by following a golden rule,

  • If the elements in the array can not be shuffled, then the order matters and the elements are ordinal
  • If the order matters, you should not be using an ARRAY type, regardless of Full Text Search.

Moreover, the to_tsvector call is a pretty bad idea, this design is problematic too. ARRAY is not for schema-less design, and the position in the array should not itself be information need to render the data in the app.

As a side note, in a normalized display you can do weighing/ranking with FTS. In order to that with your schema, you'd have to move from the schemaless abuse of text[] to an actual schema'd design anyway.

Abomination

If you decide to go down this route, this should be substantially faster and cleaner. But, this is a horrible idea.

CREATE AGGREGATE tsvector_agg (tsvector) (
  SFUNC = tsvector_concat,
  STYPE = tsvector
);

CREATE OR REPLACE FUNCTION text_array_to_tsvector( mytext text[], out tsv tsvector )
AS $func$
  BEGIN
    SELECT INTO tsv
      tsvector_agg(to_tsvector(t))
    FROM unnest(mytext) AS t;
    RETURN;
  END;
$func$ LANGUAGE plpgsql
IMMUTABLE;

Compare the two,

EXPLAIN ANALYZE SELECT to_tsvector(array_to_string(ARRAY['foo', 'bar', 'baz'],' '::text));
EXPLAIN ANALYZE SELECT text_array_to_tsvector(ARRAY['foo', 'bar', 'baz']);

See also