PostgreSQL Performance – How to Store Short Stories for Access to Individual Sentences

database-designfull-text-searchperformancepostgresqlpostgresql-performancerails

I am building a database for the first time ever (using PostgreSQL), and am very conflicted over the most efficient/logical way of storing a body of text (aka, a story). The conflict stems from the fact that the user will access text bodies in two ways:

1) access the entire body of the story on click of the story name.

2) the user can input a word or phrase into a search bar, which will return all sentences (not the whole stories) in which the word/phrase is found (meaning that it could potentially return many sentences from many stories).

There will be a great ("infinite") number of stories, and about 40 sentences per story, although it is free text so some stories will contain a few hundred sentences.

My initial DB design was to have a Story model (I'm using Ruby on Rails) with a story_id, story_title, and author_id_fk, and then to have a Storyline model with storyline_id, storyline, and story_id_fk.

However, I'm now doubting myself and think that maybe the best way to do it is to slap the body of the story onto a 4th column in the Story model called story_text, where I will store an array of strings (aka, the original text parsed into its corresponding sentences), and then the Storyline model can either simply not exist (in which case the appropriate item from the array would be called when needed – less normal, but also perhaps more efficient..?), or to keep the Storyline model, but have it contain a reference to the appropriate storyline as opposed to the actual text itself.

Any thoughts or suggestions would be much appreciated!

Best Answer

The best solution heavily depends on the definition of "best". You may be looking for simplicity, reliability, performance or disk space optimization or other things.

Specifications

infinite amount of stories, and about ~40 sentences per story

Few sentences, possibly many stories.

user can input a word or phrase

We need phrase search. (Maybe across sentence boundaries?)

Also assuming:

  • "Sentences" are substrings simply separated by dot and space('. ' ).
  • Resources (RAM, disk space) may be a limiting factor.
  • Retrieving the whole body and searching for individual sentences are equally frequent tasks.

Solution

Postgres 9.6 introduced exactly what you need. The release notes:

Full-text search can now search for phrases (multiple adjacent words)

The latest Postgres release and a text search index will go a long way for performance in either case.

You already considered to:

slap the body of the story onto a 4th column in the Story model

But an array adds 24 bytes + 1 or more bytes per sentence and makes indexing much more complicated. (As well as pretty much all other operations.) I would exclude the array solution.

An additional storyline table with one row per sentence makes indexing simple again, but adds ~ 40 bytes per sentence. Finding and retrieving sentences is simple. But all write operations are more complicated.

About storage size in Postgres:

Consider a single table with body as plain text column. Splitting the string into sentences on the fly is fast (with a simple definition of "sentence"). Smaller total storage size may be more important for performance with big tables (fewer pages to fetch). And we only need to index one column. In short:

  • dead simple storage and handling
  • minimum disk space
  • still a hot contender for best performance

Table:

CREATE TABLE story (
  story_id  serial PRIMARY KEY
, author_id int REFERENCES author  -- NOT NULL?
, title text                       -- NOT NULL?
, body  text                       -- NOT NULL?
);

FTS index, assuming English language:

CREATE INDEX story_body_english_fts_idx ON story USING GIN (to_tsvector('english', body));

Query to return whole story:

SELECT *
FROM   story
WHERE  to_tsvector('english', body) @@ phraseto_tsquery('english', 'Lewis Carroll');

Query to return sentences:

SELECT story_id, sentence_nr, sentence
FROM   story, unnest(string_to_array(body, '. ')) WITH ORDINALITY x(sentence, sentence_nr)
WHERE  to_tsvector('english', body)     @@ phraseto_tsquery('english', 'Lewis Carroll')
AND    to_tsvector('english', sentence) @@ phraseto_tsquery('english', 'Lewis Carroll');

The query demonstrates a phrase example. But it works for single words as well.

The repeated WHERE condition is logically redundant but necessary to use the index on body. This retrieves only stories with qualifying sentences, then the recheck on unnested sentences (repeating the same condition) is cheap.

About unnest(string_to_array(body, '. ')) WITH ORDINALITY:

Note: The check on the whole text body can find phrases across sentence boundaries - unlike searching individual sentences. But the recheck on sentence excludes such matches. You can fine-tune one or the other way ...

There are many more options with FTS: prefix matching, other languages, no language (no stemming or thesaurus), phrases with words in between ...