I am building a database for the first time ever (using PostgreSQL), and am very conflicted over the most efficient/logical way of storing a body of text (aka, a story). The conflict stems from the fact that the user will access text bodies in two ways:
1) access the entire body of the story on click of the story name.
2) the user can input a word or phrase into a search bar, which will return all sentences (not the whole stories) in which the word/phrase is found (meaning that it could potentially return many sentences from many stories).
There will be a great ("infinite") number of stories, and about 40 sentences per story, although it is free text so some stories will contain a few hundred sentences.
My initial DB design was to have a Story model (I'm using Ruby on Rails) with a story_id
, story_title
, and author_id_fk
, and then to have a Storyline model with storyline_id
, storyline
, and story_id_fk
.
However, I'm now doubting myself and think that maybe the best way to do it is to slap the body of the story onto a 4th column in the Story model called story_text
, where I will store an array of strings (aka, the original text parsed into its corresponding sentences), and then the Storyline model can either simply not exist (in which case the appropriate item from the array would be called when needed – less normal, but also perhaps more efficient..?), or to keep the Storyline model, but have it contain a reference to the appropriate storyline as opposed to the actual text itself.
Any thoughts or suggestions would be much appreciated!
Best Answer
The best solution heavily depends on the definition of "best". You may be looking for simplicity, reliability, performance or disk space optimization or other things.
Specifications
Few sentences, possibly many stories.
We need phrase search. (Maybe across sentence boundaries?)
Also assuming:
'. '
).Solution
Postgres 9.6 introduced exactly what you need. The release notes:
The latest Postgres release and a text search index will go a long way for performance in either case.
You already considered to:
But an array adds 24 bytes + 1 or more bytes per sentence and makes indexing much more complicated. (As well as pretty much all other operations.) I would exclude the array solution.
An additional
storyline
table with one row per sentence makes indexing simple again, but adds ~ 40 bytes per sentence. Finding and retrieving sentences is simple. But all write operations are more complicated.About storage size in Postgres:
Consider a single table with
body
as plaintext
column. Splitting the string into sentences on the fly is fast (with a simple definition of "sentence"). Smaller total storage size may be more important for performance with big tables (fewer pages to fetch). And we only need to index one column. In short:Table:
FTS index, assuming English language:
Query to return whole story:
Query to return sentences:
The query demonstrates a phrase example. But it works for single words as well.
The repeated
WHERE
condition is logically redundant but necessary to use the index onbody
. This retrieves only stories with qualifying sentences, then the recheck on unnested sentences (repeating the same condition) is cheap.About
unnest(string_to_array(body, '. ')) WITH ORDINALITY
:Note: The check on the whole text body can find phrases across sentence boundaries - unlike searching individual sentences. But the recheck on
sentence
excludes such matches. You can fine-tune one or the other way ...There are many more options with FTS: prefix matching, other languages, no language (no stemming or thesaurus), phrases with words in between ...