Simply put, I have a Postgres column containing ordinary prose and would like to determine the x
most commonly used words ("word" being a group of characters delimited by a space, but not being stop word) across all rows.
I've found two solutions that nearly hit the mark:
SELECT *
FROM ts_stat($$SELECT to_tsvector('english', title) FROM item$$)
ORDER BY ndoc DESC
LIMIT 50;
This is great, except it returns word stems.
SELECT UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM item
GROUP BY 1
ORDER BY 2 DESC
LIMIT 50;
This one returns the full words, but includes stop words.
For the sake of simplicity: stop words are supposed to be found on TABLE stop_words (lowercase_stopword text PRIMARY KEY)
.
Can anyone help me over the line?
Best Answer
Your first query was pretty close. To remove the unwanted stemming, create a text search configuration with a simple dictionary that does not do it.
I suggest to use a separate schema for text search objects, but that's totally optional:
Then your query works, and very fast, too:
dbfiddle here
This operates with lower case words without stemming and doesn't break for non-ASCII letters.
Backgroud
Read the chapter Simple Dictionary in the manual.
The exact definition of a "word" is a tricky matter. The default text search parser (currently it's the only one) identifies 23 different types of tokens. See:
Built-in text search configurations map most of those to (built-in) dictionaries. Mappings for the
english
config:The demo above creates a new config based on the
simple
config, and since all English stop words are of type 'asciiword', we only need to map this type to remove stop words, no stemming or anything else.