Postgresql – Finding the most commonly used non-stop words in a column

full-text-searchpostgresqlstring manipulation

Simply put, I have a Postgres column containing ordinary prose and would like to determine the x most commonly used words ("word" being a group of characters delimited by a space, but not being stop word) across all rows.

I've found two solutions that nearly hit the mark:

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('english', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

This is great, except it returns word stems.

SELECT   UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM     item 
GROUP    BY 1 
ORDER    BY 2 DESC
LIMIT    50;

This one returns the full words, but includes stop words.

For the sake of simplicity: stop words are supposed to be found on TABLE stop_words (lowercase_stopword text PRIMARY KEY).

Can anyone help me over the line?

Best Answer

Your first query was pretty close. To remove the unwanted stemming, create a text search configuration with a simple dictionary that does not do it.

I suggest to use a separate schema for text search objects, but that's totally optional:

CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';

CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
    TEMPLATE = pg_catalog.simple
  , STOPWORDS = english
);

CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER  TEXT SEARCH CONFIGURATION ts.english_simple
   ALTER MAPPING FOR asciiword WITH ts.english_simple_dict;  -- 1, 'Word, all ASCII'

Then your query works, and very fast, too:

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('ts.english_simple', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

dbfiddle here

This operates with lower case words without stemming and doesn't break for non-ASCII letters.

Backgroud

Read the chapter Simple Dictionary in the manual.

The exact definition of a "word" is a tricky matter. The default text search parser (currently it's the only one) identifies 23 different types of tokens. See:

SELECT * FROM ts_token_type('default');

Built-in text search configurations map most of those to (built-in) dictionaries. Mappings for the english config:

SELECT tt.*, m.mapdict::regdictionary AS dictionary
FROM   pg_ts_config_map m
LEFT   JOIN   ts_token_type(3722) tt ON tt.tokid = m.maptokentype
WHERE  mapcfg = 'english'::regconfig  --  'ts.english_simple'::regconfig
ORDER  BY tt.tokid;

The demo above creates a new config based on the simple config, and since all English stop words are of type 'asciiword', we only need to map this type to remove stop words, no stemming or anything else.