Postgresql – Finding the most commonly used non-stop words in a column

full-text-searchpostgresqlstring manipulation

Simply put, I have a Postgres column containing ordinary prose and would like to determine the x most commonly used words ("word" being a group of characters delimited by a space, but not being stop word) across all rows.

I've found two solutions that nearly hit the mark:

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('english', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

This is great, except it returns word stems.

SELECT   UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM     item 
GROUP    BY 1 
ORDER    BY 2 DESC
LIMIT    50;

This one returns the full words, but includes stop words.

For the sake of simplicity: stop words are supposed to be found on TABLE stop_words (lowercase_stopword text PRIMARY KEY).

Can anyone help me over the line?

Best Answer

Your first query was pretty close. To remove the unwanted stemming, create a text search configuration with a simple dictionary that does not do it.

I suggest to use a separate schema for text search objects, but that's totally optional:

CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';

CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
    TEMPLATE = pg_catalog.simple
  , STOPWORDS = english
);

CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER  TEXT SEARCH CONFIGURATION ts.english_simple
   ALTER MAPPING FOR asciiword WITH ts.english_simple_dict;  -- 1, 'Word, all ASCII'

Then your query works, and very fast, too:

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('ts.english_simple', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

dbfiddle here

This operates with lower case words without stemming and doesn't break for non-ASCII letters.

Backgroud

Read the chapter Simple Dictionary in the manual.

The exact definition of a "word" is a tricky matter. The default text search parser (currently it's the only one) identifies 23 different types of tokens. See:

SELECT * FROM ts_token_type('default');

Built-in text search configurations map most of those to (built-in) dictionaries. Mappings for the english config:

SELECT tt.*, m.mapdict::regdictionary AS dictionary
FROM   pg_ts_config_map m
LEFT   JOIN   ts_token_type(3722) tt ON tt.tokid = m.maptokentype
WHERE  mapcfg = 'english'::regconfig  --  'ts.english_simple'::regconfig
ORDER  BY tt.tokid;

The demo above creates a new config based on the simple config, and since all English stop words are of type 'asciiword', we only need to map this type to remove stop words, no stemming or anything else.

PostgreSQL 10

It does just that, you can see the code here,

void
readstoplist(const char *fname, StopList *s, char *(*wordop) (const char *))
{
    char      **stop = NULL;

    s->len = 0;
    if (fname && *fname)
    {
        char       *filename = get_tsearch_config_filename(fname, "stop");
        tsearch_readline_state trst;
        char       *line;
        int         reallen = 0;

**stop could be modified easily to read from the yet-nonexistant parameters, but currently it just pulls from the file. A few lines above and you'll find get_tsearch_config_filename(fname,extension) which looks in just the place you were referring too,

snprintf(result, MAXPGPATH, "%s/tsearch_data/%s.%s",
         sharepath, basename, extension);

There is nothing else in the code base that writes to a StopList pointer. As of PostgreSQL 10, it's not possible.

Amazon RDS

There is an unresolved issue from 2014 on AWS Developer for this,

Amazon confirms we're right in saying there is no way

Unfortunately, because RDS doesn't provide filesystem access, you won't be able to upload your custom files to $SHAREDIR.

And Amazon follows up with

I have raised a feature request with the RDS team.

Best Answer

Backgroud

Related Solutions

Database Indexing – Most Efficient UUID Column Type

Postgresql – Can a AWS PostgresSQL RDS instance be provided with a stop words file

PostgreSQL 10

Amazon RDS

Related Question