Postgresql – Return top 100 words from strings

full-text-searchpostgresql

Not sure this is even possible but I was wondering if there is any way of using PostgreSQL to return the top 100 most occurring words from a database column containing strings of varying lengths?

The table contains ~ 250k rows, with each row containing a message among much other metadata. So, is there a way of sifting through every word in each row, taking the total number of times it appears and comparing it to a count of all the other words in the same column to return the top 100?

If it is easier, the metadata can be dropped to leave only the column containing the Tweets.

Ideally, I would like to be able to perform what this website does to give the same output, only with many more records.

Best Answer

Assuming that multiple "words" are separated by a space character.

It's not clear whether you want a compare total counts across the whole table or counts per row. for the whole table:

SELECT word, count(*) AS ct
FROM   tbl, unnest(string_to_array(message, ' ')) word  -- implicit LATERAL join
GROUP  BY 1
ORDER  BY 2 DESC
LIMIT  100;  -- top 100

Or, faster, but less clean, with the set-returning function in the SELECT list:

SELECT unnest(string_to_array(message, ' ')) AS word, count(*) AS ct
FROM   tbl
GROUP  BY 1
ORDER  BY 2 DESC
LIMIT  100;

For natural language texts you could do a lot more: Remove punctuation characters, compare word stems etc. Full Text Search provides various tools. The debugging function ts_stat() seems particularly handy for this. For an English text:

SELECT word, ndoc, nentry
FROM   ts_stat($$SELECT to_tsvector('english', message) FROM tbl$$) 
ORDER  BY nentry DESC
LIMIT  100;

ndoc .. the number of rows
nentry .. the number of occurrences (can be bigger with multiple instances in a single text.

The query returns the most common words in the column across the whole table - just like the first query, but now punctuation and stop words are trimmed and words are reduced to their English word stem.

To find the words contained in the most rows, use instead:

ORDER  BY ndoc DESC

One would think it should be possible to employ a functional GIN index on to_tsvector('english', message) to make this very fast. But I did not find a way in a quick test.