Postgresql – Return top 100 words from strings

full-text-searchpostgresql

Not sure this is even possible but I was wondering if there is any way of using PostgreSQL to return the top 100 most occurring words from a database column containing strings of varying lengths?

The table contains ~ 250k rows, with each row containing a message among much other metadata. So, is there a way of sifting through every word in each row, taking the total number of times it appears and comparing it to a count of all the other words in the same column to return the top 100?

If it is easier, the metadata can be dropped to leave only the column containing the Tweets.

Ideally, I would like to be able to perform what this website does to give the same output, only with many more records.

Best Answer

Assuming that multiple "words" are separated by a space character.

It's not clear whether you want a compare total counts across the whole table or counts per row. for the whole table:

SELECT word, count(*) AS ct
FROM   tbl, unnest(string_to_array(message, ' ')) word  -- implicit LATERAL join
GROUP  BY 1
ORDER  BY 2 DESC
LIMIT  100;  -- top 100

Or, faster, but less clean, with the set-returning function in the SELECT list:

SELECT unnest(string_to_array(message, ' ')) AS word, count(*) AS ct
FROM   tbl
GROUP  BY 1
ORDER  BY 2 DESC
LIMIT  100;

For natural language texts you could do a lot more: Remove punctuation characters, compare word stems etc. Full Text Search provides various tools. The debugging function ts_stat() seems particularly handy for this. For an English text:

SELECT word, ndoc, nentry
FROM   ts_stat($$SELECT to_tsvector('english', message) FROM tbl$$) 
ORDER  BY nentry DESC
LIMIT  100;

ndoc .. the number of rows
nentry .. the number of occurrences (can be bigger with multiple instances in a single text.

The query returns the most common words in the column across the whole table - just like the first query, but now punctuation and stop words are trimmed and words are reduced to their English word stem.

To find the words contained in the most rows, use instead:

ORDER  BY ndoc DESC

One would think it should be possible to employ a functional GIN index on to_tsvector('english', message) to make this very fast. But I did not find a way in a quick test.

Related Solutions

Sql-server – Full text search using words stored in another table

To reiterate, you currently have a table with a list of words you want to make sure your users are not searching Table_A with.
You have successfully been able to query Table_A with Table_B which contains your list of bad words.
Your end goal going from the first point, to the second point, is that you want to know if they searched for those bad words, what would be returned. What bad word correlates to a return of a record in Table_A.

What you need to do to resolve point 3 is query sys.dm_fts_parser. There is an excellent Example by Jefferson Elias on SQL Shack. I'll post the relevant bit in case it is removed or broken by the internet later:

How to check the results of a Full-Text parsing

There are two ways to check how Full-Text feature parses a given text depending on the source of the text.

Source of the text is a String

If you want to check fast what keywords you would get for a particular string, you might want to use sys.dm_fts_parser built-in function.

Here is an example of call to that function.

The first parameter is the string that has to be parsed. The second parameter is the language identifier. Here, it’s set to 0, which means it’s neutral. The hhird parameter is the identifier of the stoplist. Here no stoplist is used. The last parameter tells this function whether to be sensitive or not to accents. Here, we asked for insensitivity. In other words, this function will take the information you would provide when creating a Full-Text Index.

 select * from sys.dm_fts_parser(
    '" dsolkjfdskljfsd dfsd-MMM-236.127 dojfdslfkjds"',
    0,
    NULL,
    0
) ;

If a table is already created with a Full-Text index, we would use another dynamic management function (DMF) called sys.dm_fts_index_keywords which takes as a parameter:

The database identifier in which it should look at The object identifier in that database It returns a dataset with a hexadecimal representation of the keyword, its corresponding form in the plain text, the identifier of the column in which the keyword has been found and finally the number of documents where this keyword can be found.

You will find below a T-SQL query to get back keywords found by Full-Text feature in our dbo.DM_OBJECT_FILE table so as its results set.

select * 
From sys.dm_fts_index_keywords(DB_ID(),OBJECT_ID('dbo.DM_OBJECT_FILE'))

This method should show you what would be returned based on your criteria. To help with point 1, you may want to look into Stopwords, Stoplists, and the Thesaurus features to help change the words from Table_B into usable words. Alternatively, you may need to implement triggers to keep the terms from being searched.

I imagine this code would work:

Declare @AllWords nvarchar(4000) = '';
Select @AllWords = @AllWords + Case When @AllWords = '' Then Word Else ' OR ' + Word End
From Table_B;
Set @AllWords = '''' + @AllWords + '''';
select * from sys.dm_fts_parser(
    @AllWords,
    0,
    NULL,
    0
) ;

Postgresql – Postgres: Dictionary to remove leading zeroes from alpha-numerical words

Postgres text search only supports prefix matching, not postfix matching like you could use:

Get partial match from GIN indexed TSVECTOR column

Depending on exact requirements a number of workarounds come to mind.

If all leading zeros are insignificant, you could remove those from the indexed column in an expression index like:

CREATE INDEX foo_idx ON tbl
USING GIN (to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g')));

Or use a regular expression fitting your actual requirements:

dbfiddle here

Then match the index in queries like:

SELECT *
FROM   tbl
WHERE  to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g'))
    @@ to_tsquery ('simple', '2ed2323 & Wunder:*');

Remove leading zeroes from the tsquery value as well - if those can be contained:

   ...
    @@ to_tsquery( 'simple', regexp_replace(my_querystring, '\m0+', '', 'g'));

PostgreSQL GIN index not used when ts_query language is fetched from a column

Best Answer

Related Solutions

Sql-server – Full text search using words stored in another table

Postgresql – Postgres: Dictionary to remove leading zeroes from alpha-numerical words

Related Question