To reiterate, you currently have a table with a list of words you want to make sure your users are not searching Table_A with.
You have successfully been able to query Table_A with Table_B which contains your list of bad words.
Your end goal going from the first point, to the second point, is that you want to know if they searched for those bad words, what would be returned. What bad word correlates to a return of a record in Table_A.
What you need to do to resolve point 3 is query sys.dm_fts_parser
. There is an excellent Example by Jefferson Elias on SQL Shack. I'll post the relevant bit in case it is removed or broken by the internet later:
How to check the results of a Full-Text parsing
There are two ways to check how Full-Text feature parses a given text
depending on the source of the text.
Source of the text is a String
If you want to check fast what keywords you would get for a particular
string, you might want to use sys.dm_fts_parser built-in function.
Here is an example of call to that function.
The first parameter is the string that has to be parsed. The second
parameter is the language identifier. Here, it’s set to 0, which means
it’s neutral. The hhird parameter is the identifier of the stoplist.
Here no stoplist is used. The last parameter tells this function
whether to be sensitive or not to accents. Here, we asked for
insensitivity. In other words, this function will take the information
you would provide when creating a Full-Text Index.
select * from sys.dm_fts_parser(
'" dsolkjfdskljfsd dfsd-MMM-236.127 dojfdslfkjds"',
0,
NULL,
0
) ;
If a table is already created with a Full-Text index, we would use
another dynamic management function (DMF) called
sys.dm_fts_index_keywords which takes as a parameter:
The database identifier in which it should look at The object
identifier in that database It returns a dataset with a hexadecimal
representation of the keyword, its corresponding form in the plain
text, the identifier of the column in which the keyword has been found
and finally the number of documents where this keyword can be found.
You will find below a T-SQL query to get back keywords found by
Full-Text feature in our dbo.DM_OBJECT_FILE table so as its results
set.
select *
From sys.dm_fts_index_keywords(DB_ID(),OBJECT_ID('dbo.DM_OBJECT_FILE'))
This method should show you what would be returned based on your criteria. To help with point 1, you may want to look into Stopwords, Stoplists, and the Thesaurus features to help change the words from Table_B into usable words. Alternatively, you may need to implement triggers to keep the terms from being searched.
I imagine this code would work:
Declare @AllWords nvarchar(4000) = '';
Select @AllWords = @AllWords + Case When @AllWords = '' Then Word Else ' OR ' + Word End
From Table_B;
Set @AllWords = '''' + @AllWords + '''';
select * from sys.dm_fts_parser(
@AllWords,
0,
NULL,
0
) ;
Postgres text search only supports prefix matching, not postfix matching like you could use:
Depending on exact requirements a number of workarounds come to mind.
If all leading zeros are insignificant, you could remove those from the indexed column in an expression index like:
CREATE INDEX foo_idx ON tbl
USING GIN (to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g')));
Or use a regular expression fitting your actual requirements:
dbfiddle here
Then match the index in queries like:
SELECT *
FROM tbl
WHERE to_tsvector('simple', regexp_replace(col, '\m0+', '', 'g'))
@@ to_tsquery ('simple', '2ed2323 & Wunder:*');
Remove leading zeroes from the tsquery
value as well - if those can be contained:
...
@@ to_tsquery( 'simple', regexp_replace(my_querystring, '\m0+', '', 'g'));
Related:
Best Answer
Assuming that multiple "words" are separated by a space character.
It's not clear whether you want a compare total counts across the whole table or counts per row. for the whole table:
Or, faster, but less clean, with the set-returning function in the
SELECT
list:For natural language texts you could do a lot more: Remove punctuation characters, compare word stems etc. Full Text Search provides various tools. The debugging function
ts_stat()
seems particularly handy for this. For an English text:ndoc
.. the number of rowsnentry
.. the number of occurrences (can be bigger with multiple instances in a single text.The query returns the most common words in the column across the whole table - just like the first query, but now punctuation and stop words are trimmed and words are reduced to their English word stem.
To find the words contained in the most rows, use instead:
One would think it should be possible to employ a functional GIN index on
to_tsvector('english', message)
to make this very fast. But I did not find a way in a quick test.