Mysql – MariaDB FULLTEXT search with short / mandatory words

full-text-searchmariadbMySQLselect

Just checking I'm understanding this correctly:

CREATE TABLE customer (
    id INT NOT NULL AUTO_INCREMENT,
    name TINYTEXT NOT NULL,
    PRIMARY KEY (id),
    FULLTEXT (name)
) ENGINE = InnoDB;

INSERT INTO customer VALUES (1, "ABC.DEF");
INSERT INTO customer VALUES (2, "ABC_DEF");
INSERT INTO customer VALUES (3, "ABC'DEF");
INSERT INTO customer VALUES (4, "ABC.DE");
INSERT INTO customer VALUES (5, "ABC.DFF");

Where I've got innodb_ft_min_token_size set to 3 (default is 4).

When running:

SELECT
    c.*,
    MATCH (name) AGAINST ("+ABC +DEF" IN BOOLEAN MODE) AS m
FROM
    customer AS c

Customers 1 and 3 match, because the . and ' are seen as word separators (annoying for O'Brien).

For customer 2, because the underscore gets the whole name treated as a single word, the "DEF" word cannot be found.

If I change the MATCH to "+ABC +DE".

1, 2, or 3 do not match because this is using a full word match ("+DE" does not match "DEF").

4 does not match because… innodb_ft_min_token_size is set to 3?

As in, the 2 letter "DE" word is not in the FULLTEXT INDEX?

If I change the MATCH to use asterisks (e.g. "+ABC* +DE*"), that will use prefix matching.

But will only add customers 1 and 3 to the selection.

Because the 2 letter "DE" word for customer 4 is not in the FULLTEXT INDEX?

If I change the MATCH to use "+ABC.DE*", it matches all of them.

Note how they all get the same rank (even customer 5), and this is no different to "+ABC*", where MATCH seems to be seeing the "DE*" as a separate word, and not matching it against anything.

Whereas "+ABC DE*" is explicitly keeping it as a separate word, and the scores are handled appropriately.

While the individual points make sense, I'm not sure this creates a good system.

For a bit more consistency, I'm wondering if the database should ignore short words (tokens) in the MATCH query, in the same way it does when building the FULLTEXT INDEX.

Only because I don't think "+DE" will ever do anything useful when the min token size is 3; and it's not exactly easy for the developer to identify what the individual words in the FULLTEXT INDEX will be (i.e. to remove them).

Best Answer

FULLTEXT is imperfect. You are doing the 'right thing' to analyze the data it this way. Now switch from "abc" and "def" to strings that are realistic in your dataset.

Here are a couple of tricks that I have used reasonably successfully.

Put a + in front of long words, don't put a + in front of short words.
Use this combo to check for things that need LIKE (or RLIKE) but desire the speed of FULLTEXT:
```
  WHERE MATCH(text) AGAINST ('abc def IN BOOLEAN MODE')
    AND text LIKE '%abc+def%'
```

That is, use FT to find rows with the "words", then use LIKE to make sure they are adjacent and/or contain certain punctuation.

Related Solutions

Like predicate to match a whole word only

Your best bet is probably to use REGEXP, which needs some special installation prior to being usable.

In that case, you'd use:

WHERE         
    name REGEXP 'e\w*o'

This regular expression means: (a literal 'e') followed by (zero or more word characters) followed by (a literal 'o').

This approach won't shy away from cases like 'This hello will work'

References:

Alternative:

If you can't install the required packages to have REGEXP available, you can probably use a user-defined function to perform the "word-by-word" LIKE check. Let's imagine it's called check_words_like(sentence, like_predicate), I'd use in the following fashion:

SELECT
    whatever
FROM
    (SELECT
        name, whatever
    FROM
        tbl
    WHERE
        name like '%e%o%'
    ) AS q1
WHERE
    check_words_like(q1.name, '%e%o%') ;

This still requires some C programming and interfacing, and might be as complicated as getting the right REGEXP package.

Sql-server – SQL Server: no search results caused by noise words

Simply deactivating the stop list fixed the problems. There may be better solutions but this worked for me.

Use database; 
ALTER FULLTEXT INDEX ON table
   SET STOPLIST OFF;
GO

Best Answer

Related Solutions

Like predicate to match a whole word only

Sql-server – SQL Server: no search results caused by noise words

Related Question