Mysql – MariaDB FULLTEXT search with short / mandatory words

full-text-searchmariadbMySQLselect

Just checking I'm understanding this correctly:

CREATE TABLE customer (
    id INT NOT NULL AUTO_INCREMENT,
    name TINYTEXT NOT NULL,
    PRIMARY KEY (id),
    FULLTEXT (name)
) ENGINE = InnoDB;

INSERT INTO customer VALUES (1, "ABC.DEF");
INSERT INTO customer VALUES (2, "ABC_DEF");
INSERT INTO customer VALUES (3, "ABC'DEF");
INSERT INTO customer VALUES (4, "ABC.DE");
INSERT INTO customer VALUES (5, "ABC.DFF");

Where I've got innodb_ft_min_token_size set to 3 (default is 4).


When running:

SELECT
    c.*,
    MATCH (name) AGAINST ("+ABC +DEF" IN BOOLEAN MODE) AS m
FROM
    customer AS c

Customers 1 and 3 match, because the . and ' are seen as word separators (annoying for O'Brien).

For customer 2, because the underscore gets the whole name treated as a single word, the "DEF" word cannot be found.


If I change the MATCH to "+ABC +DE".

1, 2, or 3 do not match because this is using a full word match ("+DE" does not match "DEF").

4 does not match because… innodb_ft_min_token_size is set to 3?

As in, the 2 letter "DE" word is not in the FULLTEXT INDEX?


If I change the MATCH to use asterisks (e.g. "+ABC* +DE*"), that will use prefix matching.

But will only add customers 1 and 3 to the selection.

Because the 2 letter "DE" word for customer 4 is not in the FULLTEXT INDEX?


If I change the MATCH to use "+ABC.DE*", it matches all of them.

Note how they all get the same rank (even customer 5), and this is no different to "+ABC*", where MATCH seems to be seeing the "DE*" as a separate word, and not matching it against anything.

Whereas "+ABC DE*" is explicitly keeping it as a separate word, and the scores are handled appropriately.


While the individual points make sense, I'm not sure this creates a good system.

For a bit more consistency, I'm wondering if the database should ignore short words (tokens) in the MATCH query, in the same way it does when building the FULLTEXT INDEX.

Only because I don't think "+DE" will ever do anything useful when the min token size is 3; and it's not exactly easy for the developer to identify what the individual words in the FULLTEXT INDEX will be (i.e. to remove them).

Best Answer

FULLTEXT is imperfect. You are doing the 'right thing' to analyze the data it this way. Now switch from "abc" and "def" to strings that are realistic in your dataset.

Here are a couple of tricks that I have used reasonably successfully.

  • Put a + in front of long words, don't put a + in front of short words.

  • Use this combo to check for things that need LIKE (or RLIKE) but desire the speed of FULLTEXT:

      WHERE MATCH(text) AGAINST ('abc def IN BOOLEAN MODE')
        AND text LIKE '%abc+def%'
    

That is, use FT to find rows with the "words", then use LIKE to make sure they are adjacent and/or contain certain punctuation.