Mysql – Get stopwords of a InnoDB fulltext index

full-text-searchMySQL

I can set stopwords for a MySQL fulltext index with SET innodb_ft_user_stopword_table ... before creating the fulltext index.

But how can I figure out the existing stopwords of before created fulltext index?

I can request information_schema.INNODB_FT_DEFAULT_STOPWORD for the default stopwords of the server. But I would like to know the used stopwords for a specific table.

In the MariaDB knowlegebase, I found a description how to determine the stopwords: https://mariadb.com/kb/en/library/full-text-index-stopwords/#innodb-stopwords

But I can't set innodb_ft_user_stopword_table to a "normal" table with several fields. (I'm using MariaDB 10.1.39.)

A way to find out the used stopwords for MyISAM tables, I would like to know too.

Best Answer

The easiest way is to get it from the MySQL Docs.

When you go to https://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html you will see the 543 words that are the default stopwords for MyISAM fulltext indexes. There is 'Copy to Clipboard' icon in the upper right corner of that list. Click on that.

You could then construct an SQL file to build that list.

EXAMPLE

I copied that to local text file in Ubuntu

The file looks like this:

root@LAPTOP-U6EJ8KII:~/DBA# cat stuff.txt
a's           able          about         above         according
accordingly   across        actually      after         afterwards
again         against       ain't         all           allow
allows        almost        alone         along         already
also          although      always        am            among
amongst       an            and           another       any
anybody       anyhow        anyone        anything      anyway
anyways       anywhere      apart         appear        appreciate
appropriate   are           aren't        around        as
aside         ask           asking        associated    at
available     away          awfully       be            became
because       become        becomes       becoming      been
before        beforehand    behind        being         believe
below         beside        besides       best          better
between       beyond        both          brief         but
by            c'mon         c's           came          can
can't         cannot        cant          cause         causes
certain       certainly     changes       clearly       co
com           come          comes         concerning    consequently
consider      considering   contain       containing    contains
corresponding could         couldn't      course        currently
definitely    described     despite       did           didn't
different     do            does          doesn't       doing
don't         done          down          downwards     during
each          edu           eg            eight         either
else          elsewhere     enough        entirely      especially
et            etc           even          ever          every
everybody     everyone      everything    everywhere    ex
exactly       example       except        far           few
fifth         first         five          followed      following
follows       for           former        formerly      forth
four          from          further       furthermore   get
gets          getting       given         gives         go
goes          going         gone          got           gotten
greetings     had           hadn't        happens       hardly
has           hasn't        have          haven't       having
he            he's          hello         help          hence
her           here          here's        hereafter     hereby
herein        hereupon      hers          herself       hi
him           himself       his           hither        hopefully
how           howbeit       however       i'd           i'll
i'm           i've          ie            if            ignored
immediate     in            inasmuch      inc           indeed
indicate      indicated     indicates     inner         insofar
instead       into          inward        is            isn't
it            it'd          it'll         it's          its
itself        just          keep          keeps         kept
know          known         knows         last          lately
later         latter        latterly      least         less
lest          let           let's         like          liked
likely        little        look          looking       looks
ltd           mainly        many          may           maybe
me            mean          meanwhile     merely        might
more          moreover      most          mostly        much
must          my            myself        name          namely
nd            near          nearly        necessary     need
needs         neither       never         nevertheless  new
next          nine          no            nobody        non
none          noone         nor           normally      not
nothing       novel         now           nowhere       obviously
of            off           often         oh            ok
okay          old           on            once          one
ones          only          onto          or            other
others        otherwise     ought         our           ours
ourselves     out           outside       over          overall
own           particular    particularly  per           perhaps
placed        please        plus          possible      presumably
probably      provides      que           quite         qv
rather        rd            re            really        reasonably
regarding     regardless    regards       relatively    respectively
right         said          same          saw           say
saying        says          second        secondly      see
seeing        seem          seemed        seeming       seems
seen          self          selves        sensible      sent
serious       seriously     seven         several       shall
she           should        shouldn't     since         six
so            some          somebody      somehow       someone
something     sometime      sometimes     somewhat      somewhere
soon          sorry         specified     specify       specifying
still         sub           such          sup           sure
t's           take          taken         tell          tends
th            than          thank         thanks        thanx
that          that's        thats         the           their
theirs        them          themselves    then          thence
there         there's       thereafter    thereby       therefore
therein       theres        thereupon     these         they
they'd        they'll       they're       they've       think
third         this          thorough      thoroughly    those
though        three         through       throughout    thru
thus          to            together      too           took
toward        towards       tried         tries         truly
try           trying        twice         two           un
under         unfortunately unless        unlikely      until
unto          up            upon          us            use
used          useful        uses          using         usually
value         various       very          via           viz
vs            want          wants         was           wasn't
way           we            we'd          we'll         we're
we've         welcome       well          went          were
weren't       what          what's        whatever      when
whence        whenever      where         where's       whereafter
whereas       whereby       wherein       whereupon     wherever
whether       which         while         whither       who
who's         whoever       whole         whom          whose
why           will          willing       wish          with
within        without       won't         wonder        would
wouldn't      yes           yet           you           you'd
you'll        you're        you've        your          yours
yourself      yourselves    zero
root@LAPTOP-U6EJ8KII:~/DBA#

I ran a loop across the file to tokenize it into another file.

root@LAPTOP-U6EJ8KII:~/DBA# for TAG in `cat stuff.txt` ; do echo ${TAG}; done > stopwords_table.sql

Then, I jumped into vi and added SQL around the words and created this:

root@LAPTOP-U6EJ8KII:~/DBA# cat stopwords_table.sql
DROP DATABASE IF EXISTS cust_stop;
CREATE DATABASE cust_stop;
USE cust_stop
CREATE TABLE stopwords (VALUE VARCHAR(20)) ENGINE=InnoDB;
INSERT INTO stopwords (VALUE) VALUES
('a''s'),
('able'),
('about'),
('above'),
('according'),
('accordingly'),
('across'),
('actually'),
('after'),
('afterwards'),
('again'),
('against'),
('ain''t'),
('all'),
('allow'),
('allows'),
('almost'),
('alone'),
('along'),
('already'),
('also'),
('although'),
('always'),
('am'),
('among'),
('amongst'),
('an'),
('and'),
('another'),
('any'),
('anybody'),
('anyhow'),
('anyone'),
('anything'),
('anyway'),
('anyways'),
('anywhere'),
('apart'),
('appear'),
('appreciate'),
('appropriate'),
('are'),
('aren''t'),
('around'),
('as'),
('aside'),
('ask'),
('asking'),
('associated'),
('at'),
('available'),
('away'),
('awfully'),
('be'),
('became'),
('because'),
('become'),
('becomes'),
('becoming'),
('been'),
('before'),
('beforehand'),
('behind'),
('being'),
('believe'),
('below'),
('beside'),
('besides'),
('best'),
('better'),
('between'),
('beyond'),
('both'),
('brief'),
('but'),
('by'),
('c''mon'),
('c''s'),
('came'),
('can'),
('can''t'),
('cannot'),
('cant'),
('cause'),
('causes'),
('certain'),
('certainly'),
('changes'),
('clearly'),
('co'),
('com'),
('come'),
('comes'),
('concerning'),
('consequently'),
('consider'),
('considering'),
('contain'),
('containing'),
('contains'),
('corresponding'),
('could'),
('couldn''t'),
('course'),
('currently'),
('definitely'),
('described'),
('despite'),
('did'),
('didn''t'),
('different'),
('do'),
('does'),
('doesn''t'),
('doing'),
('don''t'),
('done'),
('down'),
('downwards'),
('during'),
('each'),
('edu'),
('eg'),
('eight'),
('either'),
('else'),
('elsewhere'),
('enough'),
('entirely'),
('especially'),
('et'),
('etc'),
('even'),
('ever'),
('every'),
('everybody'),
('everyone'),
('everything'),
('everywhere'),
('ex'),
('exactly'),
('example'),
('except'),
('far'),
('few'),
('fifth'),
('first'),
('five'),
('followed'),
('following'),
('follows'),
('for'),
('former'),
('formerly'),
('forth'),
('four'),
('from'),
('further'),
('furthermore'),
('get'),
('gets'),
('getting'),
('given'),
('gives'),
('go'),
('goes'),
('going'),
('gone'),
('got'),
('gotten'),
('greetings'),
('had'),
('hadn''t'),
('happens'),
('hardly'),
('has'),
('hasn''t'),
('have'),
('haven''t'),
('having'),
('he'),
('he''s'),
('hello'),
('help'),
('hence'),
('her'),
('here'),
('here''s'),
('hereafter'),
('hereby'),
('herein'),
('hereupon'),
('hers'),
('herself'),
('hi'),
('him'),
('himself'),
('his'),
('hither'),
('hopefully'),
('how'),
('howbeit'),
('however'),
('i''d'),
('i''ll'),
('i''m'),
('i''ve'),
('ie'),
('if'),
('ignored'),
('immediate'),
('in'),
('inasmuch'),
('inc'),
('indeed'),
('indicate'),
('indicated'),
('indicates'),
('inner'),
('insofar'),
('instead'),
('into'),
('inward'),
('is'),
('isn''t'),
('it'),
('it''d'),
('it''ll'),
('it''s'),
('its'),
('itself'),
('just'),
('keep'),
('keeps'),
('kept'),
('know'),
('known'),
('knows'),
('last'),
('lately'),
('later'),
('latter'),
('latterly'),
('least'),
('less'),
('lest'),
('let'),
('let''s'),
('like'),
('liked'),
('likely'),
('little'),
('look'),
('looking'),
('looks'),
('ltd'),
('mainly'),
('many'),
('may'),
('maybe'),
('me'),
('mean'),
('meanwhile'),
('merely'),
('might'),
('more'),
('moreover'),
('most'),
('mostly'),
('much'),
('must'),
('my'),
('myself'),
('name'),
('namely'),
('nd'),
('near'),
('nearly'),
('necessary'),
('need'),
('needs'),
('neither'),
('never'),
('nevertheless'),
('new'),
('next'),
('nine'),
('no'),
('nobody'),
('non'),
('none'),
('noone'),
('nor'),
('normally'),
('not'),
('nothing'),
('novel'),
('now'),
('nowhere'),
('obviously'),
('of'),
('off'),
('often'),
('oh'),
('ok'),
('okay'),
('old'),
('on'),
('once'),
('one'),
('ones'),
('only'),
('onto'),
('or'),
('other'),
('others'),
('otherwise'),
('ought'),
('our'),
('ours'),
('ourselves'),
('out'),
('outside'),
('over'),
('overall'),
('own'),
('particular'),
('particularly'),
('per'),
('perhaps'),
('placed'),
('please'),
('plus'),
('possible'),
('presumably'),
('probably'),
('provides'),
('que'),
('quite'),
('qv'),
('rather'),
('rd'),
('re'),
('really'),
('reasonably'),
('regarding'),
('regardless'),
('regards'),
('relatively'),
('respectively'),
('right'),
('said'),
('same'),
('saw'),
('say'),
('saying'),
('says'),
('second'),
('secondly'),
('see'),
('seeing'),
('seem'),
('seemed'),
('seeming'),
('seems'),
('seen'),
('self'),
('selves'),
('sensible'),
('sent'),
('serious'),
('seriously'),
('seven'),
('several'),
('shall'),
('she'),
('should'),
('shouldn''t'),
('since'),
('six'),
('so'),
('some'),
('somebody'),
('somehow'),
('someone'),
('something'),
('sometime'),
('sometimes'),
('somewhat'),
('somewhere'),
('soon'),
('sorry'),
('specified'),
('specify'),
('specifying'),
('still'),
('sub'),
('such'),
('sup'),
('sure'),
('t''s'),
('take'),
('taken'),
('tell'),
('tends'),
('th'),
('than'),
('thank'),
('thanks'),
('thanx'),
('that'),
('that''s'),
('thats'),
('the'),
('their'),
('theirs'),
('them'),
('themselves'),
('then'),
('thence'),
('there'),
('there''s'),
('thereafter'),
('thereby'),
('therefore'),
('therein'),
('theres'),
('thereupon'),
('these'),
('they'),
('they''d'),
('they''ll'),
('they''re'),
('they''ve'),
('think'),
('third'),
('this'),
('thorough'),
('thoroughly'),
('those'),
('though'),
('three'),
('through'),
('throughout'),
('thru'),
('thus'),
('to'),
('together'),
('too'),
('took'),
('toward'),
('towards'),
('tried'),
('tries'),
('truly'),
('try'),
('trying'),
('twice'),
('two'),
('un'),
('under'),
('unfortunately'),
('unless'),
('unlikely'),
('until'),
('unto'),
('up'),
('upon'),
('us'),
('use'),
('used'),
('useful'),
('uses'),
('using'),
('usually'),
('value'),
('various'),
('very'),
('via'),
('viz'),
('vs'),
('want'),
('wants'),
('was'),
('wasn''t'),
('way'),
('we'),
('we''d'),
('we''ll'),
('we''re'),
('we''ve'),
('welcome'),
('well'),
('went'),
('were'),
('weren''t'),
('what'),
('what''s'),
('whatever'),
('when'),
('whence'),
('whenever'),
('where'),
('where''s'),
('whereafter'),
('whereas'),
('whereby'),
('wherein'),
('whereupon'),
('wherever'),
('whether'),
('which'),
('while'),
('whither'),
('who'),
('who''s'),
('whoever'),
('whole'),
('whom'),
('whose'),
('why'),
('will'),
('willing'),
('wish'),
('with'),
('within'),
('without'),
('won''t'),
('wonder'),
('would'),
('wouldn''t'),
('yes'),
('yet'),
('you'),
('you''d'),
('you''ll'),
('you''re'),
('you''ve'),
('your'),
('yours'),
('yourself'),
('yourselves'),
('zero');

root@LAPTOP-U6EJ8KII:~/DBA#

NOTE: There are 47 words with an apostrophe in it (such as ain't and can't). So, I had to change them to double single quotes in the SQL file above.

All you need to do is execute the aforementioned SQL. This would create the database cust_stop. with the table stopwords. The table has a single column called VALUE (as specified in the MariaDB documentation).

After you created the stopword table, now configure MariaDB.conf with innodb_ft_server_stopword_table:

innodb_ft_server_stopword_table=cust_stop/stopwords

and restart MariaDB.

GIVE IT A TRY !!!

Note: If you are doing this in Galera, MariaDB will have a little fit because every table must have a primary key. You could either configure Galera not to be strict (be permissive) or just use the stopword file option instead.

Please see my old posts on making a text file stopword list.

UPDATE 2019-08-06 10:08 EDT

The link you gave to the MariaDB Documentation gives you the answer.

Just run

SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD;

Unfortunately, MyISAM stopwords are not exposed to the information_schema like InnoDB stopwords are exposed. My answer simply provides a way to populate the INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD table if you wanted these stopwords exposed to InnoDB.

Please keep in mind that for InnoDB, 36 words is the "normal" table and is exposed. For MyISAM, the 543 word "normal" table is not exposed.

UPDATE 2019-08-06 12:43 EDT

If you are interested in knowing what stopwords are used in a MyISAM table with a fulltext index, then you will have to use myisam_ftdump. I have only referred to it once in this forum (mysql MATCH relavancy score).

I have it installed on my laptop for MySQL 5.5

C:\MySQL55\bin>myisam_ftdump
Use: myisam_ftdump <table_name> <index_num>
  -h, --help          Display help and exit.
  -?, --help          Synonym for -h.
  -c, --count         Calculate per-word stats (counts and global weights).
  -d, --dump          Dump index (incl. data offsets and word weights).
  -l, --length        Report length distribution.
  -s, --stats         Report global stats.
  -v, --verbose       Be verbose.

Variables (--variable-name=value)
and boolean options {FALSE|TRUE}  Value (after reading options)
--------------------------------- ----------------------------------------
verbose                           FALSE

C:\MySQL55\bin>

When you run myisam_ftdump -c against a MyISAM table, you will get each word and its count. You will have to parse the words out of that output and compare each word to a text file that contains the 543 MyISAM stopwords. Any word in the Fulltext Dump that are in that list is using the stopword. I will leave that as an exercise for you.

Please keep in mind that myisam_ftdump is a utility program, not a client program. What's the difference ??? A client program (such as mysql, mysqldump, mysqladmin) requires you to supply a username and password to access the table in mysqld. A utility program (such as mysaimpack, myisamchk) can be used against a table but no process should be accessing the MyISAM table (no open file handles, no running SQL). This means that you will have to copy the MyISAM table to another folder outside of datadir to use the myisam_ftdump against that MyISAM in the outside folder or you risk corrupting the table.

For the difference between client and utility programs, click the following:

UPDATE 2019-08-07 10:24 EDT

Here is a desperate query

SELECT DISTINCT A.value FROM
INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD A,mytable B
WHERE
    B.mycolumn LIKE CONCAT(A.value,' %') OR
    B.mycolumn LIKE CONCAT('% ',A.value) OR
    B.mycolumn LIKE CONCAT('% ',A.value, '%')
;

This is an ugly Cartesian Join, the worst of its kind. It should locate a stopword at the beginning, end or middle of a text field.

Best Answer

EXAMPLE

GIVE IT A TRY !!!

UPDATE 2019-08-06 10:08 EDT

UPDATE 2019-08-06 12:43 EDT

UPDATE 2019-08-07 10:24 EDT

Related Solutions

Mysql – why/how does the number of matched columns influences the way of excecuting a query

Postgresql – get the union of the a Postgres fulltext index

Related Question