Mysql – How to compare across rows in large table

MySQLperformance

I need some help with a complicated query. Here are the dataset structures I'm dealing with. This is a MySQL database.

*transactions_table*

zip  | date
-----------------
1234 | 2010-05-01
4567 | 2010-03-10
8901 | 2010-02-25


*stores_table*

brand_id | zip
---------------
1        | 1234
2        | 4567
3        | 8901
4        | 2222
5        | 6666

*manager_table*

brand_id | name
---------------
1        | jane smith
1        | joe smith
2        | ellen katz
3        | foo fuzz
4        | bar bazz

My goal is to get all the transactions in stores that are managed by managers whose names do not match for a regex on the final word of the name column. The number of managers per store can be 1 or 2; it's inconsistent.

I can join all the data together, but then I can't figure out how to perform that last filter step. Here's what I have so far:

SELECT
manager.name, store.zip, txn.date
FROM
transactions_table txn
JOIN
stores_table store
  ON
  txn.zip = store.zip
JOIN
manager_table manager
  ON
  store.brand_id = manager.brand_id

Which gives me a derived table that looks like:

name       | zip  | date
------------------------------
jane smith | 1234 | 2010-05-01
joe smith  | 1234 | 2010-05-01
ellen katz | 4567 | 2010-03-10
foo fuzz   | 8901 | 2010-02-25

But I want to exclude the first two rows, because they represent a transaction at a store managed by people with names that match on their last name.

I want the dataset to look like:

name       | zip  | date
------------------------------
ellen katz | 4567 | 2010-03-10
foo fuzz   | 8901 | 2010-02-25

A couple notes:

  • I realize the domain this represents seems a little weird, but it's a simplified example for the sake of trying to nail down this query
  • scale is an issue. the real manager_table is about 38MM rows and the real stores_table is about 18MM, so I'd like to avoid a join on a derived table is possible

Best Answer

You could get the last word by first using reverse() to reverse it. So the last space becomes the first, so that the right one is taken even if there is more than one space in the name. Find the position of that space with locate(). With a substring() from the first character to the found position of the space minus one you'll get the last word reversed. Reverse this and have your last word.

reverse(substring(reverse(manager.name), 1, locate(' ', reverse(manager.name)) - 1))

For comparison we it actually doesn't matter if the word is reversed or not and if the space is still included. So we can simplify it to:

substring(reverse(manager.name), 1, locate(' ', reverse(manager.name)))

But we have to take care that at least one space is included as otherwise we'd do a substring(..., 1, 0) (locate() returns 0 if no match was found) which will result in the empty string. So two names without a space would match even if they were different, as the empty string is equal to the empty string. So we additionally need to check if:

locate(' ', manager.name) > 0

The order of the characters doesn't matter here, so we don't need to use reverse().

So you could use something like:

SELECT manager.name,
       store.zip,
       txn.date
       FROM transactions_table txn
            INNER JOIN stores_table store
                       ON txno.zip = store.zip
            INNER JOIN (SELECT *
                               FROM manager_table managero
                               WHERE NOT EXISTS (SELECT *
                                                        FROM manager_table manageri
                                                        WHERE AND locate(' ', managero.name) > 0
                                                              AND locate(' ', manageri.name) > 0
                                                              AND manageri.name <> managero.name
                                                              AND manageri.brand_id = managero.brand_id
                                                              AND substring(reverse(manageri.name), 1, locate(' ', reverse(manageri.name)) - 1) = substring(reverse(managero.name), 1, locate(' ', reverse(managero.name)) - 1)) manager
                       ON storeo.brand_id = managero.brand_id;

It will only join the set of managers, where no other manager with the same last word in name and the same brand_id exists. However note, that this fails to exclude managers without a space in their names, even if the names are equal. That is due to the fact, that you didn't reveal if there is any id column in manager_table. Only name seems to be unique. So there is no way to tell if the inner row is the same as the outer row without checking for the whole name. If there is and id column (e.g. named id) you can rewrite it to:

SELECT manager.name,
       store.zip,
       txn.date
       FROM transactions_table txn
            INNER JOIN stores_table store
                       ON txno.zip = store.zip
            INNER JOIN (SELECT *
                               FROM manager_table managero
                               WHERE NOT EXISTS (SELECT *
                                                        FROM manager_table manageri
                                                        WHERE manageri.id <> managero.id
                                                              AND (locate(' ', managero.name) > 0
                                                                   AND locate(' ', manageri.name) > 0
                                                                   AND manageri.brand_id = managero.brand_id
                                                                   AND substring(reverse(manageri.name), 1, locate(' ', reverse(manageri.name)) - 1) = substring(reverse(managero.name), 1, locate(' ', reverse(managero.name)) - 1)) manager
                                                                    OR manageri.name = managero.name)
                       ON storeo.brand_id = managero.brand_id;

That will also find duplicate managers with the same name, without a space in it.

Note: I assumed you're not already using MySQL 8.0. Version 8.0 introduced regexp_replace and regexp_substring providing possible easier alternatives to the string reversing.