Mysql – How to compare across rows in large table

MySQLperformance

I need some help with a complicated query. Here are the dataset structures I'm dealing with. This is a MySQL database.

*transactions_table*

zip  | date
-----------------
1234 | 2010-05-01
4567 | 2010-03-10
8901 | 2010-02-25


*stores_table*

brand_id | zip
---------------
1        | 1234
2        | 4567
3        | 8901
4        | 2222
5        | 6666

*manager_table*

brand_id | name
---------------
1        | jane smith
1        | joe smith
2        | ellen katz
3        | foo fuzz
4        | bar bazz

My goal is to get all the transactions in stores that are managed by managers whose names do not match for a regex on the final word of the name column. The number of managers per store can be 1 or 2; it's inconsistent.

I can join all the data together, but then I can't figure out how to perform that last filter step. Here's what I have so far:

SELECT
manager.name, store.zip, txn.date
FROM
transactions_table txn
JOIN
stores_table store
  ON
  txn.zip = store.zip
JOIN
manager_table manager
  ON
  store.brand_id = manager.brand_id

Which gives me a derived table that looks like:

name       | zip  | date
------------------------------
jane smith | 1234 | 2010-05-01
joe smith  | 1234 | 2010-05-01
ellen katz | 4567 | 2010-03-10
foo fuzz   | 8901 | 2010-02-25

But I want to exclude the first two rows, because they represent a transaction at a store managed by people with names that match on their last name.

I want the dataset to look like:

name       | zip  | date
------------------------------
ellen katz | 4567 | 2010-03-10
foo fuzz   | 8901 | 2010-02-25

A couple notes:

I realize the domain this represents seems a little weird, but it's a simplified example for the sake of trying to nail down this query
scale is an issue. the real manager_table is about 38MM rows and the real stores_table is about 18MM, so I'd like to avoid a join on a derived table is possible

Best Answer

You could get the last word by first using reverse() to reverse it. So the last space becomes the first, so that the right one is taken even if there is more than one space in the name. Find the position of that space with locate(). With a substring() from the first character to the found position of the space minus one you'll get the last word reversed. Reverse this and have your last word.

reverse(substring(reverse(manager.name), 1, locate(' ', reverse(manager.name)) - 1))

For comparison we it actually doesn't matter if the word is reversed or not and if the space is still included. So we can simplify it to:

substring(reverse(manager.name), 1, locate(' ', reverse(manager.name)))

But we have to take care that at least one space is included as otherwise we'd do a substring(..., 1, 0) (locate() returns 0 if no match was found) which will result in the empty string. So two names without a space would match even if they were different, as the empty string is equal to the empty string. So we additionally need to check if:

locate(' ', manager.name) > 0

The order of the characters doesn't matter here, so we don't need to use reverse().

So you could use something like:

SELECT manager.name,
       store.zip,
       txn.date
       FROM transactions_table txn
            INNER JOIN stores_table store
                       ON txno.zip = store.zip
            INNER JOIN (SELECT *
                               FROM manager_table managero
                               WHERE NOT EXISTS (SELECT *
                                                        FROM manager_table manageri
                                                        WHERE AND locate(' ', managero.name) > 0
                                                              AND locate(' ', manageri.name) > 0
                                                              AND manageri.name <> managero.name
                                                              AND manageri.brand_id = managero.brand_id
                                                              AND substring(reverse(manageri.name), 1, locate(' ', reverse(manageri.name)) - 1) = substring(reverse(managero.name), 1, locate(' ', reverse(managero.name)) - 1)) manager
                       ON storeo.brand_id = managero.brand_id;

It will only join the set of managers, where no other manager with the same last word in name and the same brand_id exists. However note, that this fails to exclude managers without a space in their names, even if the names are equal. That is due to the fact, that you didn't reveal if there is any id column in manager_table. Only name seems to be unique. So there is no way to tell if the inner row is the same as the outer row without checking for the whole name. If there is and id column (e.g. named id) you can rewrite it to:

SELECT manager.name,
       store.zip,
       txn.date
       FROM transactions_table txn
            INNER JOIN stores_table store
                       ON txno.zip = store.zip
            INNER JOIN (SELECT *
                               FROM manager_table managero
                               WHERE NOT EXISTS (SELECT *
                                                        FROM manager_table manageri
                                                        WHERE manageri.id <> managero.id
                                                              AND (locate(' ', managero.name) > 0
                                                                   AND locate(' ', manageri.name) > 0
                                                                   AND manageri.brand_id = managero.brand_id
                                                                   AND substring(reverse(manageri.name), 1, locate(' ', reverse(manageri.name)) - 1) = substring(reverse(managero.name), 1, locate(' ', reverse(managero.name)) - 1)) manager
                                                                    OR manageri.name = managero.name)
                       ON storeo.brand_id = managero.brand_id;

That will also find duplicate managers with the same name, without a space in it.

Note: I assumed you're not already using MySQL 8.0. Version 8.0 introduced regexp_replace and regexp_substring providing possible easier alternatives to the string reversing.

Answer to question 1

This should do it for you

UPDATE user_activity a
INNER JOIN user_profile b 
USING (user_profile_id)
SET a.user_profile_id_int = b.user_profile_id_int;

This will work in MySQL. If you are not used to that JOIN syntax, do this:

UPDATE user_activity a
INNER JOIN user_profile b 
ON a.user_profile_id = b.user_profile_id
SET a.user_profile_id_int = b.user_profile_id_int;

Both should work.

Answer to question 2

Your query, in theory, works. However, look at what it is doing:

A table scan on user_activity, an indexed lookup of the user_profile_id_int using the PRIMARY KEY of user_profile, and an in-place update of the current row in user_activity.

The query is hitting two tables and two primary keys, back and forth, on a per-row basis. All steps slow each other down. Thus, you get a longer running query.

CAVEAT

Adding a compound index on user_profile should speed things up:

ALTER TABLE user_profile ADD INDEX (user_profile_id,user_profile_id_int);

Mysql – How to speed up queries on a large 220 million rows table (9 gig data)

Thoughts on the issue, thrown in random order:

The obvious index for this query is: (rated_user_id, rating). A query that gets data for only one of the million users and needs 17 seconds is doing something wrong: reading from the (rated_user_id, rater_user_id) index and then reading from the table the (hundreds to thousands) values for the rating column, as rating is not in any index. So, the query has to read many rows of the table which are located in many different disk locations.
Before starting adding numerous indexes in the tables, try to analyze the performance of the whole database, the whole set of slow queries, examine again the choices of the datatypes, the engine you use and the configuration settings.
Consider moving to a newer version of MySQL, 5.1, 5.5 or even 5.6 (also: Percona and MariaDB versions.) Several benefits as bugs have been corrected, the optimizer improved and you can set the low threshold for slow queries to less than 1 second (like 10 milliseconds). This will give you far better info about slow queries.
The choice for the datatype of rating is weird. VARCHAR(1)? Why not CHAR(1)? Why not TINYINT? This will save you some space, both tin the table and in the indexes that (will) include that column. A varchar(1) column needs one more byte over char(1) and if they are utf8, the (var)char columns will need 3 (or 4) bytes, instead of 1 (tinyint).

Best Answer

Related Solutions

Mysql – Update Large Table

Answer to question 1

Answer to question 2

CAVEAT

Mysql – How to speed up queries on a large 220 million rows table (9 gig data)

Related Question