Mysql – Complex join 3 tables with sum and group by

group byjoin;MySQLsum

I'm trying to generate a report for one of our departments that pulls data from 3 tables. The basic structure is below:

CREATE TABLE `evd_us_county_mio_2018q4` (
    `state_code` INT(11) NULL DEFAULT NULL,
    `state_abbr` VARCHAR(10) NULL DEFAULT NULL,
    `county_code` VARCHAR(10) NULL DEFAULT NULL,
    `county_name` VARCHAR(100) NULL DEFAULT NULL,
    `vehicle_category` VARCHAR(100) NULL DEFAULT NULL,
    `vehicle_segmentation` VARCHAR(200) NULL DEFAULT NULL,
    `vehicle_make` VARCHAR(150) NULL DEFAULT NULL,
    `vehicle_model` VARCHAR(200) NULL DEFAULT NULL,
    `vehicle_model_year` INT(11) NULL DEFAULT NULL,
    `vehicle_fuel_type` VARCHAR(50) NULL DEFAULT NULL,
    `vehicle_engine_displacement_ccs` VARCHAR(50) NULL DEFAULT NULL,
    `vehicle_engine_cylinders` INT(11) NULL DEFAULT NULL,
    `vehicle_count` INT(11) NULL DEFAULT NULL
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;

enter image description here

CREATE TABLE `codes_usa` (
    `id` MEDIUMINT(9) NOT NULL AUTO_INCREMENT,
    `company_id` SMALLINT(6) NOT NULL,
    `postal_code` VARCHAR(20) NOT NULL,
    `place_name` VARCHAR(180) NOT NULL COMMENT 'city',
    `admin_name1` VARCHAR(100) NOT NULL COMMENT 'state',
    `admin_code1` VARCHAR(20) NOT NULL COMMENT 'state_alpha',
    `admin_name2` VARCHAR(100) NOT NULL COMMENT 'county_name',
    `admin_code2` VARCHAR(20) NOT NULL,
    `admin_name3` VARCHAR(100) NOT NULL,
    `admin_code3` VARCHAR(20) NOT NULL,
    `latitude` FLOAT(10,6) NOT NULL,
    `longitude` FLOAT(10,6) NOT NULL,
    `accuracy` TINYINT(3) UNSIGNED NOT NULL,
    PRIMARY KEY (`id`),
    INDEX `postal_code` (`postal_code`),
    INDEX `company_id` (`company_id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=43644
;

enter image description here

CREATE TABLE `company` (
    `id` SMALLINT(5) NOT NULL AUTO_INCREMENT,
    `account_number` VARCHAR(10) NOT NULL,
    `distributorship` TINYINT(1) UNSIGNED NOT NULL DEFAULT '1' COMMENT 'added 11-21-17 for the Directory page to distinguish between companies that were actual distributorships and other companies or resellers',
    `name` VARCHAR(70) NULL DEFAULT NULL,
    `address` VARCHAR(100) NULL DEFAULT NULL,
    `address2` VARCHAR(100) NULL DEFAULT NULL,
    `address3` VARCHAR(100) NULL DEFAULT NULL,
    `address4` VARCHAR(100) NULL DEFAULT NULL,
    `city` VARCHAR(60) NULL DEFAULT NULL,
    `state_province_region` VARCHAR(60) NULL DEFAULT NULL,
    `postal_code` VARCHAR(12) NULL DEFAULT NULL,
    `country` CHAR(60) NULL DEFAULT NULL,
    `shipping_address` VARCHAR(100) NULL DEFAULT NULL,
    `shipping_address2` VARCHAR(100) NULL DEFAULT NULL,
    `shipping_address3` VARCHAR(100) NULL DEFAULT NULL,
    `shipping_address4` VARCHAR(100) NULL DEFAULT NULL,
    `shipping_city` VARCHAR(60) NULL DEFAULT NULL,
    `shipping_state_province_region` VARCHAR(60) NULL DEFAULT NULL,
    `shipping_postal_code` VARCHAR(12) NULL DEFAULT NULL,
    `shipping_country` VARCHAR(60) NULL DEFAULT NULL,
    `UPS_address` VARCHAR(100) NULL DEFAULT NULL,
    `UPS_address2` VARCHAR(100) NULL DEFAULT NULL,
    `UPS_address3` VARCHAR(100) NULL DEFAULT NULL,
    `UPS_address4` VARCHAR(100) NULL DEFAULT NULL,
    `UPS_city` VARCHAR(60) NULL DEFAULT NULL,
    `UPS_state_province_region` VARCHAR(60) NULL DEFAULT NULL,
    `UPS_postal_code` VARCHAR(12) NULL DEFAULT NULL,
    `UPS_country` VARCHAR(60) NULL DEFAULT NULL,
    `phone` VARCHAR(25) NULL DEFAULT NULL,
    `fax` VARCHAR(25) NULL DEFAULT NULL,
    `web` VARCHAR(100) NULL DEFAULT NULL,
    `zone` TINYINT(1) NULL DEFAULT NULL,
    `zone_country_codes` VARCHAR(255) NULL DEFAULT 'USA' COMMENT 'Country code to reference the postal_codes table',
    `date_established` DATE NULL DEFAULT NULL,
    `order_frequency` TINYINT(3) NULL DEFAULT NULL,
    `on_hand_inventory` TINYINT(3) NULL DEFAULT NULL,
    `active` TINYINT(1) NOT NULL DEFAULT '0',
    `last_updated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    `deactivation_date` DATE NOT NULL DEFAULT '0000-00-00',
    PRIMARY KEY (`id`),
    UNIQUE INDEX `account_number` (`account_number`),
    INDEX `company_name` (`name`),
    INDEX `zone_country_codes` (`zone_country_codes`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=908
;

enter image description here

UPDATE
I originally though my results were accurate; after showing a sample to the department that will be using the data, and looking at previous years numbers, I determined my query is still not quite right. The following query is not returning accurate results:

SELECT a.state_abbr, a.county_name, SUM(a.vehicle_count) AS vehicle_count, c.NAME AS distributorship
FROM import_experian.evd_us_county_mio_2018q4 a
LEFT JOIN postal.codes_usa b
     ON a.state_abbr=b.admin_code1 AND a.county_name = b.admin_name2
    LEFT JOIN headquarters.company c
     ON b.company_id=c.id
WHERE a.vehicle_category = 'Offroad'
GROUP BY a.state_abbr, a.county_name
ORDER BY a.state_abbr, a.county_name
LIMIT 100

enter image description here

I know this is inaccruate because querying for the total vehicle count in AK only returns ~27k:

SELECT a.state_abbr, sum(a.vehicle_count) AS vehicle_count
FROM import_experian.evd_us_county_mio_2018q4 a 
WHERE a.vehicle_category = 'Offroad' AND a.state_abbr = 'AK'

enter image description here

and GROUP BY a.county_name shows far fewer vehicles in each county as well:

SELECT a.state_abbr, a.county_name, sum(a.vehicle_count) AS vehicle_count
FROM import_experian.bg_products_evd_us_county_mio_2018q4 a 
WHERE a.vehicle_category = 'Offroad' AND a.state_abbr = 'AK'
GROUP BY a.county_name

enter image description here

It seems when I addd the JOIN statements that the SUM() of vehicle_count is being multiplied if a company exists. For example, notice that without the JOIN statement, Anchorage shows 8,193 for the vehicle_count which is the correct number, the same for Fairbanks North Star:

enter image description here

Comparing the above results to the result from the query with JOIN statements reveals something interesting:

enter image description here

Achorage is not associated with any company at this time and the SUM() is accurate while Fairbanks North Star does have a company associated an the SUM() is X times the true value.

What am I missing here? How can I fix my query?

Thank you!

Best Answer

You will have to add a full GROUP BY to get deterministic result:

SELECT a.state_abbr, a.county_name, SUM(a.vehicle_count) AS vehicle_count, C.name
FROM import_experian.bg_products_evd_us_county_mio_2018q4 a 
JOIN bgprod_postal.codes_usa b
     ON a.state_abbr=b.admin_code1 AND a.county_name = b.admin_name2
JOIN bgprod_headquarters.company c
     ON b.company_id=c.id
WHERE a.vehicle_category = 'Offroad'
GROUP BY a.state_abbr, a.county_name, C.name 
LIMIT 100;

For efficiency, you may want to create an index on:

CREATE INDEX ... ON import_experian.bg_products_evd_us_county_mio_2018q4
    (state_abbr, county_name, vehicle_category);

If vehicle_category is very selective you may change that to:

CREATE INDEX ... ON import_experian.bg_products_evd_us_county_mio_2018q4
    (vehicle_category, state_abbr, county_name);

In addition to that, you may want to create an index on:

CREATE INDEX ... ON bgprod_postal.codes_usa
    (admin_code1, admin_name2);

One way to add a YEAR dimension to the result, is to add a constant to each leg of a union like:

SELECT 2018 as origin, a.state_abbr, a.county_name, SUM(a.vehicle_count) AS vehicle_count, C.name
FROM import_experian.bg_products_evd_us_county_mio_2018q4 a 
JOIN bgprod_postal.codes_usa b
     ON a.state_abbr=b.admin_code1 AND a.county_name = b.admin_name2
JOIN bgprod_headquarters.company c
     ON b.company_id=c.id
WHERE a.vehicle_category = 'Offroad'
GROUP BY a.state_abbr, a.county_name, C.name 
UNION ALL
SELECT 2019 as origin, a.state_abbr, a.county_name, SUM(a.vehicle_count) AS vehicle_count, C.name
FROM import_experian.bg_products_evd_us_county_mio...
...

That you get duplicate results is a consequence of that you don't have any real keys in your tables (an autogenerated id is useless when it comes to integrity issues). To figure out what is wrong you need to think through what the real keys should be, group by them and apply having count(1) > 1. Those rows are your culprits.