I'm trying to generate a report for one of our departments that pulls data from 3 tables. The basic structure is below:
CREATE TABLE `evd_us_county_mio_2018q4` (
`state_code` INT(11) NULL DEFAULT NULL,
`state_abbr` VARCHAR(10) NULL DEFAULT NULL,
`county_code` VARCHAR(10) NULL DEFAULT NULL,
`county_name` VARCHAR(100) NULL DEFAULT NULL,
`vehicle_category` VARCHAR(100) NULL DEFAULT NULL,
`vehicle_segmentation` VARCHAR(200) NULL DEFAULT NULL,
`vehicle_make` VARCHAR(150) NULL DEFAULT NULL,
`vehicle_model` VARCHAR(200) NULL DEFAULT NULL,
`vehicle_model_year` INT(11) NULL DEFAULT NULL,
`vehicle_fuel_type` VARCHAR(50) NULL DEFAULT NULL,
`vehicle_engine_displacement_ccs` VARCHAR(50) NULL DEFAULT NULL,
`vehicle_engine_cylinders` INT(11) NULL DEFAULT NULL,
`vehicle_count` INT(11) NULL DEFAULT NULL
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB
;
CREATE TABLE `codes_usa` (
`id` MEDIUMINT(9) NOT NULL AUTO_INCREMENT,
`company_id` SMALLINT(6) NOT NULL,
`postal_code` VARCHAR(20) NOT NULL,
`place_name` VARCHAR(180) NOT NULL COMMENT 'city',
`admin_name1` VARCHAR(100) NOT NULL COMMENT 'state',
`admin_code1` VARCHAR(20) NOT NULL COMMENT 'state_alpha',
`admin_name2` VARCHAR(100) NOT NULL COMMENT 'county_name',
`admin_code2` VARCHAR(20) NOT NULL,
`admin_name3` VARCHAR(100) NOT NULL,
`admin_code3` VARCHAR(20) NOT NULL,
`latitude` FLOAT(10,6) NOT NULL,
`longitude` FLOAT(10,6) NOT NULL,
`accuracy` TINYINT(3) UNSIGNED NOT NULL,
PRIMARY KEY (`id`),
INDEX `postal_code` (`postal_code`),
INDEX `company_id` (`company_id`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=43644
;
CREATE TABLE `company` (
`id` SMALLINT(5) NOT NULL AUTO_INCREMENT,
`account_number` VARCHAR(10) NOT NULL,
`distributorship` TINYINT(1) UNSIGNED NOT NULL DEFAULT '1' COMMENT 'added 11-21-17 for the Directory page to distinguish between companies that were actual distributorships and other companies or resellers',
`name` VARCHAR(70) NULL DEFAULT NULL,
`address` VARCHAR(100) NULL DEFAULT NULL,
`address2` VARCHAR(100) NULL DEFAULT NULL,
`address3` VARCHAR(100) NULL DEFAULT NULL,
`address4` VARCHAR(100) NULL DEFAULT NULL,
`city` VARCHAR(60) NULL DEFAULT NULL,
`state_province_region` VARCHAR(60) NULL DEFAULT NULL,
`postal_code` VARCHAR(12) NULL DEFAULT NULL,
`country` CHAR(60) NULL DEFAULT NULL,
`shipping_address` VARCHAR(100) NULL DEFAULT NULL,
`shipping_address2` VARCHAR(100) NULL DEFAULT NULL,
`shipping_address3` VARCHAR(100) NULL DEFAULT NULL,
`shipping_address4` VARCHAR(100) NULL DEFAULT NULL,
`shipping_city` VARCHAR(60) NULL DEFAULT NULL,
`shipping_state_province_region` VARCHAR(60) NULL DEFAULT NULL,
`shipping_postal_code` VARCHAR(12) NULL DEFAULT NULL,
`shipping_country` VARCHAR(60) NULL DEFAULT NULL,
`UPS_address` VARCHAR(100) NULL DEFAULT NULL,
`UPS_address2` VARCHAR(100) NULL DEFAULT NULL,
`UPS_address3` VARCHAR(100) NULL DEFAULT NULL,
`UPS_address4` VARCHAR(100) NULL DEFAULT NULL,
`UPS_city` VARCHAR(60) NULL DEFAULT NULL,
`UPS_state_province_region` VARCHAR(60) NULL DEFAULT NULL,
`UPS_postal_code` VARCHAR(12) NULL DEFAULT NULL,
`UPS_country` VARCHAR(60) NULL DEFAULT NULL,
`phone` VARCHAR(25) NULL DEFAULT NULL,
`fax` VARCHAR(25) NULL DEFAULT NULL,
`web` VARCHAR(100) NULL DEFAULT NULL,
`zone` TINYINT(1) NULL DEFAULT NULL,
`zone_country_codes` VARCHAR(255) NULL DEFAULT 'USA' COMMENT 'Country code to reference the postal_codes table',
`date_established` DATE NULL DEFAULT NULL,
`order_frequency` TINYINT(3) NULL DEFAULT NULL,
`on_hand_inventory` TINYINT(3) NULL DEFAULT NULL,
`active` TINYINT(1) NOT NULL DEFAULT '0',
`last_updated` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`deactivation_date` DATE NOT NULL DEFAULT '0000-00-00',
PRIMARY KEY (`id`),
UNIQUE INDEX `account_number` (`account_number`),
INDEX `company_name` (`name`),
INDEX `zone_country_codes` (`zone_country_codes`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
AUTO_INCREMENT=908
;
UPDATE
I originally though my results were accurate; after showing a sample to the department that will be using the data, and looking at previous years numbers, I determined my query is still not quite right. The following query is not returning accurate results:
SELECT a.state_abbr, a.county_name, SUM(a.vehicle_count) AS vehicle_count, c.NAME AS distributorship
FROM import_experian.evd_us_county_mio_2018q4 a
LEFT JOIN postal.codes_usa b
ON a.state_abbr=b.admin_code1 AND a.county_name = b.admin_name2
LEFT JOIN headquarters.company c
ON b.company_id=c.id
WHERE a.vehicle_category = 'Offroad'
GROUP BY a.state_abbr, a.county_name
ORDER BY a.state_abbr, a.county_name
LIMIT 100
I know this is inaccruate because querying for the total vehicle count in AK only returns ~27k:
SELECT a.state_abbr, sum(a.vehicle_count) AS vehicle_count
FROM import_experian.evd_us_county_mio_2018q4 a
WHERE a.vehicle_category = 'Offroad' AND a.state_abbr = 'AK'
and GROUP BY a.county_name
shows far fewer vehicles in each county as well:
SELECT a.state_abbr, a.county_name, sum(a.vehicle_count) AS vehicle_count
FROM import_experian.bg_products_evd_us_county_mio_2018q4 a
WHERE a.vehicle_category = 'Offroad' AND a.state_abbr = 'AK'
GROUP BY a.county_name
It seems when I addd the JOIN
statements that the SUM()
of vehicle_count
is being multiplied if a company exists. For example, notice that without the JOIN
statement, Anchorage shows 8,193
for the vehicle_count
which is the correct number, the same for Fairbanks North Star:
Comparing the above results to the result from the query with JOIN
statements reveals something interesting:
Achorage is not associated with any company at this time and the SUM() is accurate while Fairbanks North Star does have a company associated an the SUM() is X times the true value.
What am I missing here? How can I fix my query?
Thank you!
Best Answer
You will have to add a full GROUP BY to get deterministic result:
For efficiency, you may want to create an index on:
If vehicle_category is very selective you may change that to:
In addition to that, you may want to create an index on:
One way to add a YEAR dimension to the result, is to add a constant to each leg of a union like:
That you get duplicate results is a consequence of that you don't have any real keys in your tables (an autogenerated id is useless when it comes to integrity issues). To figure out what is wrong you need to think through what the real keys should be, group by them and apply having count(1) > 1. Those rows are your culprits.