MySQL Performance – Running Query on Improperly Indexed 600GB Table

indexinnodbMySQLperformancequery-performance

This is my first post on here and I'm really baffled by this. I have joined a team and somewhere in the past a horrific table was created which has been bloated to a point of 600+gb of data.

Originally this data was supposed to come in for one state, but data from the entire country has been storing here for a long period of time.

I have been put in charge of running a query to isolate a LATITUDE/LONGITUDE bounding box in this momentous data set within a certain date range.

To make things even worse, the database is located on an old server which runs on a quad p3 processor with only about 4gb of ram so it's incredibly slow.

Query I would like to execute, but for obvious reasons cringe just looking at:

INSERT INTO gpsdata.vehicle_location_wa SELECT * FROM gpsdata.vehicle_location 
WHERE LOCATION_TIMESTAMP BETWEEN '2014-03-20' AND '2014-07-20' 
AND LATITUDE BETWEEN 46.86275036 AND 47.80149551
AND LONGITUDE BETWEEN -122.44599707 AND -116.94458691;

Table structure:

CREATE TABLE `vehicle_location` (  
`SEQ_ID` decimal(10 , 0 ) default NULL,  
`DEVICE_ID` varchar(100) collate utf8_unicode_ci default NULL,  
`DATA_TYPE` varchar(80) collate utf8_unicode_ci default NULL,  
`DATA_DESC` varchar(180) collate utf8_unicode_ci default NULL,  
`SPEED` decimal(6 , 3 ) default '0.000', 
`SPEED_UOM` char(3) collate utf8_unicode_ci default NULL,  
`DIRECTION` decimal(3 , 0 ) default NULL,  
`DURATION` decimal(10 , 0 ) default '0',  
`LATITUDE` decimal(16 , 13 ) default NULL,  
`LONGITUDE` decimal(16 , 13 ) default NULL,  
`GPS_STATUS` tinyint(4) default NULL,  
`LOCATION_TIMESTAMP` datetime default NULL,  
`TIMEZONE` tinyint(1) default NULL,  
`DST` tinyint(1) default NULL,  
`STATUS` tinyint(1) default NULL,  
`MILEAGE` decimal(15 , 2 ) default '0.00',  
`QUEUE_TIME` datetime default NULL,  
`OBSERVATION_TIME` datetime default NULL,  
`ID` BIGINT NOT NULL auto_increment,  
PRIMARY KEY (`ID`),  
KEY `DEVICE_ID` (`DEVICE_ID`),  
KEY `Location_Timestamp` (`LOCATION_TIMESTAMP`),  
KEY `Data_Type` (`DATA_TYPE`)  
)  ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE = utf8_unicode_ci  

Some things I thought about doing:

  • Selecting only by date range into a new table with two extra indexes: LATITUDE and LONGITUDE [Seems like best option]
  • Moving database onto a much faster machine with loads of ram (This has its own issues, the whole database is 3.5 tb of data on a raid)
  • Adding an index to table (gasp)

I'm hoping someone with a bit more experience can help guide me in the right direction or affirm some of my thinking before I start executing queries that could take days to process.

Thank you for your help!

Best Answer

You have a big problem for one reason: BTree indexes (which are the only format available for InnoDB) are highly inefficient for filtering on more than 1 range.

There is one thing that you must understand: in general, using more than 1 index per table access is not possible/efficient (there are some cases where union_merge is faster, but that is an exception)- in general, for AND conditions you want a single index over several fields.

So, no problem then, isn't it? We just index (LOCATION_TIMESTAMP, LATITUDE, LONGITUDE) right? Sorry, no. As you are using a range for the conditions on all the columns, only the first column of the index will be used. The reason is easy to see, but difficult to explain without drawing a tree, so I hope you can trust me on this and I can later explain you why.

There is an exception to this, which is a new feature found in 5.6 called Index condition pushdown (that doesn't solve the problem completely, but makes the 3-column index more useful than a 1-column one).

So what are the options?

  1. Convert some of the conditions to the ref type, that means, convert them into equality checks. That is something that can be done with the timestamps. It may sound a bit weird, but indexing (LOCATION_TIMESTAMP, LATITUDE) and having a query like this:

    WHERE LOCATION_TIMESTAMP IN ('2014-03-20', '2014-03-21', ..., '2014-07-20') AND 
    LATITUDE BETWEEN 46.86275036 AND 47.80149551;
    

    may actually be much faster. Of course, this depends on the selectivity of the clauses, etc.

    Can we do the same for LONGITUDE. In same cases, yes, you could create an additional field called LATITUDE_INT with floor(LATITUDE), and do something like:

    WHERE LOCATION_TIMESTAMP IN ('2014-03-20', '2014-03-21', ..., '2014-07-20') AND 
    LATITUDE_INT IN (46, 47) AND
    LATITUDE BETWEEN 46.86275036 AND 47.80149551 AND 
    LONGITUDE BETWEEN -122.44599707 AND -116.94458691;
    

    The 3 columns of the indes will be used in this case. But to be frank, that starts to become a bit crazy.

  2. Use the geospatial extensions. MySQL has support for spatial types like POINT, though precisely for geographical applications, and with its own indexes (R-trees), which can index in several dimensions at te same time -so they speed up multirange searches- and have its own, optimised set of bounding box functions (MBRContains() is what you want).

    This is great, isn't it? Bad news: it is only available for MyISAM in 5.5 and 5.6, and it will only be available for InnoDB in 5.7. The change is also not very transparent (not as easy as just add an index).

So, the fact that you have an unindexed table is possible due that your options are very narrow, specially for 5.5. If you are a bit flexible about the table structure (partitioning, as you suggested) or performing other tricks, there are ways to walkaround it, but it needs some care.

Some people end up combining MySQL with external tools precisely for this reason, and we will not have a real solution until 5.7.

Finally, there is the problem of the large table. This can become a problem. InnoDB is only efficient when most of its non-leaf parts of the primary key fits into memory. As ypercube says, a 3.5TB table on a 4GB server is something scary. Inserting to that table must be slow. You will need more ram and thinking about partitioning it, for example, as you said, by state (if that is ok for your queries) of by timestamp and get advantage of the manual or automatic prunning.