I have been trying to create a "gaps and islands" identification routine in MySQL. See the previous question regarding setup here and computational resources required here.
I increased the RAM of the database server to 4 Gb. The query was running ok, but now I am having an issue where the database server runs out of disk space while running the query.
The query looks at the entire database and inserts islands that it finds into a table called shutdown_events
. I use INSERT IGNORE
and look at the whole database so that I don't cut off islands that span some sort of a timestamp constraint (I try adding this in, read on..).
I created the view with the following SQL (there are ~30 dr.*_sd
values to include)
create view sd_pivot as
select dr.wellsite_id, 'EngOilP_sd' as sd, dr.timestamp, dr.EngOilP_sd as val from datarecords dr
union all
select dr.wellsite_id, 'Stg1ScrbLVL_sd', dr.timestamp, dr.Stg1ScrbLVL_sd from datarecords dr
union all
select dr.wellsite_id, 'Stg2ScrbLVL_sd', dr.timestamp, dr.Stg2ScrbLVL_sd from datarecords dr
union all
...
and then the SQL that populates the shutdownevents
table is here:
INSERT IGNORE INTO shutdownevents (wellsite_id, sd_name, start, end)
SELECT t.*
FROM (
SELECT wellsite_id, sd, MIN(timestamp) AS starttime, MAX(timestamp) AS endtime
FROM (
SELECT
sd_p.*
, @val_change := IF(@prev_val != sd_p.val, @val_change + 1, @val_change) AS vc
, @prev_val := sd_p.val
FROM
sd_pivot sd_p
, (SELECT @prev_val := NULL, @val_change := 0) var_init_subquery
ORDER BY wellsite_id, sd, timestamp
) sq
WHERE val = 1
# AND timestamp > '{two_weeks_ago}'
GROUP BY sd, vc
) t
ORDER BY wellsite_id, sd, starttime
In my testing database, I added the line (commented out in the above SQL) AND timestamp > '{two_weeks_ago}'
to only find islands that occur after that date. Note that I sacrifice the beginning of islands that span that date. I thought that this would prevent the query from trying to look at so much data and thus using so much disk space.
The problem is that trying to run the above query still eventually uses up all of the available disk space and subsequently fails. The database server has 30 Gb of disk space. Before I run the query, there is ~2 Gb used up (28 Gb free).
The database is about 1.5 Gb in size. 99.99% of the database size is the datarecords
table. It has > 2M rows, the other tables are a few hundred rows at most.
Here is a shot from htop
on the database server once the disk has run out of memory and the query is about to fail.
The actual disk space is being taken up in the folder /tmp
and the file names are like below
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:40 #sql_3a1_0.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_0.MAI
-rw-rw---- 1 mysql mysql 2.6G Nov 30 17:09 #sql_3a1_1.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:40 #sql_3a1_1.MAI
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_3.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_3.MAI
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_5.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:22 #sql_3a1_5.MAI
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:44 #sql_3a1_6.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:29 #sql_3a1_6.MAI
-rw-rw---- 1 mysql mysql 2.3G Nov 30 17:09 #sql_3a1_7.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:45 #sql_3a1_7.MAI
-rw-rw---- 1 mysql mysql 5.9G Nov 30 16:45 #sql_3a1_8.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:30 #sql_3a1_8.MAI
-rw-rw---- 1 mysql mysql 2.1G Nov 30 17:09 #sql_3a1_9.MAD
-rw-rw---- 1 mysql mysql 8.0K Nov 30 16:45 #sql_3a1_9.MAI
If I $ service mysql restart
, all of the above files disappear and I get my disk space back.
So, I this question has a couple answers addressing one or more of the following;
- How might I rewrite the query to not be so resource intensive?
- Is there a different way to approach this problem?
At the end of the day, I simply need to identify events that occur and their duration using their respective boolean column that updates every few minutes
EDIT
Here is a link to the EXPLAIN
output of the query. On the second tab is a DESCRIBE
of datarecords
The line where I attempt to limit the date range of the query is calculated in the python
script. The actual SQL to the database ends up being
...
AND timestamp > '2015-11-20'
...
Best Answer
Your problem is due to 2 reported bugs:
UNION
orUNION ALL
MySQL creates a temporary working table. While it is justifiable forUNION
, it is unnecessary forUNION ALL
Bug #50674: Do not create temporary tables for UNION ALL
This bug was scheduled to be fixed in ver. 5.7.3.
Bug #44626: Tables are reopened multiple times if used in same query, buffers allocated too
So in your case it creates 38 temporary tables consuming lots of resources.
One possible solution
is to use dynamic SQL and
CURSOR
to go through every of 38 columns. This way you don't need a view with 38UNION ALL
statements.