Get qualifying rows only
One way ..
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
, y AS (
SELECT *
FROM x
WHERE x.rn = 1
AND date_out >= '2012-10-01'
AND date_out < '2012-11-01'
)
, z AS (
SELECT x.*
FROM x
JOIN y USING (reference)
WHERE x.rn = 2
AND x.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
JOIN z USING (reference)
UNION ALL
SELECT taskid, reference, centreid, date_out
FROM z
ORDER BY reference, date_out;
Another way:
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT x.*, y.taskid AS taskid2, y.centreid AS centreid2, y.date_out AS date_out2
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
UNION ALL
SELECT y.taskid2, y.reference, y.centreid2, y.date_out2
FROM y
ORDER BY reference, date_out;
I'd expect the second one to be faster. Depends on your data distribution. Test with EXPLAIN ANALYZE
.
Get all rows for qualifying references
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT reference
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT *
FROM tbl
JOIN y USING (reference)
ORDER BY reference, date_out;
-> sqlfiddle
Answer to follow-up in comment
Separate groups if more than 30 days between entries.
WITH a AS (
SELECT *
,lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) AS last_date
,CASE WHEN date_out >
(lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) - 30)
THEN 0 ELSE 1
END AS step
FROM tbl
)
,b AS (
SELECT *
,sum(step) OVER (PARTITION BY reference ORDER BY date_out DESC) AS grp
FROM a
)
,c AS (
SELECT *
,row_number() OVER (PARTITION BY reference, grp ORDER BY date_out DESC) AS rn
FROM b
)
,d AS (
SELECT reference, grp
FROM c
JOIN c d USING (reference, grp)
WHERE c.rn = 1
AND c.date_out >= '2012-10-01'
AND c.date_out < '2012-11-01'
AND d.rn = 2
AND d.centreid = 1
)
SELECT b.taskid, b.reference, b.centreid, b.date_out
FROM b
JOIN d USING (reference, grp)
ORDER BY reference, date_out
-> sqlfiddle
But while pure SQL is a beaut .. I would solve this procedurally in a plpgsql function. Very similar to this recent answer on SO. Would be faster, because it can be done in a single table scan.
OBSERVATION #1
You have this:
CREATE TABLE mytable (id1 int, id2 int, score float) ENGINE=MyISAM;
LOAD DATA INFILE '50-billion-records.txt' INTO mytable (id1, id2, score);
ALTER TABLE mytable ADD INDEX id1_index (id1);
This is how you should load the MyISAM table:
CREATE TABLE mytable (id1 int, id2 int, score float,key (id1)) ENGINE=MyISAM;
ALTER TABLE mytable DISABLE KEYS;
LOAD DATA INFILE '50-billion-records.txt' INTO mytable (id1, id2, score);
ALTER TABLE mytable ENABLE KEYS;
When you do disable keys, it stops the .MYI
file from updating nonunique indexes.
During the LOAD DATA INFILE
, the .MYI
file will not grow because it contains no unique indexes or a primary key. This will promarily focus on loading the .MYD
.
The ENABLE KEYS
phase will do a read pass through the .MYD
and linearly build all the nonunique indexes. In your case, it will build the id1
index.
OBSERVATION #2
As for the bug you seem to be experiencing, think of this:
- 25 billions rows
- 4 byte integer for id1
- 100 billion bytes = 93.13G
- That's bigger than the 80G myisam_sort_buffer_size
IMPLICATIONS
- It is possible that incomplete or aborted indexing operations could have lost index pages that should have been written to the
.MYI
.
- It is also possible that doing
DISABLE KEYS
makes your query run because the id1
index is somehow being ignored. This could be the case since the EXPLAIN plan says type ALL
and Possible Keys NULL
. That's a full table scan. The SELECT query will work.
- With the keys enabled and incompletely written, some nonleaf node info being missing would lead to SELECT queries that are destined to fail because some of the
id1
values are not among the nonleaf node in the .MYI
.
SUGGESTIONS
Try running the LOAD DATA INFILE with my proposed code
CREATE TABLE mytable (id1 int, id2 int, score float,key (id1)) ENGINE=MyISAM;
ALTER TABLE mytable DISABLE KEYS;
LOAD DATA INFILE '50-billion-records.txt' INTO mytable (id1, id2, score);
ALTER TABLE mytable ENABLE KEYS;
Also, raise myisam_sort_buffer_size to 100G
Give it a Try !!!
UPDATE 2013-08-12 21:37 EDT
Repair By KeyCache is the result of hitting the max sort_buffer_size and myisam_sort_file_size. In turn, MySQL elects to sift through keys in the keycache.
There are three(3) options you could further adjust
OPTION #1
Set the sort_buffer_size to 16K. That's the minimum value allowed.
OPTION #2
Set the tmp_table_size to 1K. That's the minimum value allowed.
OPTION #3
Set the myisam_sort_buffer_size to its max value of 9223372036854775807
This should further prevent the Repair By KeyCache problem
Give it a Try !!!
Best Answer
Broadly speaking, on a MyISAM Table with a range scan, the process is:
BTREE
(inside the .MYI file) and access the row result (on the .MYD file) -Handler_read_key
Handler_read_next
)You can actually get this plan by observing that you get a
range
join type onEXPLAIN
and on theHandler_*
counters onSHOW SESSION STATUS
.Theoretically, the first step is
O(log n)
-wheren
is the number of records indexed (the table size)- while the second isO(m)
- where m is the number of records returned. So, theoretically, a larger table will take more to return the records. Why so I say theorically? Because theO()
notation can be deceitful if you do not have into account the constants. Indexes usually end up in memory, while rows (specially on MyISAM, which has not a dedicated buffer for data) can be on disk, so the difference in performance of both operations is large. Also, MyISAM has problems with large tables, so then number of levels tend not to be too large.Let me show you an unrelated graph:
In the above graph, the full table scan (blue line) should be flat, because all rows are examined, but it is not mainly because at that point, reading and returning 16M rows is more costly than returning only 1.
So the answer is- both operations take time, which one dominates depend on the actual value of m and n, plus the state of the database in terms of speed of hardware (memory, disk) and the state of the buffers (filesystem, key buffer). In conventional usage, an index scan of a single row is a very fast operation, but it depends comparing to what, and if you have into account extreme cases, like large tables where the BTREE index doesn't fit into memory.