Here is another database problem I stumbled upon.
I have a date-range partitioned MyIsam look-up table with 200M records and ~150 columns. On this Table I need to perform cascading SELECT-Statements to filter the data. Output:
filter 126M
filter 110M
filter 40M
filter 5M
filter 100k
Every single SELECT is highly complex with regex (=no index possible) and multiple comparisons, which is why I want them to query the least amount of rows possible.
There are about 500 unique filters and around 200 constant users. Every filter needs to be run for each user, in total around 100k combinations.
Big question: Is there a way for each subsequent SELECT statement to query only the previous subset?
Example: Filter #5 should only have to query the 5M rows out of query 4 to get those 100k results. At the moment it has to scan through all 200M records.
EDIT
current approach:
cache table
CREATE TABLE IF NOT EXISTS `cache` (
`filter_id` int(11) NOT NULL,
`user_id` int(11) NOT NULL,
`lookup_id` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
ALTER TABLE `cache`
ADD PRIMARY KEY (`filter_id`,`user_id`);
This would contain the relation between individual data-rows from the lookup table and the filters. PLUS I'd be able to use the rrimary index to get all of the lookup_ids from the previous filter.
Query for subsequent filters:
SELECT SUM( column), COUNT(*)
FROM cache c
LEFT JOIN lookup_table l ON c.lookup_id= l.id
WHERE
c.filter_id = 1
AND c. user_id= x
AND l.regex_column = preg_rlike...
Best Answer
If
filter 110M
is exactly a subset offilter 126M
, then appending moreAND
withWHERE
would have done the job.If that is complex to accomplish try to CREATE VIEW of the previous SELECT statement and the next SELECT statement should query from the view.