I'm looking at analyzing a kaggle dataset which is pretty normalized, specifically joining, grouping, slicing and dicing the 3 tables patients(2M) <-> diagnosis(700M) <-> doctors(2M)
.
I have 30GB RAM, 300GB SSD machine, so I'm able to analyze using a good old c program (I practically malloc and load everything to memory…), but I want to employ the power of ad-hoc SQL queries.
I tried different MySql, SQLLite and even mini-hadoop configurations, both on my machine and on google compute cloud and amazon EC (just to make sure that its not a config/env problem).
I can't find anything that will give me results in less than a couple of minutes to any ad-hoc query.. and if I didn't go thru the hassle of building indexes ad-hoc performance was atrocious (hours or so).
Am I missing something obvious? If its possible for my c program to fit everything in memory and do crude analysis, isn't there a database which loads everything to the buffer cache and gives lightning-fast performance for all the wonderful SQL analytics?
I know this might some a bit naive, but please go easy on me – I'm used to huge Teradata/Hadoop clusters doing heavylifting of many TB of data in very good timings, so I expected this to be very easy to handle…
Best Answer
(Too long for a comment...)
MySQL...
SHOW VARIABLES LIKE '%buffer%';
-- the main configuraton is the cache allocation. But it depends on which Engine you are using and how much RAM you have.Sorry, but we need to check the details. Please provide
SHOW CREATE TABLE
, some slow queries plus theirEXPLAINs
.Are you CPU-bound? Then the queries and/or indexes need attending to.
Are you I/O-bound? Then probably indexes.
Edit 1
Glancing at the "schema", I see something naughty:
diagnosis_code
isNUMBER
in one table andVARCHAR(30)
in another. This makes it impossible to use an index whenJOINing
.So, I repeat, let's see the MySQL version of
CREATE TABLE
and the queries.