Less RAM Than Index_Length in MyISAM – MySQL Optimization

indexmyisamMySQL

I'm working on a database to store "time series" data (the value of X was Y at this time). The rows themselves are very small and static in size, with the primary key consisting of two smallint columns, 1 tinyint column, and 1 timestamp column. The index length usage is very low per row (about 12 bytes), but the database will be used to store a very large amount of data.

So the problem is the server will wind up having less physical RAM than the size of the index_length in MySQL for that table. What are the implications of having that happen? I know in theory Linux can swap the memory to the disk, but will that duplicate the disk usage (since there already exists a .MYI file)? What are the performance implications of not being able to store the entire index in RAM? Can I still expect sub 10ms selects with SATA II drives in RAID 1?

In response to the first comment for more info

My question was more theoretical than practical at the moment. The project I'm working on is well funded enough that technically we can afford the RAM costs, but I'd prefer to know the implications of not having enough RAM to cover the indexes. But anyway, I'll attempt to answer anyway.

Technically the project has unlimited RAM, so the only reason to have less of it is to keep costs down.

The data is stored in MyISAM tables for "historical" storage purposes, but exists in an NDBCluster for the first 24 hours or so (the NDB Cluster uses about 4x the index RAM than MyISAM).

I can certainly upgrade the RAM, but doing so adds a lot of complexity.

The answer to the amount of MB usage for the index is 2.29MB, but it is rather meaningless. Right now I am just testing the index usage for the data structure. The 2.29MB consists of 155,301 rows (about 15.5 bytes a row).

So there is only 1 table I actually care about. The rest of them are very small in size. The structure of the table is as follows:

CREATE  TABLE IF NOT EXISTS `monitor`.`result` (
  `server` SMALLINT UNSIGNED NOT NULL ,
  `ref_id` SMALLINT UNSIGNED NOT NULL ,
  `request` TINYINT UNSIGNED NOT NULL ,
  `recorded` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,
  `resolution` TINYINT NOT NULL ,
  `value` MEDIUMINT UNSIGNED NOT NULL ,
  PRIMARY KEY (`server`, `request`, `recorded`, `ref_id`) )
ENGINE = MyISAM

The reasons a "ref_id" column exists is to narrow what the data set is referring to down past the server level. So for instance, we may have stats about a user or a device on a server.

Why I would need so much RAM

It might seem like the table above wouldn't use that much RAM, and for most practical cases, it doesn't. I would like to store as much data as possible. I understand that I can store less data, but I'd like the resolution of the data to be as high as it can be. Disk space is cheap, so I'm not even remotely concerned with that cost, but RAM on the other hand can become expensive. Even though the business model makes it feasible to not have to worry about the RAM to any extent, I'd like to keep costs down wherever possible.

To put it in perspective, I'd like to store say at maximum 100 stats every minute for each server that is monitored. You can see the number of rows gets large quickly with a thousand servers (100×1000×1044×365 = 38,106,000,000). The budget per year on a thousand servers is $120,000 (a lot of RAM), but the whole point is to keep costs down.

Refining the question

I really appreciate the answers that have been provided so far, so I'll just get a little more specific to address my concerns more specifically.

The answers so far have lead me to think I need to just simply do some benchmarking on my own and see what comes of it (development for ya!). So really at this point the "problem" I face is that the RAM usage will inevitably be hundreds of gigabytes.

Question 1) So if I decide to go the route of putting an incredibly large amount of data into RAM, it will need to be spread across a bunch of servers. I do this already with NDBCluster, but NDBCluster uses almost 3 times as much RAM to store the identical data (15 bytes vs about 48 bytes). What is the accepted method for storing that much data in RAM in a cluster of servers? Should I implement some application level system for integrating with a bunch of MyISAM Servers?

Question 2) Is MyISAM even the right choice in Database Engines? I tested a bit with InnoDB and it seemed to use a lot more RAM than MyISAM for handling the index. What about non MySQL solutions?

Question 3) Is storing the index on a disk even worthwhile? At that point should I not even create an index if it won't be in RAM anyway (I seriously doubt it).

Question 4) If I go the route of not putting the data in RAM, what sort of disk setup is recommended for this project? A RAID of SSD's?

Question 5) Is it worthwhile at all to not include the value and resolution column in the index? How much of a waste of CPU time are we talking about assuming the index is on the disk and not in RAM?

Thank you so much any advice, I'll be sure to select an answer once these questions are answered (if possible)

Best Answer

OBSERVATION #1

Performance implications should become quickly apparent if the index pages of monitor.result have to experience two things:

  • Experience 1) Swapping of the RAM making up the MyISAM Key Cache
  • Experience 2) Rotation in and out of the MyISAM Key Cache (sized by key_buffer_size)

Experience 1 is virtually unavoidable. As for Experience #2, it could result in needed index pages being purged out of the MyISAM Key Cache in the face of more recent queries against other MyISAM tables. Those needed index pages more be brought back by querying the corresponding table. The two experiences togther could make for a slower-than-expected query on relatively small tables.

However, you could minimize or neutralize any ill effects of swapping by assigning the index of monitor.result by creating a dedicated MyISAM Key Cache. It will be a cache that will contain only index pages from monitor.result.

How do you do that ???

First, recall that you mentioned that the index usage for monitor.result was 2.29MB. You can create that dedicated key cache with that size with a little headroom, say 2.5MB. Let us create that key cache like this:

SET GLOBAL monitor_result_private_cache.key_buffer_size = 1024 * 512 * 5;
CACHE INDEX monitor.result IN monitor_result_private_cache;
LOAD INDEX INTO CACHE monitor.result;

This will perform the following steps:

  1. Create the Key Cache
  2. Assign the Key Cache to the MyISAM Table using LOAD INDEX INTO CACHE
  3. Load the Index Pages of the Assigned MyISAM Table into its Corresponding Cache

It would conveniently keep the index pages of that table from ever leaving the cache. The only table index pages would leave is if INSERTs into monitor.result increases the contents beyond 2.5MB. You must choose enough headroom to accommodate many INSERTs into monitor.result.

OBSERVATION #2

I also noticed the index you laid out for monitor.result:

PRIMARY KEY (`server`, `request`, `recorded`, `ref_id`)

If any of your queries against monitor.result resemble something like this:

SELECT resoultion,value FROM monitor.result
WHERE server = 200 AND refid = 50 AND ... ;

You can speed queries by reordering the PRIMARY KEY

CREATE  TABLE IF NOT EXISTS `monitor`.`result` (            
  `server` SMALLINT UNSIGNED NOT NULL ,            
  `ref_id` SMALLINT UNSIGNED NOT NULL ,            
  `request` TINYINT UNSIGNED NOT NULL ,            
  `recorded` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,            
  `resolution` TINYINT NOT NULL ,            
  `value` MEDIUMINT UNSIGNED NOT NULL ,            
  PRIMARY KEY (`server`, `ref_id`, `request`, `recorded`) )            
ENGINE = MyISAM       

or adding a UNIQUE index

CREATE  TABLE IF NOT EXISTS `monitor`.`result` (             
  `server` SMALLINT UNSIGNED NOT NULL ,             
  `ref_id` SMALLINT UNSIGNED NOT NULL ,             
  `request` TINYINT UNSIGNED NOT NULL ,             
  `recorded` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,             
  `resolution` TINYINT NOT NULL ,             
  `value` MEDIUMINT UNSIGNED NOT NULL ,             
  PRIMARY KEY (`server`, `request`, `recorded`, `ref_id`),
  UNIQUE KEY uniqndx1 (`server`, `ref_id`, `request`, `recorded`)
ENGINE = MyISAM             

If you add a UNIQUE index, you must double the dedicate keycache accordingly.

OBSERVATION #3

You mentioned a SATA drive. Good choice for archival, low-update historical data. Any MyISAM table on a SATA drive which has a dedicated keycache should not be bothered by the index lookup but the data retrival time from the drive would be up to you to benchmark to see if you could live with the running times.

Here is an alternative:

Create an index that has every column

CREATE  TABLE IF NOT EXISTS `monitor`.`result` (            
  `server` SMALLINT UNSIGNED NOT NULL ,            
  `ref_id` SMALLINT UNSIGNED NOT NULL ,            
  `request` TINYINT UNSIGNED NOT NULL ,            
  `recorded` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,            
  `resolution` TINYINT NOT NULL ,            
  `value` MEDIUMINT UNSIGNED NOT NULL ,            
  PRIMARY KEY (`server`, `ref_id`, `request`, `recorded`, `resolution`, `value`) )            
ENGINE = MyISAM       

What does this do? It provides data retrieval of entire rows strictly from the index. Combining this with a dedicated keycache, you will essentially has the whole table in RAM. All queries would be fulfilled by the index and never touch the table, REGARDLESS of the table being on SAS, SATA, SSD, or even stone.

UPDATE 2012-01-26 18:18 EDT

Question 1: You may want to look into memcached. I believe that there is a version InnoDB with a memcached plugin. At least, that's what I heard.

Question 2: InnoDB is for transactional tables. If you have archive data, compressed MyISAM tables should fill the bill. In fact, you could look into the ARCHIVE storage engine.

Question 3: Storing an index on disk (MyISAM,InnoDB) is always standard and cannot be changed. You have use special commands or run special queries to preload caches.

Question 4: RAID-10 for high-writes, SSD for high-reads. Watch your disk surface temperatures !!!

Question 5: if the table is strictly for holding historical info, no need for overkill. As long as it is a table rarely read, no need for special caching considerations.