PostgreSQL – Identifying Missing or Orphan Indexes

indexpostgresqlpostgresql-9.4

In the process of finding duplicate indexes, I've been using the following query to identify them:

SELECT 
indrelid::regclass AS TableName 
,array_agg(indexrelid::regclass) AS Indexes 
FROM pg_index 
GROUP BY 
indrelid 
,indkey 
HAVING COUNT(*) > 1;

From which I get an entry with the following data:

tablename   indexes
distribucion.asignacion {distribucion.distribucion_unique_asignacion_imei,distribucion.pk_distribucion_asignacion}

Too check that the index are actually duplicate, I manually see the columns involving each reported index, with this query:

select * from pg_indexes where tablename='<table_name>';

But when I search for the above indexes for ''distribucion.asignacion'' table I get no results.

What could be wrong? Have these indexes been left "orphan", if so, may be there are other orphan structures too in my database, so how can I find such?

Best Answer

pg_indexes.tablename only contains the table name, not the schema name. The schema name is available in the column schemaname.

So you need to use

select * 
from pg_indexes 
where tablename ='asignacion'
  and schemaname = 'distribucion';

Related Solutions

Algorithmic order of table indexes for table operations

Things are more complicated than that. Here are a few points of consideration.

First, this entire discussion assumes B-Trees or B+ Trees (Hence the o(log(n))). There are other types of indexes, like hash indexes, where access is in O(1). Your question insinuates you're looking up values using "equals" search (e.g. looking for X=17). But in this particular scenario, a Hash index is preferable, when possible.

I do agree though that most indexes you'll find today are B/B+ Trees, so let's continue with this assumption.

You've also implicitly indicated that there's always only one resulting row in a SELECT, which is hardly a representative case; plenty times do we look for 1,000 rows at a time. But let's continue with the assumption of a single matching row.

Your next assumption is that searches are always done by the indexed column. This is fine, but I'm just noting that DELETEing a record by some unindexed column Y turns out to be more expensive: you're both wasting O(n) time in finding the record, and then paying an additional O(log(n)) for updating the index.

But let us continue with the assumption that we're only discussing queries which are looking up at indexed columns.

Some tables use unclustered index format (which fits into your calculations) - the table is one entity and the index is another. Others use clustered index format: table rows (or rather blocks of rows, or rather yet pages of rows) are actually stored as leaves inside the clustering index. In such scenario, you will pay O(log(n)) for finding a record in an INSERT command. An optimization for that is in the case you're inserting to the end of the table, and a decent implementation would hold a pointer to the last record/page in the index. (Oh, yes, you should not the possibility that your record gets INSERTed to the middle of the table).

Actually, records could get inserted to the middle of the table even in unclustered tables; it would make sense to spend more search time so as to avoid fragmentation, and at least one implementation that I know of does that. I'm assuming other may, too.

Deleting/Inserting an index from a BTree is O(1) on average, but may cost up to log(n) operations in case of propagated page merging/splitting.

Also, the fact that always comes as a surprise to many, is that sometimes a table scan is faster than index lookup. This is particularly true for queries resulting in multiple rows. It turns out looking up the index adds overhead; when compared to full table scan the overhead could actually make total cost higher. For single row lookup the vast majority of index lookups should be faster than a full table scan.

But do consider the following general convention: you pay with dollars any action that accesses disk. You pay with nickels actions that act on in memory data. This is actually at the heart of database disk I/O optimization. If index pages are on disk, and table pages happen to be in memory, you will possibly pay less for table scan.

And that's where havoc comes in: it really all depends on your workloads, on your memory size, on your dataset size.

Did you ever take a math class where you had to solve some complex integral? It took you hours to solve it, and got point off for missing some tiny minus sign somewhere?

Did you even take a physics class where you had to solve some complex integral? The professor would just throw away chunks of the equation, saying "this is neglect-able", and you would rip your hair off? WHY is it neglect-able? Why not other things?

Computer science is based on math. Computers are based on physics. They are physical objects. They need to spin disks, access a memory bus, manage billions of transistors... You just can't actually anticipate what will happen and put it all under some equation.

It may just turn out that for some particular dataset your entire equation does not hold water. In other times, it may be just fine.

MySQL – Optimizing Indexes for Better Performance

I will answer each the the three questions

QUESTION #1

Since I'll probably move to mysql cluster, does it make sense to switch from InnoDB to NDB engine and use HASH indexes on "category" and "status" columns?

ANSWER TO QUESTION #1

Hash indexes are for one-to-one lookups. Hash indexes are only available for the MEMORY Storage Engine (See my May 17, 2011 post Why does MySQL not have hash indices on MyISAM or InnoDB?)

QUESTION #2

I read how btrees stores data. If most of my queries involve a "WHERE category = x AND status = y", should I add 3 different indexes: one on category, one on status, and one on the combination of both?

ANSWER TO QUESTION #2

You do not want single column indexes. MySQL could still use them if there are no compound index present. It will make MySQL work harder generating results by merging lookups from two separate indexes,

You are way better off with a compound index, an index on both category and status.

In your particular case

the index could be (status,category) if you order categories within a status.
the index could be (category,status) if you order status values under a category (if the number of status values is high)
Since you do an ORDER BY within a (status,category) combination, your compound would benefit even further from one or both of these combined indexes
- (status,category,created_at)
- (category,status,created_at)
Please see my posts on using compound indexes
- Sep 18, 2012 : How are multiple indexes used in a query by MySQL? (Discusses INDEX_MERGE briefly)
- Apr 11, 2014 : Why is MySQL not using the index with the higher cardinality?

QUESTION #3

"show warnings" doesn't show anything useful about what mysql is trying to warn me about: what's wrong with my query?

ANSWER TO QUESTION #3

Without seeing the actual warning message, I can't tell you. Note this: You ran EXPLAIN twice on the same query and got two slightly different results. This was due to the choose of keys. InnoDB tends to take guess by passing through index pages inside the BTREE nodes of the indexes and chooses based on cardinality.

In your case, you should run just once

OPTIMIZE TABLE object;

This will defrag the table and generate a clean set of index statistics.

You could slightly improve the query by writing it as an INNER JOIN

select SQL_NO_CACHE count(*) from object
inner join offer on object.id = offer.id
where object.category = "bid"
and status = "pending"
order by created_at;

You should also think about the fact that offer no PRIMARY KEY.

You should run this query

SELECT COUNT(1),id,category from offer
GROUP BY id,category HAVING COUNT(1) > 1;

If you get no results back, then that can be your PRIMARY KEY. Thus, the offer table should be

CREATE TABLE `offer`
(
    id BIGINT UNSIGNED,
    `category` VARCHAR(31), /* genre, type, category */
    amount DECIMAL(12,3),
    PRIMARY KEY (id,category)
) ENGINE=InnoDB;