PostgreSQL: Find rows for which any index of any value contained is over a given size

indexpostgresql

I have a database originally built with an instance of PostgreSQL with a custom page size much larger than the default. I'm trying to migrate its contents to a standard, 8k-page database.

Because data indexed by a standard index in PostgreSQL must be no more than 1/3rd of a page size, some data contained in the old database cannot be represented in the new one without modifying index definitions. Making schema changes is not an immediately available option; dropping rows (from the tables I expect to be involved) is.

Finding overall size of an index in PostgreSQL is easy and well-documented. Finding individual rows for which associated index sizes are over a threshold is something I'm having some trouble.


How can I find rows (across all tables in a schema) containing indexed values for which the indexes of said values are too large to represent in a standard PostgreSQL build (which is to say, longer than 2712 bytes)?

Best Answer

Let me begin this reply with a caveat: I've never encountered this exact problem, so I don't fully understand the nature of it. However, I'm going to give my advice so that perhaps it gives you the insight you need to complete your task.

pageinspect extension

Recent versions of PostgreSQL come with an extension which you can install that allows you to inspect the individual pages of elements stored in your database.

First things first, all you need to do to use it is to run the command

 CREATE EXTENSION pageinspect;

and it's ready to go.

Now, once again, I don't fully understand the issue, but it appears you are concerned with the size of the indexed entry, and not necessarily the actual row entry in the heap.

For that case, you should use a function call from the pageinspect extension to query the pages of your BTree index, as

SELECT * FROM bt_page_items('idx_123', 1);

and you'll get back some results like

itemoffset  ctid   itemlen  nulls  vars  data
-----------------------------------------------------------------
1           (1,182)  16     f      f     "00 3e 9b ac b2 02 01 00"
2           (0,1)    16     f      f     "00 20 d6 fa 65 02 01 00"
3           (0,2)    16     f      f     "00 09 7b 30 66 02 01 00"
4           (0,3)    16     f      f     "00 f2 1f 66 66 02 01 00"
5           (0,4)    16     f      f     "00 db c4 9b 66 02 01 00"
6           (0,5)    16     f      f     "00 c4 69 d1 66 02 01 00"
...

Presumably, though you should check against your own data, you can see the number of bytes which the index key is using by checking the itemlen field. I am betting that you can use this to check against your pre-determined limit which you said was 2712 bytes, and in that case use this info to collect all the ctids which violate your condition.

Broadly speaking then, you need to create a PL/pgSQL function to iterate over the pages of your index(s), checking the itemlen, and collecting the ctids of the heap entries, and once completed, since you state it was OK to drop violating rows, then wrap it all up with a

DELETE FROM my_table WHERE my_table.ctid = *** your ctid list ***;

I hope this helps!