Playing a bit with pg_buffercache, I could get answers to some of your questions.
- This is quite obvious, but the results for (5) also show that answer is YES
- I am yet to set up a good example for this, for now it is more yes than no :) (See my edit below, the answer is NO.)
- Since the planner is who decides whether to use an index or not, we can say YES, it decides caching (but this is more complicated)
- The exact details of caching could be derived from the source code, I couldn't find too much on this topic, except this one (see the author's answer, too). However, I'm pretty sure that this again is far more complicated than a simple yes or no. (Again, from my edit you can get some idea - since the cache size is limited, those 'sensible' indexes compete for available space. If they are too many, they will kick each other from cache - so the answer is rather NO.)
- As a simple query with
pg_buffercache
shows, the answer is a definitive YES. It is worth to note that temporary table data don't get cached here.
EDIT
I've found Jeremiah Peschka's terrific article about table and index storage. With information from there, I could answer (2) as well. I set up a small test, so you can check these yourself.
-- we will need two extensions
CREATE EXTENSION pg_buffercache;
CREATE EXTENSION pageinspect;
-- a very simple test table
CREATE TABLE index_cache_test (
id serial
, blah text
);
-- I am a bit megalomaniac here, but I will use this for other purposes as well
INSERT INTO index_cache_test
SELECT i, i::text || 'a'
FROM generate_series(1, 1000000) a(i);
-- let's create the index to be cached
CREATE INDEX idx_cache_test ON index_cache_test (id);
-- now we can have a look at what is cached
SELECT c.relname,count(*) AS buffers
FROM
pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode
INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname
ORDER BY 2 DESC LIMIT 10;
relname | buffers
----------------------------------+---------
index_cache_test | 2747
pg_statistic_relid_att_inh_index | 4
pg_operator_oprname_l_r_n_index | 4
... (others are all pg_something, which are not interesting now)
-- this shows that the whole table is cached and our index is not in use yet
-- now we can check which row is where in our index
-- in the ctid column, the first number shows the page, so
-- all rows starting with the same number are stored in the same page
SELECT * FROM bt_page_items('idx_cache_test', 1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (1,164) | 16 | f | f | 6f 01 00 00 00 00 00 00
2 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
3 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
4 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
5 | (0,4) | 16 | f | f | 04 00 00 00 00 00 00 00
6 | (0,5) | 16 | f | f | 05 00 00 00 00 00 00 00
...
64 | (0,63) | 16 | f | f | 3f 00 00 00 00 00 00 00
65 | (0,64) | 16 | f | f | 40 00 00 00 00 00 00 00
-- with the information obtained, we can write a query which is supposed to
-- touch only a single page of the index
EXPLAIN (ANALYZE, BUFFERS)
SELECT id
FROM index_cache_test
WHERE id BETWEEN 10 AND 20 ORDER BY id
;
Index Scan using idx_test_cache on index_cache_test (cost=0.00..8.54 rows=9 width=4) (actual time=0.031..0.042 rows=11 loops=1)
Index Cond: ((id >= 10) AND (id <= 20))
Buffers: shared hit=4
Total runtime: 0.094 ms
(4 rows)
-- let's have a look at the cache again (the query remains the same as above)
relname | buffers
----------------------------------+---------
index_cache_test | 2747
idx_test_cache | 4
...
-- and compare it to a bigger index scan:
EXPLAIN (ANALYZE, BUFFERS)
SELECT id
FROM index_cache_test
WHERE id <= 20000 ORDER BY id
;
Index Scan using idx_test_cache on index_cache_test (cost=0.00..666.43 rows=19490 width=4) (actual time=0.072..19.921 rows=20000 loops=1)
Index Cond: (id <= 20000)
Buffers: shared hit=4 read=162
Total runtime: 24.967 ms
(4 rows)
-- this already shows that something was in the cache and further pages were read from disk
-- but to be sure, a final glance at cache contents:
relname | buffers
----------------------------------+---------
index_cache_test | 2691
idx_test_cache | 58
-- note that some of the table pages are disappeared
-- but, more importantly, a bigger part of our index is now cached
All in all, this shows that indexes and tables can be cached page by page, therefore the answer for (2) is NO.
And a final one to illustrate temporary tables being non-cached here:
CREATE TEMPORARY TABLE tmp_cache_test AS
SELECT * FROM index_cache_test ORDER BY id FETCH FIRST 20000 ROWS ONLY;
EXPLAIN (ANALYZE, BUFFERS) SELECT id FROM tmp_cache_test ORDER BY id;
-- checking the buffer cache now shows no sign of the temp table
If I were you I would run a trace specific to hits on that table. It shouldn't be overly intensive since you are restricting it to just queries against that table, from your application. Run just the minimum needed by the DTA (Database Tuning Advisor). Run it for a day here, a day there, make sure you get some end of week days and some end of month days. Then run the whole lot through the DTA.
Here is why, I'm willing to bet that you have specific combinations of columns that are going to come up more often than not. You can create more complex indexes based on that information. You might also find that you can create some correlated statistics. Basically statistics that have more than one column. For example creating a statistic on City and State together may improve queries against those two columns.
However make sure you don't create to many indexes. On a table that large I'm guessing you do a fair number of writes and every additional index added will slow them down. Of course you may do most of your writes during a batch process.
Also make sure that you put an automatic process to update your statistics periodically. With that many rows the statistics aren't going to update on their own very often. Only once 500+20% of the rows have changed, in at 500 mil rows that's a LOT.
Best Answer
(Note: I'm not a deep expert in the Pg guts. This is only my understanding. You should really read the mvcc and internals sections of the manual).
Yes, initially the row with
id = 1
is still in the index, despite the index'sWHERE
clause excluding it because it has been updated witharchived = true
. At some point down the track the index entry will be purged byVACUUM
or autovacuum, freeing the space.The reason for this is PostgreSQL's MVCC design. Other concurrent transactions might still be able to see the old version of the row with
id = 1, archived = false
- eitherREAD COMMITTED
transactions with a long-running statement, orSERIALIZABLE
transactions that don't see changes committed after they began. If the index entry were removed when the row was updated those transactions wouldn't find the row in index scans and would produce the wrong result.Technically what's actually happening is that the old row still exists, it's just had an
xmax
set so newer transactions ignore it when they see it. After the update the index points to that row, same as before. A new row witharchived = true
has been inserted with anxmin
above the old row'sxmax
, so any given transaction either sees the old or new version, never both. That new row is not added to the index.Eventually there will be no open transaction that can still see the old row, so
VACUUM
will come by and remove the old row and its index entry, freeing the space for re-use.Note that the index its self has no row visibility information. You can - and often do - have many versions of a row, each with index entries. When a transaction looks for matching index entries it reads all of the versions (unless it can tell from the visibility map that it doesn't need to) and ignores all but the one that has an
xmin
andxmax
that makes it visible to that transaction.This is one of many good reasons to ensure that autovaccum runs frequently. It reduces the number of dead index entries and dead rows you have lying around wasting space and I/O.