Postgresql – Why is the index on the join keys not used in a very simple hash join case

execution-planindexperformancepostgresql

I have two tables:

create table animal
(
    aid integer,
    cid integer,
    aname varchar(255) default NULL::character varying,
    species text
);

create table daily_feeds
(
    aid integer,
    zid integer,
    shift integer,
    menu integer
);

and a query:

SELECT
aname,
shift
FROM animal
NATURAL JOIN daily_feeds
WHERE menu = 1;

The table animal contains around 40000 row, the table daily_feeds contains around 80000 row, 2 for each animal.

I thought that adding an index on the columns used for the join animal.aid and daily_feeds.aid could result in a different execution plan and improved performances.

create index aaid on animal (aid);

create index daid on daily_feeds using btree (aid);

This doesn't happen. Why is that?

Obviously this is a very simple and academic example that I'm trying to use to learn more about database query optimization.

EDIT:

Added execution plan when no indexes are present:

Hash Join  (cost=1439.24..2488.23 rows=499 width=10) (actual time=11.701..55.260 rows=487 loops=1)
    Hash Cond: (animal.aid = daily_feeds.aid)
    ->  Seq Scan on animal  (cost=0.00..694.00 rows=40000 width=10) (actual  time=0.007..20.775 rows=40000 loops=1)
    ->  Hash  (cost=1433.00..1433.00 rows=499 width=8) (actual time=11.590..11.590 rows=487 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 28kB
        ->  Seq Scan on daily_feeds  (cost=0.00..1433.00 rows=499 width=8) (actual time=0.011..11.218 rows=487 loops=1)
            Filter: (menu = 1)
            Rows Removed by Filter: 79513
Planning time: 0.124 ms
Execution time: 55.521 ms

EDIT2: another question here:

Since I use two indexes I can also make them clustered indexes, physically ordering the table by the aid attribute.

cluster animal using aaid;
cluster daily_feeds using daid;

If I do that the Execution Plan doesn't change again. Isn't a Merge Join a more effective join strategy if both tables are sorted by the key column?

I'm not really sure about the behavior of Postgres here.

Best Answer

Providing information on the actual query plan used would help.

Speaking in general, the DB engine would have to use the two indexes to identify rows in both tables that have matching aid values; then, look up all the actual rows in the daily_feeds table to see if menu is 1, and then look up the rows in animal to get the aname values.

If there was a large number of rows to be returned, and if those are your complete tables, then it may well have decided it would be faster to simply load the tables in the first place, and scan them to find the matching rows and do the menu check (at which point aname is already there and available).


Let me note here that I'm most familiar with the way SQL Server uses indexes. It's possible that some of the following may not directly apply to PostgreSQL. However, most of the basics seem to match up, and I'm not going to delve into specifics enough to be likely to go too far wrong.

The optimizer has three ways to find rows in daily_feeds. It can simply look through the entire table; it can use the aid index to seek to a particular value or range of values of aid, then use the index information to go to the correct pages in the table to get the rows; or, it can treat the index as if it was a table, and scan that (possibly looking up the rows in the full table, if all the data it needs isn't in the index).

There are many consideration an optimizer must make in figuring out how to get the data it needs. One factor is the amount of pages it might have to load to get the data. Loading pages form disk is one of the most time-consuming tasks for a database. So, if it can limit the number of pages it may need, then that should speed up the query.

For this query, we need to find two columns in animal (aid and aname), and three in daily_feeds (aid, menu, shift).

It has two ways to identify the rows it's interested in: rows where an aid in animal matches an aid in daily_feeds, and rows where menu is 1. I'll guess that statistics would indicate (certainly a foreign key relationship would, if it actually existed) that all or almost all rows would be returned form both tables based on the aid = aid check. That doesn't do a good job of reducing the number of rows we're looking for. However, menu = 1 knocks out a lot of rows. Since menu isn't in an index, we have to go to the table to filter it by that.

Once we've got the daily_feeds rows we want, we need to find the matching rows in animal to pick up aname. We can expect to be looking for around 500 rows, and must assume (worst case) they're evenly distributed through the animal data. I'm make another assumption here; that (even with a text column) the rows in the animal table aren't terribly wide. If the basic table is stored in 500 pages (ignoring the text data; it may be stored off-row, but we don't need it for this query), then the odds are that all the pages will have to be loaded to get the rows we need.

Now, there are two paths to proceed:

  • Use the aaid index on animal to find the 500 aid values we're looking for; then, using the index, go to the pages where the data lives and find them in each page.
  • Load the animal table directly, and scan it for the 500 aid values we're looking for directly.

The optimizer decided to use option 2. Looking at it from this perspective, that doesn't necessarily seem that odd.

Note: the engine may be deciding this entire plan before it takes its first step. This can lead to bad plans (if, for instance, it turned out that menu = 1 only brought back 5 rows, then using the index might be faster) on occasion. SQL Server can maintain statistics on columns that help it determine cardinality (how well a known value for column X will limit the target rows), to help it make these decisions. PostgreSQL has some sort of statistic facility as well - not sure if it's directly comparable or not.