PostgreSQL – How to Index Two Tables with One-to-Many Relation

postgresql

Take the following two tables:

table A has ~100k rows and 50 columns, and it is being updated only twice per month, how long this update takes is not an issue (as long as it doesn't take more than one day, which I think is unlikely :D). It will grow very slightly, not more than 10k rows per year.
table B has ~25k rows and 25 columns, column F is a foreign key to column C of table A. This table is being updated 100-1000 times a day, most of the time to add new rows, so it will easily grow to hundreds of thousands of rows.

Table A has a one-to-many relation to table B, both column F and column C must be not null, have their own index, and column C is also unique but it is not the primary key of table A. Every row of table B has a unique correspondence with a row in table A.

Both tables are being queried many thousands of times every day, individually or using their relation using an application that allows a few dozen different types of query.

Is it a good idea to index every column of table A that is involved in a query, plus making a composite index for every type of query? If no, what are the possible drawbacks?
For table B is it a good idea to make a composite index only for the most time consuming queries? Or should I make a composite index also for every column in table B that is being queried, since those queries may vary in the parameters but not in the form?

Best Answer

The thing about indexes are that they are excellent tools for speeding up queries, but they bring along overhead as well.

If you choose to index every column in a table that is involved in a query, you have chosen to store considerably more data. This has an effect on the storage required and in the overhead of maintaining all those indexes. (Or maintaining all columns in a very wide index.)
In past experience, having many indexes, especially several indexes with similar distribution statistics, can make it easier for the server to select a poor index and thus a poor plan.

My experience shows that indexing just about everything will hurt your performance overall, for the reasons mentioned above.

Both table A and table B should only contain the indexes that would be of best use for your queries.

Of course, your first indexing decisions may need to be refined over time as you learn more about your database's behavior. Do not be afraid to add additional indexes when needed. And do not be afraid to drop an index that is not being useful.

Related Solutions

Postgresql – Indexing to improve performance of range queries

If you have an index on created then the planner will need to choose between an using that index or the PK (or a full table scan) - it will not benefit from both at the same time.

--EDIT

As pointed out by @jug in the comments below, this is not accurate at least since 8.1: the planner may choose to build two in-memory bitmaps and combine them to get the result set. This gets more expensive as the tables get bigger, so the planner may choose not to do this depending on the size of the table and the estimated cost of using one index and then filtering.

--END EDIT

The new index will only be helpful if in some cases using it is more efficient than access via the PK. The kind of things that could make this likely include:

A large number of (...) in SELECT * FROM foo WHERE foo_id IN (...) AND created > 1234 AND created <= 6789
A small range, eg created > 6780 AND created <= 6790

Unless one or both is likely to happen, you should not create the secondary index - if they might, it would be best to test each scenario with and without the index to see if any performance benefit is worth the cost (eg increased storage and overhead for insert and update operation)

PostgreSQL Column Types – Numeric vs Integer for Size and Performance

Would this be a good idea?

No.

would numeric(6,0) take less bytes?

No.

test=> SELECT pg_column_size(INT4 '999999'), pg_column_size(NUMERIC(6,0) '999999');
 pg_column_size | pg_column_size 
----------------+----------------
              4 |             10
(1 row)

how about the performance (this table is being queried a lot)?

Slower. It's stored as binary-coded decimal because it's an arbitrary precision value.

Best Answer

Related Solutions

Postgresql – Indexing to improve performance of range queries

PostgreSQL Column Types – Numeric vs Integer for Size and Performance

Related Question