Postgresql – Guidelines on best indexing strategy for varying searches on 20+ columns

indexindex-tuningperformancepostgresql

I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation.

Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search.

The problem:

The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria to search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc. etc. – and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by different columns (year, price, mileage and a score we create about the cars). By default we order by our own generated score.

What I’ve done so far:

I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actually for sale, the indexes are made as partial indexes on a sales state column.

Questions:

How would you go about analyzing and determining what columns should be indexed, and how?
What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinations varies a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do single column indexes etc. etc.)
I expect that it does not make sense to index all columns?
I expect it does not make sense to index boolean columns?
Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them?
Would it be a good idea to have all indexes sorted by my default sorting?
Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stay in memory etc.)?

Best Answer

Capture the queries that people actually make.
Analyze data from #1; then devise short, multi-column, indexes (2, maybe 3, columns each)
No. Especially not for yes/no flags. It's ok to combine such, as in #2.
See #3.
The order of columns is important in an index. A 5-column index is (roughly) equivalent to the 1-col index on its first column, but is not useful if you don't query on that column.
Probably.
Forcing into memory is counter-productive because it takes away space for caching other stuff.

More discussion: http://mysql.rjweb.org/doc.php/eav

Related Solutions

Sql-server – Indexing strategy for dynamic predicate

If I were you I would run a trace specific to hits on that table. It shouldn't be overly intensive since you are restricting it to just queries against that table, from your application. Run just the minimum needed by the DTA (Database Tuning Advisor). Run it for a day here, a day there, make sure you get some end of week days and some end of month days. Then run the whole lot through the DTA.

Here is why, I'm willing to bet that you have specific combinations of columns that are going to come up more often than not. You can create more complex indexes based on that information. You might also find that you can create some correlated statistics. Basically statistics that have more than one column. For example creating a statistic on City and State together may improve queries against those two columns.

However make sure you don't create to many indexes. On a table that large I'm guessing you do a fair number of writes and every additional index added will slow them down. Of course you may do most of your writes during a batch process.

Also make sure that you put an automatic process to update your statistics periodically. With that many rows the statistics aren't going to update on their own very often. Only once 500+20% of the rows have changed, in at 500 mil rows that's a LOT.

Sql-server – How to improve performance if index is not Most SELECTIVE

Without having much detail, I can't recommend much.

One thing that does jump out at me is that it's very likely you can improve performance on the table by normalizing it! The presence of so many duplicated (so few unique) values in the columns you listed suggests that perhaps many others in the table are not normalized, as well. I'm suggesting making the Name column an int (or even smallint) with a foreign key to a Names table, and the Current_state column bit (or alternately a tinyint) with a foreign key to a WhateverStates table. You would have to, of course, change your data access code to deal with this indirection, but that is nothing more than the basic job relational database developers have always had to do.

Normalizing will reduce the number of bytes per row, increasing the number of rows per page, reducing the number of pages that have to be read to satisfy any particular query, helping performance across the board! Right now the columns given require likely close to 34 bytes each. After the change I suggest, those columns will only require 11 bytes each. Of course, I haven't seen your whole table--your rows may be so big that it doesn't matter.

What columns and datatypes are in the clustered index (if there is one)? This can radically affect the size of the nonclustered indexes, again affecting performance in exactly the way I described (rows per page).

When you do query based on non-selective columns such as Current_state, what other columns are always or almost always included? It may be okay for you to have a nonselective column in an index if the index also contains a more selective column (or that in conjunction with the less selective column is more highly selective). If on the other hand you generally query often for rows based on the single column Current_state = 'Pending', then you can add a filtered index:

CREATE INDEX IX_YourTable_Pending ON dbo.YourTable (ClusteredColumnsInOrder)
WHERE Current_State = 'Pending'; --SQL 2008 and up only

This technique could help you even when you also include other columns: you would want to put those in the index instead of the ClusteredColumnsInOrder columns I suggested (which was just a tricky way to not put any additional columns into the nonclustered index, since--remember, now--nonclustered indexes always have all the columns of the clustered index implicitly included). Or, if you only pull a very few other columns, you can make your nonclustered index cover the query by adding INCLUDE (AdditionalColumn1, AdditionalColumn2) so that the query engine doesn't have to go back to the clustered index to satisfy the query.

You haven't provided very much information such as full table schema, sample data, and sample queries, and without those it's going to be pretty hard to give you very specific advice about what to do.

One thing I can say, though, is that indiscriminately throwing indexes at the table may not improve things much and could in fact hurt performance of your system overall.

If the hints I have given you here don't seem to help much, then I recommend that you do come back with some of the additional info I mentioned so that we can do a better job of assisting you.

Best Answer

Related Solutions

Sql-server – Indexing strategy for dynamic predicate

Sql-server – How to improve performance if index is not Most SELECTIVE

Related Question