Postgresql – Create index on very large table with many shared values

indexpostgresql

I'm looking to create an index on a large table (~50 million rows) on a field with lots of non-unique values.

Table schema looks like:

 Column |         Type          | Modifiers | Storage  | Stats target | Description 
--------+-----------------------+-----------+----------+--------------+-------------
 gid    | character varying(20) |           | extended |              | 
 word   | character varying(30) |           | extended |              | 
 stat   | double precision      |           | plain    |              | 
Has OIDs: no

I want to create an index on the 'word' column. There is a fairly regular pattern where each word appears about 1000 times. I need to do fast SELECT * FROM mytable WHERE word='something';queries. Creating a regular B-Tree index on these tables takes a ton of time but does substantially improve performance.

I'm uncomfortable with my solution right now for several reasons

(1) The selection of a B-Tree index isn't particularly motivated. Are there alternative indexing schemes that perform better on fields with highly duplicated values?

(2) I'm in a production environment where these tables pop into and out of existence fairly regularly. Because not all tables will always be heavily queried I've opted to only build indexes on a table when certain (outside the DB) applications are triggered, such that I know 10k+ queries will be performed on the table+field. Waiting 20 minutes while an index is being created isn't ideal however. The situation is delicate; optimization gained by creating the index competes with the initial time-sink required to create the index. Are there 'cheaper' indexes to create? perhaps ones that will perform overall slightly worse than B-Tree but have less initial creation cost?

Best Answer

For starters gid should probably be a numeric type. integer should be good enough or bigint if the key space shouldn't be big enough. Much smaller footprint, faster processing than with character data, faster and smaller indexes.

More importantly, to improve performance I suggest database normalization.

Quote:

There is a fairly regular pattern where each word appears about 1000 times.

Create a separate table for unique words:

CREATE TABLE word (
   word_id serial
 , word    text
);

Fill it with unique instances of word in your big_tbl:

INSERT INTO word (word)
SELECT DISTINCT word
FROM   big_tbl
ORDER  BY word;

ORDER BY is optional, not needed for query at hand. But it speeds up index creation and might be cheaper overall.

The table should be small in comparison: only ~ 50k rows for 50M rows in your big table.
Add indexes after filling the table:

ALTER TABLE word
    ADD CONSTRAINT word_word_uni UNIQUE (word) -- essential
  , ADD CONSTRAINT word_word_id_pkey PRIMARY KEY (word_id);  -- expendable?

If those are read-only tables, you can do without the pk. It's not relevant to the operations at hand.

Replace your big table with a much smaller new table. You may have to lock the big table to avoid concurrent writes. Concurrent reads are not a problem.

CREATE TABLE big_tbl_new AS
SELECT b.gid      -- or the suggested smaller, faster numeric replacement
     , w.word_id, b.stat
FROM   big_tbl b
JOIN   word w USING (word)
ORDER  BY word;   -- sorting by word helps query at hand

ORDER BY clusters the data (once) making the query at hand faster, because far fewer blocks have to be read (unless your data is clustered mostly already). The sort carries a cost, weigh cost and benefit once more.

DROP big_tbl;     -- make sure your new table has all data!
ALTER big_tbl_new RENAME TO big_tbl;

Recreate indexes:

ALTER TABLE big_tbl ADD CONSTRAINT big_tbl_gid_pkey PRIMARY KEY (gid);  -- expendable?
CREATE INDEX big_tbl_word_id_idx ON big_tbl (word_id);  -- essential

Your query looks like this now and should be faster:

SELECT b.*
FROM   word w
JOIN   big_tbl b USING (word_id)
WHERE  w.word = 'something';

Reorganization is meant to be a one-time operation to re-organize your data. Keep the new form and also consider keeping indexes permanently.

All of this together (including new indexes) should occupy about half of what you had before on disk, also cutting the time for creation in half (at least). Index creation should be considerably faster, the query as well. If RAM is a limiting factor, these modification pay double.

If you have to write to the table as well, it becomes more expensive (but you did not mention anything about that). You'd need to adjust your logic for DELETE / UPDATE / INSERT:
Example for INSERT: Fetch word_id for existing words or insert a new row in word returning the new word_id. Details for this:
How do I insert a row which contains a foreign key?

Related Solutions

Postgresql – Why did Postgres UPDATE take 39 hours

I had something similar happen recently with a table of 3.5 million rows. My update would never finish. After a lot of experimenting and frustration, I finally found the culprit. It turned out to be the indexes on the table being updated.

The solution was to drop all indexes on the table being updated before running the update statement. Once I did that, the update finished in a few minutes. Once the update completed, I re-created the indexes and was back in business. This probably won't help you at this point but it may someone else looking for answers.

I'd keep the indexes on the table you are pulling the data from. That one won't have to keep updating any indexes and should help with finding the data you want to update. It ran fine on a slow laptop.

Postgresql – Very large btree index with few rows (openstreetmap gis data)

You seem to expect that rows with NULL values are excluded from a B-tree index automatically, but that's not the case. Those are indexed as well and can be searched for. However, since:

access_type ... is null in 90% of cases

that's hardly useful in your case. Such common values hardly ever make sense in an index to begin with, be it NULL or any other value. Exclude NULL values from the index with a partial index.

CREATE INDEX planet_osm_polygon_accesstype_index ON planet_osm_polygon (access_type)
WHERE access_type IS NOT NULL;

Should be much smaller and thus faster.

Remember that you may have to include the same WHERE condition in queries to make Postgres realize it can apply the partial index.

Best Answer

Related Solutions

Postgresql – Why did Postgres UPDATE take 39 hours

Postgresql – Very large btree index with few rows (openstreetmap gis data)

Related Question