Postgresql – Improve query speed for large, wide table

postgresqlpostgresql-9.4table

I have created a medical claims database using PostgreSQL 9.4.1 and pgAdmin III. There is one table with ~61 million rows and 103 columns. The table has the structure given below. It is static and will not be updated. (Most columns are omitted here to save space.)

    CREATE TABLE claims
   (
     spending double precision,
     releaseid bigint NOT NULL,
     priceinclude bytea,
     costsharing double precision,
     hcciprofode bytea,
     memberlinkeid bigint,
     age integer,
     orgid integer,
     gender character(1),
     membercity character varying(22),
     memberstate character(2),
     zip character varying(5),
     principaldiagnosis character varying(6),
     CONSTRAINT claims_pkid PRIMARY KEY (releaseid)
   )
   WITH (
     OIDS=FALSE
   );

Indexes:
"claims_pkid" PRIMARY KEY, btree (releaseid)
"claims_age_idx" btree (age)
"claims_memberlinkeid_idx" btree (memberlinkeid)
"claims_principaldiagnosis_idx" btree (principaldiagnosis)
"claims_serviceprovider_idx" btree (serviceprovider)
"claims_zip_idx" btree (zip)

The indexes shown above are the only ones I've added so far. I am running the database using Windows 7 with the following hardware:

  • 64-bit Windows 7
  • 32.0 GB RAM
  • Intel Core i7-4770 CPU @ 3.40 GHz

My main question involves what I can do to improve query speed without buying a dedicated server. I am trying to demonstrate a proof-of-concept to my supervisors to show them that SQL would be faster than the statistical package the currently use, STATA. I have optimized the default settings of PostgreSQL as follows:

shared_buffers=8GB
effective_cache_size=24GB
work_mem=1GB
maintenance_work_mem=1GB
checkpoint_segments=128
checkpoint_completion=0.9
wal_buffers=-1
default_stat_target=500

However, the queries are still slow. For example, the following query

EXPLAIN ANALYZE VERBOSE
SELECT * FROM claims
WHERE gender='F'
AND age BETWEEN 16 AND 20
LIMIT 100000;

returns this:

Limit  (cost=0.00..226622.43 rows=100000 width=427) (actual time=0.029..9509.028 rows=100000 loops=1)
Output: spending, releaseid, priceinclude, costsharing, hcciprofcode, htype, 
hcciotpcode, drgmpr, medsurg, typecode, claimcat, memberlinkeid, age, orgid, 
insurancetype, gender, membercity, memberstate, zip, admissiondate, serviceprovider, 
servicenpi, ser (...)"
->  Seq Scan on public.claims  (cost=0.00..4778859.76 rows=2108732 width=427) 
(actual time=0.028..9497.420 rows=100000 loops=1)"
Output: spending, releaseid, priceinclude, costsharing, 
hcciprofcode, htype, hcciotpcode, drgmpr, medsurg, 
typecode, claimcat, memberlinkeid, age, orgid, 
insurancetype, gender, membercity, memberstate, 
zip, admissiondate, serviceprovider, servicenp (...)"
Filter: ((claims.age >= 16) AND (claims.age <= 20) AND (claims.gender = 'F'::bpchar))"
"Rows Removed by Filter: 3194024"
"Planning time: 0.190 ms"
"Execution time: 9519.455 ms"

Should I be creating a new view or a smaller table to access frequently used columns? I have never created a database this large so I am quite perplexed as to what to do!

UPDATE: My goal is to be able to query this database in a reasonable amount of time for exploratory data analysis. 15 seconds for queries is reasonable; 5 minutes for queries on the whole table is not. I am not sure how best to deal with this large table in order to achieve that goal, which is what I would like help with. (Apologies if this is a little vague.)

Best Answer

For your example query: SELECT * FROM claims WHERE gender='F' AND age BETWEEN 16 AND 20 LIMIT 100000; try to add index on age. But I suggest to use "SQL SERVER PROFILER" and "Database Engine Tuning Adviser" to monitor your queries, and then add suitable indexes, and make suitable changes.