PostgreSQL – How to Optimize Slow Query Selecting Single Row from Range

execution-planperformancepostgresqlquery-performance

I have imported a copy of the ip2location_db11 lite database, which contains 3,319,097 rows, and I am looking to optimize a numeric range query, where the low and high values are in separate columns of the table (ip_from, ip_to).

Importing the database:

CREATE TABLE ip2location_db11
(
  ip_from bigint NOT NULL, -- First IP address in netblock.
  ip_to bigint NOT NULL, -- Last IP address in netblock.
  country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
  country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
  region_name character varying(128) NOT NULL, -- Region or state name.
  city_name character varying(128) NOT NULL, -- City name.
  latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
  longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
  zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
  time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
  CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';

My first naive indexing attempt was to create separate indices on each of those columns, which resulted in a sequential scan with query times of 400ms:

account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

                                                          QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
 Planning time: 0.155 ms
 Execution time: 401.834 ms
(6 rows)

account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_to_db11_idx" btree (ip_to)

My second attempt was to create a multi-column btree index, which resulted in an index scan with query times of 290ms:

account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);

account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
   Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
   Rows Removed by Filter: 1160706
 Planning time: 0.324 ms
 Execution time: 289.172 ms
(7 rows)

n4l_account=> \d ip2location_db11
          Table "public.ip2location_db11"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 ip_from      | bigint                 | not null
 ip_to        | bigint                 | not null
 country_code | character(2)           | not null
 country_name | character varying(64)  | not null
 region_name  | character varying(128) | not null
 city_name    | character varying(128) | not null
 latitude     | real                   | not null
 longitude    | real                   | not null
 zip_code     | character varying(30)  | not null
 time_zone    | character varying(8)   | not null
Indexes:
    "ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
    "ip_from_db11_idx" btree (ip_from)
    "ip_range_db11_idx" btree (ip_from, ip_to)
    "ip_to_db11_idx" btree (ip_to)

Update: As requested in the comments, I have re-done the above query. The timing of the first 15 queries after re-creating the table (165ms, 65ms, 86ms, 83ms, 86ms, 64ms, 85ms, 811ms, 868ms, 845ms, 810ms, 781ms, 797ms, 890ms, 806ms):

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ip2location_db11  (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Heap Blocks: exact=1
   Buffers: shared hit=8273
   ->  Bitmap Index Scan on ip_range_db11_idx  (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
         Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
         Buffers: shared hit=8272
 Planning time: 0.099 ms
 Execution time: 64.907 ms
(10 rows)

account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.ip2location_db11  (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
   Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
   Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
   Rows Removed by Filter: 3319096
   Buffers: shared hit=33 read=43078
 Planning time: 0.667 ms
 Execution time: 811.783 ms
(7 rows)

Sample rows from the imported CSV file:

"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"

Is there a better way to index this table that would improve the query, or is there a more efficient query that would get me the same result?

Best Answer

This is a little bit of a different solution to those already offered which involved using spacial indexes to do some tricks.

Instead its worth remembering that with IP addresses you cannot have overlapping ranges. That is A -> B cannot intersect X -> Y in any way. Knowing this you can change your SELECT query slightly and take advantage of this trait. In taking advantage of this trait you do not have to have any "clever" indexing at all. In fact you only need to index your ip_from column.

Previously, the query being analyzed was:

SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;

Lets assume that the range that 2538629520 falls into happens to be 2538629512 and 2538629537.

Note: It really does not matter what the range is, this is just to help demonstrate the pattern we can take advantage of.

From this we can assume that the next ip_from value is 2538629538. We actually dont need to worry about any records above this ip_from value. Indeed all we actually care about is the range where ip_from equals 2538629512.

Knowing this fact, our query actually becomes (in English):

Find me the maximumip_from value where my IP address is higher than ip_from. Show me the record where you find this value.

Or in otherwords: Find me the ip_from value just before my IP address and give me that record

Because we never have overlapping ranges of ip_from to ip_to this holds true and allows us to write the query as:

SELECT * 
FROM ip2location
WHERE ip_from = (
    SELECT MAX(ip_from)
    FROM ip2location
    WHERE ip_from <= 2538629520
    )

Back to the indexing to take advantage of all this. All we're actually looking across is ip_from and we are doing integer comparisons. The MIN(ip_from) have PostgreSQL find the first record available. This is good because we can seek right to that and then not worry about any other records at all.

All we really need is an index like:

CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)

We can make the index unique because we will not have overlapping records. I would even make this column the primary key myself.

With this index and this query, the explain plan is:

Index Scan using ix_ip2location_ipfrom on public.ip2location  (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
    ->  Result  (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
        Output: $0
        InitPlan 1 (returns $0)
            ->  Limit  (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
                Output: ip2location_1.ip_from
                ->  Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1  (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
                        Output: ip2location_1.ip_from
                        Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
                        Heap Fetches: 0

To give you an idea of improvement in query performance with this approach I tested this on my Raspberry Pi. The original approach took approx 4 secs. This approach takes approx 120ms. The big win is from individual row seeks verses some scans. The original query would suffer EXTREMELY from low range values as more of the table needs to be considered in the results. This query will exhibit consistent performance across the range of values.

Hope this helps and my explanation makes sense to you all.