I have imported a copy of the ip2location_db11 lite database, which contains 3,319,097 rows, and I am looking to optimize a numeric range query, where the low and high values are in separate columns of the table (ip_from
, ip_to
).
Importing the database:
CREATE TABLE ip2location_db11
(
ip_from bigint NOT NULL, -- First IP address in netblock.
ip_to bigint NOT NULL, -- Last IP address in netblock.
country_code character(2) NOT NULL, -- Two-character country code based on ISO 3166.
country_name character varying(64) NOT NULL, -- Country name based on ISO 3166.
region_name character varying(128) NOT NULL, -- Region or state name.
city_name character varying(128) NOT NULL, -- City name.
latitude real NOT NULL, -- City latitude. Default to capital city latitude if city is unknown.
longitude real NOT NULL, -- City longitude. Default to capital city longitude if city is unknown.
zip_code character varying(30) NOT NULL, -- ZIP/Postal code.
time_zone character varying(8) NOT NULL, -- UTC time zone (with DST supported).
CONSTRAINT ip2location_db11_pkey PRIMARY KEY (ip_from, ip_to)
);
\copy ip2location_db11 FROM 'IP2LOCATION-LITE-DB11.CSV' WITH CSV QUOTE AS '"';
My first naive indexing attempt was to create separate indices on each of those columns, which resulted in a sequential scan with query times of 400ms:
account=> CREATE INDEX ip_from_db11_idx ON ip2location_db11 (ip_from);
account=> CREATE INDEX ip_to_db11_idx ON ip2location_db11 (ip_to);
account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..48930.99 rows=43111 width=842) (actual time=286.714..401.805 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Rows Removed by Filter: 3319096
Planning time: 0.155 ms
Execution time: 401.834 ms
(6 rows)
account=> \d ip2location_db11
Table "public.ip2location_db11"
Column | Type | Modifiers
--------------+------------------------+-----------
ip_from | bigint | not null
ip_to | bigint | not null
country_code | character(2) | not null
country_name | character varying(64) | not null
region_name | character varying(128) | not null
city_name | character varying(128) | not null
latitude | real | not null
longitude | real | not null
zip_code | character varying(30) | not null
time_zone | character varying(8) | not null
Indexes:
"ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
"ip_from_db11_idx" btree (ip_from)
"ip_to_db11_idx" btree (ip_to)
My second attempt was to create a multi-column btree index, which resulted in an index scan with query times of 290ms:
account=> CREATE INDEX ip_range_db11_idx ON ip2location_db11 (ip_from,ip_to);
account=> EXPLAIN ANALYZE VERBOSE SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ip_to_db11_idx on public.ip2location_db11 (cost=0.43..51334.91 rows=756866 width=69) (actual time=1.109..289.143 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Index Cond: ('2538629520'::bigint <= ip2location_db11.ip_to)
Filter: ('2538629520'::bigint >= ip2location_db11.ip_from)
Rows Removed by Filter: 1160706
Planning time: 0.324 ms
Execution time: 289.172 ms
(7 rows)
n4l_account=> \d ip2location_db11
Table "public.ip2location_db11"
Column | Type | Modifiers
--------------+------------------------+-----------
ip_from | bigint | not null
ip_to | bigint | not null
country_code | character(2) | not null
country_name | character varying(64) | not null
region_name | character varying(128) | not null
city_name | character varying(128) | not null
latitude | real | not null
longitude | real | not null
zip_code | character varying(30) | not null
time_zone | character varying(8) | not null
Indexes:
"ip2location_db11_pkey" PRIMARY KEY, btree (ip_from, ip_to)
"ip_from_db11_idx" btree (ip_from)
"ip_range_db11_idx" btree (ip_from, ip_to)
"ip_to_db11_idx" btree (ip_to)
Update: As requested in the comments, I have re-done the above query. The timing of the first 15 queries after re-creating the table (165ms, 65ms, 86ms, 83ms, 86ms, 64ms, 85ms, 811ms, 868ms, 845ms, 810ms, 781ms, 797ms, 890ms, 806ms):
account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.ip2location_db11 (cost=28200.29..76843.12 rows=368789 width=842) (actual time=64.866..64.866 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Recheck Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Heap Blocks: exact=1
Buffers: shared hit=8273
-> Bitmap Index Scan on ip_range_db11_idx (cost=0.00..28108.09 rows=368789 width=0) (actual time=64.859..64.859 rows=1 loops=1)
Index Cond: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Buffers: shared hit=8272
Planning time: 0.099 ms
Execution time: 64.907 ms
(10 rows)
account=> EXPLAIN (ANALYZE, VERBOSE, BUFFERS, TIMING) SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.ip2location_db11 (cost=0.00..92906.18 rows=754776 width=69) (actual time=577.234..811.757 rows=1 loops=1)
Output: ip_from, ip_to, country_code, country_name, region_name, city_name, latitude, longitude, zip_code, time_zone
Filter: (('2538629520'::bigint >= ip2location_db11.ip_from) AND ('2538629520'::bigint <= ip2location_db11.ip_to))
Rows Removed by Filter: 3319096
Buffers: shared hit=33 read=43078
Planning time: 0.667 ms
Execution time: 811.783 ms
(7 rows)
Sample rows from the imported CSV file:
"0","16777215","-","-","-","-","0.000000","0.000000","-","-"
"16777216","16777471","AU","Australia","Queensland","Brisbane","-27.467940","153.028090","4000","+10:00"
"16777472","16778239","CN","China","Fujian","Fuzhou","26.061390","119.306110","350004","+08:00"
Is there a better way to index this table that would improve the query, or is there a more efficient query that would get me the same result?
Best Answer
This is a little bit of a different solution to those already offered which involved using spacial indexes to do some tricks.
Instead its worth remembering that with IP addresses you cannot have overlapping ranges. That is
A -> B
cannot intersectX -> Y
in any way. Knowing this you can change yourSELECT
query slightly and take advantage of this trait. In taking advantage of this trait you do not have to have any "clever" indexing at all. In fact you only need to index yourip_from
column.Previously, the query being analyzed was:
Lets assume that the range that
2538629520
falls into happens to be2538629512
and2538629537
.From this we can assume that the next
ip_from
value is2538629538
. We actually dont need to worry about any records above thisip_from
value. Indeed all we actually care about is the range whereip_from
equals2538629512
.Knowing this fact, our query actually becomes (in English):
Because we never have overlapping ranges of
ip_from
toip_to
this holds true and allows us to write the query as:Back to the indexing to take advantage of all this. All we're actually looking across is ip_from and we are doing integer comparisons. The MIN(ip_from) have PostgreSQL find the first record available. This is good because we can seek right to that and then not worry about any other records at all.
All we really need is an index like:
CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)
We can make the index unique because we will not have overlapping records. I would even make this column the primary key myself.
With this index and this query, the explain plan is:
To give you an idea of improvement in query performance with this approach I tested this on my Raspberry Pi. The original approach took approx 4 secs. This approach takes approx 120ms. The big win is from individual row seeks verses some scans. The original query would suffer EXTREMELY from low range values as more of the table needs to be considered in the results. This query will exhibit consistent performance across the range of values.
Hope this helps and my explanation makes sense to you all.