How to Get PostgreSQL FDW to Push Down LIMIT to Backend Server

performancepostgresql-fdwpostgresql-performancesharding

I've set up a PostgreSQL FDW server with the following table, sharded by user_id over four servers:

CREATE TABLE my_big_table
(
    user_id bigint NOT NULL,
    serial bigint NOT NULL,         -- external, incrementing only
    some_object_id bigint NOT NULL,
    timestamp_ns bigint NOT NULL,
    object_type smallint NOT NULL,
    other_type smallint NOT NULL,
    data bytea
) PARTITION BY HASH (user_id) ;
CREATE SERVER shardA
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host '192.168.200.11', port '5432', dbname 'postgres', fetch_size '10000');
 .
 .
 .
CREATE SERVER shardD
    FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host '192.168.200.14', port '5432', dbname 'postgres', fetch_size '10000');
create foreign table my_big_table_mod4_s0 partition of my_big_table
    FOR VALUES WITH (MODULUS 4, REMAINDER 0) server shardA
    OPTIONS (table_name 'my_big_table_mod4_s0');
 .
 .
 .
create foreign table my_big_table_mod4_s3 partition of my_big_table
    FOR VALUES WITH (MODULUS 4, REMAINDER 3) server shardD
    OPTIONS (table_name 'my_big_table_mod4_s3');

On the backend servers I have set up a table with several indexes, its data CLUSTERed by (user_id, serial) across several partitions. I don't think those details are very relevant for my actual question, though.

The common query against my cluster is in the pattern of:

SELECT * from my_big_table
WHERE
  user_id = 12345     -- only 1 user, always! --> single foreign server.
ORDER BY serial DESC  -- get 'newest' 90% of the time, 10% ASC
LIMIT 1000;           -- max limit 1000, sometimes less

For users with < 1000 records: all fine, no problem.

For users with > 100.000 records I see the issue leading to poor performance: explain shows LIMIT and sorting happens on FDW, not pushed down. Why?

 Limit  (cost=927393.08..927395.58 rows=1000 width=32)
   Output: my_big_table_mod4_s0.serial, my_big_table_mod4_s0.some_object_id, my_big_table_mod4_s0.timestamp_ns, my_big_table_mod4_s0.object_type, my_big_table_mod4_s0.other_type, (length(my_big_table_mod4_s0.data))
   ->  Sort  (cost=927393.08..931177.06 rows=1513592 width=32)
         Output: my_big_table_mod4_s0.serial, my_big_table_mod4_s0.some_object_id, my_big_table_mod4_s0.timestamp_ns, my_big_table_mod4_s0.object_type, my_big_table_mod4_s0.other_type, (length(my_big_table_mod4_s0.data))
         Sort Key: my_big_table_mod4_s0.serial DESC
         ->  Foreign Scan on public.my_big_table_mod4_s0  (cost=5318.35..844404.46 rows=1513592 width=32)
               Output: my_big_table_mod4_s0.serial, my_big_table_mod4_s0.some_object_id, my_big_table_mod4_s0.timestamp_ns, my_big_table_mod4_s0.object_type, my_big_table_mod4_s0.other_type, length(my_big_table_mod4_s0.data)
               Remote SQL: SELECT serial, some_object_id, timestamp_ns, object_type, other_type, data FROM public.my_big_table_mod4_s0 WHERE ((user_id = 4560084))
 JIT:
   Functions: 3
   Options: Inlining true, Optimization true, Expressions true, Deforming true

Takeaway from the above is:

  • Single backend server selected: OK! (solved with this)
  • Remote SQL: SELECT [...] indicates no ORDER BY, no LIMIT. Problem.

Executed on the backend server directly shows this:

 Limit  (cost=1.74..821.42 rows=1000 width=32)
   Output: my_big_table_mod4_s0_part123.serial, my_big_table_mod4_s0_part123.some_object_id, my_big_table_mod4_s0_part123.timestamp_ns, my_big_table_mod4_s0_part123.object_type, my_big_table_mod4_s0_part123.other_type, (length(my_big_table_mod4_s0_part123.data))
   ->  Append  (cost=1.74..1240669.45 rows=1513592 width=32)
         ->  Index Scan Backward using my_big_table_mod4_s0_part123_pkey on public.my_big_table_mod4_s0_part123  (cost=0.43..290535.67 rows=355620 width=32)
               Output: my_big_table_mod4_s0_part123.serial, my_big_table_mod4_s0_part123.some_object_id, my_big_table_mod4_s0_part123.timestamp_ns, my_big_table_mod4_s0_part123.object_type, my_big_table_mod4_s0_part123.other_type, length(my_big_table_mod4_s0_part123.data)
               Index Cond: (my_big_table_mod4_s0_part123.user_id = 4560084)
         ->  Index Scan Backward using [... other partitions ...]

What have I tried:

  • As FDW is still active in development, I tried using a newer version: both 11.4 and 12-beta2 for both FDW as well as backend servers. No difference observed.
  • Running ANALYZE against the foreign table (on FDW instance). Takes a huge amount of time; looks like it's full tablescanning the remote table? No difference in query planning.
  • Changing the value of fetch_size on the remote SERVER object. No difference.
  • Set use_remote_estimate=true on the remote SERVER object. No difference.
  • Set fdw_tuple_cost=100 on the remote SERVER object. Sorting now happens on the remote server, but LIMIT still not pushed down.
  • Looking online for other people seeing this, only showing this related post: Missed LIMIT clause pushdown in FDW API

    But this thread mentions commits fixing this in 9.7 and whatnot, but I'm using 11.4 and 12-beta2. Or do I misunderstand this?

    And post: Strange cost estimation for foreign tables shows a nice example on tuning FDW, but unfortunately does not cover my issue with LIMITs.

  • Having a quick look in the PostgreSQL source code, I noticed this statement, perhaps relevant for FDW, maybe not (source).

    We can't push sub-select containing LIMIT/OFFSET to workers as
    there is no guarantee that the row order will be fully
    deterministic, and applying LIMIT/OFFSET will lead to
    inconsistent results at the top-level. (In some cases, where
    the result is ordered, we could relax this restriction. But it
    doesn't currently seem worth expending extra effort to do so.)

  • Having another look at the source code, I found this promising commit (d50d172e51):

    This provides the ability for postgres_fdw to handle SELECT commands
    so that it 1) skips the LockRows step (if any) (note that this is
    safe since it performs early locking) and 2) pushes down the LIMIT
    and/or OFFSET restrictions (if any) to the remote side. This doesn't
    handle the INSERT/UPDATE/DELETE cases.

    It adds a unit test case exactly for my case!

    -- and both ORDER BY and LIMIT can be shipped
    EXPLAIN (VERBOSE, COSTS OFF)
      SELECT * FROM ft1 t1 WHERE t1.c1 === t1.c2 order by t1.c2 limit 1;
                                                                            QUERY PLAN                                                                         
    ------------------------------------------------------------------------------------------------------------------------------------------------------------
    Foreign Scan on public.ft1 t1
      Output: c1, c2, c3, c4, c5, c6, c7, c8
      Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" OPERATOR(public.===) c2)) ORDER BY c2 ASC NULLS LAST LIMIT 1::bigint
    

    which should be part of 12-beta2, which I'm running already…

I noticed I should use indexes in DESC order here, but that's not so relevant now.

Best Answer

This works as expected since version 12 with commit d50d172e51 indeed, but only for non-partitioned (non-sharded) tables.

Running the query directly against the foreign table name (my_big_table_mod4_s0), the LIMIT is pushed down correctly.

Reported as a bug ; I don't see a technical reason why this should not work with partitioning (partition pruning) involved.

Update: turns out this is not really a bug, given the complexity of the planner and partition pruning in combination with FDW, but more of a feature request. The original author of the aforementioned commit indicates work on this may be done for PostgreSQL 13. :-)

Lesson learned: FDW is not really an efficient query router for all type of queries (yet).

Work-around for now with partitioned (sharded) tables on FDW for me is to create a function in plpgsql to determine the foreign table name based on the declarative partitioning layout (mod(user_id, 4) in my case). (I believe it's getting out of scope to fully include that here.)