We are running a pg-xl cluster:
1 gtm
2 coords
2 datanodes (+2 slaves)
Everything works ok except for some medium queries (10 joins + group by). These queries always take 10.xx seconds. There is a 10-second timeout somewhere in pg-xl.
If I analyze the query I can see that it takes way less:
Planning time: 5.718 ms
Execution time: 57.537 ms
On the datanode logs, after 10 second, I can see: WARNING: SQueue p_1_3831_11, timeout while waiting for Consumers finishing
I tried to play with the shared_queues/shared_queue_size parameters but no luck.
Does anyone of you know where this timeout can come from?
Thanks in advance for your help.
EDIT:
I can reproduce with a simple LEFT JOIN (invoice table has 900k rows and quote table has 150k):
EXPLAIN
SELECT *
FROM quote
LEFT JOIN invoice ON invoice.doc_id = quote.doc_id AND invoice.account_id = 99171 AND invoice.state = 0
WHERE quote.quote_id = 5880 AND quote.account_id = 99171 AND quote.state = 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Remote Subquery Scan on all (data01,data02) (cost=0.28..16.33 rows=1 width=4910)
-> Nested Loop Left Join (cost=0.28..16.33 rows=1 width=4910)
Join Filter: (invoice.doc_id = quote.doc_id)
-> Remote Subquery Scan on all (data01,data02) (cost=100.14..109.31 rows=1 width=1142)
Distribute results by H: doc_id
-> Index Scan using idx_81855_account_id_3 on quote (cost=0.14..8.16 rows=1 width=1142)
Index Cond: ((account_id = 99171) AND (state = 0))
Filter: (quote_id = 5880)
-> Materialize (cost=100.14..111.93 rows=1 width=3768)
-> Remote Subquery Scan on all (data01,data02) (cost=100.14..111.93 rows=1 width=3768)
Distribute results by H: doc_id
-> Index Scan using idx_82126_account_id_6 on invoice (cost=0.14..8.16 rows=1 width=3768)
Index Cond: ((account_id = 99171) AND (state = 0))
(13 rows)
On data-01, the transactions are fine:
LOG: duration: 0.088 ms statement: SET global_session TO coord01_23261;SET parentPGXCPid TO 23261;
LOG: duration: 0.012 ms statement: BEGIN
LOG: duration: 0.020 ms statement: SET LOCAL coordinator_lxid TO "24";
LOG: duration: 0.738 ms parse p_1_5add_3a: Remote Subplan
LOG: duration: 0.063 ms statement: SET global_session TO coord01_23261;SET parentpgxcpid TO "23261";SET parentPGXCPid TO 13090;
LOG: duration: 0.011 ms statement: BEGIN
LOG: duration: 0.016 ms statement: SET LOCAL coordinator_lxid TO "45";
LOG: duration: 0.194 ms parse p_1_5add_38: Remote Subplan
LOG: duration: 0.069 ms statement: SET global_session TO coord01_23261;SET parentpgxcpid TO "23261";SET parentPGXCPid TO 10474;
LOG: duration: 1.667 ms bind p_1_5add_3a/p_1_5add_3a: Remote Subplan
LOG: duration: 0.011 ms statement: BEGIN
LOG: duration: 0.194 ms parse p_1_5add_39: Remote Subplan
LOG: duration: 0.062 ms bind p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.053 ms execute p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.018 ms statement: SET LOCAL coordinator_lxid TO "45";
LOG: duration: 0.520 ms execute p_1_5add_3a/p_1_5add_3a: Remote Subplan
LOG: duration: 0.176 ms parse p_1_5add_38: Remote Subplan
LOG: duration: 0.157 ms parse p_1_5add_39: Remote Subplan
LOG: duration: 0.029 ms bind p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.011 ms execute p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.045 ms bind p_1_5add_39/p_1_5add_39: Remote Subplan
LOG: duration: 0.462 ms execute p_1_5add_39/p_1_5add_39: Remote Subplan
LOG: duration: 0.023 ms statement: COMMIT TRANSACTION
LOG: duration: 0.036 ms statement: RESET ALL;RESET SESSION AUTHORIZATION;RESET transaction_isolation;
LOG: duration: 0.488 ms statement: COMMIT TRANSACTION
WARNING: SQueue p_1_5add_39, timeout while waiting for Consumers finishing
LOG: duration: 0.053 ms statement: COMMIT TRANSACTION
LOG: duration: 0.044 ms statement: RESET ALL;RESET SESSION AUTHORIZATION;RESET transaction_isolation;
LOG: duration: 0.046 ms statement: RESET ALL;RESET SESSION AUTHORIZATION;RESET transaction_isolation;
But on data-02, we can see the 10 second timeout
LOG: duration: 0.087 ms statement: SET global_session TO coord01_23261;SET parentPGXCPid TO 23261;
LOG: duration: 0.013 ms statement: BEGIN
LOG: duration: 0.020 ms statement: SET LOCAL coordinator_lxid TO "24";
LOG: duration: 0.060 ms statement: SET global_session TO coord01_23261;SET parentpgxcpid TO "23261";SET parentPGXCPid TO 13090;
LOG: duration: 0.773 ms parse p_1_5add_3a: Remote Subplan
LOG: duration: 0.010 ms statement: BEGIN
LOG: duration: 0.016 ms statement: SET LOCAL coordinator_lxid TO "45";
LOG: duration: 0.057 ms statement: SET global_session TO coord01_23261;SET parentpgxcpid TO "23261";SET parentPGXCPid TO 10474;
LOG: duration: 0.177 ms parse p_1_5add_38: Remote Subplan
LOG: duration: 0.173 ms parse p_1_5add_39: Remote Subplan
LOG: duration: 0.037 ms bind p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.035 ms execute p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.010 ms statement: BEGIN
LOG: duration: 0.016 ms statement: SET LOCAL coordinator_lxid TO "45";
LOG: duration: 0.194 ms parse p_1_5add_38: Remote Subplan
LOG: duration: 1.666 ms bind p_1_5add_3a/p_1_5add_3a: Remote Subplan
LOG: duration: 0.165 ms parse p_1_5add_39: Remote Subplan
LOG: duration: 0.026 ms bind p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.003 ms execute p_1_5add_38/p_1_5add_38: Remote Subplan
LOG: duration: 0.038 ms bind p_1_5add_39/p_1_5add_39: Remote Subplan
LOG: duration: 0.454 ms execute p_1_5add_39/p_1_5add_39: Remote Subplan
LOG: duration: 1.506 ms execute p_1_5add_3a/p_1_5add_3a: Remote Subplan
LOG: duration: 0.018 ms statement: COMMIT TRANSACTION
LOG: duration: 0.032 ms statement: RESET ALL;RESET SESSION AUTHORIZATION;RESET transaction_isolation;
WARNING: SQueue p_1_5add_39, timeout while waiting for Consumers finishing
LOG: duration: 0.050 ms statement: COMMIT TRANSACTION
LOG: duration: 0.048 ms statement: RESET ALL;RESET SESSION AUTHORIZATION;RESET transaction_isolation;
LOG: duration: 10010.353 ms statement: COMMIT TRANSACTION
LOG: duration: 0.047 ms statement: RESET ALL;RESET SESSION AUTHORIZATION;RESET transaction_isolation;
Best Answer
You probably encountered this Postgres-XL bug: SQueue X, timeout while waiting for Consumers finishing
Cause
The linked report has all the details, the gist of it is that there is an implementation bug when you have queries that:
EXPLAIN
, if you see anyDistribute results by:
in the output, it means some data will be distributed between nodes during query execution)Symptoms
EXPLAIN ANALYZE
reports much different execution time: it shows the time without the bug!WARNING: SQueue p_1_2_3, timeout while waiting for Consumers finishing
in your postgres logsWorkarounds
It's a bug in Postgres-XL, so you can only work around it, here are some options:
EXPLAIN
)