Postgresql – Londiste replication fails intermittenly

postgresqlreplication

I have skytools setup for postrgesql replication. It keeps on failing intermittently. When I checked londiste status I get following error

$ londiste /etc/skytools/mydb_data_0.ini status
Queue: mydb_data   Local node: mydb_data_0

mydb_data (root)
  |                   Tables: 18/0/0
  |                   Lag: 9s, Tick: 2620740
  +--: mydb_data_0 (leaf)
  |                   Tables: 18/0/0
  |                   Lag: 15m8s, Tick: 2620670
  |                   ERR: mydb_data_0: Lost position: batch 620669..2620669, dst has 2620670
  +--: mydb_data_1 (leaf)
                      Tables: 18/0/0
                      Lag: 9s, Tick: 2620740

I really don't understand what's going wrong. I get the same error message in postgres logs also,

Exception: Lost position: batch 2620669..2620669, dst has 2620670

I found this article to solve the error which I'm getting. It says you would have to use --reset option of worker to reset the queue position on remote site and then issue wait-sync to get the table queue moving again.

So I did this,

$ londiste /etc/skytools/mytestdb_data_0.ini worker --reset
Ignoring stale pidfile
2016-12-23 17:00:34,278 15245 INFO Resetting queue tracking on dst side

It resets the queue successfully, but when I check londiste status I get this error,

$ londiste /etc/skytools/mydb_data_0.ini status
Queue: mydb_data   Local node: mydb_data_0

mydb_data (root)
  |                   Tables: 18/0/0
  |                   Lag: 9s, Tick: 2620740
  +--: mydb_data_0 (leaf)
  |                   Tables: 18/0/0
  |                   Lag: 15m8s, Tick: 2620670
  |                   ERR: mydb_data_0: [ev_id=84594950,ev_txid=702851528] duplicate key value violates unique constraint "dmn_pkey"
  +--: mydb_data_1 (leaf)
                      Tables: 18/0/0
                      Lag: 9s, Tick: 2620740

I don't know what is causing this, can you please guide me on this.

Postgresql verion : 9.5, Skytools version : 3.2

Update

I found this skytools logs for master db,

2016-12-27 05:36:23,369 15563 ERROR Job mydb_data_0 crashed: Lost position: batch 2681655..2681655, dst has 2681656
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/skytools/scripting.py", line 578, in run_func_safely
    r = func()
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/cascade/consumer.py", line 199, in work
    return BaseConsumer.work(self)
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/baseconsumer.py", line 257, in work
    self._launch_process_batch(db, batch_id, ev_list)
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/baseconsumer.py", line 286, in _launch_process_batch
    self.process_batch(db, batch_id, list)
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/cascade/consumer.py", line 168, in process_batch
    if self.is_batch_done(state, self.batch_info, dst_db):
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/cascade/worker.py", line 185, in is_batch_done
    done = CascadedConsumer.is_batch_done(self, state, batch_info, dst_db)
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/cascade/consumer.py", line 254, in is_batch_done
    prev_tick, cur_tick, dst_tick))
Exception: Lost position: batch 2681655..2681655, dst has 2681656
2016-12-27 05:37:11,988 18190 INFO Resetting queue tracking on dst side
2016-12-27 05:37:47,776 18578 INFO pgq.maint_operations is installed
2016-12-27 05:37:48,038 18578 INFO {count: 0, duration: 0.295}
2016-12-27 05:37:48,115 18578 ERROR Job mydb_data_0 got error on connection 'db': duplicate key value violates unique constraint "dmn_pkey"
DETAIL:  Key (id)=(31780560) already exists..   Query: update only public.percent_bleed set percent_redir_sid = '43 ...
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/skytools/scripting.py", line 578, in run_func_safely
    r = func()
  File "/usr/lib/python2.7/dist-packages/skytools-3.0/pgq/cascade/consumer.py", line 199, in work
    return BaseConsumer.work(self)

Best Answer

Sorry for the answer, but I cannot comment....

You got the pretty straightforward message :

ERR: mydb_data_0: [ev_id=84594950,ev_txid=702851528] duplicate key value violates unique constraint "dmn_pkey"

It says you're trying to insert a value which is already present in the table. You have to work on your enqueue/dequeue process to avoid duplicates.