Is wal_keep_segments required with streaming replication when archive and restore commands are set?



  • I have a 2 node postresql(12.9) cluster with streaming replication enabled. I have also set the archive and restore commands which I verified is working as expected. No replication slots or wal_keep_segments configured.

    archive_command ='test ! -f /archive_location/%f && cp %p /archive_location/%f'
    restore_command = 'cp /archive_location/%f %p'
    

    In this setup, I notice that if the standby disconnects after receiving a partial WAL file and if the same WAL gets archived and recycled from pg_wal on the primary node, then on reconnection, this standby never catches up with primary.

    The error seen on standby is - ERROR:  requested WAL segment 0000000100000000000000XX has already been removed.

    Debugging revealed the following -

    1- Issue only seen when partial WAL file written on standby at time of disconnection. Since streaming replication is enabled, postgres doesn't wait for the 16MB WAL segment to fill up before shipping WAL to standby. So there could be partially written WAL on standby at any point. Confirmed by file compare of same WAL on primary and standby -

    cmp 0000000100000000000000XX /tmp/0000000100000000000000XX
    0000000100000000000000XX /tmp/0000000100000000000000XX differ: byte 15722105, line 68077
    

    On reconnection, postgres first tries restoring WAL from archive but on failure, falls on streaming the same WAL from publisher over TCP connection. Since wal_keep_segments is not set, this WAL is recycled from pg_wal on publisher and standby is now stuck in this loop

    628ed196.bd8bf 0DEBUG:  00000: switched WAL source from stream to archive after failure
    2
    628ed196.bd8bf 0DEBUG:  00000: record with incorrect prev-link 1/FDFF2008 at 0/EDEFE678
    2022-05-26 06:32:19 IST   776383 628ed196.bd8bf 0LOCATION:  ReadRecord, xlog.c:4348
    2022-05-26 06:32:19 IST   776383 628ed196.bd8bf 0DEBUG:  00000: switched WAL source from archive to stream after failure
    2022-05-26 06:32:19 IST   776629 628ed19b.bd9b5 0FATAL:  XX000: could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000000000000XX has already been removed
    

    Given this info looks like wal_keep_segments is needed even when archive + restore is set for streaming replication. Is this true? Am I missing something?

    Edit - Adding necessary log info here for improved readability -

    628ed196.bd8bf 0DEBUG:  00000: switched WAL source from stream to archive after failure
    2022-05-26 06:32:19 IST   776383 628ed196.bd8bf 0LOCATION:  WaitForWALToBecomeAvailable, xlog.c:12208
    2022-05-26 06:32:19 IST   776383 628ed196.bd8bf 0DEBUG:  00000: record with incorrect prev-link 1/FDFF2008 at 0/EDEFE678
    2022-05-26 06:32:19 IST   776383 628ed196.bd8bf 0LOCATION:  ReadRecord, xlog.c:4348 2022-05-26 06:32:19 IST   776383 628ed196.bd8bf 0DEBUG:  00000: switched WAL source from archive to stream after failure
    2022-05-26 06:32:19 IST   776381 628ed196.bd8bd 0LOCATION:  LogChildExit, postmaster.c:3697
    2022-05-26 06:32:19 IST   776629 628ed19b.bd9b5 0LOG:  00000: started streaming WAL from primary at 0/ED000000 on timeline 1
    2022-05-26 06:32:19 IST   776629 628ed19b.bd9b5 0LOCATION:  WalReceiverMain, walreceiver.c:372 2022-05-26 06:32:19 IST   776629 628ed19b.bd9b5 0FATAL:  XX000: could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000000000000ED has already been removed
    


  • No, if you have restore_command defined properly on the standby server, you don't need wal_keep_segments (wal_keep_size in later versions) on the primary.

    If you define restore_command on the standby, it will restore the WAL segment from there. You say that the WAL segment gets archived and recycled on the primary, so the standby must be able to get it from there.

    The standby should try to fetch the WAL segment from archive, even if it already has a partially filled local WAL segment with the same name. I would consider anything else a bug.

    Are you sure that restore_command is set on the standby? Don't you get any messages about "restoring WAL segment from archive"?




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2