Log Shipping is failing with a 'Could not cleanup History'



  • We're experiencing a weird problem with our SQL Server 2019 Log shipping (15.0.4198.2).

    We have 4 databases that are log shipped, with a monitoring instance. The log tasks all fire at the same time.

    However every so often the jobs are failing to finish. (As in sqllogsip.exe would never exit.) When it happens it seems to be the majority of the 4 fail.

    Adding -querytimeout 60 to the task lets the job complete with a timeout. (So I guess the default is infinite timeout?) However the error is a curious one:

    2022-03-11 15:31:06.62 *** Error: Could not cleanup history.(Microsoft.SqlServer.Management.LogShipping) ***

    2022-03-11 15:31:06.62 *** Error: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.(.Net SqlClient Data Provider) ***

    2022-03-11 15:31:06.62 *** Error: The wait operation timed out() ***

    What could be causing this? The jobs work most of the time.

    Edit: This is the sort of distribution of failures we're seeing. (I have a 1min timeout. The backup itself always takes around a few seconds.)

    enter image description here



  • This step is executing the sp_cleanup_log_shipping_history system stored procedure.

    From https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-cleanup-log-shipping-history-transact-sql?view=sql-server-ver15 :

    sp_cleanup_log_shipping_history must be run from the master database on any log shipping server. This stored procedure cleans up local and remote copies of log_shipping_monitor_history_detail and log_shipping_monitor_error_detail based on history retention period.

    Note that even though the docs specify that the cleanup procedure must be run from the master database, the two tables being cleaned up live in the msdb database. The clustering key on both of these tables is agent_id, agent_type, which are the same two parameters on the cleanup procedure, and both have a non-clustered index on log_time_utc. Thus, we can expect that they are decently indexed for cleanup.

    The specific code path is that sys.sp_cleanup_log_shipping_history calls sys.sp_MSprocesslogshippingretentioncleanup (located in the resource database, and thus a little tougher to inspect), which eventually runs this code:

        -- cleanup history_detail
    
    delete from msdb.dbo.log_shipping_monitor_history_detail
    where agent_id = @agent_id
        and agent_type = @agent_type
        and log_time_utc < @cutoff_time_utc
    
    -- cleanup error_detail
    
    delete from msdb.dbo.log_shipping_monitor_error_detail
    where agent_id = @agent_id
        and agent_type = @agent_type
        and log_time_utc < @cutoff_time_utc
    

    Given the intermittent nature of the failures, I suspect that you are encountering some sort of blocking that prevents the cleanup from running normally. The most likely scenario is when you have multiple log shipping jobs running in parallel, and they are concurrently trying to write to the same table.

    To know the reason for sure, you would need to reproduce the issue, or catch it with monitoring, or catch it in the action, you can gather details on blocking or wait types being experienced.

    Lets assume it's blocking...

    The easiest option is to stagger your jobs so that they no longer run in parallel. By eliminating concurrency, you should eliminate blocking.

    Another option would be to add ideal covering indexes to speed your deletes. I'm not able to reproduce your issue, but I suspect something along the lines of this might help:

    CREATE NONCLUSTERED INDEX MyNewIndex_history_detail
        ON dbo.log_shipping_monitor_history_detail
         (agent_id,agent_type,log_time_utc);
    

    CREATE NONCLUSTERED INDEX MyNewIndex_error_detail
    ON dbo.log_shipping_monitor_history_detail
    (agent_id,agent_type,log_time_utc);




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2