Recover 3-node Galera cluster from crash



  • this morning we got a power failure and all (physical) servers were abruplty shut down. I got a problem on my MySQL Galera cluster (3 nodes). I was able to start 1 node. On other two, MySQL cannot start.

    This is my current working node:

    node1:# cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    638cf118-e0d5-11e3-a36d-560a2eb8e3ca
    seqno:   -1
    cert_index:
            node1:# 
    node1:# 
    node1:# cat /var/lib/mysql/gvwstate.dat
    my_uuid: 911fc8ad-2003-11e9-ba40-4eae4c9ae759
    #vwbeg
    view_id: 3 911fc8ad-2003-11e9-ba40-4eae4c9ae759 438
    bootstrap: 0
    member: 911fc8ad-2003-11e9-ba40-4eae4c9ae759 0
    #vwend
    node1:# 
    node1:# 
    

    This is node2:

    node2:# cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    00000000-0000-0000-0000-000000000000
    seqno:   -1
    cert_index:
            node2:# 
    node2:# 
    node2:# cat /var/lib/mysql/gvwstate.dat
    cat: /var/lib/mysql/gvwstate.dat: No such file or directory
    

    And this is node3:

    node3:# cat /var/lib/mysql/grastate.dat
    # GALERA saved state
    version: 2.1
    uuid:    638cf118-e0d5-11e3-a36d-560a2eb8e3ca
    seqno:   1937427725
    cert_index:
    node3:# 
    node3:# 
    node3:# cat /var/lib/mysql/gvwstate.dat
    cat: /var/lib/mysql/gvwstate.dat: No such file or directory
    node3:# 
    

    This is my error if I try to start node2, for example:

    220112 18:39:00 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    220112 18:39:00 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.6D9DCm' --pid-file='/var/lib/mysql/node2-recover.pid'
    220112 18:39:00 [Note] /usr/sbin/mysqld (mysqld 5.5.47-MariaDB-1~wheezy-wsrep-log) starting as process 10414 ...
    220112 18:39:11 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
    220112 18:39:11 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
    220112 18:39:11 [Note] /usr/sbin/mysqld (mysqld 5.5.47-MariaDB-1~wheezy-wsrep-log) starting as process 10559 ...
    220112 18:39:11 [Note] WSREP: Read nil XID from storage engines, skipping position init
    220112 18:39:11 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
    220112 18:39:11 [Note] WSREP: wsrep_load(): Galera 3.9(rXXXX) by Codership Oy  loaded successfully.
    220112 18:39:11 [Note] WSREP: CRC-32C: using hardware acceleration.
    220112 18:39:11 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
    220112 18:39:11 [Note] WSREP: Passing config to GCS: base_host = 10.10.2.42; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recover
    220112 18:39:11 [Note] WSREP: Service thread queue flushed.
    220112 18:39:11 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
    220112 18:39:11 [Note] WSREP: wsrep_sst_grab()
    220112 18:39:11 [Note] WSREP: Start replication
    220112 18:39:11 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
    220112 18:39:11 [Note] WSREP: protonet asio version 0
    220112 18:39:11 [Note] WSREP: Using CRC-32C for message checksums.
    220112 18:39:11 [Note] WSREP: backend: asio
    220112 18:39:11 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
    220112 18:39:11 [Note] WSREP: restore pc from disk failed
    220112 18:39:11 [Note] WSREP: GMCast version 0
    220112 18:39:11 [Note] WSREP: (8d5827a6, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
    220112 18:39:11 [Note] WSREP: (8d5827a6, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
    220112 18:39:11 [Note] WSREP: EVS version 0
    220112 18:39:11 [Note] WSREP: gcomm: connecting to group 'mcsgalera1', peer '10.10.2.41:,10.10.2.42:,10.10.2.44:'
    220112 18:39:11 [Warning] WSREP: (8d5827a6, 'tcp://0.0.0.0:4567') address 'tcp://10.10.2.42:4567' points to own listening address, blacklisting
    220112 18:39:11 [Note] WSREP: (8d5827a6, 'tcp://0.0.0.0:4567') address 'tcp://10.10.2.42:4567' pointing to uuid 8d5827a6 is blacklisted, skipping
    220112 18:39:11 [Note] WSREP: (8d5827a6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
    220112 18:39:11 [Note] WSREP: declaring 911fc8ad at tcp://10.10.2.41:4567 stable
    220112 18:39:11 [Note] WSREP: Node 911fc8ad state prim
    220112 18:39:11 [Note] WSREP: view(view_id(PRIM,8d5827a6,439) memb {
            8d5827a6,0
            911fc8ad,0
    } joined {
    } left {
    } partitioned {
    })
    220112 18:39:11 [Note] WSREP: save pc into disk
    220112 18:39:11 [Note] WSREP: discarding pending addr without UUID: tcp://10.10.2.44:4567
    220112 18:39:11 [Note] WSREP: gcomm: connected
    220112 18:39:11 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
    220112 18:39:11 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
    220112 18:39:11 [Note] WSREP: Opened channel 'mcsgalera1'
    220112 18:39:11 [Note] WSREP: Waiting for SST to complete.
    220112 18:39:11 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
    220112 18:39:11 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 8da4e645-73ce-11ec-8cd1-dff855ac7d35
    220112 18:39:11 [Note] WSREP: STATE EXCHANGE: sent state msg: 8da4e645-73ce-11ec-8cd1-dff855ac7d35
    220112 18:39:11 [Note] WSREP: STATE EXCHANGE: got state msg: 8da4e645-73ce-11ec-8cd1-dff855ac7d35 from 0 (node2)
    220112 18:39:11 [Note] WSREP: STATE EXCHANGE: got state msg: 8da4e645-73ce-11ec-8cd1-dff855ac7d35 from 1 (node1)
    220112 18:39:11 [Note] WSREP: Quorum results:
            version    = 3,
            component  = PRIMARY,
            conf_id    = 7,
            members    = 1/2 (joined/total),
            act_id     = 1937583591,
            last_appl. = -1,
            protocols  = 0/7/3 (gcs/repl/appl),
            group UUID = 638cf118-e0d5-11e3-a36d-560a2eb8e3ca
    220112 18:39:11 [Note] WSREP: Flow-control interval: [23, 23]
    220112 18:39:11 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 1937583591)
    220112 18:39:11 [Note] WSREP: State transfer required: 
            Group state: 638cf118-e0d5-11e3-a36d-560a2eb8e3ca:1937583591
            Local state: 00000000-0000-0000-0000-000000000000:-1
    220112 18:39:11 [Note] WSREP: New cluster view: global state: 638cf118-e0d5-11e3-a36d-560a2eb8e3ca:1937583591, view# 8: Primary, number of nodes: 2, my index: 0, protocol version 3
    220112 18:39:11 [Warning] WSREP: Gap in state sequence. Need state transfer.
    220112 18:39:11 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.2.42' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '10559''
    WSREP_SST: [INFO] Streaming with xbstream (20220112 18:39:12.142)
    WSREP_SST: [INFO] Using socat as streamer (20220112 18:39:12.143)
    WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20220112 18:39:12.148)
    WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20220112 18:39:12.168)
    220112 18:39:12 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.10.2.42:4444/xtrabackup_sst//1
    220112 18:39:12 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
    220112 18:39:12 [Note] WSREP: REPL Protocols: 7 (3, 2)
    220112 18:39:12 [Note] WSREP: Service thread queue flushed.
    220112 18:39:12 [Note] WSREP: Assign initial position for certification: 1937583591, protocol version: 3
    220112 18:39:12 [Note] WSREP: Service thread queue flushed.
    220112 18:39:12 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (638cf118-e0d5-11e3-a36d-560a2eb8e3ca): 1 (Operation not permitted)
             at galera/src/replicator_str.cpp:prepare_for_IST():456. IST will be unavailable.
    220112 18:39:12 [Note] WSREP: Member 0.0 (node2) requested state transfer from '*any*'. Selected 1.0 (node1)(SYNCED) as donor.
    220112 18:39:12 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 1937583591)
    220112 18:39:12 [Note] WSREP: Requesting state transfer: success, donor: 1
    WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /var/lib/mysql//.sst from previous state transfer (20220112 18:39:12.834)
    WSREP_SST: [INFO] Proceeding with SST (20220112 18:39:12.837)
    WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20220112 18:39:12.837)
    WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20220112 18:39:12.838)
    removed `/var/lib/mysql/ib_logfile0'
    removed `/var/lib/mysql/ibdata1'
    removed `/var/lib/mysql/aria_log_control'
    removed `/var/lib/mysql/aria_log.00000001'
    removed `/var/lib/mysql/ib_logfile1'
    WSREP_SST: [INFO] Cleaning the binlog directory /var/log/mysql as well (20220112 18:39:13.022)
    removed `/var/log/mysql/mariadb-bin.000001'
    WSREP_SST: [INFO] Waiting for SST streaming to complete! (20220112 18:39:13.026)
    220112 18:39:14 [Note] WSREP: (8d5827a6, 'tcp://0.0.0.0:4567') turning message relay requesting off
    xbstream: Can't create/write to file '././backup-my.cnf' (Errcode: 17 - File exists)
    xbstream: failed to create file.
    2022/01/12 18:39:23 socat[10874] E write(1, 0x19293d0, 8192): Broken pipe
    WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 1 1 (20220112 18:39:23.331)
    WSREP_SST: [ERROR] Cleanup after exit with status:32 (20220112 18:39:23.333)
    220112 18:39:23 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.2.42' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '10559': 32 (Broken pipe)
    220112 18:39:23 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
    220112 18:39:23 [ERROR] WSREP: SST failed: 32 (Broken pipe)
    220112 18:39:23 [ERROR] Aborting
    

    220112 18:39:23 [Warning] WSREP: 1.0 (node1): State transfer to 0.0 (node2) failed: -22 (Invalid argument)
    220112 18:39:23 [ERROR] WSREP: gcs/src/gcs_group.cpp:int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():731: Will never receive state. Need to abort.
    220112 18:39:23 [Note] WSREP: gcomm: terminating thread
    220112 18:39:23 [Note] WSREP: gcomm: joining thread
    220112 18:39:23 [Note] WSREP: gcomm: closing backend
    220112 18:39:24 [Note] WSREP: view(view_id(NON_PRIM,8d5827a6,439) memb {
    8d5827a6,0
    } joined {
    } left {
    } partitioned {
    911fc8ad,0
    })
    220112 18:39:24 [Note] WSREP: view((empty))
    220112 18:39:24 [Note] WSREP: gcomm: closed
    220112 18:39:24 [Note] WSREP: /usr/sbin/mysqld: Terminated.
    220112 18:39:24 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended

    how can I recover my cluster? Thank you very much



  • Looks like this is the issue:

    xbstream: Can't create/write to file '././backup-my.cnf' (Errcode: 17 - File exists).
    

    Maybe check if you see that file anywhere in your datadir (/var/lib/mysql/ or a sub-directory like /var/lib/mysql/.sst/) and either either delete it or set the permissions so that your mysql user can overwrite it next time it tries to write there.




Suggested Topics

  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2
  • 2