MHA failover GTID 专题

这里以masterha_master_switch为背景详解各种可能遇到的场景

假定环境(经典三节点)

host_1(host_1:3306) (current master) +--host_2(host_2:3306 slave[candidate master]) +--host_3(host_3:3306 etl)

一、Master : MySQL down

1.1 etl 延迟8小时

配置文件中加上no_check_delay=0 即可忽略报错

1.2 slave(候选master)比etl还要落后更多

1.2.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB gtid状态* master  host_2dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_2.000002 |     2885 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave (candidate master)  host_1           Retrieved_Gtid_Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446353                Auto_Position: 1* etl (other slave)  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353-446356            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446356                Auto_Position: 1### 切换日志masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorThu Nov  9 10:43:49 2017 - [info] MHA::MasterFailover version 0.56.Thu Nov  9 10:43:49 2017 - [info] Starting master failover.Thu Nov  9 10:43:49 2017 - [info]Thu Nov  9 10:43:49 2017 - [info] * Phase 1: Configuration Check Phase..Thu Nov  9 10:43:49 2017 - [info]Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Thu Nov  9 10:43:50 2017 - [info] Binlog server  host_2 is reachable.Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Thu Nov  9 10:43:50 2017 - [info] Binlog server  host_1 is reachable.Thu Nov  9 10:43:50 2017 - [info] HealthCheck: SSH to  host_3 is reachable.Thu Nov  9 10:43:50 2017 - [info] Binlog server  host_3 is reachable.Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)Thu Nov  9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Thu Nov  9 10:43:51 2017 - [info] GTID failover mode = 1Thu Nov  9 10:43:51 2017 - [info] Dead Servers:Thu Nov  9 10:43:51 2017 - [info]    host_2( host_2:3306)Thu Nov  9 10:43:51 2017 - [info] Checking master reachability via MySQL(double check)...Thu Nov  9 10:43:51 2017 - [info]  ok.Thu Nov  9 10:43:51 2017 - [info] Alive Servers:Thu Nov  9 10:43:51 2017 - [info]    host_1( host_1:3306)Thu Nov  9 10:43:51 2017 - [info]    host_3( host_3:3306)Thu Nov  9 10:43:51 2017 - [info] Alive Slaves:Thu Nov  9 10:43:51 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:43:51 2017 - [info]     GTID ONThu Nov  9 10:43:51 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 10:43:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 10:43:51 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:43:51 2017 - [info]     GTID ONThu Nov  9 10:43:51 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 10:43:51 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..Thu Nov  9 10:43:51 2017 - [info]   done.Thu Nov  9 10:43:51 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Thu Nov  9 10:43:51 2017 - [info]   done.Thu Nov  9 10:43:51 2017 - [info] Starting GTID based failover.Thu Nov  9 10:43:51 2017 - [info]Thu Nov  9 10:43:51 2017 - [info] ** Phase 1: Configuration Check Phase completed.Thu Nov  9 10:43:51 2017 - [info]Thu Nov  9 10:43:51 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Thu Nov  9 10:43:51 2017 - [info]Thu Nov  9 10:43:51 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Thu Nov  9 10:43:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..Thu Nov  9 10:43:51 2017 - [info] Executing master IP deactivation script:Thu Nov  9 10:43:51 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=rootThu Nov  9 10:43:53 2017 - [info]  done.Thu Nov  9 10:43:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Thu Nov  9 10:43:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info] * Phase 3: Master Recovery Phase..Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info] The latest binary log file/position on all slaves is  host_2.000002:1115Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,Thu Nov  9 10:43:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Thu Nov  9 10:43:53 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:43:53 2017 - [info]     GTID ONThu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 10:43:53 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000002:230Thu Nov  9 10:43:53 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353Thu Nov  9 10:43:53 2017 - [info] Oldest slaves:Thu Nov  9 10:43:53 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:43:53 2017 - [info]     GTID ONThu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: Determining New Master Phase..Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info] Searching new master from slaves..Thu Nov  9 10:43:53 2017 - [info]  Candidate masters from the configuration file:Thu Nov  9 10:43:53 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:43:53 2017 - [info]     GTID ONThu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 10:43:53 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 10:43:53 2017 - [info]  Non-candidate masters:Thu Nov  9 10:43:53 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:43:53 2017 - [info]     GTID ONThu Nov  9 10:43:53 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 10:43:53 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 10:43:53 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Thu Nov  9 10:43:53 2017 - [info]   Not found.Thu Nov  9 10:43:53 2017 - [info]  Searching from all candidate_master slaves..Thu Nov  9 10:43:53 2017 - [info] New master is  host_1( host_1:3306)Thu Nov  9 10:43:53 2017 - [info] Starting master failover..Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info] * Phase 3.3: New Master Recovery Phase..Thu Nov  9 10:43:53 2017 - [info]Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied..Thu Nov  9 10:43:53 2017 - [info]   done.Thu Nov  9 10:43:53 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..Thu Nov  9 10:43:53 2017 - [info]  Waiting all logs to be applied on the latest slave..Thu Nov  9 10:43:53 2017 - [info]  Resetting slave  host_1( host_1:3306) and starting replication from the new master  host_3( host_3:3306)..Thu Nov  9 10:43:53 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 10:43:54 2017 - [info]  Slave started.Thu Nov  9 10:43:54 2017 - [info]  Waiting to execute all relay logs on  host_1( host_1:3306)..Thu Nov  9 10:43:54 2017 - [info]  master_pos_wait( host_3.000049:18041) completed on  host_1( host_1:3306). Executed 0 events.Thu Nov  9 10:43:54 2017 - [info]   done.Thu Nov  9 10:43:54 2017 - [info]   done.Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host  host_2 started, pid: 150294Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host  host_1 started, pid: 150295Thu Nov  9 10:43:54 2017 - [info] -- Saving binlog from host  host_3 started, pid: 150297Thu Nov  9 10:43:54 2017 - [info]Thu Nov  9 10:43:54 2017 - [info] Log messages from  host_1 ...Thu Nov  9 10:43:54 2017 - [info]Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server  host_1..Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Thu Nov  9 10:43:54 2017 - [info] End of log messages from  host_1.Thu Nov  9 10:43:54 2017 - [warning] Got error from  host_1.Thu Nov  9 10:43:54 2017 - [info]Thu Nov  9 10:43:54 2017 - [info] Log messages from  host_3 ...Thu Nov  9 10:43:54 2017 - [info]Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server  host_3..Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Thu Nov  9 10:43:54 2017 - [info] End of log messages from  host_3.Thu Nov  9 10:43:54 2017 - [warning] Got error from  host_3.Thu Nov  9 10:43:55 2017 - [info]Thu Nov  9 10:43:55 2017 - [info] Log messages from  host_2 ...Thu Nov  9 10:43:55 2017 - [info]Thu Nov  9 10:43:54 2017 - [info] Fetching binary logs from binlog server  host_2..Thu Nov  9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 10:43:55 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109104349.binlog succeeded.Thu Nov  9 10:43:55 2017 - [info] End of log messages from  host_2.Thu Nov  9 10:43:55 2017 - [info] Saved mysqlbinlog size from  host_2 is 6047 bytes.Thu Nov  9 10:43:55 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109104349.binlog ..Thu Nov  9 10:43:55 2017 - [info] Differential log apply from binlog server succeeded.Thu Nov  9 10:43:55 2017 - [info] Getting new master's binlog name and position..Thu Nov  9 10:43:55 2017 - [info]   host_1.000053:3624Thu Nov  9 10:43:55 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Thu Nov  9 10:43:55 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000053, 3624, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,Thu Nov  9 10:43:55 2017 - [info] Executing master IP activate script:Thu Nov  9 10:43:55 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Thu Nov  9 10:43:57 2017 - [info]  OK.Thu Nov  9 10:43:57 2017 - [info] Setting read_only=0 on  host_1( host_1:3306)..Thu Nov  9 10:43:57 2017 - [info]  ok.Thu Nov  9 10:43:57 2017 - [info] ** Finished master recovery successfully.Thu Nov  9 10:43:57 2017 - [info] * Phase 3: Master Recovery Phase completed.Thu Nov  9 10:43:57 2017 - [info]Thu Nov  9 10:43:57 2017 - [info] * Phase 4: Slaves Recovery Phase..Thu Nov  9 10:43:57 2017 - [info]Thu Nov  9 10:43:57 2017 - [info]Thu Nov  9 10:43:57 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Thu Nov  9 10:43:57 2017 - [info]Thu Nov  9 10:43:57 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 155162. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109104349.log if it takes time..Thu Nov  9 10:43:58 2017 - [info]Thu Nov  9 10:43:58 2017 - [info] Log messages from  host_3 ...Thu Nov  9 10:43:58 2017 - [info]Thu Nov  9 10:43:57 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..Thu Nov  9 10:43:57 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 10:43:58 2017 - [info]  Slave started.Thu Nov  9 10:43:58 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,Thu Nov  9 10:43:58 2017 - [info] End of log messages from  host_3.Thu Nov  9 10:43:58 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Thu Nov  9 10:43:58 2017 - [info] All new slave servers recovered successfully.Thu Nov  9 10:43:58 2017 - [info]Thu Nov  9 10:43:58 2017 - [info] * Phase 5: New master cleanup phase..Thu Nov  9 10:43:58 2017 - [info]Thu Nov  9 10:43:58 2017 - [info] Resetting slave info on the new master..Thu Nov  9 10:43:58 2017 - [info]   host_1: Resetting slave info succeeded.Thu Nov  9 10:43:58 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.Thu Nov  9 10:43:58 2017 - [info]Thu Nov  9 10:43:58 2017 - [info] Sending mail..

1.2.2 当master的所有日志已经传递到1个etl，这时候master 上的MySQL挂了

### 模拟现场，现场的3台DB gtid状态* master  host_1dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_1.000053 |     5229 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave (candidate master)  host_2           Retrieved_Gtid_Set:            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362                Auto_Position: 1* etl (other slave)  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446357-446362            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362                Auto_Position: 1### 切换日志masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorThu Nov  9 10:59:14 2017 - [info] MHA::MasterFailover version 0.56.Thu Nov  9 10:59:14 2017 - [info] Starting master failover.Thu Nov  9 10:59:14 2017 - [info]Thu Nov  9 10:59:14 2017 - [info] * Phase 1: Configuration Check Phase..Thu Nov  9 10:59:14 2017 - [info]Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Thu Nov  9 10:59:15 2017 - [info] Binlog server  host_2 is reachable.Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Thu Nov  9 10:59:15 2017 - [info] Binlog server  host_1 is reachable.Thu Nov  9 10:59:15 2017 - [info] HealthCheck: SSH to  host_3 is reachable.Thu Nov  9 10:59:16 2017 - [info] Binlog server  host_3 is reachable.Thu Nov  9 10:59:16 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)Thu Nov  9 10:59:16 2017 - [info] GTID failover mode = 1Thu Nov  9 10:59:16 2017 - [info] Dead Servers:Thu Nov  9 10:59:16 2017 - [info]    host_1( host_1:3306)Thu Nov  9 10:59:16 2017 - [info] Checking master reachability via MySQL(double check)...Thu Nov  9 10:59:16 2017 - [info]  ok.Thu Nov  9 10:59:16 2017 - [info] Alive Servers:Thu Nov  9 10:59:16 2017 - [info]    host_2( host_2:3306)Thu Nov  9 10:59:16 2017 - [info]    host_3( host_3:3306)Thu Nov  9 10:59:16 2017 - [info] Alive Slaves:Thu Nov  9 10:59:16 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:59:16 2017 - [info]     GTID ONThu Nov  9 10:59:16 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 10:59:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 10:59:16 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:59:16 2017 - [info]     GTID ONThu Nov  9 10:59:16 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 10:59:16 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 10:59:16 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..Thu Nov  9 10:59:16 2017 - [info]   done.Thu Nov  9 10:59:16 2017 - [info] Starting GTID based failover.Thu Nov  9 10:59:16 2017 - [info]Thu Nov  9 10:59:16 2017 - [info] ** Phase 1: Configuration Check Phase completed.Thu Nov  9 10:59:16 2017 - [info]Thu Nov  9 10:59:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Thu Nov  9 10:59:16 2017 - [info]Thu Nov  9 10:59:16 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Thu Nov  9 10:59:16 2017 - [info] Forcing shutdown so that applications never connect to the current master..Thu Nov  9 10:59:16 2017 - [info] Executing master IP deactivation script:Thu Nov  9 10:59:16 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=rootThu Nov  9 10:59:20 2017 - [info]  done.Thu Nov  9 10:59:20 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Thu Nov  9 10:59:20 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info] * Phase 3: Master Recovery Phase..Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info] The latest binary log file/position on all slaves is  host_1.000053:5229Thu Nov  9 10:59:20 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,Thu Nov  9 10:59:20 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Thu Nov  9 10:59:20 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:59:20 2017 - [info]     GTID ONThu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 10:59:20 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000053:3624Thu Nov  9 10:59:20 2017 - [info] Oldest slaves:Thu Nov  9 10:59:20 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:59:20 2017 - [info]     GTID ONThu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: Determining New Master Phase..Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info] Searching new master from slaves..Thu Nov  9 10:59:20 2017 - [info]  Candidate masters from the configuration file:Thu Nov  9 10:59:20 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:59:20 2017 - [info]     GTID ONThu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 10:59:20 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 10:59:20 2017 - [info]  Non-candidate masters:Thu Nov  9 10:59:20 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 10:59:20 2017 - [info]     GTID ONThu Nov  9 10:59:20 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 10:59:20 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 10:59:20 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Thu Nov  9 10:59:20 2017 - [info]   Not found.Thu Nov  9 10:59:20 2017 - [info]  Searching from all candidate_master slaves..Thu Nov  9 10:59:20 2017 - [info] New master is  host_2( host_2:3306)Thu Nov  9 10:59:20 2017 - [info] Starting master failover..Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info] * Phase 3.3: New Master Recovery Phase..Thu Nov  9 10:59:20 2017 - [info]Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied..Thu Nov  9 10:59:20 2017 - [info]   done.Thu Nov  9 10:59:20 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..Thu Nov  9 10:59:20 2017 - [info]  Waiting all logs to be applied on the latest slave..Thu Nov  9 10:59:20 2017 - [info]  Resetting slave  host_2( host_2:3306) and starting replication from the new master  host_3( host_3:3306)..Thu Nov  9 10:59:20 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 10:59:21 2017 - [info]  Slave started.Thu Nov  9 10:59:21 2017 - [info]  Waiting to execute all relay logs on  host_2( host_2:3306)..Thu Nov  9 10:59:21 2017 - [info]  master_pos_wait( host_3.000049:22035) completed on  host_2( host_2:3306). Executed 0 events.Thu Nov  9 10:59:21 2017 - [info]   done.Thu Nov  9 10:59:21 2017 - [info]   done.Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host  host_2 started, pid: 184482Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host  host_1 started, pid: 184483Thu Nov  9 10:59:21 2017 - [info] -- Saving binlog from host  host_3 started, pid: 184487Thu Nov  9 10:59:21 2017 - [info]Thu Nov  9 10:59:21 2017 - [info] Log messages from  host_2 ...Thu Nov  9 10:59:21 2017 - [info]Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server  host_2..Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Thu Nov  9 10:59:21 2017 - [info] End of log messages from  host_2.Thu Nov  9 10:59:21 2017 - [warning] Got error from  host_2.Thu Nov  9 10:59:21 2017 - [info]Thu Nov  9 10:59:21 2017 - [info] Log messages from  host_3 ...Thu Nov  9 10:59:21 2017 - [info]Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server  host_3..Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Thu Nov  9 10:59:21 2017 - [info] End of log messages from  host_3.Thu Nov  9 10:59:21 2017 - [warning] Got error from  host_3.Thu Nov  9 10:59:22 2017 - [info]Thu Nov  9 10:59:22 2017 - [info] Log messages from  host_1 ...Thu Nov  9 10:59:22 2017 - [info]Thu Nov  9 10:59:21 2017 - [info] Fetching binary logs from binlog server  host_1..Thu Nov  9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053  --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 10:59:22 2017 - [info] scp from root@ host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171109105914.binlog succeeded.Thu Nov  9 10:59:22 2017 - [info] End of log messages from  host_1.Thu Nov  9 10:59:22 2017 - [info] Saved mysqlbinlog size from  host_1 is 800 bytes.Thu Nov  9 10:59:22 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171109105914.binlog ..Thu Nov  9 10:59:22 2017 - [info] Differential log apply from binlog server succeeded.Thu Nov  9 10:59:22 2017 - [info] Getting new master's binlog name and position..Thu Nov  9 10:59:22 2017 - [info]   host_2.000003:1680Thu Nov  9 10:59:22 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Thu Nov  9 10:59:22 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000003, 1680, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,Thu Nov  9 10:59:22 2017 - [info] Executing master IP activate script:Thu Nov  9 10:59:22 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Thu Nov  9 10:59:24 2017 - [info]  OK.Thu Nov  9 10:59:24 2017 - [info] Setting read_only=0 on  host_2( host_2:3306)..Thu Nov  9 10:59:24 2017 - [info]  ok.Thu Nov  9 10:59:24 2017 - [info] ** Finished master recovery successfully.Thu Nov  9 10:59:24 2017 - [info] * Phase 3: Master Recovery Phase completed.Thu Nov  9 10:59:24 2017 - [info]Thu Nov  9 10:59:24 2017 - [info] * Phase 4: Slaves Recovery Phase..Thu Nov  9 10:59:24 2017 - [info]Thu Nov  9 10:59:24 2017 - [info]Thu Nov  9 10:59:24 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Thu Nov  9 10:59:24 2017 - [info]Thu Nov  9 10:59:24 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 189393. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109105914.log if it takes time..Thu Nov  9 10:59:25 2017 - [info]Thu Nov  9 10:59:25 2017 - [info] Log messages from  host_3 ...Thu Nov  9 10:59:25 2017 - [info]Thu Nov  9 10:59:24 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..Thu Nov  9 10:59:24 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 10:59:25 2017 - [info]  Slave started.Thu Nov  9 10:59:25 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,Thu Nov  9 10:59:25 2017 - [info] End of log messages from  host_3.Thu Nov  9 10:59:25 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Thu Nov  9 10:59:25 2017 - [info] All new slave servers recovered successfully.Thu Nov  9 10:59:25 2017 - [info]Thu Nov  9 10:59:25 2017 - [info] * Phase 5: New master cleanup phase..Thu Nov  9 10:59:25 2017 - [info]Thu Nov  9 10:59:25 2017 - [info] Resetting slave info on the new master..Thu Nov  9 10:59:25 2017 - [info]   host_2: Resetting slave info succeeded.Thu Nov  9 10:59:25 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.Thu Nov  9 10:59:25 2017 - [info]Thu Nov  9 10:59:25 2017 - [info] Sending mail..

1.3 slave(候选master)的日志是最新的，比etl要多

1.3.1 当master的部分日志还没传递两个slave，这时候master 上的MySQL挂了

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorTue Nov  7 17:11:29 2017 - [info] MHA::MasterFailover version 0.56.Tue Nov  7 17:11:29 2017 - [info] Starting master failover.Tue Nov  7 17:11:29 2017 - [info]Tue Nov  7 17:11:29 2017 - [info] * Phase 1: Configuration Check Phase..Tue Nov  7 17:11:29 2017 - [info]Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Tue Nov  7 17:11:29 2017 - [info] Binlog server  host_2 is reachable.Tue Nov  7 17:11:29 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Tue Nov  7 17:11:30 2017 - [info] Binlog server  host_1 is reachable.Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to  host_3 is reachable.Tue Nov  7 17:11:30 2017 - [info] Binlog server  host_3 is reachable.Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)Tue Nov  7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Tue Nov  7 17:11:30 2017 - [info] GTID failover mode = 1Tue Nov  7 17:11:30 2017 - [info] Dead Servers:Tue Nov  7 17:11:30 2017 - [info]    host_1( host_1:3306)Tue Nov  7 17:11:30 2017 - [info] Checking master reachability via MySQL(double check)...Tue Nov  7 17:11:30 2017 - [info]  ok.Tue Nov  7 17:11:30 2017 - [info] Alive Servers:Tue Nov  7 17:11:30 2017 - [info]    host_2( host_2:3306)Tue Nov  7 17:11:30 2017 - [info]    host_3( host_3:3306)Tue Nov  7 17:11:30 2017 - [info] Alive Slaves:Tue Nov  7 17:11:30 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 17:11:30 2017 - [info]     GTID ONTue Nov  7 17:11:30 2017 - [info]     Replicating from  host_1( host_1:3306)Tue Nov  7 17:11:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Nov  7 17:11:30 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 17:11:30 2017 - [info]     GTID ONTue Nov  7 17:11:30 2017 - [info]     Replicating from  host_1( host_1:3306)Tue Nov  7 17:11:30 2017 - [info]     Not candidate for the new Master (no_master is set)Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..Tue Nov  7 17:11:30 2017 - [info]   done.Tue Nov  7 17:11:30 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Tue Nov  7 17:11:30 2017 - [info]   done.Tue Nov  7 17:11:30 2017 - [info] Starting GTID based failover.Tue Nov  7 17:11:30 2017 - [info]Tue Nov  7 17:11:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.Tue Nov  7 17:11:30 2017 - [info]Tue Nov  7 17:11:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Tue Nov  7 17:11:30 2017 - [info]Tue Nov  7 17:11:30 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Tue Nov  7 17:11:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..Tue Nov  7 17:11:31 2017 - [info] Executing master IP deactivation script:Tue Nov  7 17:11:31 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=rootTue Nov  7 17:11:33 2017 - [info]  done.Tue Nov  7 17:11:33 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Tue Nov  7 17:11:33 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] * Phase 3: Master Recovery Phase..Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] The latest binary log file/position on all slaves is  host_1.000051:13508Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:3-8Tue Nov  7 17:11:33 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Tue Nov  7 17:11:33 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 17:11:33 2017 - [info]     GTID ONTue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Nov  7 17:11:33 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000051:11918Tue Nov  7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:2-3,Tue Nov  7 17:11:33 2017 - [info] Oldest slaves:Tue Nov  7 17:11:33 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 17:11:33 2017 - [info]     GTID ONTue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: Determining New Master Phase..Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Searching new master from slaves..Tue Nov  7 17:11:33 2017 - [info]  Candidate masters from the configuration file:Tue Nov  7 17:11:33 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 17:11:33 2017 - [info]     GTID ONTue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)Tue Nov  7 17:11:33 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Nov  7 17:11:33 2017 - [info]  Non-candidate masters:Tue Nov  7 17:11:33 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 17:11:33 2017 - [info]     GTID ONTue Nov  7 17:11:33 2017 - [info]     Replicating from  host_1( host_1:3306)Tue Nov  7 17:11:33 2017 - [info]     Not candidate for the new Master (no_master is set)Tue Nov  7 17:11:33 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Tue Nov  7 17:11:33 2017 - [info] New master is  host_2( host_2:3306)Tue Nov  7 17:11:33 2017 - [info] Starting master failover..Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] * Phase 3.3: New Master Recovery Phase..Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info]  Waiting all logs to be applied..Tue Nov  7 17:11:33 2017 - [info]   done.Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host  host_2 started, pid: 54677Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host  host_1 started, pid: 54681Tue Nov  7 17:11:33 2017 - [info] -- Saving binlog from host  host_3 started, pid: 54683Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Log messages from  host_3 ...Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server  host_3..Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Tue Nov  7 17:11:33 2017 - [info] End of log messages from  host_3.Tue Nov  7 17:11:33 2017 - [warning] Got error from  host_3.Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Log messages from  host_2 ...Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server  host_2..Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Tue Nov  7 17:11:33 2017 - [info] End of log messages from  host_2.Tue Nov  7 17:11:33 2017 - [warning] Got error from  host_2.Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Log messages from  host_1 ...Tue Nov  7 17:11:33 2017 - [info]Tue Nov  7 17:11:33 2017 - [info] Fetching binary logs from binlog server  host_1..Tue Nov  7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051  --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 17:11:33 2017 - [info] scp from root@ host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171107171129.binlog succeeded.Tue Nov  7 17:11:33 2017 - [info] End of log messages from  host_1.Tue Nov  7 17:11:33 2017 - [info] Saved mysqlbinlog size from  host_1 is 8578 bytes.Tue Nov  7 17:11:33 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171107171129.binlog ..Tue Nov  7 17:11:33 2017 - [info] Differential log apply from binlog server succeeded.Tue Nov  7 17:11:33 2017 - [info] Getting new master's binlog name and position..Tue Nov  7 17:11:33 2017 - [info]   host_2.000001:5048Tue Nov  7 17:11:33 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Tue Nov  7 17:11:33 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000001, 5048, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,Tue Nov  7 17:11:33 2017 - [info] Executing master IP activate script:Tue Nov  7 17:11:33 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Tue Nov  7 17:11:36 2017 - [info]  OK.Tue Nov  7 17:11:36 2017 - [info] Setting read_only=0 on  host_2( host_2:3306)..Tue Nov  7 17:11:36 2017 - [info]  ok.Tue Nov  7 17:11:36 2017 - [info] ** Finished master recovery successfully.Tue Nov  7 17:11:36 2017 - [info] * Phase 3: Master Recovery Phase completed.Tue Nov  7 17:11:36 2017 - [info]Tue Nov  7 17:11:36 2017 - [info] * Phase 4: Slaves Recovery Phase..Tue Nov  7 17:11:36 2017 - [info]Tue Nov  7 17:11:36 2017 - [info]Tue Nov  7 17:11:36 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Tue Nov  7 17:11:36 2017 - [info]Tue Nov  7 17:11:36 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 58422. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171107171129.log if it takes time..Tue Nov  7 17:11:37 2017 - [info]Tue Nov  7 17:11:37 2017 - [info] Log messages from  host_3 ...Tue Nov  7 17:11:37 2017 - [info]Tue Nov  7 17:11:36 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..Tue Nov  7 17:11:36 2017 - [info]  Executed CHANGE MASTER.Tue Nov  7 17:11:37 2017 - [info]  Slave started.Tue Nov  7 17:11:37 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,Tue Nov  7 17:11:37 2017 - [info] End of log messages from  host_3.Tue Nov  7 17:11:37 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Tue Nov  7 17:11:37 2017 - [info] All new slave servers recovered successfully.Tue Nov  7 17:11:37 2017 - [info]Tue Nov  7 17:11:37 2017 - [info] * Phase 5: New master cleanup phase..Tue Nov  7 17:11:37 2017 - [info]Tue Nov  7 17:11:37 2017 - [info] Resetting slave info on the new master..Tue Nov  7 17:11:37 2017 - [info]   host_2: Resetting slave info succeeded.Tue Nov  7 17:11:37 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.Tue Nov  7 17:11:37 2017 - [info]Tue Nov  7 17:11:37 2017 - [info] Sending mail..

1.3.2 当master的所有日志已经传递slave，这时候master 上的MySQL挂了

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorTue Nov  7 15:56:11 2017 - [info] MHA::MasterFailover version 0.56.Tue Nov  7 15:56:11 2017 - [info] Starting master failover.Tue Nov  7 15:56:11 2017 - [info]Tue Nov  7 15:56:11 2017 - [info] * Phase 1: Configuration Check Phase..Tue Nov  7 15:56:11 2017 - [info]Tue Nov  7 15:56:11 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Tue Nov  7 15:56:12 2017 - [info] Binlog server  host_2 is reachable.Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Tue Nov  7 15:56:12 2017 - [info] Binlog server  host_1 is reachable.Tue Nov  7 15:56:12 2017 - [info] HealthCheck: SSH to  host_3 is reachable.Tue Nov  7 15:56:13 2017 - [info] Binlog server  host_3 is reachable.Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)Tue Nov  7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Tue Nov  7 15:56:13 2017 - [info] GTID failover mode = 1Tue Nov  7 15:56:13 2017 - [info] Dead Servers:Tue Nov  7 15:56:13 2017 - [info]    host_2( host_2:3306)Tue Nov  7 15:56:13 2017 - [info] Checking master reachability via MySQL(double check)...Tue Nov  7 15:56:13 2017 - [info]  ok.Tue Nov  7 15:56:13 2017 - [info] Alive Servers:Tue Nov  7 15:56:13 2017 - [info]    host_1( host_1:3306)Tue Nov  7 15:56:13 2017 - [info]    host_3( host_3:3306)Tue Nov  7 15:56:13 2017 - [info] Alive Slaves:Tue Nov  7 15:56:13 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 15:56:13 2017 - [info]     GTID ONTue Nov  7 15:56:13 2017 - [info]     Replicating from  host_2( host_2:3306)Tue Nov  7 15:56:13 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Nov  7 15:56:13 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 15:56:13 2017 - [info]     GTID ONTue Nov  7 15:56:13 2017 - [info]     Replicating from  host_2( host_2:3306)Tue Nov  7 15:56:13 2017 - [info]     Not candidate for the new Master (no_master is set)Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..Tue Nov  7 15:56:13 2017 - [info]   done.Tue Nov  7 15:56:13 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Tue Nov  7 15:56:13 2017 - [info]   done.Tue Nov  7 15:56:13 2017 - [info] Starting GTID based failover.Tue Nov  7 15:56:13 2017 - [info]Tue Nov  7 15:56:13 2017 - [info] ** Phase 1: Configuration Check Phase completed.Tue Nov  7 15:56:13 2017 - [info]Tue Nov  7 15:56:13 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Tue Nov  7 15:56:13 2017 - [info]Tue Nov  7 15:56:13 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Tue Nov  7 15:56:13 2017 - [info] Forcing shutdown so that applications never connect to the current master..Tue Nov  7 15:56:13 2017 - [info] Executing master IP deactivation script:Tue Nov  7 15:56:13 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=rootTue Nov  7 15:56:16 2017 - [info]  done.Tue Nov  7 15:56:16 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Tue Nov  7 15:56:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] * Phase 3: Master Recovery Phase..Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] The latest binary log file/position on all slaves is  host_2.000049:11291Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446352Tue Nov  7 15:56:16 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Tue Nov  7 15:56:16 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 15:56:16 2017 - [info]     GTID ONTue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Nov  7 15:56:16 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000049:10703Tue Nov  7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446350Tue Nov  7 15:56:16 2017 - [info] Oldest slaves:Tue Nov  7 15:56:16 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 15:56:16 2017 - [info]     GTID ONTue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: Determining New Master Phase..Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Searching new master from slaves..Tue Nov  7 15:56:16 2017 - [info]  Candidate masters from the configuration file:Tue Nov  7 15:56:16 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 15:56:16 2017 - [info]     GTID ONTue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)Tue Nov  7 15:56:16 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Nov  7 15:56:16 2017 - [info]  Non-candidate masters:Tue Nov  7 15:56:16 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledTue Nov  7 15:56:16 2017 - [info]     GTID ONTue Nov  7 15:56:16 2017 - [info]     Replicating from  host_2( host_2:3306)Tue Nov  7 15:56:16 2017 - [info]     Not candidate for the new Master (no_master is set)Tue Nov  7 15:56:16 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Tue Nov  7 15:56:16 2017 - [info] New master is  host_1( host_1:3306)Tue Nov  7 15:56:16 2017 - [info] Starting master failover..Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] * Phase 3.3: New Master Recovery Phase..Tue Nov  7 15:56:16 2017 - [info]Tue Nov  7 15:56:16 2017 - [info]  Waiting all logs to be applied..Tue Nov  7 15:56:16 2017 - [info]   done.Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host  host_2 started, pid: 79759Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host  host_1 started, pid: 79768Tue Nov  7 15:56:16 2017 - [info] -- Saving binlog from host  host_3 started, pid: 79770Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_1 ...Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_1..Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_1.Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_1.Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_3 ...Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_3..Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_3.Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_3.Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_2 ...Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_2..Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 15:56:17 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog succeeded.Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_2.Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from  host_2 is 768 bytes.Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog ..Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.Tue Nov  7 15:56:17 2017 - [info] Getting new master's binlog name and position..Tue Nov  7 15:56:17 2017 - [info]   host_1.000051:11449Tue Nov  7 15:56:17 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Tue Nov  7 15:56:17 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000051, 11449, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,Tue Nov  7 15:56:17 2017 - [info] Executing master IP activate script:Tue Nov  7 15:56:17 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Tue Nov  7 15:56:20 2017 - [info]  OK.Tue Nov  7 15:56:20 2017 - [info] Setting read_only=0 on  host_1( host_1:3306)..Tue Nov  7 15:56:20 2017 - [info]  ok.Tue Nov  7 15:56:20 2017 - [info] ** Finished master recovery successfully.Tue Nov  7 15:56:20 2017 - [info] * Phase 3: Master Recovery Phase completed.Tue Nov  7 15:56:20 2017 - [info]Tue Nov  7 15:56:20 2017 - [info] * Phase 4: Slaves Recovery Phase..Tue Nov  7 15:56:20 2017 - [info]Tue Nov  7 15:56:20 2017 - [info]Tue Nov  7 15:56:20 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Tue Nov  7 15:56:20 2017 - [info]Tue Nov  7 15:56:20 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 85941. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171107155611.log if it takes time..Tue Nov  7 15:56:21 2017 - [info]Tue Nov  7 15:56:21 2017 - [info] Log messages from  host_3 ...Tue Nov  7 15:56:21 2017 - [info]Tue Nov  7 15:56:20 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..Tue Nov  7 15:56:20 2017 - [info]  Executed CHANGE MASTER.Tue Nov  7 15:56:21 2017 - [info]  Slave started.Tue Nov  7 15:56:21 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,Tue Nov  7 15:56:21 2017 - [info] End of log messages from  host_3.Tue Nov  7 15:56:21 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Tue Nov  7 15:56:21 2017 - [info] All new slave servers recovered successfully.Tue Nov  7 15:56:21 2017 - [info]Tue Nov  7 15:56:21 2017 - [info] * Phase 5: New master cleanup phase..Tue Nov  7 15:56:21 2017 - [info]Tue Nov  7 15:56:21 2017 - [info] Resetting slave info on the new master..Tue Nov  7 15:56:21 2017 - [info]   host_1: Resetting slave info succeeded.Tue Nov  7 15:56:21 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.Tue Nov  7 15:56:21 2017 - [info]Tue Nov  7 15:56:21 2017 - [info] Sending mail..

1.4 slave(候选master）上面有大事务在跑

1000s的大查询

无影响，正常切换

flush tables with readlock

dba:(none)> show processlist;+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+| Id | User | Host                 | db   | Command | Time | State                        | Info                                                                                                 |+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+| 63 | dba  | localhost            | NULL | Query   |    0 | starting                     | show processlist                                                                                     || 65 | dba  | xx:11164   | NULL | Sleep   |  121 |                              | NULL                                                                                                 || 83 | dba  | new master:49022 | NULL | Query   |  176 | Waiting for global read lock | BINLOG 'GpAKWhNYUy1LMAAAAGYHAAAAAG0AAAAAAAEAAmxjAAh0X2NoYXJfMgACAw8CLAECGpAKWh5YUy1LJwAAAI0HAAAAAG |+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+3 rows in set (0.00 sec)

1.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl （这是最严苛的情况）

binlog server 写3台

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorTue Nov  7 15:56:17 2017 - [info] Log messages from  host_1 ...Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_1..Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_1.Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_1.Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_3 ...Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_3..Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_3.Tue Nov  7 15:56:17 2017 - [warning] Got error from  host_3.Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:17 2017 - [info] Log messages from  host_2 ...Tue Nov  7 15:56:17 2017 - [info]Tue Nov  7 15:56:16 2017 - [info] Fetching binary logs from binlog server  host_2..Tue Nov  7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049  --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binTue Nov  7 15:56:17 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog succeeded.Tue Nov  7 15:56:17 2017 - [info] End of log messages from  host_2.Tue Nov  7 15:56:17 2017 - [info] Saved mysqlbinlog size from  host_2 is 768 bytes.Tue Nov  7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog ..Tue Nov  7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.

binlog server 只写master

masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorThu Nov  9 11:20:04 2017 - [info] -- Saving binlog from host  host_2 started, pid: 117389Thu Nov  9 11:20:05 2017 - [info]Thu Nov  9 11:20:05 2017 - [info] Log messages from  host_2 ...Thu Nov  9 11:20:05 2017 - [info]Thu Nov  9 11:20:04 2017 - [info] Fetching binary logs from binlog server  host_2..Thu Nov  9 11:20:04 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000004  --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 11:20:05 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109111957.binlog succeeded.Thu Nov  9 11:20:05 2017 - [info] End of log messages from  host_2.Thu Nov  9 11:20:05 2017 - [info] Saved mysqlbinlog size from  host_2 is 4444 bytes.Thu Nov  9 11:20:05 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109111957.binlog ..Thu Nov  9 11:20:05 2017 - [info] Differential log apply from binlog server succeeded.

binlog server 只写slave

### 3台服务器的GTID状态* master   host_1dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_1.000055 |     6016 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave  host_2            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369                Auto_Position: 1* etl  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446366-446369            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369                Auto_Position: 1### 切换日志masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorThu Nov  9 15:00:09 2017 - [info] MHA::MasterFailover version 0.56.Thu Nov  9 15:00:09 2017 - [info] Starting master failover.Thu Nov  9 15:00:09 2017 - [info]Thu Nov  9 15:00:09 2017 - [info] * Phase 1: Configuration Check Phase..Thu Nov  9 15:00:09 2017 - [info]Thu Nov  9 15:00:09 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Thu Nov  9 15:00:09 2017 - [info] Binlog server  host_2 is reachable.Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)Thu Nov  9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Thu Nov  9 15:00:10 2017 - [info] GTID failover mode = 1Thu Nov  9 15:00:10 2017 - [info] Dead Servers:Thu Nov  9 15:00:10 2017 - [info]    host_1( host_1:3306)Thu Nov  9 15:00:10 2017 - [info] Checking master reachability via MySQL(double check)...Thu Nov  9 15:00:10 2017 - [info]  ok.Thu Nov  9 15:00:10 2017 - [info] Alive Servers:Thu Nov  9 15:00:10 2017 - [info]    host_2( host_2:3306)Thu Nov  9 15:00:10 2017 - [info]    host_3( host_3:3306)Thu Nov  9 15:00:10 2017 - [info] Alive Slaves:Thu Nov  9 15:00:10 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 15:00:10 2017 - [info]     GTID ONThu Nov  9 15:00:10 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 15:00:10 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 15:00:10 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 15:00:10 2017 - [info]     GTID ONThu Nov  9 15:00:10 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 15:00:10 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..Thu Nov  9 15:00:10 2017 - [info]   done.Thu Nov  9 15:00:10 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Thu Nov  9 15:00:10 2017 - [info]   done.Thu Nov  9 15:00:10 2017 - [info] Starting GTID based failover.Thu Nov  9 15:00:10 2017 - [info]Thu Nov  9 15:00:10 2017 - [info] ** Phase 1: Configuration Check Phase completed.Thu Nov  9 15:00:10 2017 - [info]Thu Nov  9 15:00:10 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Thu Nov  9 15:00:10 2017 - [info]Thu Nov  9 15:00:10 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Thu Nov  9 15:00:10 2017 - [info] Forcing shutdown so that applications never connect to the current master..Thu Nov  9 15:00:10 2017 - [info] Executing master IP deactivation script:Thu Nov  9 15:00:10 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=rootThu Nov  9 15:00:17 2017 - [info]  done.Thu Nov  9 15:00:17 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Thu Nov  9 15:00:17 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info] * Phase 3: Master Recovery Phase..Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info] The latest binary log file/position on all slaves is  host_1.000055:4090Thu Nov  9 15:00:17 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,Thu Nov  9 15:00:17 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Thu Nov  9 15:00:17 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 15:00:17 2017 - [info]     GTID ONThu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 15:00:17 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000055:2806Thu Nov  9 15:00:17 2017 - [info] Oldest slaves:Thu Nov  9 15:00:17 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 15:00:17 2017 - [info]     GTID ONThu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: Determining New Master Phase..Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info] Searching new master from slaves..Thu Nov  9 15:00:17 2017 - [info]  Candidate masters from the configuration file:Thu Nov  9 15:00:17 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 15:00:17 2017 - [info]     GTID ONThu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 15:00:17 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 15:00:17 2017 - [info]  Non-candidate masters:Thu Nov  9 15:00:17 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 15:00:17 2017 - [info]     GTID ONThu Nov  9 15:00:17 2017 - [info]     Replicating from  host_1( host_1:3306)Thu Nov  9 15:00:17 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 15:00:17 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Thu Nov  9 15:00:17 2017 - [info]   Not found.Thu Nov  9 15:00:17 2017 - [info]  Searching from all candidate_master slaves..Thu Nov  9 15:00:17 2017 - [info] New master is  host_2( host_2:3306)Thu Nov  9 15:00:17 2017 - [info] Starting master failover..Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info] * Phase 3.3: New Master Recovery Phase..Thu Nov  9 15:00:17 2017 - [info]Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied..Thu Nov  9 15:00:17 2017 - [info]   done.Thu Nov  9 15:00:17 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..Thu Nov  9 15:00:17 2017 - [info]  Waiting all logs to be applied on the latest slave..Thu Nov  9 15:00:17 2017 - [info]  Resetting slave  host_2( host_2:3306) and starting replication from the new master  host_3( host_3:3306)..Thu Nov  9 15:00:17 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 15:00:18 2017 - [info]  Slave started.Thu Nov  9 15:00:18 2017 - [info]  Waiting to execute all relay logs on  host_2( host_2:3306)..Thu Nov  9 15:00:18 2017 - [info]  master_pos_wait( host_3.000049:25843) completed on  host_2( host_2:3306). Executed 0 events.Thu Nov  9 15:00:18 2017 - [info]   done.Thu Nov  9 15:00:18 2017 - [info]   done.Thu Nov  9 15:00:18 2017 - [info] -- Saving binlog from host  host_2 started, pid: 175683Thu Nov  9 15:00:18 2017 - [info]Thu Nov  9 15:00:18 2017 - [info] Log messages from  host_2 ...Thu Nov  9 15:00:18 2017 - [info]Thu Nov  9 15:00:18 2017 - [info] Fetching binary logs from binlog server  host_2..Thu Nov  9 15:00:18 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000055  --start_pos=4090 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109150009.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binThu Nov  9 15:00:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Thu Nov  9 15:00:18 2017 - [info] End of log messages from  host_2.Thu Nov  9 15:00:18 2017 - [warning] Got error from  host_2.Thu Nov  9 15:00:18 2017 - [info] Getting new master's binlog name and position..Thu Nov  9 15:00:18 2017 - [info]   host_2.000005:1390Thu Nov  9 15:00:18 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Thu Nov  9 15:00:18 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000005, 1390, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,Thu Nov  9 15:00:18 2017 - [info] Executing master IP activate script:Thu Nov  9 15:00:18 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Thu Nov  9 15:00:22 2017 - [info]  OK.Thu Nov  9 15:00:22 2017 - [info] Setting read_only=0 on  host_2( host_2:3306)..Thu Nov  9 15:00:22 2017 - [info]  ok.Thu Nov  9 15:00:22 2017 - [info] ** Finished master recovery successfully.Thu Nov  9 15:00:22 2017 - [info] * Phase 3: Master Recovery Phase completed.Thu Nov  9 15:00:22 2017 - [info]Thu Nov  9 15:00:22 2017 - [info] * Phase 4: Slaves Recovery Phase..Thu Nov  9 15:00:22 2017 - [info]Thu Nov  9 15:00:22 2017 - [info]Thu Nov  9 15:00:22 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Thu Nov  9 15:00:22 2017 - [info]Thu Nov  9 15:00:22 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 180681. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109150009.log if it takes time..Thu Nov  9 15:00:23 2017 - [info]Thu Nov  9 15:00:23 2017 - [info] Log messages from  host_3 ...Thu Nov  9 15:00:23 2017 - [info]Thu Nov  9 15:00:22 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..Thu Nov  9 15:00:22 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 15:00:23 2017 - [info]  Slave started.Thu Nov  9 15:00:23 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,Thu Nov  9 15:00:23 2017 - [info] End of log messages from  host_3.Thu Nov  9 15:00:23 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Thu Nov  9 15:00:23 2017 - [info] All new slave servers recovered successfully.Thu Nov  9 15:00:23 2017 - [info]Thu Nov  9 15:00:23 2017 - [info] * Phase 5: New master cleanup phase..Thu Nov  9 15:00:23 2017 - [info]Thu Nov  9 15:00:23 2017 - [info] Resetting slave info on the new master..Thu Nov  9 15:00:23 2017 - [info]   host_2: Resetting slave info succeeded.Thu Nov  9 15:00:23 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.Thu Nov  9 15:00:23 2017 - [info]Thu Nov  9 15:00:23 2017 - [info] Sending mail..

结论：由于binlog server没有配置master，所以会丢失master没有传递过来的事务日志

好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

binlog server 啥都不写

### 3台DB的GTID状态* master  host_2dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_2.000005 |     5785 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446378 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave  host_1           Retrieved_Gtid_Set:            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369                Auto_Position: 1* etl  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372                Auto_Position: 1### 切换日志Thu Nov  9 16:22:41 2017 - [info] MHA::MasterFailover version 0.56.Thu Nov  9 16:22:41 2017 - [info] Starting master failover.Thu Nov  9 16:22:41 2017 - [info]Thu Nov  9 16:22:41 2017 - [info] * Phase 1: Configuration Check Phase..Thu Nov  9 16:22:41 2017 - [info]Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)Thu Nov  9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Thu Nov  9 16:22:41 2017 - [info] GTID failover mode = 1Thu Nov  9 16:22:41 2017 - [info] Dead Servers:Thu Nov  9 16:22:41 2017 - [info]    host_2( host_2:3306)Thu Nov  9 16:22:41 2017 - [info] Checking master reachability via MySQL(double check)...Thu Nov  9 16:22:41 2017 - [info]  ok.Thu Nov  9 16:22:41 2017 - [info] Alive Servers:Thu Nov  9 16:22:41 2017 - [info]    host_1( host_1:3306)Thu Nov  9 16:22:41 2017 - [info]    host_3( host_3:3306)Thu Nov  9 16:22:41 2017 - [info] Alive Slaves:Thu Nov  9 16:22:41 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 16:22:41 2017 - [info]     GTID ONThu Nov  9 16:22:41 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 16:22:41 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 16:22:41 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 16:22:41 2017 - [info]     GTID ONThu Nov  9 16:22:41 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 16:22:41 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..Thu Nov  9 16:22:41 2017 - [info]   done.Thu Nov  9 16:22:41 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Thu Nov  9 16:22:41 2017 - [info]   done.Thu Nov  9 16:22:41 2017 - [info] Starting GTID based failover.Thu Nov  9 16:22:41 2017 - [info]Thu Nov  9 16:22:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.Thu Nov  9 16:22:41 2017 - [info]Thu Nov  9 16:22:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Thu Nov  9 16:22:41 2017 - [info]Thu Nov  9 16:22:42 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Thu Nov  9 16:22:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..Thu Nov  9 16:22:42 2017 - [info] Executing master IP deactivation script:Thu Nov  9 16:22:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root===================    swift vip :  tgw_vip from  host_2 is deleted  ==============================--2017-11-09 16:22:42--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi正在连接 tgw_server:80... 已连接。已发出 HTTP 请求，正在等待回应... 200 OK长度：未指定 [text/html]正在保存至: “STDOUT”     0K                                                        9.79M=0s2017-11-09 16:22:44 (9.79 MB/s) - 已写入标准输出 [38]Thu Nov  9 16:22:44 2017 - [info]  done.Thu Nov  9 16:22:44 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Thu Nov  9 16:22:44 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info] * Phase 3: Master Recovery Phase..Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info] The latest binary log file/position on all slaves is  host_2.000005:4015Thu Nov  9 16:22:44 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372Thu Nov  9 16:22:44 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Thu Nov  9 16:22:44 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 16:22:44 2017 - [info]     GTID ONThu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 16:22:44 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000005:3130Thu Nov  9 16:22:44 2017 - [info] Oldest slaves:Thu Nov  9 16:22:44 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 16:22:44 2017 - [info]     GTID ONThu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: Determining New Master Phase..Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info] Searching new master from slaves..Thu Nov  9 16:22:44 2017 - [info]  Candidate masters from the configuration file:Thu Nov  9 16:22:44 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 16:22:44 2017 - [info]     GTID ONThu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 16:22:44 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Thu Nov  9 16:22:44 2017 - [info]  Non-candidate masters:Thu Nov  9 16:22:44 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledThu Nov  9 16:22:44 2017 - [info]     GTID ONThu Nov  9 16:22:44 2017 - [info]     Replicating from  host_2( host_2:3306)Thu Nov  9 16:22:44 2017 - [info]     Not candidate for the new Master (no_master is set)Thu Nov  9 16:22:44 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Thu Nov  9 16:22:44 2017 - [info]   Not found.Thu Nov  9 16:22:44 2017 - [info]  Searching from all candidate_master slaves..Thu Nov  9 16:22:44 2017 - [info] New master is  host_1( host_1:3306)Thu Nov  9 16:22:44 2017 - [info] Starting master failover..Thu Nov  9 16:22:44 2017 - [info]From: host_2( host_2:3306) (current master) +-- host_1( host_1:3306) +-- host_3( host_3:3306)To: host_1( host_1:3306) (new master) +-- host_3( host_3:3306)Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info] * Phase 3.3: New Master Recovery Phase..Thu Nov  9 16:22:44 2017 - [info]Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied..Thu Nov  9 16:22:44 2017 - [info]   done.Thu Nov  9 16:22:44 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..Thu Nov  9 16:22:44 2017 - [info]  Waiting all logs to be applied on the latest slave..Thu Nov  9 16:22:44 2017 - [info]  Resetting slave  host_1( host_1:3306) and starting replication from the new master  host_3( host_3:3306)..Thu Nov  9 16:22:44 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 16:22:45 2017 - [info]  Slave started.Thu Nov  9 16:22:45 2017 - [info]  Waiting to execute all relay logs on  host_1( host_1:3306)..Thu Nov  9 16:22:45 2017 - [info]  master_pos_wait( host_3.000049:28663) completed on  host_1( host_1:3306). Executed 0 events.Thu Nov  9 16:22:45 2017 - [info]   done.Thu Nov  9 16:22:45 2017 - [info]   done.Thu Nov  9 16:22:45 2017 - [info] Getting new master's binlog name and position..Thu Nov  9 16:22:45 2017 - [info]   host_1.000056:1170Thu Nov  9 16:22:45 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Thu Nov  9 16:22:45 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000056, 1170, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372Thu Nov  9 16:22:45 2017 - [info] Executing master IP activate script:Thu Nov  9 16:22:45 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Unknown option: new_master_userUnknown option: new_master_password===================    swift vip :  tgw_vip to  host_1  is added  ==============================Thu Nov  9 16:22:47 2017 - [info]  OK.Thu Nov  9 16:22:47 2017 - [info] Setting read_only=0 on  host_1( host_1:3306)..Thu Nov  9 16:22:47 2017 - [info]  ok.Thu Nov  9 16:22:47 2017 - [info] ** Finished master recovery successfully.Thu Nov  9 16:22:47 2017 - [info] * Phase 3: Master Recovery Phase completed.Thu Nov  9 16:22:47 2017 - [info]Thu Nov  9 16:22:47 2017 - [info] * Phase 4: Slaves Recovery Phase..Thu Nov  9 16:22:47 2017 - [info]Thu Nov  9 16:22:47 2017 - [info]Thu Nov  9 16:22:47 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Thu Nov  9 16:22:47 2017 - [info]Thu Nov  9 16:22:47 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 112317. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109162241.log if it takes time..Thu Nov  9 16:22:48 2017 - [info]Thu Nov  9 16:22:48 2017 - [info] Log messages from  host_3 ...Thu Nov  9 16:22:48 2017 - [info]Thu Nov  9 16:22:47 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..Thu Nov  9 16:22:47 2017 - [info]  Executed CHANGE MASTER.Thu Nov  9 16:22:48 2017 - [info]  Slave started.Thu Nov  9 16:22:48 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372) completed on  host_3( host_3:3306). Executed 0 events.Thu Nov  9 16:22:48 2017 - [info] End of log messages from  host_3.Thu Nov  9 16:22:48 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Thu Nov  9 16:22:48 2017 - [info] All new slave servers recovered successfully.Thu Nov  9 16:22:48 2017 - [info]Thu Nov  9 16:22:48 2017 - [info] * Phase 5: New master cleanup phase..Thu Nov  9 16:22:48 2017 - [info]Thu Nov  9 16:22:48 2017 - [info] Resetting slave info on the new master..Thu Nov  9 16:22:49 2017 - [info]   host_1: Resetting slave info succeeded.Thu Nov  9 16:22:49 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.Thu Nov  9 16:22:49 2017 - [info]----- Failover Report -----bak_mha_test: MySQL Master failover  host_2( host_2:3306) to  host_1( host_1:3306) succeededMaster  host_2( host_2:3306) is down!Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.Started automated(non-interactive) failover.Invalidated master IP address on  host_2( host_2:3306)Selected  host_1( host_1:3306) as a new master. host_1( host_1:3306): OK: Applying all logs succeeded. host_1( host_1:3306): OK: Activated master IP address. host_3( host_3:3306): OK: Slave started, replicating from  host_1( host_1:3306) host_1( host_1:3306): Resetting slave info succeeded.Master failover to  host_1( host_1:3306) completed successfully.

结论：由于binlog server没有配置master，所以会丢失master没有传递过来的事务日志

好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

1.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

99%的场景都是可以重新执行的

1%的场景不能再次执行，执行会报错

一般这种场景就是：已经failover到最后的change master阶段，这样主从结构已经变更，MHA无法重新走一遍。
不过，即便到这步骤失败了，表示master的日志已经补完，由于是gtid模式，自己再让slave change master到最新的master即可，最后ACTIVE new ip和readonly=1就好了

Thu Nov  9 16:49:39 2017 - [info] MHA::MasterFailover version 0.56.Thu Nov  9 16:49:39 2017 - [info] Starting master failover.Thu Nov  9 16:49:39 2017 - [info]Thu Nov  9 16:49:39 2017 - [info] * Phase 1: Configuration Check Phase..Thu Nov  9 16:49:39 2017 - [info]Thu Nov  9 16:49:39 2017 - [info] GTID failover mode = 1Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master  host_1( host_1:3306) does not match with specified dead master  host_2( host_2:3306)!Thu Nov  9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/bin/masterha_master_switch line 53

1.7 Master：MySQL down小结

1. failover最终命令masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error2. binlog server建议配置master就可以了[binlog1]$master_ip只配置slave，或者没有配置，会导致丢失部分没有从master传递过来的日志事务

二、Master : Server down

2.1 etl 延迟8小时

同1.1 结论

2.2 slave(候选master)比etl还要落后更多

2.2.1 当master的部分日志还没传递两个slave，这时候master server挂了

### 3台DB的GTID状态* master  host_2dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_2.000008 |     5445 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave  host_1           Retrieved_Gtid_Set:            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446385                Auto_Position: 1* etl  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388                Auto_Position: 1### 模拟故障场景* 隔离master的网络，让其等同于down机master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPTmaster> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP### 切换日志masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_2  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorFri Nov 10 11:12:38 2017 - [info] MHA::MasterFailover version 0.56.Fri Nov 10 11:12:38 2017 - [info] Starting master failover.Fri Nov 10 11:12:38 2017 - [info]Fri Nov 10 11:12:38 2017 - [info] * Phase 1: Configuration Check Phase..Fri Nov 10 11:12:38 2017 - [info]Fri Nov 10 11:13:28 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.Fri Nov 10 11:13:28 2017 - [warning] Failed to SSH to binlog server  host_2Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to  host_1 is reachable.Fri Nov 10 11:13:29 2017 - [info] Binlog server  host_1 is reachable.Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to  host_3 is reachable.Fri Nov 10 11:13:29 2017 - [info] Binlog server  host_3 is reachable.Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on  host_1( host_1:3306)Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Fri Nov 10 11:13:29 2017 - [info] GTID failover mode = 1Fri Nov 10 11:13:29 2017 - [info] Dead Servers:Fri Nov 10 11:13:29 2017 - [info]    host_2( host_2:3306)Fri Nov 10 11:13:29 2017 - [info] Checking master reachability via MySQL(double check)...Fri Nov 10 11:13:30 2017 - [info]  ok.Fri Nov 10 11:13:30 2017 - [info] Alive Servers:Fri Nov 10 11:13:30 2017 - [info]    host_1( host_1:3306)Fri Nov 10 11:13:30 2017 - [info]    host_3( host_3:3306)Fri Nov 10 11:13:30 2017 - [info] Alive Slaves:Fri Nov 10 11:13:30 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 11:13:30 2017 - [info]     GTID ONFri Nov 10 11:13:30 2017 - [info]     Replicating from  host_2( host_2:3306)Fri Nov 10 11:13:30 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Fri Nov 10 11:13:30 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 11:13:30 2017 - [info]     GTID ONFri Nov 10 11:13:30 2017 - [info]     Replicating from  host_2( host_2:3306)Fri Nov 10 11:13:30 2017 - [info]     Not candidate for the new Master (no_master is set)Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on  host_1( host_1:3306) ..Fri Nov 10 11:13:30 2017 - [info]   done.Fri Nov 10 11:13:30 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Fri Nov 10 11:13:30 2017 - [info]   done.Fri Nov 10 11:13:30 2017 - [info] Starting GTID based failover.Fri Nov 10 11:13:30 2017 - [info]Fri Nov 10 11:13:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.Fri Nov 10 11:13:30 2017 - [info]Fri Nov 10 11:13:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Fri Nov 10 11:13:30 2017 - [info]Fri Nov 10 11:14:20 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.Fri Nov 10 11:14:20 2017 - [info] Forcing shutdown so that applications never connect to the current master..Fri Nov 10 11:14:20 2017 - [info] Executing master IP deactivation script:Fri Nov 10 11:14:20 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh: connect to host  host_2 port 22: Connection timed out===================    swift vip :  tgw_vip from  host_2 is deleted  ==============================--2017-11-10 11:14:27--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi正在连接 tgw_server:80... 已连接。已发出 HTTP 请求，正在等待回应... 200 OK长度：未指定 [text/html]正在保存至: “STDOUT”     0K                                                        11.4M=0s2017-11-10 11:16:27 (11.4 MB/s) - 已写入标准输出 [38]Fri Nov 10 11:16:27 2017 - [info]  done.Fri Nov 10 11:16:27 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Fri Nov 10 11:16:27 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info] * Phase 3: Master Recovery Phase..Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info] The latest binary log file/position on all slaves is  host_2.000008:4265Fri Nov 10 11:16:27 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388Fri Nov 10 11:16:27 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Fri Nov 10 11:16:27 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 11:16:27 2017 - [info]     GTID ONFri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)Fri Nov 10 11:16:27 2017 - [info] The oldest binary log file/position on all slaves is  host_2.000008:3380Fri Nov 10 11:16:27 2017 - [info] Oldest slaves:Fri Nov 10 11:16:27 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 11:16:27 2017 - [info]     GTID ONFri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: Determining New Master Phase..Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info] Searching new master from slaves..Fri Nov 10 11:16:27 2017 - [info]  Candidate masters from the configuration file:Fri Nov 10 11:16:27 2017 - [info]    host_1( host_1:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 11:16:27 2017 - [info]     GTID ONFri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)Fri Nov 10 11:16:27 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Fri Nov 10 11:16:27 2017 - [info]  Non-candidate masters:Fri Nov 10 11:16:27 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 11:16:27 2017 - [info]     GTID ONFri Nov 10 11:16:27 2017 - [info]     Replicating from  host_2( host_2:3306)Fri Nov 10 11:16:27 2017 - [info]     Not candidate for the new Master (no_master is set)Fri Nov 10 11:16:27 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Fri Nov 10 11:16:27 2017 - [info]   Not found.Fri Nov 10 11:16:27 2017 - [info]  Searching from all candidate_master slaves..Fri Nov 10 11:16:27 2017 - [info] New master is  host_1( host_1:3306)Fri Nov 10 11:16:27 2017 - [info] Starting master failover..Fri Nov 10 11:16:27 2017 - [info]From: host_2( host_2:3306) (current master) +-- host_1( host_1:3306) +-- host_3( host_3:3306)To: host_1( host_1:3306) (new master) +-- host_3( host_3:3306)Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: New Master Recovery Phase..Fri Nov 10 11:16:27 2017 - [info]Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied..Fri Nov 10 11:16:27 2017 - [info]   done.Fri Nov 10 11:16:27 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..Fri Nov 10 11:16:27 2017 - [info]  Waiting all logs to be applied on the latest slave..Fri Nov 10 11:16:27 2017 - [info]  Resetting slave  host_1( host_1:3306) and starting replication from the new master  host_3( host_3:3306)..Fri Nov 10 11:16:27 2017 - [info]  Executed CHANGE MASTER.Fri Nov 10 11:16:28 2017 - [info]  Slave started.Fri Nov 10 11:16:28 2017 - [info]  Waiting to execute all relay logs on  host_1( host_1:3306)..Fri Nov 10 11:16:28 2017 - [info]  master_pos_wait( host_3.000049:40136) completed on  host_1( host_1:3306). Executed 0 events.Fri Nov 10 11:16:28 2017 - [info]   done.Fri Nov 10 11:16:28 2017 - [info]   done.Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host  host_2 started, pid: 43038Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host  host_1 started, pid: 43039Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host  host_3 started, pid: 43041Fri Nov 10 11:16:28 2017 - [info]Fri Nov 10 11:16:28 2017 - [info] Log messages from  host_2 ...Fri Nov 10 11:16:28 2017 - [info] End of log messages from  host_2.Fri Nov 10 11:16:28 2017 - [warning] SSH is not reachable on  host_2. SkippingFri Nov 10 11:16:28 2017 - [info]Fri Nov 10 11:16:28 2017 - [info] Log messages from  host_1 ...Fri Nov 10 11:16:28 2017 - [info]Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server  host_1..Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binFailed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again. at /usr/bin/save_binary_logs line 123    eval {...} called at /usr/bin/save_binary_logs line 70    main::main() called at /usr/bin/save_binary_logs line 66Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Fri Nov 10 11:16:28 2017 - [info] End of log messages from  host_1.Fri Nov 10 11:16:28 2017 - [warning] Got error from  host_1.Fri Nov 10 11:16:28 2017 - [info]Fri Nov 10 11:16:28 2017 - [info] Log messages from  host_3 ...Fri Nov 10 11:16:28 2017 - [info]Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server  host_3..Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000008  --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binFailed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again. at /usr/bin/save_binary_logs line 123    eval {...} called at /usr/bin/save_binary_logs line 70    main::main() called at /usr/bin/save_binary_logs line 66Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Fri Nov 10 11:16:28 2017 - [info] End of log messages from  host_3.Fri Nov 10 11:16:28 2017 - [warning] Got error from  host_3.Fri Nov 10 11:16:28 2017 - [info] Getting new master's binlog name and position..Fri Nov 10 11:16:28 2017 - [info]   host_1.000058:4059Fri Nov 10 11:16:28 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Fri Nov 10 11:16:28 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_1.000058, 4059, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388Fri Nov 10 11:16:28 2017 - [info] Executing master IP activate script:Fri Nov 10 11:16:28 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Unknown option: new_master_userUnknown option: new_master_password===================    swift vip :  tgw_vip to  host_1  is added  ==============================Fri Nov 10 11:16:30 2017 - [info]  OK.Fri Nov 10 11:16:30 2017 - [info] ** Finished master recovery successfully.Fri Nov 10 11:16:30 2017 - [info] * Phase 3: Master Recovery Phase completed.Fri Nov 10 11:16:30 2017 - [info]Fri Nov 10 11:16:30 2017 - [info] * Phase 4: Slaves Recovery Phase..Fri Nov 10 11:16:30 2017 - [info]Fri Nov 10 11:16:30 2017 - [info]Fri Nov 10 11:16:30 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Fri Nov 10 11:16:30 2017 - [info]Fri Nov 10 11:16:30 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 46878. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171110111238.log if it takes time..Fri Nov 10 11:16:31 2017 - [info]Fri Nov 10 11:16:31 2017 - [info] Log messages from  host_3 ...Fri Nov 10 11:16:31 2017 - [info]Fri Nov 10 11:16:30 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_1( host_1:3306)..Fri Nov 10 11:16:30 2017 - [info]  Executed CHANGE MASTER.Fri Nov 10 11:16:31 2017 - [info]  Slave started.Fri Nov 10 11:16:31 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388) completed on  host_3( host_3:3306). Executed 0 events.Fri Nov 10 11:16:31 2017 - [info] End of log messages from  host_3.Fri Nov 10 11:16:31 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Fri Nov 10 11:16:31 2017 - [info] All new slave servers recovered successfully.Fri Nov 10 11:16:31 2017 - [info]Fri Nov 10 11:16:31 2017 - [info] * Phase 5: New master cleanup phase..Fri Nov 10 11:16:31 2017 - [info]Fri Nov 10 11:16:31 2017 - [info] Resetting slave info on the new master..Fri Nov 10 11:16:31 2017 - [info]   host_1: Resetting slave info succeeded.Fri Nov 10 11:16:31 2017 - [info] Master failover to  host_1( host_1:3306) completed successfully.Fri Nov 10 11:16:31 2017 - [info]----- Failover Report -----bak_mha_test: MySQL Master failover  host_2( host_2:3306) to  host_1( host_1:3306) succeededMaster  host_2( host_2:3306) is down!Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.Started automated(non-interactive) failover.Invalidated master IP address on  host_2( host_2:3306)Selected  host_1( host_1:3306) as a new master. host_1( host_1:3306): OK: Applying all logs succeeded. host_1( host_1:3306): OK: Activated master IP address. host_3( host_3:3306): OK: Slave started, replicating from  host_1( host_1:3306) host_1( host_1:3306): Resetting slave info succeeded.Master failover to  host_1( host_1:3306) completed successfully.Fri Nov 10 11:16:31 2017 - [info] Sending mail..### 最后一步很重要如果dead master之后又活过来了，那么这一步要做dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -chttp://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md  ==》 tgw章节详细描述

结论：由于master 已挂，然而最后的日志没有传递到其他服务器，所以会丢失master没有传递过来的事务日志

好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

2.2.2 当master的所有日志已经传递到1个etl，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到etl，所以最后是不会丢失master上任何数据的。

2.3 slave(候选master)的日志是最新的，比etl要多

2.3.1 当master的部分日志还没传递两个slave，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master 已挂，然而最后的日志没有传递到其他服务器，所以会丢失master没有传递过来的事务日志

好在，slave和etl之间会互相change master，所以尽管slave（candidate master）的日志落后，最终也还是用etl的日志补齐了slave缺失的日志。

2.3.2 当master的所有日志已经传递slave，这时候master server挂了

测试省略，和2.2.1基本一样

结论：由于master上的所有日志全部传递到slave，所以最后是不会丢失master上任何数据的。

2.4 slave(候选master）上面有大事务在跑

1000s的大查询

同1.4结论

flush tables with readlock

同1.4结论

2.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl （这是最严苛的情况）

binlog server 写3台

2.2.1 测试的就是这种情况，详细日志切换请看2.2.1

结论：由于binlog server配置了3台，但是由于master server已经挂掉，无法从master的binlog server上获取日志，所以会丢失master上没有传递的日志事务

binlog server 只写master

### 3台DB的gtid 状态* master  host_1dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_1.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave  host_2           Retrieved_Gtid_Set:            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392                Auto_Position: 1* etl  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392                Auto_Position: 1### 模拟故障master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPTmaster> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP### 故障切换masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_errorFri Nov 10 14:15:51 2017 - [info] MHA::MasterFailover version 0.56.Fri Nov 10 14:15:51 2017 - [info] Starting master failover.Fri Nov 10 14:15:51 2017 - [info]Fri Nov 10 14:15:51 2017 - [info] * Phase 1: Configuration Check Phase..Fri Nov 10 14:15:51 2017 - [info]Fri Nov 10 14:16:41 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.Fri Nov 10 14:16:41 2017 - [warning] Failed to SSH to binlog server  host_1Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln239] Binlog Server is defined but there is no alive server.Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR:  at /usr/share/perl5/vendor_perl/MHA/MasterFailover.pm line 2082

结论： binlog server 必须要配置一个活的 server，如果只配置master，如果master挂了，那么就等于一个都没有，MHA不会切换

binlog server 只写slave

### 3台DB的gtid 状态* master  host_1dba:lc> show master status;+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+| File                | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                        |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+|  host_1.000058 |     8517 |              |                  | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+1 row in set (0.00 sec)* slave  host_2           Retrieved_Gtid_Set:            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392                Auto_Position: 1* etl  host_3           Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392            Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392                Auto_Position: 1### 模拟故障master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPTmaster> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP### 故障切换Fri Nov 10 14:29:50 2017 - [info] MHA::MasterFailover version 0.56.Fri Nov 10 14:29:50 2017 - [info] Starting master failover.Fri Nov 10 14:29:50 2017 - [info]Fri Nov 10 14:29:50 2017 - [info] * Phase 1: Configuration Check Phase..Fri Nov 10 14:29:50 2017 - [info]Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to  host_2 is reachable.Fri Nov 10 14:29:50 2017 - [info] Binlog server  host_2 is reachable.Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to  host_3 is reachable.Fri Nov 10 14:29:50 2017 - [info] Binlog server  host_3 is reachable.Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on  host_2( host_2:3306)Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on  host_3( host_3:3306)Fri Nov 10 14:29:50 2017 - [info] GTID failover mode = 1Fri Nov 10 14:29:50 2017 - [info] Dead Servers:Fri Nov 10 14:29:50 2017 - [info]    host_1( host_1:3306)Fri Nov 10 14:29:50 2017 - [info] Checking master reachability via MySQL(double check)...Fri Nov 10 14:29:51 2017 - [info]  ok.Fri Nov 10 14:29:51 2017 - [info] Alive Servers:Fri Nov 10 14:29:51 2017 - [info]    host_2( host_2:3306)Fri Nov 10 14:29:51 2017 - [info]    host_3( host_3:3306)Fri Nov 10 14:29:51 2017 - [info] Alive Slaves:Fri Nov 10 14:29:51 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 14:29:51 2017 - [info]     GTID ONFri Nov 10 14:29:51 2017 - [info]     Replicating from  host_1( host_1:3306)Fri Nov 10 14:29:51 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Fri Nov 10 14:29:51 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 14:29:51 2017 - [info]     GTID ONFri Nov 10 14:29:51 2017 - [info]     Replicating from  host_1( host_1:3306)Fri Nov 10 14:29:51 2017 - [info]     Not candidate for the new Master (no_master is set)Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on  host_2( host_2:3306) ..Fri Nov 10 14:29:51 2017 - [info]   done.Fri Nov 10 14:29:51 2017 - [info]  Starting SQL thread on  host_3( host_3:3306) ..Fri Nov 10 14:29:52 2017 - [info]   done.Fri Nov 10 14:29:52 2017 - [info] Starting GTID based failover.Fri Nov 10 14:29:52 2017 - [info]Fri Nov 10 14:29:52 2017 - [info] ** Phase 1: Configuration Check Phase completed.Fri Nov 10 14:29:52 2017 - [info]Fri Nov 10 14:29:52 2017 - [info] * Phase 2: Dead Master Shutdown Phase..Fri Nov 10 14:29:52 2017 - [info]Fri Nov 10 14:30:42 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to  host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.Fri Nov 10 14:30:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..Fri Nov 10 14:30:42 2017 - [info] Executing master IP deactivation script:Fri Nov 10 14:30:42 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh: connect to host  host_1 port 22: Connection timed out===================    swift vip :  tgw_vip from  host_1 is deleted  ==============================--2017-11-10 14:30:49--  http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi正在连接 tgw_server:80... 已连接。已发出 HTTP 请求，正在等待回应... 200 OK长度：未指定 [text/html]正在保存至: “STDOUT”     0K                                                        12.1M=0s2017-11-10 14:32:47 (12.1 MB/s) - 已写入标准输出 [38]Fri Nov 10 14:32:47 2017 - [info]  done.Fri Nov 10 14:32:47 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.Fri Nov 10 14:32:47 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info] * Phase 3: Master Recovery Phase..Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info] The latest binary log file/position on all slaves is  host_1.000058:6912Fri Nov 10 14:32:47 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392Fri Nov 10 14:32:47 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):Fri Nov 10 14:32:47 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 14:32:47 2017 - [info]     GTID ONFri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)Fri Nov 10 14:32:47 2017 - [info] The oldest binary log file/position on all slaves is  host_1.000058:5307Fri Nov 10 14:32:47 2017 - [info] Oldest slaves:Fri Nov 10 14:32:47 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 14:32:47 2017 - [info]     GTID ONFri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: Determining New Master Phase..Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info] Searching new master from slaves..Fri Nov 10 14:32:47 2017 - [info]  Candidate masters from the configuration file:Fri Nov 10 14:32:47 2017 - [info]    host_2( host_2:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 14:32:47 2017 - [info]     GTID ONFri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)Fri Nov 10 14:32:47 2017 - [info]     Primary candidate for the new Master (candidate_master is set)Fri Nov 10 14:32:47 2017 - [info]  Non-candidate masters:Fri Nov 10 14:32:47 2017 - [info]    host_3( host_3:3306)  Version=5.7.13-log (oldest major version between slaves) log-bin:enabledFri Nov 10 14:32:47 2017 - [info]     GTID ONFri Nov 10 14:32:47 2017 - [info]     Replicating from  host_1( host_1:3306)Fri Nov 10 14:32:47 2017 - [info]     Not candidate for the new Master (no_master is set)Fri Nov 10 14:32:47 2017 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Fri Nov 10 14:32:47 2017 - [info]   Not found.Fri Nov 10 14:32:47 2017 - [info]  Searching from all candidate_master slaves..Fri Nov 10 14:32:47 2017 - [info] New master is  host_2( host_2:3306)Fri Nov 10 14:32:47 2017 - [info] Starting master failover..Fri Nov 10 14:32:47 2017 - [info]From: host_1( host_1:3306) (current master) +-- host_2( host_2:3306) +-- host_3( host_3:3306)To: host_2( host_2:3306) (new master) +-- host_3( host_3:3306)Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: New Master Recovery Phase..Fri Nov 10 14:32:47 2017 - [info]Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied..Fri Nov 10 14:32:47 2017 - [info]   done.Fri Nov 10 14:32:47 2017 - [info]  Replicating from the latest slave  host_3( host_3:3306) and waiting to apply..Fri Nov 10 14:32:47 2017 - [info]  Waiting all logs to be applied on the latest slave..Fri Nov 10 14:32:47 2017 - [info]  Resetting slave  host_2( host_2:3306) and starting replication from the new master  host_3( host_3:3306)..Fri Nov 10 14:32:47 2017 - [info]  Executed CHANGE MASTER.Fri Nov 10 14:32:48 2017 - [info]  Slave started.Fri Nov 10 14:32:48 2017 - [info]  Waiting to execute all relay logs on  host_2( host_2:3306)..Fri Nov 10 14:32:48 2017 - [info]  master_pos_wait( host_3.000049:42954) completed on  host_2( host_2:3306). Executed 0 events.Fri Nov 10 14:32:48 2017 - [info]   done.Fri Nov 10 14:32:48 2017 - [info]   done.Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host  host_2 started, pid: 76664Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host  host_3 started, pid: 76665Fri Nov 10 14:32:48 2017 - [info]Fri Nov 10 14:32:48 2017 - [info] Log messages from  host_2 ...Fri Nov 10 14:32:48 2017 - [info]Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server  host_2..Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binFailed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again. at /usr/bin/save_binary_logs line 123    eval {...} called at /usr/bin/save_binary_logs line 70    main::main() called at /usr/bin/save_binary_logs line 66Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Fri Nov 10 14:32:48 2017 - [info] End of log messages from  host_2.Fri Nov 10 14:32:48 2017 - [warning] Got error from  host_2.Fri Nov 10 14:32:48 2017 - [info]Fri Nov 10 14:32:48 2017 - [info] Log messages from  host_3 ...Fri Nov 10 14:32:48 2017 - [info]Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server  host_3..Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000058  --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log  --binlog_dir=/data/mysql.binFailed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again. at /usr/bin/save_binary_logs line 123    eval {...} called at /usr/bin/save_binary_logs line 70    main::main() called at /usr/bin/save_binary_logs line 66Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?Fri Nov 10 14:32:48 2017 - [info] End of log messages from  host_3.Fri Nov 10 14:32:48 2017 - [warning] Got error from  host_3.Fri Nov 10 14:32:48 2017 - [info] Getting new master's binlog name and position..Fri Nov 10 14:32:48 2017 - [info]   host_2.000008:6895Fri Nov 10 14:32:48 2017 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Fri Nov 10 14:32:48 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set:  host_2.000008, 6895, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392Fri Nov 10 14:32:48 2017 - [info] Executing master IP activate script:Fri Nov 10 14:32:48 2017 - [info]   /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'Unknown option: new_master_userUnknown option: new_master_password===================    swift vip :  tgw_vip to  host_2  is added  ==============================Fri Nov 10 14:32:51 2017 - [info]  OK.Fri Nov 10 14:32:51 2017 - [info] ** Finished master recovery successfully.Fri Nov 10 14:32:51 2017 - [info] * Phase 3: Master Recovery Phase completed.Fri Nov 10 14:32:51 2017 - [info]Fri Nov 10 14:32:51 2017 - [info] * Phase 4: Slaves Recovery Phase..Fri Nov 10 14:32:51 2017 - [info]Fri Nov 10 14:32:51 2017 - [info]Fri Nov 10 14:32:51 2017 - [info] * Phase 4.1: Starting Slaves in parallel..Fri Nov 10 14:32:51 2017 - [info]Fri Nov 10 14:32:51 2017 - [info] -- Slave recovery on host  host_3( host_3:3306) started, pid: 80398. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171110142950.log if it takes time..Fri Nov 10 14:32:52 2017 - [info]Fri Nov 10 14:32:52 2017 - [info] Log messages from  host_3 ...Fri Nov 10 14:32:52 2017 - [info]Fri Nov 10 14:32:51 2017 - [info]  Resetting slave  host_3( host_3:3306) and starting replication from the new master  host_2( host_2:3306)..Fri Nov 10 14:32:51 2017 - [info]  Executed CHANGE MASTER.Fri Nov 10 14:32:52 2017 - [info]  Slave started.Fri Nov 10 14:32:52 2017 - [info]  gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392) completed on  host_3( host_3:3306). Executed 0 events.Fri Nov 10 14:32:52 2017 - [info] End of log messages from  host_3.Fri Nov 10 14:32:52 2017 - [info] -- Slave on host  host_3( host_3:3306) started.Fri Nov 10 14:32:52 2017 - [info] All new slave servers recovered successfully.Fri Nov 10 14:32:52 2017 - [info]Fri Nov 10 14:32:52 2017 - [info] * Phase 5: New master cleanup phase..Fri Nov 10 14:32:52 2017 - [info]Fri Nov 10 14:32:52 2017 - [info] Resetting slave info on the new master..Fri Nov 10 14:32:52 2017 - [info]   host_2: Resetting slave info succeeded.Fri Nov 10 14:32:52 2017 - [info] Master failover to  host_2( host_2:3306) completed successfully.Fri Nov 10 14:32:52 2017 - [info]----- Failover Report -----bak_mha_test: MySQL Master failover  host_1( host_1:3306) to  host_2( host_2:3306) succeededMaster  host_1( host_1:3306) is down!Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.Started automated(non-interactive) failover.Invalidated master IP address on  host_1( host_1:3306)Selected  host_2( host_2:3306) as a new master. host_2( host_2:3306): OK: Applying all logs succeeded. host_2( host_2:3306): OK: Activated master IP address. host_3( host_3:3306): OK: Slave started, replicating from  host_2( host_2:3306) host_2( host_2:3306): Resetting slave info succeeded.Master failover to  host_2( host_2:3306) completed successfully.Fri Nov 10 14:32:52 2017 - [info] Sending mail..

结论： binlog server 配置成多台slave，这是正确的方案。由于master 挂了，master没有传递过来的binlog会丢失，这是没办法的. 好在，其余slave自动补齐现有日志

binlog server 啥都不写

会切换成功，由于master 挂了，master没有传递过来的binlog会丢失好在，其余slave自动补齐现有日志

2.6 如果MHA过程中失败，是否可以重新执行MHA的failover呢？

同1.6结论

三、遇到的坑

3.1 交互模式下，如果没有及时敲'YES'，则终止切换

四、总结

MHA + GTID 模式，重点配置和用法如下：

1. commandmasterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf  --dead_master_host= host_1  --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error2. binlog server在配置文件中对 master，slave，etl 都写在binlog server中。对MySQL down 和 DB server down  综合考虑下，建议这样配置。3. tgw 清理dead master 如果还可以起来，那么必须在上面执行： /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c原因可参看：http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节

五、流程简介

* Phase 1: Configuration Check Phase.. HealthCheck: SSH N台DB是否reachable Binlog server： N台DB 是否reachable GTID failover mode = ？ Dead Servers is ？ Primary candidate for the new Master (candidate_master is set) ？* Phase 2: Dead Master Shutdown Phase.. Executing master IP deactivation script： TGW-vip delete操作 shutdown_script： ？* Phase 3: Master Recovery Phase.. * Phase 3.1: Getting Latest Slaves Phase.. Latest slaves ，file position ？ Oldest slaves ， file position ？ * Phase 3.3: Determining New Master Phase.. 选择哪个slave为new master * Phase 3.3: New Master Recovery Phase.. Replicating from the latest slave and waiting to apply.. --让new master change master 到 latest slave Waiting all logs to be applied on the latest slave.. --让new master跟 latest slave的日志保持一致 Saving binlog from binlog server。。。 --根据配置的binlog server，开始生成latest slave和dead master之间的diff日志 Applying differential binlog --apply这些差异日志到new master，让new master执行完所有缺失的日志 Getting new master's binlog name and position.. --获取new master现在的binlog file pos All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST, MASTER_PORT=3306, MASTER_AUTO_POSITION=1 Executing master IP activate script: --TGW-vip 激活操作，并且设置readonly=0* Phase 4: Slaves Recovery Phase.. (并行操作) Resetting slave and starting replication from the new master* Phase 5: New master cleanup phase.. Resetting slave info on the new master.