MySQL High Availability- P9

Chia sẻ: Thanh Cong | Ngày: | Loại File: PDF | Số trang:50

lượt xem

MySQL High Availability- P9

Mô tả tài liệu
  Download Vui lòng tải xuống để xem tài liệu đầy đủ

MySQL High Availability- P9: A lot of research has been done on replication, but most of the resulting concepts are never put into production. In contrast, MySQL replication is widely deployed but has never been adequately explained. This book changes that. Things are explained here that were previously limited to people willing to read a lot of source code and spend a lot of time debugging it in production, including a few late-night sessions.

Chủ đề:

Nội dung Text: MySQL High Availability- P9

  1. Administrator. We will look at the MySQL Administrator in “Replication Monitoring with MySQL Administrator” on page 381. Monitoring Commands for the Slave The SHOW SLAVE STATUS command displays information about the slave’s binary log, its connection to the server, and replication activity, including the name and offset position of the current binlog file. This information is vital in diagnosing slave performance, as we have seen in previous chapters. Example 10-5 shows the result of a typical SHOW SLAVE STATUS command executed on a server running MySQL version 5.5. Example 10-5. The SHOW SLAVE STATUS command mysql> SHOW SLAVE STATUS \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: localhost Master_User: rpl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 39016226 Relay_Log_File: relay-bin.000004 Relay_Log_Pos: 9353715 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 25263417 Relay_Log_Space: 39016668 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 66 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Monitoring Slaves | 377 Please purchase PDF Split-Merge on to remove this watermark.
  2. Replicate_Ignore_Server_Ids: Master_Server_Id: 1 1 row in set (0.00 sec) There is a lot of information here. This command is the most important command for replication. It is a good idea to study the details of each item presented. Rather than listing the information item by item, we present the information from the perspective of an administrator. That is, the information is normally inspected with a specific goal in mind. Thus, we group the information into categories for easier reference. These categories include master connection information, slave performance, log information, filtering, log performance, and error conditions. The most important piece of information is the first column. This tells you the current status of the I/O thread. It presents one of several states: connecting to the master, waiting for events from the master, reconnecting to the master, etc. The information displayed about the master connection includes the current hostname of the master, the user account used to connect, and the port the slave is connected to on the master. Toward the bottom of the listing is the SSL connection information (if you are using an SSL connection). The next category includes information about the binary log on the master and the relay log on the slave. The filename and position of each are displayed. It is important to note these values whenever you diagnose replication problems. Of particular note is Relay_Master_Log_File, which shows the filename of the master binary log where the most recent event from the relay log has been executed. Replication filtering configuration lists all of the slave-side replication filters. Check here if you are uncertain how your filters are set up. Also included is the last error number and text for the slave and the I/O and SQL threads. Beyond the state values for the slave threads, this information is most often examined when there is an error. It can be helpful to check this information first when encountering errors on the slave, before examining the error log, as this information is the most current and normally gives you the reason for the failure. There is also information about the configuration of the slave, including the settings for the skip counter and the until conditions. See the online MySQL Reference Man ual for more information about these fields. Near the bottom of the list is the current error information. This includes errors for the slave’s I/O and SQL threads. These values should always be 0 for a properly functioning slave. Some of the more important performance columns are discussed in more detail here: 378 | Chapter 10: Replication Monitoring Please purchase PDF Split-Merge on to remove this watermark.
  3. Connect_Retry The number of seconds that expire between retry connect attempts. This value should always be low, but you may want to set it higher if you have a case where the slave is having issues connecting to the master. Exec_Master_Log_Pos This shows the position of the last event executed from the master’s binary log. Relay_Log_Space The total size of all of the relay logfiles. You can use this to determine if you need to purge the relay logs in the event you are running low on disk space. Seconds_Behind_Master The number of seconds between the time an event was executed and the time the event was written in the master’s binary log. A high value here can indicate signif- icant replication lag. We discuss replication lag in an upcoming section. The value for Seconds_Behind_Master could become stale when replication stops due to network failures, loss of heartbeat from the master, etc. It is most meaningful when replication is running. If your slave has binary logging enabled, the SHOW BINARY LOGS command displays the list of binlog files available on the slave and their sizes in bytes. Example 10-6 shows the results of a typical SHOW BINARY LOGS command. Example 10-6. The SHOW BINARY LOGS command on the slave mysql> SHOW BINARY LOGS; +------------------+-----------+ | Log_name | File_size | +------------------+-----------+ | slave-bin.000001 | 5151604 | | slave-bin.000002 | 1030108 | | slave-bin.000003 | 1030044 | +------------------+-----------+ 3 rows in set (0.00 sec) You can rotate the relay log on the slave with the FLUSH LOGS command. You can also use the SHOW BINLOG EVENTS command to show events in the binary log on the slave if the slave has binary logging enabled. The difference between showing events on the slave and showing them on the master is you want to specify the binlog filename on the slave as shown in the SHOW BINARY LOGS output. Example 10-7 shows the binlog events from a typical replication configuration. Monitoring Slaves | 379 Please purchase PDF Split-Merge on to remove this watermark.
  4. Example 10-7. The SHOW BINLOG EVENTS command (statement-based) mysql> SHOW BINLOG EVENTS IN 'slave-bin.000001' FROM 2701 LIMIT 2 \G *************************** 1. row *************************** Log_name: slave-bin.000001 Pos: 2701 Event_type: Query Server_id: 1 End_log_pos: 3098 Info: use `employees`; CREATE TABLE salaries ( emp_no INT NOT NULL, salary INT NOT NULL, from_date DATE NOT NULL, to_date DATE NOT NULL, KEY (emp_no), FOREIGN KEY (emp_no) REFERENCES employees (emp_no) ON DELETE CASCADE, PRIMARY KEY (emp_no, from_date) ) *************************** 2. row *************************** Log_name: slave-bin.000001 Pos: 3098 Event_type: Query Server_id: 1 End_log_pos: 3405 Info: use `employees`; INSERT INTO `departments` VALUES ('d001','Marketing'),('d002','Finance'), ('d003','Human Resources'),('d004','Production'), ('d005','Development'),('d006','Quality Management'), ('d007','Sales'),('d008','Research'), ('d009','Customer Service') 2 rows in set (0.01 sec) In MySQL versions 5.5 and later, you can also inspect the slave’s relay log with SHOW RELAYLOG EVENTS. Slave Status Variables There are only a few status variables for monitoring the slave. These include counters that indicate how many times a slave-related command was issued on the master and statistics for key slave operations. The first four listed here are simply counters of the various slave-related commands. The values should correspond with the frequency of the maintenance of your slaves. If they do not, you may want to investigate the possi- bility that there are more slaves in your topology than you expected or that a particular slave is being restarted too frequently. 380 | Chapter 10: Replication Monitoring Please purchase PDF Split-Merge on to remove this watermark.
  5. Com_show_slave_hosts The number of times the SHOW SLAVE HOSTS command was issued. Com_show_slave_status The number of times the SHOW SLAVE STATUS command was issued. Com_slave_start The number of times the SLAVE START command was issued. Com_slave_stop The number of times the SLAVE STOP command was issued. Slave_heartbeat_period The current configuration for the number of seconds that elapse between heartbeat checks of the master. Slave_open_temp_tables The number of temporary tables the slave’s SQL thread is using. A high value can indicate the slave is overburdened. Slave_received_heartbeats The count of heartbeat replies from the master. This value should correspond roughly to the elapsed time since the slave was restarted divided by the heartbeat interval. Slave_retried_transactions The number of times the SQL thread has retried transactions since the slave was started. Slave_running Simply displays ON if the slave is connected to the master and the I/O and SQL threads are executing without error. Replication Monitoring with MySQL Administrator You have seen how you can use the MySQL Administrator to monitor network traffic and storage engines. It also has a simple display for monitoring the master and slave in a replication topology. You can view basic information about replication on the Rep- lication Status tab. However, to get the most out of this information, you should start your slaves with the --report_host startup option, providing a unique name for each slave. Figure 10-1 shows the MySQL Administrator running on a master with one connected slave. If there were slaves connected without the --report_host option, they would be omitted from the list. If you run the MySQL Administrator on a slave, you will only see the slave’s informa- tion. Figure 10-2 shows the MySQL Administrator running on the slave. Replication Monitoring with MySQL Administrator | 381 Please purchase PDF Split-Merge on to remove this watermark.
  6. Figure 10-1. The MySQL Administrator running on the master Figure 10-2. The MySQL Administrator running on the slave 382 | Chapter 10: Replication Monitoring Please purchase PDF Split-Merge on to remove this watermark.
  7. In Figures 10-1 and 10-2, the information displayed includes the hostname, server ID, port, kind (master or slave), a general status, the logfile (binlog filename), and the current log position. Figure 10-1 shows the replication topology listing all of the con- nected slaves. This report can be handy when you want to get an at-a-glance status of your servers. Other Items to Consider This section discusses some additional considerations for monitoring replication. It includes special networking considerations and monitoring lag (delays in replication). Networking If you have limited networking bandwidth, high contention for the bandwidth, or simply a very slow connection, you can improve replication performance by using compression. You can configure compression using the slave_compressed_protocol variable. In cases where network bandwidth is not a problem but you have data that you want to protect while in transit from the master to the slaves, you can use an SSL connection. You can configure the SSL connection using the CHANGE MASTER command. See the sec- tion titled “Setting Up Replication Using SSL” in the online MySQL Reference Man ual for details on using SSL connections in replication. Another networking configuration you may want to consider is using master heart- beats. You have seen where this information is shown on the SHOW SLAVE STATUS com- mand. A heartbeat is a mechanism to automatically check connection status between a master and a slave. It can detect levels of connectivity in milliseconds. Master heart- beat is used in replication scenarios where the slave must be kept in sync with the master with little or no delay. Having the capability to detect when a threshold expires ensures the delay is identified before replication is halted on the slave. You can configure master heartbeat using a parameter in the CHANGE MASTER command with the master_heartbeat_period= setting (added in MySQL version 5.4.4), where the value is the number of seconds at which you want the heartbeat to occur. You can monitor the status of the heartbeat with the following commands: SHOW STATUS like 'slave_heartbeat period' SHOW STATUS like 'slave_received_heartbeats' Monitor and Manage Slave Lag Periods of massive updates, overburdened slaves, or other significant network per- formance events can cause your slaves to lag behind the master. When this happens, the slaves are not processing the events in their relay logs fast enough to keep up with the changes sent from the master. Other Items to Consider | 383 Please purchase PDF Split-Merge on to remove this watermark.
  8. As you saw with the SHOW SLAVE STATUS command, Seconds_Behind_Master can show indications that the slave is running behind the master. This field tells you by how many seconds the slave’s SQL thread is behind the slave’s I/O thread—that is, how far behind the slave is in processing the incoming events from the master. The slave uses the time- stamps of the events to calculate this value. When the SQL thread on the slave reads an event from the master, it calculates the difference in the timestamp. The following excerpt shows a condition in which the slave is 146 seconds behind the master. In this case, the slave is more than two minutes behind; this can be a problem if your appli- cation is relying on the slaves to provide timely information. mysql> SHOW SLAVE STATUS \G ... Seconds_Behind_Master: 146 ... The SHOW PROCESSLIST command (run on the slave) can also provide an indication of how far behind the slave is. Here, we see the number of seconds that the SQL thread is behind, measured using the difference between the timestamp of the last replicated event and the real time of the slave. For example, if your slaves have been offline for 30 minutes and have reconnected to the master, you would expect to see a value of ap- proximately 1,800 seconds in the Time field of the SHOW PROCESSLIST results. The excerpt below shows this condition. Large values in this field are indicative of significant delays that can result in stale data on the slaves. mysql> SHOW PROCESSLIST \G ... Time: 1814 ... Depending on how your replication topology is designed, you may be replicating data for load balancing. In this case, you typically use multiple slaves, directing a portion of the application or users to the slaves for SELECT queries, thereby reducing the burden on the master. Causes and Cures for Slave Lag Slave lag can be a nuisance for some replication users. The main reason for lag is the single-threaded nature of the slave (actually, there are two threads, but only one exe- cutes events and this is the main culprit in slave lag). For example, a master with a multiple-core CPU can run multiple transactions in parallel and will be faster than a slave that is executing transactions (events from the binary log) in a single thread. We have already discussed some ways to detect slave lag. In this section, we discuss some common causes and solutions for reducing slave lag. There are several causes for slave lag (e.g., network latency). It is possible the slave I/O thread is delayed in reading events from the logs. The most common reason for slave lag is simply that the slave has a single thread to execute all events, whereas the master has potentially many threads executing in parallel. Some other causes include 384 | Chapter 10: Replication Monitoring Please purchase PDF Split-Merge on to remove this watermark.
  9. long-running queries with inefficient joins, I/O-bound reads from disk, lock conten- tion, and InnoDB thread concurrency issues. Now that you know more about what causes slave lag, let us examine some things you can do to minimize it: Organize your data You can see performance improvements by normalizing your data and by using sharding to distribute your data. This helps eliminate duplication of data, but as you saw in Chapter 8, duplication of some data (such as lookup text) can actually improve performance. The idea here is to use just enough normalization and sharding to improve performance without going too far. This is something only you, the owner of the data, can determine either through experience or experi- mentation. Divide and conquer We know that adding more slaves to handle the queries (scale-out) is a good way to improve performance, but not scaling out enough could still result in slave lag if the slaves are processing a much greater number of queries. In extreme cases, you can see slave lag on all of the slaves. To combat this, consider segregating your data using replication filtering to replicate different databases among your slaves. You can still use scale-out, but in this case you use an intermediary slave for each group of databases you filter, then scale from there. Identify long-running queries and refactor them If long-running queries are the source of slave lag, consider refactoring the query or the operation or application to issue shorter queries or more compact transac- tions. However, if you use this technique combined with replication filtering, you must use care when issuing transactions that span the replication filter groups. Once you divide a long-running query that should be an atomic operation (a trans- action) across slaves, you run the risk of causing data integrity problems. Load balancing You can also use load balancing to redirect your queries to different slaves. This may reduce the amount of time each slave is spending answering queries, thereby leaving more computational time to process replication events. Ensure you are using the latest hardware Clearly, having the best hardware for the job normally equates to better perform- ance. At the very least, you should ensure your slave servers are configured to their optimal hardware capabilities and are at least as powerful as the master. Reduce lock contention Table locks for MyISAM and row-level locks for InnoDB can cause slave lag. If you have queries that result in a lot of locks on MyISAM or InnoDB tables, consider refactoring the queries to avoid as many locks as possible. Other Items to Consider | 385 Please purchase PDF Split-Merge on to remove this watermark.
  10. Conclusion This chapter concludes our discussion of the many ways you can monitor MySQL, and provides a foundation for you to implement your own schedules for monitoring virtu- ally every aspect of the MySQL server. Now that you know the basics of operating system monitoring, database performance, and MySQL monitoring and benchmarking, you have the tools and knowledge to suc- cessfully tune your server for optimal performance. Joel smiled as he compiled his report about the replication issue. He paused and glanced at his doorway. He could almost sense it coming. “Joel!” Joel jumped, unable to believe his prediction. “I’ve got the replication problem solved, sir,” he said quickly. “Great! Send me the details when you get a moment.” “I also discovered some interesting things about the order processing system.” He no- ticed Mr. Summerson’s eyebrow raise slightly in anticipation. Joel continued, “It seems we have sized the buffer pool incorrectly. I think I can make some improvements in that area as well.” Mr. Summerson said, “Monitoring again?” “Yes, sir. I’ve got some reports on the InnoDB storage engine. I’ll include that in my email, too.” “Good work. Good work indeed.” Joel knew that look. His boss was thinking again, and that always led to more work. Joel was surprised when his boss simply walked away slowly. “Well, it seems I finally stumped him.” 386 | Chapter 10: Replication Monitoring Please purchase PDF Split-Merge on to remove this watermark.
  11. CHAPTER 11 Replication Troubleshooting The message subject was simply “Fix the Seattle server.” Joel knew such cryptic subject lines came from only one person. A quick scan of the message header confirmed the email was from Mr. Summerson. Joel opened the message and read the contents. “The Seattle server is acting up again. I think the replication thingy is hosed. Make this your top priority.” “OK,” Joel muttered to himself. Because the monitoring reports he had produced last week showed no anomalies and he was sure the replication setup was correct the last time he checked, Joel wasn’t sure how to attack the problem. But he knew where to find the answers. “It looks like I need to read that replication troubleshooting chapter after all.” A familiar head appeared in his doorway. Joel decided to perform a preemptive ma- neuver by saying, “I’m on it.” This resulted in a nod and a casual salute as his boss continued down the hall. MySQL replication is usually trouble-free and rarely needs tuning or tweaking once the topology is active and properly configured. However, there are times when things can go wrong. Sometimes an error is manifested, and you have clear evidence with which to start your investigations. Other times the condition or problem is easily understood, but the causes of the more difficult problems that can arise are not so obvious. Fortu- nately, you can resolve these problems if you follow some simple guidelines and prac- tices for troubleshooting replication. This chapter presents these ideas by focusing on techniques to resolve replication problems. We begin with a description of what can go wrong, then we discuss the basic tools available to help troubleshoot problems, and we conclude with some strategies for solving and preventing replication problems. 387 Please purchase PDF Split-Merge on to remove this watermark.
  12. Troubleshooting replication problems involving the MySQL Cluster follows the same procedures presented in this chapter. If you are having problems with MySQL Cluster, see Chapter 15 for troubleshooting cluster failures and startup issues. Seasoned computer users understand that computing systems are prone to occasional failures. Information technology professionals make it part of their creed to prevent failures and ensure reliable access and data to users. However, even properly managed systems can have issues. MySQL replication is no exception. In particular, the slave state is not crash-safe. This means that if the MySQL instance on the slave crashes, it is possible the slave will stop in an undefined state. In the worst case, the relay log or the file could be corrupt. Indeed, the more complex the topology (including load and database complexity) and the more diverse the roles are among the nodes in the topology, the more likely some- thing will go wrong. That doesn’t mean replication cannot scale—on the contrary, you have seen how replication can easily scale to massive replication topologies. What we are saying is that when replication problems occur, they are usually the result of an unexpected action or configuration change. What Can Go Wrong There are many things that can go wrong to disrupt replication. MySQL replication is most susceptible to problems with data, be it data corruption or unintended interrup- tions in the replication stream. System crashes that result in an unsafe and uncontrolled termination of MySQL can also cause replication restarting issues. You should always prepare a backup of your data before changing anything to fix the problem. In some cases the backup will contain data that is corrupt or missing, but the benefits are still valid, specifically, that no matter what you do, you can at least return the data to the state at the time of the error. You’d be surprised how easy it is to make a bad situation worse. In this section, we begin exploring replication troubleshooting by describing the most common failures in MySQL replication. These are some of the more frequently en- countered replication problems. While the list is not complete in the sense that it in- cludes all possible replication problems, it does give you an idea of the types of things that can go wrong. We include a brief statement of some likely causes for each. Problems on the Master While most errors will manifest on the slave, look to this section for potential solutions for problems originating on the master. Administrators sometimes automatically 388 | Chapter 11: Replication Troubleshooting Please purchase PDF Split-Merge on to remove this watermark.
  13. suspect the slave. You should take a look at both the master and the slave when diag- nosing replication problems. Master crashed and memory tables are in use When the master is restarted, any data for memory tables is purged (as is normal for the memory storage engine). However, if a table that uses the memory storage engine (hence, a memory table) is being replicated, the slave may have outdated data if it wasn’t restarted (the server, not the slave). Fortunately, when the first access to the memory table occurs after a restart, a special delete event is sent to the slaves to signal the slaves to purge the data, thereby synchro- nizing the data. However, the interval between when the table is referenced and when the replication event is transmitted can result in the slave having outdated data. To avoid this problem, use a script to first purge the data, then repopulate it on the master at startup using the init_file option. For example, if you have a memory table that stores frequently used data, create a file like the following and reference it with the init_file option: # Force slaves to purge data DELETE FROM db1.mem_zip; # Repopulate the data INSERT INTO ... The first command is a delete query, which will be replicated to the slaves when rep- lication is restarted. Following that are statements to repopulate the data. In this way, you can ensure there is no gap where the slave could have out-of-date information in a memory table. Master crashed and binary log events are missing It is possible for the master to fail and not write recent events to the binary log on disk. That is, if the server crashes before MySQL flushes its binary events cache to disk (in the binary log), those cached events can be lost. This is usually indicated by an error on the slave stating that the binary log offset event is missing or does not exist. In this case, the slave is attempting to reconnect on restart using the last known binlog file and position of the master, and while the binlog file may exist, the offset does not because the events that incremented the offset were not written to disk. Unfortunately, there is no way to retrieve the lost binlog events. To solve this problem, you must check the current binlog position on the master and use this information to tell the slave to start at the next known event on the master. Be sure to check the data on both your master and slave once the slave is synchronized. It is also possible that some of the events that were lost on the master were applied to the data prior to the crash. You should always compare the tables in question on the What Can Go Wrong | 389 Please purchase PDF Split-Merge on to remove this watermark.
  14. master to determine if there are differences between the master and the slave. This situation is rare, but it can cause problems later on if an update for a row is executed on the master against one of these missing events, which then causes a failure when run on the slave. In this case, the slave is attempting to run an update on rows that do not exist. For example, consider a scenario of a fictional, simplified database for an auto dealer where information about cars for sale is stored in tables corresponding to new and used cars. The tables are set up with autoincrement keys. On the master, the following happens: INSERT INTO auto.used_cars VALUES (2004, 'Porsche', 'Cayman', 23100, 'blue'); A crash occurs after the following statement is executed but before it is written to the binary log: UPDATE auto.used_cars SET color = 'white' WHERE id = 17; In this case, the update query was lost during the crash on the master. When the slave attempts to restart, an error is generated. You can resolve the problem using the sug- gestion just shown. A check on the number of rows on the master and slave shows the same row count. Notice the update that corrected the color of the 2004 Porsche to white instead of blue. Now consider what will happen when a salesperson tries to help a customer find the blue Porsche of her dreams by executing this query on the slave: SELECT * FROM auto.used_cars WHERE make = 'Porsche' AND model = 'Cayman' AND color = 'blue'; Will the salesperson who runs the query discover he has a blue Porsche Cayman for sale? A good auto salesperson always ensures he has the car on the lot by visual in- spection, but for argument’s sake let us assume he is too busy to do so and tells his customer he has the car of her dreams. Imagine his embarrassment (and loss of a sale) when his customer arrives to test-drive the car only to discover that it is white. To prevent loss of data should the master crash, turn on sync_binlog (set to 1) at startup or in your configuration file. This will tell the master to flush an event to the binary log immediately. While this may cause a noticeable performance drop for InnoDB, the protection afforded could be great if you cannot afford to lose any changes to the data (but you may lose the last event, depending on when the crash occurred). While this academic example may not seem too bad, consider the possibilities of a missing update to a medical database or a database that contains scientific data. Clearly, a missing update, even a seemingly simple one, can cause problems for your users. Indeed, the above scenario can be considered a form of data corruption. Always check the contents of your tables when encountering this problem. In this case, crash recovery 390 | Chapter 11: Replication Troubleshooting Please purchase PDF Split-Merge on to remove this watermark.
  15. ensures the binary log and InnoDB are consistent when sync_binlog=1, but it otherwise has no effect for MyISAM tables. Query runs fine on the master but not on the slave While not strictly a problem on the master, it is sometimes possible that a query (e.g., an update or insert command) will run properly on the master but not on the slave. There are many causes of this type of error, but most point to a referential integrity issue or a configuration problem on the slave or the database. The most common cause of this error is a query referencing a table that does not exist on the slave or that has a different signature (different columns or column types). In this case, you must change the slave to match the server in order to properly execute the query. In some cases, it is possible the query is referencing a table that is not replicated. For example, if you are using any of the replication filtering startup options (a quick check of the master and slave status will confirm this), it is possible that the database the query is referencing is not on the slave. In this situation, you must either adjust your filters accordingly or manually add the missing tables to the missing database on the slave. In other cases, the cause of a failed query can be more complex, such as character set issues, corrupt tables, or even corrupt data. If you confirm your slave is configured the same as your master, you may need to diagnose the query manually. If you cannot correct the problem on the slave, you may need to perform the update manually and tell the slave to skip the event that contains the failed query. To skip an event on a slave, use the sql_slave_skip_counter variable and specify the number of events from the master you want to skip. Sometimes this is the fastest way to restart replication. Table corruption after a crash If your master or slave crashes and, after restarting them both, you find one or more tables are corrupt or find that they are marked as crashed by MyISAM, you will need to fix these problems before restarting replication. You can detect which tables are corrupt by examining the server’s logfiles, looking for errors like the following: ... [ERROR] /usr/bin/mysqld: Table 'db1.t1' is marked as crashed and should be repaired ... You can use the following command to perform optimization and repair in one step to repair all of the tables for a given database (in this case, db1). mysqlcheck -u -p --check --optimize --auto-repair db1 What Can Go Wrong | 391 Please purchase PDF Split-Merge on to remove this watermark.
  16. For MyISAM tables, you can use the myisam-recover option to turn on automatic recovery. There are four modes of recovery. See the online MySQL Reference Manual for more details. Once you have repaired the affected tables, you must also determine if the tables on the slave have been corrupted. This is necessary if the master and slave share the same data center and the failure was environmental (e.g., they were connected to the same power source). Always perform a backup on a table before repairing it. In some cases a repair operation can result in data loss or leave the table in an unknown state. It is also possible that a repair can leave the master and slave out of sync, especially if there is data loss as a result of the repair. You may need to compare the data in the affected table to ensure the master and slave are synchronized. If they are not, you may need to reload the data for the affected table on the slave if the slave is missing data, or copy data from the slave if the master is missing data. Binary log is corrupt on the master If a server crash or disk problem results in a corrupt binary log on the master, you cannot restart replication. There are many causes and types of corruption that can occur in the binary log, but all result in the inability to execute one or more events on the slave, often resulting in errors such as “could not parse relay log event.” In this case, you must carefully examine the binary log for recoverable events and rotate the logs on the master with the FLUSH LOGS command. There may be data loss on the slave as a result and the slave will most definitely fail in this scenario. The best recovery method is to resynchronize the slave with the master using a reliable backup and re- covery tool. In addition to rotating the logs, you can ensure any data loss is minimized and get replication restarted without errors. In some cases, if it is easy to determine how many events were corrupted or missing, it may be possible to skip the corrupted events by using the sql_slave_skip_counter on the slave. You can determine this by comparing the master’s binlog reference on the slave to the current binlog position on the master. Killing long-running queries for nontransactional tables If you are forced to terminate a query that is modifying a nontransactional table, it is possible the query has been replicated to and executed on the slave. When this occurs, it is likely the changes on the master will be different than on the slave. 392 | Chapter 11: Replication Troubleshooting Please purchase PDF Split-Merge on to remove this watermark.
  17. For example, if you terminate a query that updates 400 out of the 600 rows in a table such that only 200 of the 400 changes are complete, it is possible that the slave com- pleted all 400 updates. Thus, whenever you terminate a query that updates data on the master, you need to confirm the change has not executed on the slave and if it has (or even as a precaution), you should resynchronize the data on the slave once you’ve corrected the table on the master. Usually in this case, you will fix the master and then make a backup of the data on the master and restore it on the slave. Problems on the Slave Most problems you will encounter will be the result of some error on the slave. In some situations, like those described in the previous section, it may be a problem that origi- nated on the master, but it almost always will be seen on the slave in one form or another. The following sections list some of the common problems on the slave. Use Binary Logging on the Slave One way to ensure a more robust slave is to turn on binary logging using the log-slave- updates option. This will cause the slave to log the events it executes from its relay log, thereby creating a binary log that you can use to replay events on the slave in the event that the relay log (or the data) becomes corrupt. Slave server crashed and replication won’t start When a slave server crashes, it is usually easy to reestablish replication with the master once you determine the last known good event executed on the slave. You can see this by examining the SHOW SLAVE STATUS output. However, where there are errors regarding account access, it is possible that replication cannot be restarted. This can be the result of authentication problems (e.g., the slave’s replication account was deleted) or corrupted tables on the master or slave(s). In these cases, you are likely to see connection errors in the console and logs for the slave MySQL server. When this occurs, always check the permissions of the replication user on the master. Ensure the proper privileges are granted to the user defined in either your configuration file or on your CHANGE MASTER command. The privileges should be similar to the following: GRANT REPLICATION SLAVE ON *.* TO 'rpl_user'@'%' IDENTIFIED BY 'password_here'; You can change this command to suit your needs as a means to solve this problem. What Can Go Wrong | 393 Please purchase PDF Split-Merge on to remove this watermark.
  18. Slave connection times out and reconnects frequently If you have multiple slaves in your topology and have either not set the server_id option or have the same value for server_id for two or more of your slaves, you may have conflicting server IDs. When this happens, one of the slaves may exhibit frequent time- outs or drop and reconnect sequences. This problem is simply due to the nonunique IDs among your slaves and can be difficult to diagnose (or, we should say, it’s easy to misdiagnose as a connection problem). You should always check the error log of the master and slave for error messages. In this case, it is likely the error will contain the nature of the timeout. To prevent this type of problem, always ensure that all of your servers have a server_id option set either in the configuration file or in the startup command line. Query results are different on the slave than on the master One of the more difficult problems to detect occurs when the query results performed on one or more slaves do not match that of the master. It is possible you may never notice the problem. The problem could be as simple or innocuous as sort order issues, or as severe as missing or extra rows in the result set. The main causes of this type of problem are character set differences between the master and slave. For example, the master can be configured with one character set and col- lation defaults while one or more slaves are configured with another. If your users start complaining of extra or missing rows or differing result orders, you should check the character set setting first on both the master and your slaves. Another possible cause of this problem is using different default storage engines on the master and slave—for example, if you use the MyISAM storage engine on the master and use the InnoDB storage engine on the slave. In this case, it is entirely likely that the query results will be in different orders if you used an ALTER TABLE command that changed the storage engine to one that has a different collation than the master. Perhaps a more subtle cause of this type of problem is when the table definitions differ on the master and slave. It is possible to have differences in which a subset of the columns for a given table is the same and either some initial columns or ending columns (order is important here) are missing on the slave. There are many potential errors when you use this feature, but it can sometimes result in the expectation that the data for some columns is replicated but the slave doesn’t have the columns defined. While having fewer columns on the slave may be desired, a careless user can achieve this accidentally by dropping columns in such a way that replication can still proceed. In some cases, the SELECT queries executed on the slave will fail when referencing the missing columns, thereby giving you a clue to the problem. Other times you can simply be missing data in your applications. 394 | Chapter 11: Replication Troubleshooting Please purchase PDF Split-Merge on to remove this watermark.
  19. A common user error that can result in differences in query results between the master and slave is making other types of changes to the tables or databases executed on the slave but not executed on the master. That is, a user performs some nonreplicated data manipulation on the slave that changes a table signature but does not execute the same on the master. When this occurs, queries can return either the wrong results, wrong columns, wrong order, or extra data, or simply fail due to referencing missing columns. It is always a good precaution to check the layout of a table involved in these types of problems to ensure it is the same on the master and slave. If it is not, resynchronize the table and retry the query. Slave issues errors when attempting to restart with SSL Problems related to SSL connections are typically the usual permission issues described previously. In this case, the privileges granted must also include the REQUIRE SSL option as shown below. Be sure to check that the replication user exists and has the correct privileges. GRANT REPLICATION SLAVE ON *.* TO 'rpl_user'@'%' IDENTIFIED BY 'password_here' REQUIRE SSL; Other issues related to restarting replication when SSL connections are used are missing certificate files or incorrect values for the SSL-related options in the configuration file (e.g., ssl-ca, ssl-cert, and ssl-key) or the related options in the CHANGE MASTER com- mand (e.g., MASTER_SSL_CA, MASTER_SSL_CAPATH, MASTER_SSL_CERT, and MASTER_SSL_KEY). Be sure to check your settings and paths to ensure nothing has changed since the last time replication was started. Memory table data goes missing If one or more of your databases uses the memory storage engine, the data contained in these tables will be lost when a slave server is restarted (the server, not the slave threads). This is expected, as data in memory tables does not survive a restart. The table configuration still exists and the table can be accessed, but the data has been purged. It is possible that when a slave server is restarted, queries directed to the memory table fail (e.g., UPDATE) or query results are inaccurate (e.g., SELECT). Thus, the error may not occur right away and could be as simple as missing rows in a query result. To avoid this problem, you should carefully consider the use of memory tables in your databases. You should not create memory tables on the master to be updated on the slaves via replication without procedures in place to recover the data for the tables in the event of a crash or planned restart of the server. For example, you can execute a script before you start replication that copies the data for the table from the master. If the data is derived, use a script to repopulate the data on the slave. Other things to consider are filtering out the table during replication or possibly not using the memory storage engine for any replicated table. What Can Go Wrong | 395 Please purchase PDF Split-Merge on to remove this watermark.
  20. Temporary tables are missing after a slave crash If your replicated databases and queries make use of temporary tables, you should consider some important facts about temporary tables. When a slave is restarted, its temporary tables are lost. If any temporary tables were replicated from the master and you cannot restart the slave from that point, you may have to manually create the tables or skip the queries that reference the temporary tables. This scenario often results in the case where a query will not execute on one or more slaves. The resolution to this problem is similar to missing memory tables. Specifically, in order to get the query to execute, you may have to manually re-create the temporary tables or resynchronize the data on the slave with the data on the master and skip the query when restarting the slave. Slave is slow and is not synced with the master In slave lag, also called excessive lag, the slave cannot process all of the events from the master fast enough to avoid delays in updates of the data. In the most extreme cases, the updates to the data on the slave become out of date and cause incorrect results. For example, if a slave server in a ticketing agency is many minutes behind the master, it is possible the ticketing agency can sell seats that are no longer available (i.e., they have been marked as “sold” on the master but the slave did not get the updates until too late). We discussed this problem in previous chapters, but a summary of the resolution is still relevant here. To detect the problem, monitor the slave’s SHOW SLAVE STATUS output and examine the Seconds_Behind_Master column to ensure the value is within tolerance for your application. To solve the problem, consider moving some of the databases to other slaves, reducing the number of databases being replicated to the slave, improving network delays (if any), and making data storage improvements. For example, you can relieve the slave of processing extraneous events by using an additional slave for bulk or expensive data updates. You can relieve the replication load by making updates on a separate slave and applying the changes using a reliable backup and restore method on all of the other machines in the topology. Data loss after a slave crash It is possible that a slave server may crash and not record the last known master binlog position. This information is saved in the file. When this occurs, the slave will attempt to restart at the wrong (older) position and therefore attempt to execute some queries that may have already been executed. This normally results in query errors; you can handle this by skipping the duplicate events. However, it is also possible these duplicate events can cause the data to be changed (corrupted) so that the slave is no longer in sync with the master. Unfortunately, these types of problems are not that easy to detect. Careful examination of the logfiles may 396 | Chapter 11: Replication Troubleshooting Please purchase PDF Split-Merge on to remove this watermark.
Đồng bộ tài khoản