replication « Remi Bergsma's blog

Fixing an out-of-sync or corrupt MySQL slave

20 November 2012 — 2 Comments

Sometimes a MySQL slave may get corrupted, or its data may otherwise be unreliable. Usually I clone the data from a slave that is still ok and fix it from there. However, today I run into an issue that made me doubt on the data of any slave. To be absolutely sure the data is consistent on both master and slave, I decided to deploy a new slave with a clone of the master and then redeploy the other slaves from the newly created slave like I normally do with a script.

This blog post describes both methods of restoring replication.

Restoring data directly from the master
We will create a dump from the master server and use it on a slave. To be sure nothing changes during the dump, we issue a ‘read lock’ on the database. Reading will work, writes will wait until we unlock, so please choose the right time to do this maintenance.

To lock all tables run:

FLUSH TABLES WITH READ LOCK;

Now that we have the lock, record the position of the master and write it down. We need it later to instruct the slaves where to continue reading updates from the master.

SHOW MASTER STATUS\G

Example output:

File: bin-log.002402
Position: 20699406

Time to create a sql dump of the current databases. Do this in another session and keep the first one open. This will make sure you’ll keep your lock while dumping the database.

mysqldump -ppassword -u username --add-drop-database databasename table1 table2 > masterdump.sql

After the dump is complete, go back to fist screen and release lock:

UNLOCK TABLES;

This is all we need to do on the master.

Restoring from an already running slave
As an alternative to creating a dump from the master, you can also use a slave’s data. This has the advantage of not having locks on the master database and thus not interrupting service. On the other hand, you will have to be sure this slave’s data is correct.

First stop the slave

SLAVE STOP;

And verify it has stopped

SHOW SLAVE STATUS\G

Output:

Slave_IO_Running: No
Slave_SQL_Running: No
Master_Log_File: bin-log.002402
Read_Master_Log_Pos: 20699406

Record the ‘Relay_Master_Log_File’ and ‘Exec_Master_Log_Pos’. This is the position this slave is at. We will need it later to instruct the new slave.

Create a sql dump of the slave’s data:

/usr/bin/mysqldump --add-drop-database -ppassword -u user -h mysqlserver --databases databasename

Now that we have a dump, we can start the slave again.

SLAVE START;

In the period between the ‘stop’ and ‘start’ slave, everything still works except that updates from the master are not processed. As soon as you start the slave again, the slave catches up with the master.

This method has the advantage that is it easily scriptable. Whenever there’s a problem, you’d run a script with the above commands and have everything fixed in a matter of seconds. That’s a real time saver!

Setting up the new slave
Use scp to securely copy the sql dump we just created above to the slave. Alternatively you may run the ‘mysqldump’ commands directly from the slave as well. Then login and run these commands:

STOP SLAVE;
RESET SLAVE;

Restore the sql dump:

mysql -ppassword -u user databasename < masterdump.sql

You now have a slave with up to date data. We’ll have to instruct the slave where to start updating. Use the result from the ‘master status’ or ‘slave status’ query above depending on the method of your choice.

CHANGE MASTER TO
 master_host='mysqlmaster',
 master_user='replicate_user',
 master_password='replicate_password',
 master_log_file='bin-log.002402',
 master_log_pos=20699406;

Then start the slave:

SLAVE START;

And check the status after a few seconds:

SHOW SLAVE STATUS\G

Output:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

The slave now runs again with up to date data!

In Linux sysadmin howto, mysql, replication

Replacing both DRBD nodes while keeping the storage online

19 November 2012 — Leave a comment

DRBD (Distributed Redundant Block Device) is an open source storage solution that is best compared with a RAID-1 (mirror) between two servers. I’ve implemented this for both our cloud storage as our cloud management servers.

We’re in the process of replacing both cloud storage nodes (that is: everything except the disks and its raid array) and of course no downtime is allowed. Although DRBD is made for redundancy (one node can be offline without impact), completely replacing a node is a bit tricky.

Preventing a ‘split brain’ situation
The most important thing to remember is that only one storage node is allowed to be the active node at all times. If this is violated, a so-called ‘split brain’ happens. DRBD has methods in surviving such a state, but it is best to prevent it from happening.

When discussing this project in our Team, it was suggested to boot the replaced storage node without any networking cables. Usually this is a safe way to prevent the node from interacting with others. In this case, it is not such a good idea: since the new secondary server has the same disks and configuration as the old one unexpeced thing may happen. When booting without networking cables attached, both nodes cannot find one another and the newly booted secondary may decode to become primary itself. A split brain situation will then occur: both nodes will be primary at the same time and access the data. You won’t be able to recover from this, unless you manually decide which node is the master (and lose the changes on the other node). In this case this can be easily decided, but it’s a lot of unneccessary trouble.

Instead, boot the replaced node with network cables connected so the replication network will be up and both nodes immediately will see each other. In our case, this means connecting the 10Gbps connection between the nodes. This connection is used by DRBD for syncing and by Heartbeat for sending the heartbeats. This prevents entering the ‘split brain’ state and immediately starts syncing.

Note: If you want to replace everything including the disks, you’ll have to manually join the cluster with the new secondary node and then sync the data. In this case it doesn’t matter whether the networking cables are connected or not, since this new node won’t be able to become primary anyway.

The procedure
Back to our case: replacing all hardware except for the disks. We managed to successfully replace both nodes using this procedure:

shut down the secondary node
replace the hardware, install the existing disks and 10Gbps card
boot the node with at least the 10Gbps connection active
the node should sync with the primary
when syncing finishes, redundancy is restored
make sure all other networking connections are working. Since the main board was replaced some MAC-addresses changes. Update UDEV accordingly
when all is fine, check if DRBD and Heartbeat are running without errors on both nodes
then stop heartbeat on the primary node. A fail-over to the new secondary node will occur
if all went well, you can now safely shut down the old primary
replace the hardware, install the existing disks and 10Gbps card
boot the node with at least the 10Gbps connection active
the node should sync with the primary
when syncing finishes, redundancy is restored
make sure all other networking connections are working. Since the main board was replaced some MAC-addresses changes. Update UDEV accordingly
the old primary is now secondary
If you want, initialize another fail-over (In our case we didn’t fail-over again, since both nodes are equal powerful)

Congratulations: the cluster is redundant again with the new hardware!

Using the above procedure, we replaced both nodes of our DRBD storage cluster without any downtime.

In Linux sysadmin drbd, howto, migration, redundancy, replication, storage

Fixing MySQL replication after a faulty query

28 September 2012 — Leave a comment

Today I run into an issue with MySQL replication that prevented updates from the master to appear on the slaves. When checking the slave status, MySQL reports:

Slave_IO_State: Waiting for master to send event
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Error: Query caused different errors on master and slave.

Apparently, an error occurred on the master (like a query that was wrong) and that failed on the slave as well. But since the error messages differ (a bit) the replication was stuck on this query. Manual intervention is required to tell MySQL what to do next.

This is what I used to fix it:

mysql -p
STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

I logged in into MySQL (line 1), stopped the slave thread (line 2), skipped the one faulty query (line 3) and started the slave thread again (final line). The status now reports both a running IO and SQL thread.

mysql> show slave status\G

Output:

 Slave_IO_State: Waiting for master to send event
 Slave_IO_Running: Yes
 Slave_SQL_Running: Yes

In Linux sysadmin error, howto, mysql, replication

Moving a MySQL slave to a new replication master

10 July 2012 — Leave a comment

I’m upgrading our MySQL master/slave setup and am moving it to new (virtual) hardware in our cloud environment. One of the things I did last night was moving the MySQL slaves to a new master that I had prepared in the new environment. This post describes how I connected the slaves to their new master in the cloud.

First, you’ll need to make sure the new master has the same data as the old one.
1. Make sure no more updates occur on the old master
2. Create a sql dump of the master using mysqldump
3. Import that dump into the new master using mysql cmd line tool

At this point both masters should have the same data.
4. Now, shut down the old master as it can be retired 😉
5. Before allowing write access to the new master, note it’s position by executing this query:

mysql> show master status\G;
File: mn-bin.000005
Position: 11365777
Binlog_Do_DB: database_name
Binlog_Ignore_DB:
1 row in set (0.00 sec)

We’ll need this information later on when instructing the slaves to connect to their new master.

6. It’s now safe to allow write access again to the new master
7. Do this on any slave, it will connect it to the new master:

CHANGE MASTER TO
master_host=’master_hostname’,
master_user=’replicate_user’,
master_password=’password’,
master_log_file=’log-bin.000005‘,
master_log_pos= 11365777

Note the ‘master_log_file’ and ‘master_log_pos’. Their values are the ones we selected from the master at step 5. Then check if it worked (allow a few seconds to connect):

mysql> show slave status\G;

Look for these lines, they should say ‘Yes’:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes

And the status should be:

Slave_IO_State: Waiting for master to send event

That’s it, the slave is now connected to a new master. Test it by updating the master, and checking whether the slave receives the update too.

In Linux sysadmin howto, migration, mysql, replication

	Shashi on Setting locales correctly on M…
	Sayling Low on Alt-key in OSX-Terminal
	Roger on Setting locales correctly on M…
	belwardblog on HOWTO discover the ip address…
	Guilherme Caeiro Dia… on Setting locales correctly on M…
	Terminal Show Multip… on Setting locales correctly on M…
	bodhix on RRDtool: moving data between 3…
	vasu on One-liner: restore compressed…
	Angel on HOWTO quickly add a route in M…
	Kar.ma on HOWTO connect to hosts on a re…
	Home \| MacarioJames.… on Sed inline editing different o…
	Mac i problemy z loc… on Setting locales correctly on M…
	NearlyNormal on HOWTO enable color for PHP and…
	Yong on Connecting two Open vSwitches…
	Aysad Kozanoglu on Creating a multi hop SSH tunne…

Remi Bergsma's blog

Archives For 30 November 1999

Fixing an out-of-sync or corrupt MySQL slave

Replacing both DRBD nodes while keeping the storage online

Fixing MySQL replication after a faulty query

Moving a MySQL slave to a new replication master

About me

Blog Stats

Tag Cloud

Top posts

Recent comments

Archives

Tweets @remibergsma

Follow Blog via Email

Remi Bergsma's blog

Archives For 30 November 1999

Fixing an out-of-sync or corrupt MySQL slave

Rate this:

Share this:

Replacing both DRBD nodes while keeping the storage online

Rate this:

Share this:

Fixing MySQL replication after a faulty query

Rate this:

Share this:

Moving a MySQL slave to a new replication master

Rate this:

Share this:

About me

Blog Stats

Tag Cloud

Top posts

Recent comments

Archives

Tweets @remibergsma

Follow Blog via Email