redis重做从库时报Connectionwithmasterlost错误

背景：之前同事初始化redis实例的时候把monitor命令给rename了，今天刚好这组实例有问题需要通过monitor命令来帮助分析问题，结果发现monitor命令用不了，因为redis命令重命名不支持动态修改，于是就想删除monitor命令后赶快重启redis从库。重启完后，发现主从复制状态一直处于down的状态，于是开始排查问题

站在用户的角度思考问题，与客户深入沟通，找到南丰网站设计与南丰网站推广的解决方案，凭借多年的经验，让设计与互联网技术结合，创造个性化、用户体验好的作品，建站类型包括：成都做网站、网站建设、外贸营销网站建设、企业官网、英文网站、手机端网站、网站推广、域名注册、网站空间、企业邮箱。业务覆盖南丰地区。

先看redis从库的日志：

43037:S 20 Apr 06:12:38.047 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:12:38.047 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:12:38.048 Master replied to PING, replication can continue...
43037:S 20 Apr 06:12:38.048 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:12:39.112 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885061114038
43037:S 20 Apr 06:19:16.258 MASTER <-> SLAVE sync: receiving 20819568576 bytes from master
43037:S 20 Apr 06:20:08.203 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:24:36.036 MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:28:18.538 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:28:19.782 Background append only file rewriting started by pid 173002
43037:S 20 Apr 06:28:19.982 # Connection with master lost.
43037:S 20 Apr 06:28:19.982 Caching the disconnected master state.
43037:S 20 Apr 06:28:20.984 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:28:20.985 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:28:20.985 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:28:20.985 Master replied to PING, replication can continue...
43037:S 20 Apr 06:28:20.986 Trying a partial resynchronization (request 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885062375607).
43037:S 20 Apr 06:28:22.073 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885240485270
43037:S 20 Apr 06:28:22.073 Discarding previously cached master state.
43037:S 20 Apr 06:33:30.800 # Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.
43037:S 20 Apr 06:33:30.801 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:33:30.802 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:33:30.802 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:33:30.802 Master replied to PING, replication can continue...
43037:S 20 Apr 06:33:30.803 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:34:27.458 AOF rewrite child asks to stop sending diffs.
173002:C 20 Apr 06:34:27.470 Parent agreed to stop sending diffs. Finalizing AOF...
173002:C 20 Apr 06:34:27.470 Concatenating 1.19 MB of AOF diff received from parent.
173002:C 20 Apr 06:34:27.477 SYNC append only file rewrite performed
173002:C 20 Apr 06:34:28.757 AOF rewrite: 119 MB of memory used by copy-on-write
43037:S 20 Apr 06:34:29.961 Background AOF rewrite terminated with success
43037:S 20 Apr 06:34:29.961 Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
43037:S 20 Apr 06:34:29.961 Background AOF rewrite finished successfully
43037:S 20 Apr 06:35:08.471 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885393456775
43037:S 20 Apr 06:41:49.489 MASTER <-> SLAVE sync: receiving 20821399746 bytes from master
43037:S 20 Apr 06:42:28.911 # I/O error trying to sync with MASTER: connection lost
43037:S 20 Apr 06:42:32.642 Connecting to MASTER 10.93.157.52:6385
43037:S 20 Apr 06:42:32.646 MASTER <-> SLAVE sync started
43037:S 20 Apr 06:42:32.646 Non blocking connect for SYNC fired the event.
43037:S 20 Apr 06:42:32.647 Master replied to PING, replication can continue...
43037:S 20 Apr 06:42:32.647 Partial resynchronization not possible (no cached master)
43037:S 20 Apr 06:42:33.755 Full resync from master: 96f2ae75d50e1f8b69737509d5b32b2da660e7c0:885541422071
43037:S 20 Apr 06:49:15.956 MASTER <-> SLAVE sync: receiving 20821403571 bytes from master
43037:S 20 Apr 06:50:16.781 MASTER <-> SLAVE sync: Flushing old data
43037:S 20 Apr 06:54:23.078 MASTER <-> SLAVE sync: Loading DB in memory
43037:S 20 Apr 06:58:10.123 MASTER <-> SLAVE sync: Finished with success
43037:S 20 Apr 06:58:11.317 Background append only file rewriting started by pid 223387
43037:S 20 Apr 06:58:11.536 # Connection with master lost.
43037:S 20 Apr 06:58:11.536 * Caching the disconnected master state.

再来看一下redis主库的日志：

304369:M 20 Apr 05:13:00.197 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:13:00.197 Full resync requested by slave 10.93.157.16:6383
304369:M 20 Apr 05:13:00.197 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:13:00.902 Background saving started by pid 366254
366254:C 20 Apr 05:19:14.460 DB saved on disk
366254:C 20 Apr 05:19:14.961 RDB: 4613 MB of memory used by copy-on-write
304369:M 20 Apr 05:19:15.579 Background saving terminated with success
304369:M 20 Apr 05:20:18.303 Synchronization with slave 10.93.157.16:6383 succeeded
304369:M 20 Apr 05:22:32.216 # Client id=1768461 addr=10.93.157.16:26912 fd=10 name= age=572 idle=0 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=11450 omem=278945640 events=rw cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:22:32.216 # Connection with slave 10.93.157.16:6383 lost.
304369:M 20 Apr 05:28:27.651 Slave 10.93.157.16:6383 asks for synchronization
304369:M 20 Apr 05:28:27.651 Unable to partial resync with slave 10.93.157.16:6383 for lack of backlog (Slave request was: 884116305579).
304369:M 20 Apr 05:28:27.651 Starting BGSAVE for SYNC with target: disk
304369:M 20 Apr 05:28:28.356 Background saving started by pid 396084
304369:M 20 Apr 05:32:37.471 # Client id=1768945 addr=10.93.157.16:21854 fd=22 name= age=250 idle=250 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16366 oll=11126 omem=271085384 events=r cmd=psync scheduled to
be closed ASAP for overcoming of output buffer limits.
304369:M 20 Apr 05:32:37.538 # Connection with slave 10.93.157.16:6383 lost.

从日志中我们可以看到在redis主库在接到从库要求重新同步数据的时候先生成一个rdb文件，再通过psync来做部分同步，可以看到问题就出在部分同步这一块，由于client-output-buffer-limit值设置太小，导致导数发送失败。失败后，从库继续发起数据同步的请求，但是每次都失败，redis主库反复地生成rdb文件，虽然是有fork一个子进程，但是对redis主库的吞吐量是有一定影响的。

现在我们来解决这个问题，先看一下目前 client-output-buffer-limit的大小：
127.0.0.1:6385> config get client-output-buffer-limit
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
可以看到，目前的限制是最大256M和60s内不超过64M,从我们上面日志信息可以看到psync的数据明显是大于256M的。我们进行如下设置把很限制调大：
config set client-output-buffer-limit 'slave 1073741824 268435456 60'
调完后再观察，发现从库的的复制状态很快就变成了up。
我们上面的调整是加大复制输出缓冲区，还有一个办法就是关闭复制输出缓冲区的限制：
config set client-output-buffer-limit 'slave 0 0 0'

文章标题：redis重做从库时报Connectionwithmasterlost错误
本文链接：http://chengdu.cdxwcx.cn/article/pshege.html

甜橘子，专注成都网站制作网站设计与营销型网站建设与优化

首页

网站建设

网站制作案例

解决方案

网站设计报价

网站制作动态

关于我们

联系我们

成都网站建设设计将想法与焦点和您一起共享

redis重做从库时报Connectionwithmasterlost错误

其他资讯

java熬夜加班代码 java程序员加班多吗

四川sap系统灭菌柜验证的简单介绍

php数据库用户注册 PHP数据库用户注册功能页面设置美观

帝国软件cms万能标签帝国cms万能接口

go语言自行开发p2p go语言开发教程

甜橘子，专注成都网站制作网站设计与营销型网站建设与优化

成都网站建设设计 将想法与焦点和您一起共享

redis重做从库时报Connectionwithmasterlost错误

其他资讯

java熬夜加班代码 java程序员加班多吗

四川sap系统灭菌柜验证的简单介绍

php数据库用户注册 PHP数据库用户注册功能页面设置美观

帝国软件cms万能标签 帝国cms万能接口

go语言自行开发p2p go语言开发教程

成都网站建设设计将想法与焦点和您一起共享

帝国软件cms万能标签帝国cms万能接口