现象

haas中regionserver频繁报警killed、running。分析hbase regionserver日志发现异常：


2017-08-29 11:14:33,703 WARN [regionserver/hostname/172.42.0.1:60023] regionserver.HRegionServer: error 
telling master we are up
com.google.protobuf.ServiceException: org.apache.hadoop.net.ConnectTimeoutException: 10000 millis timeout while waiting for channel to
 be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/172.42.0.1:41015 remote=hostname/1
0.95.26.153:60003]
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:231)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
        at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup
(RegionServerStatusProtos.java:8277)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2167)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:826)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 10000 millis timeout while waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending local=/172.42.0.1:41015 remote=hostname/10.95.26.153:60003]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupConnection(RpcClientImpl.java:408)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:714)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:894)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:863)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1214)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
       ... 5 more
2017-08-29 11:14:33,703 WARN [regionserver/hostname/172.42.0.1:60023] regionserver.HRegionServer: reportForDuty failed; 
sleeping and then retrying.
2017-08-29 11:14:36,704 INFO [regionserver/hostname/172.42.0.1:60023] regionserver.HRegionServer: reportForDuty to 
master=hostname,60003,1502412914707 with port=60023, startcode=1503975994142
2017-08-29 11:14:36,704 DEBUG [regionserver/hostname/172.42.0.1:60023] ipc.AbstractRpcClient: Use SIMPLE authentication 
for service RegionServerStatusService, sasl=false
2017-08-29 11:14:36,704 DEBUG [regionserver/hostname/172.42.0.1:60023] ipc.AbstractRpcClient: Connecting to 
hostname/10.95.26.153:60003

发现regionserver的IP地址是docker0 而master对应的是bond0的IP地址。

从regionserver到master的TCP会话都停留在SYN_SENT。

1 2	[hadoop@hostname logs]$ netstat -an \| egrep 6000 tcp 0 1 172.42.0.1:43184 10.95.26.153:60003 SYN_SENT

原因

从错误日志看到，hbase regionserver的本地子网是172.x，但是远程master子网是10.x

1	[connection-pending local=/172.42.0.1:41015 remote=hostname/10.95.26.153:60003]

当服务器分配有多层网络接口时，可能会发生这种情况。默认情况下，region server通过查找标识的主要主机名和性能DNS来确定它应该绑定到哪个接口。基于网络路由表之后，region server确定它需要进行docker0接口，即使主服务器正在bond0上进行通信。

[hadoop@hostname logs]$ netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
0.0.0.0         10.90.18.4      0.0.0.0         UG        0 0          0 bond0
10.0.0.0        10.90.18.1      255.0.0.0       UG        0 0          0 bond0
10.90.18.0      0.0.0.0         255.255.254.0   U         0 0          0 bond0
10.232.0.0      10.90.18.4      255.255.224.0   UG        0 0          0 bond0
100.64.0.0      10.90.18.1      255.192.0.0     UG        0 0          0 bond0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0          0 bond0
172.16.0.0      10.90.18.1      255.240.0.0     UG        0 0          0 bond0
172.42.0.0      0.0.0.0         255.255.0.0     U         0 0          0 docker0
192.168.0.0     10.90.18.1      255.255.0.0     UG        0 0          0 bond0

解决

在hbase-site.xml中设置param hbase.regionserver.dns.interface，以强制区域服务器在启动时使用bond0，以便它可以在正确的网络接口上与hbase master进行通信。

<property>
    <name>hbase.regionserver.dns.interface</name>
    <value>bond0</value>
</property>

重启regionserver 即可。