TCP网卡绑定的问题

现象

haas中regionserver频繁报警killed、running。分析hbase regionserver日志发现异常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

2017-08-29 11:14:33,703 WARN [regionserver/hostname/172.42.0.1:60023] regionserver.HRegionServer: error
telling master we are up
com.google.protobuf.ServiceException: org.apache.hadoop.net.ConnectTimeoutException: 10000 millis timeout while waiting for channel to
be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending local=/172.42.0.1:41015 remote=hostname/1
0.95.26.153:60003]
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:231)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup
(RegionServerStatusProtos.java:8277)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2167)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:826)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 10000 millis timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending local=/172.42.0.1:41015 remote=hostname/10.95.26.153:60003]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupConnection(RpcClientImpl.java:408)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:714)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:894)
at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:863)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1214)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
... 5 more
2017-08-29 11:14:33,703 WARN [regionserver/hostname/172.42.0.1:60023] regionserver.HRegionServer: reportForDuty failed;
sleeping and then retrying.
2017-08-29 11:14:36,704 INFO [regionserver/hostname/172.42.0.1:60023] regionserver.HRegionServer: reportForDuty to
master=hostname,60003,1502412914707 with port=60023, startcode=1503975994142
2017-08-29 11:14:36,704 DEBUG [regionserver/hostname/172.42.0.1:60023] ipc.AbstractRpcClient: Use SIMPLE authentication
for service RegionServerStatusService, sasl=false
2017-08-29 11:14:36,704 DEBUG [regionserver/hostname/172.42.0.1:60023] ipc.AbstractRpcClient: Connecting to
hostname/10.95.26.153:60003

发现regionserver的IP地址是docker0 而master对应的是bond0的IP地址。

从regionserver到master的TCP会话都停留在SYN_SENT。

1
2
[hadoop@hostname logs]$ netstat -an | egrep 6000
tcp 0 1 172.42.0.1:43184 10.95.26.153:60003 SYN_SENT

原因

从错误日志看到,hbase regionserver的本地子网是172.x,但是远程master子网是10.x

1
[connection-pending local=/172.42.0.1:41015 remote=hostname/10.95.26.153:60003]

当服务器分配有多层网络接口时,可能会发生这种情况。 默认情况下,region server通过查找标识的主要主机名和性能DNS来确定它应该绑定到哪个接口。 基于网络路由表之后,region server确定它需要进行docker0接口,即使主服务器正在bond0上进行通信。

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@hostname logs]$ netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 10.90.18.4 0.0.0.0 UG 0 0 0 bond0
10.0.0.0 10.90.18.1 255.0.0.0 UG 0 0 0 bond0
10.90.18.0 0.0.0.0 255.255.254.0 U 0 0 0 bond0
10.232.0.0 10.90.18.4 255.255.224.0 UG 0 0 0 bond0
100.64.0.0 10.90.18.1 255.192.0.0 UG 0 0 0 bond0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 bond0
172.16.0.0 10.90.18.1 255.240.0.0 UG 0 0 0 bond0
172.42.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.0.0 10.90.18.1 255.255.0.0 UG 0 0 0 bond0

解决

在hbase-site.xml中设置param hbase.regionserver.dns.interface,以强制区域服务器在启动时使用bond0,以便它可以在正确的网络接口上与hbase master进行通信。
<property>
    <name>hbase.regionserver.dns.interface</name>
    <value>bond0</value>
</property>

重启regionserver 即可。