当前位置: 首页 > backend >正文

Hadoop-HA高可用集群启动nameNode莫名挂掉,排错解决

Hadoop-HA高可用集群启动nameNode莫名挂掉,排错解决

nameNode错误日志

2025-05-21 16:14:12,218 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:12,219 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:12,251 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:13,144 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 7009 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:13,220 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:13,223 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:13,257 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:14,149 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 8014 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:14,227 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:14,238 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:14,260 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:15,151 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 9016 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:15,238 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:15,250 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:15,268 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,164 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 10029 ms (timeout=20000 ms) for a response for selectInputStreams. No responses yet.
2025-05-21 16:14:16,239 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,285 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,285 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:16,302 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.191.111:8485, 192.168.191.112:8485, 192.168.191.113:8485]. Skipping.
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.191.112:8485: Call From node01/192.168.191.111 to node02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.113:8485: Call From node01/192.168.191.111 to node03:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.111:8485: Call From node01/192.168.191.111 to node01:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefusedat org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:286)at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:485)at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:269)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1672)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1705)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:297)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:449)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)
2025-05-21 16:14:16,367 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2025-05-21 16:14:16,369 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem write lock held for 10180 ms via java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:273)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:225)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1614)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:337)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:449)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)Number of suppressed write-lock reports: 0Longest write-lock held interval: 10180.0Total suppressed write-lock held time: 0.0
2025-05-21 16:14:16,414 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted
java.lang.InterruptedException: sleep interruptedat java.lang.Thread.sleep(Native Method)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:469)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:399)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:416)at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:412)
2025-05-21 16:14:16,421 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2025-05-21 16:14:16,466 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2025-05-21 16:14:17,717 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:17,718 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:17,722 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:18,720 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:18,723 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:18,724 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:19,753 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:19,754 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:19,754 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:20,755 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:20,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:20,773 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:21,756 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:21,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:21,901 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:22,776 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:22,958 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:22,973 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:23,789 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:23,990 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:24,001 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:24,798 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:25,030 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:25,049 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:25,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:26,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:26,086 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:26,862 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node01/192.168.191.111:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:27,103 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node02/192.168.191.112:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:27,148 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: node03/192.168.191.113:8485. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2025-05-21 16:14:27,191 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.191.111:8485, 192.168.191.112:8485, 192.168.191.113:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
192.168.191.111:8485: Call From node01/192.168.191.111 to node01:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.113:8485: Call From node01/192.168.191.111 to node03:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
192.168.191.112:8485: Call From node01/192.168.191.111 to node02:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefusedat org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:286)at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:204)at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:443)at org.apache.hadoop.hdfs.server.namenode.JournalSet$6.apply(JournalSet.java:616)at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)at org.apache.hadoop.hdfs.server.namenode.JournalSet.recoverUnfinalizedSegments(JournalSet.java:613)at org.apache.hadoop.hdfs.server.namenode.FSEditLog.recoverUnclosedStreams(FSEditLog.java:1602)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1223)at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1887)at org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)at org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:49)at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1746)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1723)at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)at java.security.AccessController.doPrivileged(Native Method)at javax.security.auth.Subject.doAs(Subject.java:422)at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
2025-05-21 16:14:27,196 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [192.168.191.111:8485, 192.168.191.112:8485, 192.168.191.113:8485], stream=null))
2025-05-21 16:14:27,426 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at node01/192.168.191.111
************************************************************/

出现报错的原因

我们在执行start-dfs.sh的时候,默认启动顺序是namenode => datanode => journalnode => zkfc,如果journalnode和namenode不在一台机器启动的话,很容易因为网络延迟问题导致NN无法连接JN,无法实现选举,最后导致刚刚启动的namenode会突然挂掉一个主的,留下一个standy的,虽然有NN启动时有重试机制等待JN的启动,但是由于重试次数限制,可能网络情况不好,导致重试次数用完了,也没有启动成功。

  • A: 此时需要手动启动主的那个namenode,避免了网络延迟等待journalnode的步骤,一旦两个namenode连入journalnode,实现了选举,则不会出现失败情况。

  • B: 先启动JournalNode然后再运行start-dfs.sh,

  • C: 把nn对jn的容错次数或时间调成更大的值,保证能够对正常的启动延迟、网络延迟能容错。

解决方法

  1. 在hdfs-site.xml中加入,nn对jn检测的重试次数,默认为10次,每次1000ms,故网络情况差需要增加,这里设置为30次
    <property><name>ipc.client.connect.max.retries</name><value>30</value></property>
  1. 先启动journalnode,再启动dfs

封装一个简单的集群启动脚本

  • 按照先启动zookeeper,再启动Hadoop的先后顺序,在启动启动的spark等等。

集群启动脚本

#!/bin/bashif [ $# -lt 1 ]
thenecho "No Args Input..."exit ;
fi# 启动Zookeeper集群
start_zookeeper() {echo " =================== 启动 zookeeper 集群 ==================="# 在 node01, node02, node03 上启动 zookeeperfor node in node01 node02 node03doecho " --------------- 启动 $node 的 zookeeper ---------------"ssh $node "/opt/yjx/zookeeper-3.4.5/bin/zkServer.sh start"done# 检查每个节点的 zookeeper 状态echo " --------------- 检查 zookeeper 状态 ---------------"for node in node01 node02 node03doecho "检查 $node 的 zookeeper 状态"ssh $node "/opt/yjx/zookeeper-3.4.5/bin/zkServer.sh status"done
}# 关闭Zookeeper集群
stop_zookeeper() {echo " =================== 关闭 zookeeper 集群 ==================="# 在 node01, node02, node03 上关闭 zookeeperfor node in node01 node02 node03doecho " --------------- 检查 $node 上 Zookeeper 是否已经启动 ---------------"# 检查端口2181是否有进程占用process_check=$(ssh $node "netstat -tuln | grep 2181")if [ -z "$process_check" ]; thenecho "$node 上的 Zookeeper 没有运行,跳过停止操作"elseecho "$node 上的 Zookeeper 正在运行,执行停止操作"ssh $node "/opt/yjx/zookeeper-3.4.5/bin/zkServer.sh stop"fidone
}case $1 in
"start")# 先启动 Zookeeper,再启动 Hadoopstart_zookeeperecho " =================== 启动 hadoop 集群 ==================="#ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/hadoop-daemon.sh start journalnode"#ssh node02 "/opt/yjx/hadoop-3.1.2/sbin/hadoop-daemon.sh start journalnode"#ssh node03 "/opt/yjx/hadoop-3.1.2/sbin/hadoop-daemon.sh start journalnode"echo " --------------- 启动 journalnode ---------------"ssh node01 hdfs --daemon start journalnodessh node02 hdfs --daemon start journalnodessh node03 hdfs --daemon start journalnodeecho " --------------- 启动 hdfs ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/start-dfs.sh"echo " --------------- 启动 yarn ---------------"ssh node01  "/opt/yjx/hadoop-3.1.2/sbin/start-yarn.sh"echo " --------------- 启动 historyserver ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/bin/mapred --daemon start historyserver";;"stop")# 先关闭 Hadoop,再关闭 Zookeeperecho " =================== 关闭 hadoop 集群 ==================="echo " --------------- 关闭 historyserver ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/bin/mapred --daemon stop historyserver"echo " --------------- 关闭 yarn ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/stop-yarn.sh"echo " --------------- 关闭 hdfs ---------------"ssh node01 "/opt/yjx/hadoop-3.1.2/sbin/stop-dfs.sh"# 关闭 Zookeeper 集群stop_zookeeper ;;*)echo "Input Args Error..." ;;
esac
http://www.xdnf.cn/news/7787.html

相关文章:

  • digitalworld.local: FALL靶场
  • Mysql-数据闪回工具MyFlash
  • SQL查询, 响应体临时字段报: Unknown column ‘data_json_map‘ in ‘field list‘
  • leetcode 92. Reverse Linked List II
  • 张 Prompt Tuning--中文数据准确率提升:理性与冲动识别新突破
  • 分类算法 Kmeans、KNN、Meanshift 实战
  • maven之pom.xml
  • 【25软考网工】第七章(3) UOS Linux防火墙配置和Web应用服务配置
  • OpenHarmony外设驱动使用 (九),Pin_auth
  • 国产化Excel处理组件Spire.XLS for .NET系列教程:通过 C# 将 TXT 文本转换为 Excel 表格
  • 物业后勤小程序源码介绍
  • 【项目记录】准备工作及查询部门
  • python-leetcode 71.每日温度
  • Vue 3.0中核心的Composition API
  • 打造一个支持MySQL查询的MCP同步插件:Java实现
  • PCB智能报价系统——————仙盟创梦IDE
  • Python实例题:PyOt实现简易浏览器
  • leetcode字符串篇【公共前缀】:14-最长公共前缀
  • C语言-9.指针
  • “交互式“ PDF 与“静态“ PDF 表单的区别
  • liinux系统安装Helm
  • 系统数据对接-从获取到处理的全流程
  • PH热榜 | 2025-05-20
  • Ubuntu24.04安装Dify
  • YOLO中model.predict方法返回内容Results详解
  • 智能笔记助手-NotepadAI使用指南
  • 【大模型面试每日一题】Day 24:假设训练资源有限,如何在模型参数量、训练数据量和训练时长三者间做权衡?
  • MySQL之数据库基础知识,库和表的操作以及基础数据类型
  • Paillier加密方案的原理、实现与应用(dev)
  • Cribl 使用Function 的实际例子-02