华为云用户手册

  • 问题背景与现象 单个节点内DataNode的各磁盘使用率不均匀。 例如: 189-39-235-71:~ # df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda 360G 92G 250G 28% / /dev/xvdb 700G 900G 200G 78% /srv/BigData/hadoop/data1 /dev/xvdc 700G 900G 200G 78% /srv/BigData/hadoop/data2 /dev/xvdd 700G 900G 200G 78% /srv/BigData/hadoop/data3 /dev/xvde 700G 900G 200G 78% /srv/BigData/hadoop/data4 /dev/xvdf 10G 900G 890G 2% /srv/BigData/hadoop/data5 189-39-235-71:~ #
  • 解决办法 将DataNode选择磁盘策略的参数dfs.datanode.fsdataset.volume.choosing.policy的值改为:org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy,保存并重启受影响的服务或实例。 让DataNode根据磁盘剩余空间大小,优先选择磁盘剩余空间多的节点存储数据副本。 针对新写入到本DataNode的数据会优先写磁盘剩余空间多的磁盘。 部分磁盘使用率较高,依赖业务逐渐删除在HDFS中的数据(老化数据)来逐渐降低。
  • 问题背景与现象 执行distcp跨集群拷贝文件时,出现部分文件拷贝失败“ Source and target differ in block-size. Use -pb to preserve block-sizes during copy. ” Caused by: java.io.IOException: Check-sum mismatch between hdfs://10.180.144.7:25000/kylin/kylin_default_instance_prod/parquet/f2e72874-f01c-45ff-b219-207f3a5b3fcb/c769cd2d-575a-4459-837b-a19dd7b20c27/339114721280/0.parquettar and hdfs://10.180.180.194:25000/kylin/kylin_default_instance_prod/parquet/f2e72874-f01c-45ff-b219-207f3a5b3fcb/.distcp.tmp.attempt_1523424430246_0004_m_000019_2. Source and target differ in block-size. Use -pb to preserve block-sizes during copy. Alternatively, skip checksum-checks altogether, using -skipCrc. (NOTE: By skipping checksums, one runs the risk of masking data-corruption during file-transfer.) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareCheckSums(RetriableFileCopyCommand.java:214)
  • 问题背景与现象 给某目录设置quota后,往目录中写文件失败,出现如下问题“The DiskSpace quota of /tmp/tquota2 is exceeded”。 [omm@189-39-150-115 client]$ hdfs dfs -put switchuser.py /tmp/tquota2 put: The DiskSpace quota of /tmp/tquota2 is exceeded: quota = 157286400 B = 150 MB but diskspace consumed = 402653184 B = 384 MB
  • 原因分析 HDFS支持设置某目录的配额,即限制某目录下的文件最多占用空间大小,例如如下命令是设置“/tmp/tquota”目录最多写入150MB的文件(文件大小*副本数)。 hadoop dfsadmin -setSpaceQuota 150M /tmp/tquota2 使用如下命令可以查看目录设置的配额情况,SPACE_QUOTA是设置的空间配额,REM_SPACE_QUOTA是当前剩余的空间配额。 hdfs dfs -count -q -h -v /tmp/tquota2 图1 查看目录设置的配额 日志分析 ,如下日志说明写入文件需要消耗384M,但是当前的空间配额是150M,因此空间不足。写文件前,需要的剩余空间是:块大小*副本数,128M*3副本=384M。 [omm@189-39-150-115 client]$ [omm@189-39-150-115 client]$ hdfs dfs -put switchuser.py /tmp/tquota2 put: The DiskSpace quota of /tmp/tquota2 is exceeded: quota = 157286400 B = 150 MB but diskspace consumed = 402653184 B = 384 MB
  • 原因分析 HDFS的客户端和服务端数据传输走的rpc协议,该协议有多种加密方式,由hadoop.rpc.protection参数控制。 如果客户端和服务端的hadoop.rpc.protection参数的配置值不一样,即会报No common protection layer between client and server错误。 hadoop.rpc.protection参数表示数据可通过以下任一方式在节点间进行传输。 privacy:指数据在鉴权及加密后再传输。这种方式会降低性能。 authentication:指数据在鉴权后直接传输,不加密。这种方式能保证性能但存在安全风险。 integrity:指数据直接传输,即不加密也不鉴权。 为保证数据安全,请谨慎使用这种方式。
  • 问题背景与现象 shell客户端或者其他客户端操作HDFS失败,报“No common protection layer between client and server”。 在集群外的机器,执行任意hadoop命令,如hadoop fs -ls /均失败,最底层的报错为"No common protection layer between client and server"。 2017-05-13 19:14:19,060 | ERROR | [pool-1-thread-1] | Server startup failure | org.apache.sqoop.core.SqoopServer.initializeServer(SqoopServer.java:69) org.apache.sqoop.common.SqoopException: MAPRED_EXEC_0028:Failed to operate HDFS - Failed to get the file /user/loader/etl_dirty_data_dir status at org.apache.sqoop.job.mr.HDFSClient.fileExist(HDFSClient.java:85) ... at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Couldn't setup connection for loader/hadoop@HADOOP.COM to loader37/10.162.0.37:25000; Host Details : local host is: "loader37/10.162.0.37"; destination host is: "loader37":25000; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776) ... ... 10 more Caused by: java.io.IOException: Couldn't setup connection for loader/hadoop@HADOOP.COM to loader37/10.162.0.37:25000 at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:674 ... 28 more Caused by: javax.security.sasl.SaslException: No common protection layer between client and server at com.sun.security.sasl.gsskerb.GssKrb5Client.doFinalHandshake(GssKrb5Client.java:251) ... at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:720)
  • 调整HDFS SHDFShell客户端日志级别 临时调整,关闭该shell客户端窗口后,日志会还原为默认值。 执行export HADOOP_ROOT_ LOG GER命令可以调整客户端日志级别。 执行export HADOOP_ROOT_LOGGER=日志级别,console,可以调整shell客户端的日志级别。 export HADOOP_ROOT_LOGGER=DEBUG,console,调整为DEBUG。 export HADOOP_ROOT_LOGGER=ERROR,console,调整为ERROR。 永久调整 在HDFS客户端环境变量配置文件“/opt/client/HDFS/component_env”(其中“/opt/client”需要改为实际客户端路径)增加“export HADOOP_ROOT_LOGGER=日志级别,console”。 执行source /opt/client/bigdata_env。 重新执行客户端命令。 父主题: 使用HDFS
  • 原因分析 查看客户端或者NameNode运行日志“/var/log/Bigdata/hdfs/nn/hadoop-omm-namenode-XXX.log”存在异常提示The directory item limit of /tmp is exceeded:。该错误的含义为/tmp目录的文件数超过1048576的限制。 2018-03-14 11:18:21,625 | WARN | IPC Server handler 62 on 25000 | DIR* NameSystem.startFile: /tmp/test.txt The directory item limit of /tmp is exceeded: limit=1048576 items=1048577 | FSNamesystem.java:2334 该限制是dfs.namenode.fs-limits.max-directory-items参数,定义单个目录下不含递归的最大目录数或者文件数,默认值1048576,取值范围1~6400000。
  • 原因分析 FileNotFoundException...No lease on...File does not exist,该日志说明文件在操作的过程中被删除了。 搜索HDFS的NameNode的审计日志(Active NameNode的/var/log/Bigdata/audit/hdfs/nn/hdfs-audit-namenode.log)搜索文件名,确认文件的创建时间。 搜索文件创建到出现异常时间范围的NameNode的审计日志,搜索该文件是否被删除或者移动到其他目录。 如果该文件没有被删除或者移动,可能是该文件的父目录,或者更上层目录被删除或者移动,需要继续搜索上层目录。如本样例中,是文件的父目录被删除。 2017-05-31 02:04:08,286 | INFO | IPC Server handler 30 on 25000 | allowed=true ugi=appUser@HADOOP.COM (auth:TOKEN) ip=/192.168.1.22 cmd=delete src=/user/sparkhive/warehouse/daas/dsp/output/_temporary dst=null perm=null proto=rpc | FSNamesystem.java:8189 如上日志说明:192.168.1.22 节点的appUser用户删除了/user/sparkhive/warehouse/daas/dsp/output/_temporary。 可以使用zgrep "文件名" *.zip命令搜索zip包的内容。
  • 问题背景与现象 有MapReduce任务所有map任务均成功,但reduce任务失败,查看日志发现报异常“FileNotFoundException...No lease on...File does not exist”。 Error: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): No lease on /user/sparkhive/warehouse/daas/dsp/output/_temporary/1/_temporary/attempt_1479799053892_17075_r_000007_0/part-r-00007 (inode 6501287): File does not exist. Holder DFSClient_attempt_1479799053892_17075_r_000007_0_-1463597952_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3350) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3442) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3409) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:789)
  • 原因分析 查看NameNode日志“/var/log/Bigdata/hdfs/nn/hadoop-omm-namenode-主机名.log”,发现该文件一直在被尝试写,直到最终失败。 2015-07-13 10:05:07,847 | WARN | org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@36fea922 | DIR* NameSystem.internalReleaseLease: Failed to release lease for file /hive/order/OS_ORDER._8.txt._COPYING_. Committed blocks are waiting to be minimally replicated. Try again later. | FSNamesystem.java:3936 2015-07-13 10:05:07,847 | ERROR | org.apache.hadoop.hdfs.server.namenode.LeaseManager$Monitor@36fea922 | Cannot release the path /hive/order/OS_ORDER._8.txt._COPYING_ in the lease [Lease. Holder: DFSClient_NONMAPREDUCE_-1872896146_1, pendingcreates: 1] | LeaseManager.java:459 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: DIR* NameSystem.internalReleaseLease: Failed to release lease for file /hive/order/OS_ORDER._8.txt._COPYING_. Committed blocks are waiting to be minimally replicated. Try again later. at FSNamesystem.internalReleaseLease(FSNamesystem.java:3937) 根因分析:被上传的文件损坏,因此会上传失败。 验证办法:cp或者scp被拷贝的文件,也会失败,确认文件本身已损坏。
  • 问题背景与现象 HDFS客户端写文件close失败,客户端提示数据块没有足够副本数。 客户端日志: 2015-05-27 19:00:52.811 [pool-2-thread-3] ERROR: /tsp/nedata/collect/UGW/ugwufdr/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338 close hdfs sequence file fail (SequenceFileInfoChannel.java:444) java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2160) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2128) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103) at com.huawei.pai.collect2.stream.SequenceFileInfoChannel.close(SequenceFileInfoChannel.java:433) at com.huawei.pai.collect2.stream.SequenceFileWriterToolChannel$FileCloseTask.call(SequenceFileWriterToolChannel.java:804) at com.huawei.pai.collect2.stream.SequenceFileWriterToolChannel$FileCloseTask.call(SequenceFileWriterToolChannel.java:792) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
  • 原因分析 HDFS客户端开始写Block。 例如:HDFS客户端是在2015-05-27 18:50:24,232开始写/20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338的。其中分配的块是blk_1099105501_25370893。 2015-05-27 18:50:24,232 | INFO | IPC Server handler 30 on 25000 | BLOCK* allocateBlock: /20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338. BP-1803470917-192.168.57.33-1428597734132 blk_1099105501_25370893{blockU CS tate=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-b2d7b7d0-f410-4958-8eba-6deecbca2f87:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-76bd80e7-ad58-49c6-bf2c-03f91caf750f:NORMAL|RBW]]} | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveAllocatedBlock(FSNamesystem.java:3166) 写完之后HDFS客户端调用了fsync。 2015-05-27 19:00:22,717 | INFO | IPC Server handler 22 on 25000 | BLOCK* fsync: 20150527/10/6_20150527105000_20150527105500_SR5S14_1432723806338_128_11.pkg.tmp1432723806338 for DFSClient_NONMAPREDUCE_-120525246_15 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.fsync(FSNamesystem.java:3805) HDFS客户端调用close关闭文件,NameNode收到客户端的close请求之后就会检查最后一个块的完成状态,只有当有足够的DataNode上报了块完成才可用关闭文件,检查块完成的状态是通过checkFileProgress函数检查的,打印如下: 2015-05-27 19:00:27,603 | INFO | IPC Server handler 44 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:28,005 | INFO | IPC Server handler 45 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:28,806 | INFO | IPC Server handler 63 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:30,408 | INFO | IPC Server handler 43 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:33,610 | INFO | IPC Server handler 37 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) 2015-05-27 19:00:40,011 | INFO | IPC Server handler 37 on 25000 | BLOCK* checkFileProgress: blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} has not reached minimal replication 1 | org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkFileProgress(FSNamesystem.java:3197) NameNode打印了多次checkFileProgress是由于HDFS客户端多次尝试close文件,但是由于当前状态不满足要求,导致close失败, HDFS客户端retry的次数是由参数dfs.client.block.write.locateFollowingBlock.retries决定的,该参数默认是5,所以在NameNode的日志中看到了6次checkFileProgress打印。 但是再过0.5s之后,DataNode就上报块已经成功写入。 2015-05-27 19:00:40,608 | INFO | IPC Server handler 60 on 25000 | BLOCK* addStoredBlock: blockMap updated: 192.168.10.21:25009 is added to blk_1099105501_25370893{blockUCState=COMMITTED, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-ef5fd3c9-5088-4813-ae9a-34a0714ec3a3:NORMAL|RBW], ReplicaUnderConstruction[[DISK]DS-f863e30f-ce5b-48cc-9cca-72f64c558adc:NORMAL|RBW]]} size 11837530 | org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.logAddStoredBlock(BlockManager.java:2393) 2015-05-27 19:00:48,297 | INFO | IPC Server handler 37 on 25000 | BLOCK* addStoredBlock: blockMap updated: 192.168.10.10:25009 is added to blk_1099105501_25370893 size 11837530 | org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.logAddStoredBlock(BlockManager.java:2393) DataNode上报块写成功通知延迟的原因可能有:网络瓶颈导致、CPU瓶颈导致。 如果此时再次调用close或者close的retry的次数增多,那么close都将返回成功。建议适当增大参数dfs.client.block.write.locateFollowingBlock.retries的值,默认值为5次,尝试的时间间隔为400ms、800ms、1600ms、3200ms、6400ms,12800ms,那么close函数最多需要25.2秒才能返回。
  • 解决办法 执行ulimit -a命令查看有问题节点文件句柄数最多设置是多少,如果很小,建议修改成640000。 图1 查看文件句柄数 执行vi /etc/security/limits.d/90-nofile.conf命令编辑文件,修改文件句柄数设置。如果没有这个文件,可以新建一个文件,并按照下图内容修改。 图2 修改文件句柄数 重新打开一个终端窗口,用ulimit -a命令查看是否修改成功,如果没有,请重新按照上述步骤重新修改。 从Manager页面重启DataNode实例。
  • 原因分析 DataNode日志“/var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-XXX.log”,存在异常提示java.io.IOException: Too many open files。 2016-05-19 17:18:59,126 | WARN | org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@142ff9fa | YSDN12:25009:DataXceiverServer: | org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:160) java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:241) at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:100) at org.apache.hadoop.hdfs.net.TcpPeerServer.accept(TcpPeerServer.java:134) at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:137) at java.lang.Thread.run(Thread.java:745) 如果某个DataNode日志中打印“Too many open files”,说明该节点文件句柄不足,导致打开文件句柄失败,然后就会重试往其他DataNode节点写数据,最终表现为写文件很慢或者写文件失败。
  • 原因分析 查看NameNode原生页面发现有大量的块丢失。 图1 块丢失 查看原生页面Datanode Information发现显示的DataNode节点数和实际的相差10个节点。 图2 查看DataNode节点数 查看DateNode运行日志“/var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-主机名.log”,发现如下错误信息。 重要错误信息Clock skew too great 图3 DateNode运行日志错误
  • 原因分析 在NameNode运行日志(/var/log/Bigdata/hdfs/nn/hadoop-omm-namenode-XXX.log)中搜索“WARN”,可以看到有大量时间在垃圾回收,如下例中耗时较长63s。 2017-01-22 14:52:32,641 | WARN | org.apache.hadoop.util.JvmPauseMonitor$Monitor@1b39fd82 | Detected pause in JVM or host machine (eg GC): pause of approximately 63750ms GC pool 'ParNew' had collection(s): count=1 time=0ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=63924ms | JvmPauseMonitor.java:189 分析NameNode日志“/var/log/Bigdata/hdfs/nn/hadoop-omm-namendoe-XXX.log”,可以看到NameNode在等待块上报,且总的Block个数过多,如下例中是3629万。 2017-01-22 14:52:32,641 | INFO | IPC Server handler 8 on 25000 | STATE* Safe mode ON. The reported blocks 29715437 needs additional 6542184 blocks to reach the threshold 0.9990 of total blocks 36293915. 打开Manager页面,查看NameNode的GC_OPTS参数配置如下: 图1 查看NameNode的GC_OPTS参数配置 NameNode内存配置和数据量对应关系参考表1。 表1 NameNode内存配置和数据量对应关系 文件对象数量 参考值 10,000,000 “-Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M” 20,000,000 “-Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G” 50,000,000 “-Xms32G -Xmx32G -XX:NewSize=2G -XX:MaxNewSize=3G” 100,000,000 “-Xms64G -Xmx64G -XX:NewSize=4G -XX:MaxNewSize=6G” 200,000,000 “-Xms96G -Xmx96G -XX:NewSize=8G -XX:MaxNewSize=9G” 300,000,000 “-Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G”
  • 解决办法 该解决办法以20051端口被占用为例,20050端口被占用的解决办法与该办法类似。 以root用户登录DBService安装报错的节点主机,执行命令:netstat -nap | grep 20051查看占用20051端口的进程。 使用kill命令强制终止使用20051端口的进程。 约2分钟后,再次执行命令:netstat -nap | grep 20051,查看是否还有进程占用该端口。 确认占用该端口进程所属的服务,并修改为其他端口。 分别在“/tmp”和“/var/run/ MRS -DBService/”目录下执行find . -name "*20051*"命令,将搜索到的文件全部删除。 登录Manager,重启DBService服务。
  • 原因分析 查看HMaster日志(/var/log/Bigdata/hbase/hm/hbase-omm-xxx.log)显示,hbase.regionserver.global.memstore.size + hfile.block.cache.size总和大于0.8导致启动不成功,因此需要调整参数配置值总和低于0.8。 查看HMaster和RegionServer的out日志(/var/log/Bigdata/hbase/hm/hbase-omm-xxx.out/var/log/Bigdata/hbase/rs/hbase-omm-xxx.out),提示Unrecognized VM option。 Unrecognized VM option Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. 检查GC_OPTS相关参数存在多余空格,如-D sun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE。
  • 原因分析 该问题多半为HDFS性能较慢,导致健康检查超时,从而导致监控告警。可通过以下方式判断: 首先查看HMaster日志(“/var/log/Bigdata/hbase/hm/hbase-omm-xxx.log”),确认HMaster日志中没有频繁打印“system pause”或“jvm”等GC相关信息。 然后可以通过下列三种方式确认原因为HDFS性能慢造成告警产生。 使用客户端验证,通过hbase shell进入hbase命令行后,执行list验证需要运行多久。 开启HDFS的debug日志,然后查看下层目录很多的路径(hadoop fs -ls /XXX/XXX),验证需要运行多久。 打印HMaster进程jstack: su - omm jps jstack pid 如下图所示,Jstack显示一直卡在DFSClient.listPaths。 图1 异常
  • 原因分析 通过查看RegionServer日志(/var/log/Bigdata/hbase/rs/hbase-omm-xxx.log)。 使用lsof -i:21302(MRS1.7.X及以后端口号是16020)查看到pid,然后根据pid查看到相应的进程,发现RegionServer的端口被DFSZkFailoverController占用。 查看“/proc/sys/net/ipv4/ip_local_port_range”显示为“9000 65500”,临时端口范围与MRS产品端口范围重叠,因为安装时未进行preinstall操作。
  • 解决办法 修改或者添加自定义配置“allow.everyone.if.no.acl.found”参数为“true”,重启Kafka服务。 删除Topic设置的ACL。 例如: kinit test_user 需要使用Kafka管理员用户(属于kafkaadmin组)操作。 例如: kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --remove --allow-principal User:test_user --producer --topic topic_acl kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --remove --allow-principal User:test_user --consumer --topic topic_acl --group test
  • 问题现象 使用SparkStreaming来消费Kafka中指定Topic的消息时,发现无法从Kafka中获取到数据。提示如下错误: Error getting partition metadata。 Exception in thread "main" org.apache.spark.SparkException: Error getting partition metadata for 'testtopic'. Does the topic exist? org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366) org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366) scala.util.Either.fold(Either.scala:97) org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365) org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:422) com.xxx.bigdata.spark.examples.FemaleInfoCollectionPrint$.main(FemaleInfoCollectionPrint.scala:45) com.xxx.bigdata.spark.examples.FemaleInfoCollectionPrint.main(FemaleInfoCollectionPrint.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:498) org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:762) org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183) org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208) org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:123) org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  • 解决办法 配置自定义参数“allow.everyone.if.no.acl.found”参数为“true”,重启Kafka服务。 采用具有权限用户登录。 例如: kinit test_user 或者赋予用户相关权限。 需要使用Kafka管理员用户(属于kafkaadmin组)操作。 例如: kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --topic topic_acl --consumer --add --allow-principal User:test --group test [root@10-10-144-2 client]# kafka-acls.sh --authorizer-properties zookeeper.connect=8.5.144.2:2181/kafka --list --topic topic_acl Current ACLs for resource `Topic:topic_acl`: User:test_user has Allow permission for operations: Describe from hosts: * User:test_user has Allow permission for operations: Write from hosts: * User:test has Allow permission for operations: Describe from hosts: * User:test has Allow permission for operations: Write from hosts: * User:test has Allow permission for operations: Read from hosts: * 用户加入Kafka组或者Kafkaadmin组。
  • 解决办法 配置自定义配置“allow.everyone.if.no.acl.found”参数为“true”,重启Kafka服务。 采用具有权限用户登录。 例如: kinit test_user 或者赋予用户相关权限。 需要使用Kafka管理员用户(属于kafkaadmin组)操作。 例如: kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --topic topic_acl --producer --add --allow-principal User:test [root@10-10-144-2 client]# kafka-acls.sh --authorizer-properties zookeeper.connect=8.5.144.2:2181/kafka --list --topic topic_acl Current ACLs for resource `Topic:topic_acl`: User:test_user has Allow permission for operations: Describe from hosts: * User:test_user has Allow permission for operations: Write from hosts: * User:test has Allow permission for operations: Describe from hosts: * User:test has Allow permission for operations: Write from hosts: * 用户加入Kafka组或者Kafkaadmin组。
  • 解决办法 适当调大堆内存(xmx)的值。 与正常启动Flume的节点进行文件和文件夹权限对比,更改错误文件或文件夹权限。 重新配置JAVA_HOME。 客户端替换“${install_home}/fusioninsight-flume-flume组件版本号/conf/ENV_VARS文件中JAVA_HOME”的值,服务端替换“etc”目录下“ENV_VARS”文件中“JAVA_HOME”的值。 其中“JAVA_HOME”的值可通过登录正常启动Flume的节点,执行echo ${JAVA_HOME}获取。 ${install_home}为Flume客户端的安装路径。
  • 原因分析 Flume堆内存设置的值大于机器剩余内存,查看Flume启动日志: [CST 2019-02-26 13:31:43][INFO] [[checkMemoryValidity:124]] [GC_OPTS is invalid: Xmx(40960000MB) is bigger than the free memory(56118MB) in system.] [9928] Flume文件或文件夹权限异常,界面或后台会提示如下信息: [2019-02-26 13:38:02]RoleInstance prepare to start failure [{ScriptExecutionResult=ScriptExecutionResult [exitCode=126, output=, errMsg=sh: line 1: /opt/Bigdata/MRS_XXX/install/ FusionInsight -Flume-1.9.0/flume/bin/flume-manage.sh: Permission denied JAVA_HOME配置错误,查看Flume agent启动日志: Info: Sourcing environment configuration script /opt/FlumeClient/fusioninsight-flume-1.9.0/conf/flume-env.sh + '[' -n '' ']' + exec /tmp/MRS-Client/MRS_Flume_ClientConfig/JDK/jdk-8u18/bin/java '-XX:OnOutOfMemoryError=bash /opt/FlumeClient/fusioninsight-flume-1.9.0/bin/out_memory_error.sh /opt/FlumeClient/fusioninsight-flume-1.9.0/conf %p' -Xms2G -Xmx4G -XX:CMSFullGCsBeforeCompaction=1 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -Dkerberos.domain.name=hadoop.hadoop.com -verbose:gc -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/Bigdata//flume-client-1/flume/flume-root-20190226134231-%p-gc.log -Dproc_org.apache.flume.node.Application -Dproc_name=client -Dproc_conf_file=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf/properties.properties -Djava.security.krb5.conf=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf//krb5.conf -Djava.security.auth.login.config=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf//jaas.conf -Dzookeeper.server.principal=zookeeper/hadoop.hadoop.com -Dzookeeper.request.timeout=120000 -Dflume.instance.id=884174180 -Dflume.agent.name=clientName1 -Dflume.role=client -Dlog4j.configuration.watch=true -Dlog4j.configuration=log4j.properties -Dflume_log_dir=/var/log/Bigdata//flume-client-1/flume/ -Dflume.service.id=flume-client-1 -Dbeetle.application.home.path=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf/service -Dflume.called.from.service -Dflume.conf.dir=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf -Dflume.metric.conf.dir=/opt/FlumeClient/fusioninsight-flume-1.9.0/conf -Dflume.script.home=/opt/FlumeClient/fusioninsight-flume-1.9.0/bin -cp '/opt/FlumeClient/fusioninsight-flume-1.9.0/conf:/opt/FlumeClient/fusioninsight-flume-1.9.0/lib/*:/opt/FlumeClient/fusioninsight-flume-1.9.0/conf/service/' -Djava.library.path=/opt/FlumeClient/fusioninsight-flume-1.9.0/plugins.d/native/native org.apache.flume.node.Application --conf-file /opt/FlumeClient/fusioninsight-flume-1.9.0/conf/properties.properties --name client /opt/FlumeClient/fusioninsight-flume-1.9.0/bin/flume-ng: line 233: /tmp/FusionInsight-Client/Flume/FusionInsight_Flume_ClientConfig/JDK/jdk-8u18/bin/java: No such file or directory
  • 原因分析 Flume文件或文件夹权限异常,重启后Manager界面提示如下信息: [2019-02-26 13:38:02]RoleInstance prepare to start failure [{ScriptExecutionResult=ScriptExecutionResult [exitCode=126, output=, errMsg=sh: line 1: /opt/Bigdata/MRS_XXX/install/FusionInsight-Flume-1.9.0/flume/bin/flume-manage.sh: Permission denied
  • 原因分析 服务端配置错误,监测端口启动失败,例如服务端Avro Source配置了错误的IP,或者已经被占用了的端口。 查看Flume运行日志: 2016-08-31 17:28:42,092 | ERROR | [lifecycleSupervisor-1-9] | Unable to start EventDrivenSourceRunner: { source:Avro source avro_source: { bindAddress: 10.120.205.7, port: 21154 } } - Exception follows. | org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253) java.lang.RuntimeException: org.jboss.netty.channel.ChannelException: Failed to bind to: /192.168.205.7:21154 若采用了加密传输,证书或密码错误。 2016-08-31 17:15:59,593 | ERROR | [conf-file-poller-0] | Source avro_source has been removed due to an error during configuration | org.apache.flume.node.AbstractConfigurationProvider.loadSources(AbstractConfigurationProvider.java:388) org.apache.flume.FlumeException: Avro source configured with invalid keystore: /opt/Bigdata/MRS_XXX/install/FusionInsight-Flume-1.9.0/flume/conf/flume_sChat.jks 客户端与服务端通信异常。 PING 192.168.85.55 (10.120.85.55) 56(84) bytes of data. From 192.168.85.50 icmp_seq=1 Destination Host Unreachable From 192.168.85.50 icmp_seq=2 Destination Host Unreachable From 192.168.85.50 icmp_seq=3 Destination Host Unreachable From 192.168.85.50 icmp_seq=4 Destination Host Unreachable
共100000条