华为云用户手册

  • 原因分析 修改“$Flink_HOME/conf”目录下的“log4j.properties”文件,控制的是JobManager和TaskManager的算子内的日志输出,输出的日志会打印到对应的yarn contain中,可以在Yarn WebUI查看对应日志。 MRS 3.1.0及之后版本的Flink 1.12.0版本开始默认的日志框架是log4j2,配置的方式跟之前log4j的方式有区别,使用如log4j日志规则不会生效。
  • 处理步骤 分别登录主、备Master节点。 执行cd /srv/BigData/命令进入到备份文件所在目录。 执行unlink LocalBackup命令删除LocalBackup软连接。 执行mkdir -p LocalBackup命令创建LocalBackup目录。 执行chown -R omm:wheel LocalBackup命令修改文件所属用户、群组。 执行chmod 700 LocalBackup命令修改文件读写权限。 登录MRS Manager页面重新执行周期备份。
  • 解决办法 停止HBase组件。 在HBase客户端使用hbase用户登录认证,执行如下命令。 例如: hadoop03:~ # source /opt/client/bigdata_envhadoop03:~ # kinit hbasePassword for hbase@HADOOP.COM: hadoop03:~ # hbase zkcli 删除zk中acl表信息。 例如: [zk: hadoop01:24002,hadoop02:24002,hadoop03:24002(CONNECTED) 0] deleteall /hbase/table/hbase:acl[zk: hadoop01:24002,hadoop02:24002,hadoop03:24002(CONNECTED) 0] deleteall /hbase/table-lock/hbase:acl 启动HBase组件。
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现创建Topic Partition的Leader显示为none。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 10.6.92.36:2181/kafkaCreated topic "test". [root@10-10-144-2 client]# kafka-topics.sh --describe --zookeeper 10.6.92.36:2181/kafkaTopic:test PartitionCount:2 ReplicationFactor:2 Configs: Topic: test Partition: 0 Leader: none Replicas: 2,3 Isr: Topic: test Partition: 1 Leader: none Replicas: 3,1 Isr:
  • 原因分析 查看HMaster的运行日志,发现有报大量的如下错误: 2018-03-26 11:10:54,185 | INFO | hadoopc1h3,21300,1522031630949_splitLogManager__ChoreService_1 | total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhadoopc1h1%2C21302%2C1520214023667-splitting%2Fhadoopc1h1%252C21302%252C1520214023667.default.1520584926990=last_update = 1522033841041 last_version = 34255 cur_worker_name = hadoopc1h3,21302,1520943011826 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} | org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor.chore(SplitLogManager.java:745)2018-03-26 11:11:00,185 | INFO | hadoopc1h3,21300,1522031630949_splitLogManager__ChoreService_1 | total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhadoopc1h1%2C21302%2C1520214023667-splitting%2Fhadoopc1h1%252C21302%252C1520214023667.default.1520584926990=last_update = 1522033841041 last_version = 34255 cur_worker_name = hadoopc1h3,21302,1520943011826 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} | org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor.chore(SplitLogManager.java:745)2018-03-26 11:11:06,185 | INFO | hadoopc1h3,21300,1522031630949_splitLogManager__ChoreService_1 | total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhadoopc1h1%2C21302%2C1520214023667-splitting%2Fhadoopc1h1%252C21302%252C1520214023667.default.1520584926990=last_update = 1522033841041 last_version = 34255 cur_worker_name = hadoopc1h3,21302,1520943011826 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} | org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor.chore(SplitLogManager.java:745)2018-03-26 11:11:10,787 | INFO | RpcServer.reader=9,bindAddress=hadoopc1h3,port=21300 | Kerberos principal name is hbase/hadoop.hadoop.com@HADOOP.COM | org.apache.hadoop.hbase.ipc.RpcServer$Connection.readPreamble(RpcServer.java:1532)2018-03-26 11:11:12,185 | INFO | hadoopc1h3,21300,1522031630949_splitLogManager__ChoreService_1 | total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhadoopc1h1%2C21302%2C1520214023667-splitting%2Fhadoopc1h1%252C21302%252C1520214023667.default.1520584926990=last_update = 1522033841041 last_version = 34255 cur_worker_name = hadoopc1h3,21302,1520943011826 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} | org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor.chore(SplitLogManager.java:745)2018-03-26 11:11:18,185 | INFO | hadoopc1h3,21300,1522031630949_splitLogManager__ChoreService_1 | total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fhadoopc1h1%2C21302%2C1520214023667-splitting%2Fhadoopc1h1%252C21302%252C1520214023667.default.1520584926990=last_update = 1522033841041 last_version = 34255 cur_worker_name = hadoopc1h3,21302,1520943011826 status = in_progress incarnation = 3 resubmits = 3 batch = installed = 1 done = 0 error = 0} | org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor.chore(SplitLogManager.java:745) 节点上下电,RegionServer的wal分裂失败导致。
  • 解决办法 执行select count(*) from table_name;前确认需要查询的数据量大小,确认是否需要在beeline中显示如此数量级的数据。 如数量在一定范围内需要显示,请调整hive客户端的jvm参数, 在hive客户端目录/Hive下的component_env中添加export HIVE_OPTS=-Xmx1024M(具体数值请根据业务调整),并重新执行source 客户端目录/bigdata_env配置环境变量。
  • 解决办法 停止HBase组件。 通过hdfs fsck命令检查/hbase/WALs文件的健康状态。 hdfs fsck /hbase/WALs 输出如下表示文件都正常,如果有异常则需要先处理异常的文件,再执行后面的操作。 The filesystem under path '/hbase/WALs' is HEALTHY 备份/hbase/WALs文件。 hdfs dfs -mv /hbase/WALs /hbase/WALs_old 新建/hbase/WALs目录。 hdfs dfs -mkdir /hbase/WALs 必须保证路径权限是hbase:hadoop。 启动HBase组件。
  • 解决方案 DBservice服务不可用请参考ALM-27001 DBService服务不可用。 HDFS服务不可用请参考ALM-14000 HDFS服务不可用。 ZooKeeper服务不可用请参考ALM-13000 ZooKeeper服务不可用。 LDAP/KrbServer服务不可用请参考ALM-25000 LdapServer服务不可用/ALM-25500 KrbServer服务不可用。 MetaStore实例不可用请参考ALM-16004 Hive服务不可用。
  • 原因分析 使用客户端命令,打印NoAuthException异常。 Error while executing topic command org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topicsorg.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304) at org.I0Itec.zkclient.ZkClient.createPersistent(ZkClient.java:213) at kafka.utils.ZkUtils$.createParentPath(ZkUtils.scala:215) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:338) at kafka.admin.AdminUtils$.writeTopicConfig(AdminUtils.scala:247) 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klistTicket cache: FILE:/tmp/krb5cc_0Default principal: test@HADOOP.COMValid starting Expires Service principal01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息。 [root@10-10-144-2 client]# id testuid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 原因分析 查看/var/log/Bigdata/dbservice/DB/gaussdb.log日志没有内容。 查看/var/log/Bigdata/dbservice/scriptlog/preStartDBService.log日志,发现如下信息,判断为配置信息丢失。 The program "gaussdb" was found by "/opt/Bigdata/MRS_xxx/install/dbservice/gaussdb/bin/gs_guc)But not was not the same version as gs_guc.Check your installation. 比对主备DBServer节点/srv/BigData/dbdata_service/data目录下的配置文件发现差距比较大。
  • 解决办法 Hue配置过期,重启Hue服务即可。 在MRS 2.0.1及之后版本,单Master节点的集群Hue服务需要手动修改配置。 登录Master节点。 执行hostname -i获取本机IP。 执行如下命令获取“HUE_FLOAT_IP”的地址: grep "HUE_FLOAT_IP" ${BIGDATA_HOME}/MRS_Current/1_*/etc*/ENV_VARS, 其中MRS以实际文件名为准。 比较本机IP和“HUE_FLOAT_IP”的值是否相同,若不相同,请修改“HUE_FLOAT_IP”的值为本机IP。 重启Hue服务。
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181 提示错误NoNodeException: KeeperErrorCode = NoNode for /brokers/ids。 具体如下: Error while executing topic command : org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids[2017-09-17 16:35:28,520] ERROR org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/idsat org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:541)at kafka.utils.ZkUtils.getSortedBrokerList(ZkUtils.scala:176)at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:235)at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:105)at kafka.admin.TopicCommand$.main(TopicCommand.scala:60)at kafka.admin.TopicCommand.main(TopicCommand.scala)Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/idsat org.apache.zookeeper.KeeperException.create(KeeperException.java:115)at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2256)at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2284)at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114)at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678)at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675)at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)... 8 more (kafka.admin.TopicCommand$)
  • 原因分析 使用客户端命令,打印NoNodeException异常。 Error while executing topic command : org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids[2017-09-17 16:35:28,520] ERROR org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/idsat org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995)at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675)at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671)at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:541)at kafka.utils.ZkUtils.getSortedBrokerList(ZkUtils.scala:176)at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:235)at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:105)at kafka.admin.TopicCommand$.main(TopicCommand.scala:60)at kafka.admin.TopicCommand.main(TopicCommand.scala) 通过Manager查看Kafka服务是否处于正常状态。 检查客户端命令中ZooKeeper地址是否正确,访问ZooKeeper上所存放的Kafka信息,其路径(Znode)应该加上/kafka,发现配置中缺少/kafka: [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --zookeeper 192.168.234.231:2181/kafka --replication-factor 1 --partitions 2 --topic test 提示错误NoAuthException,KeeperErrorCode = NoAuth for /config/topics。 具体如下: Error while executing topic command org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topicsorg.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304) at org.I0Itec.zkclient.ZkClient.createPersistent(ZkClient.java:213) at kafka.utils.ZkUtils$.createParentPath(ZkUtils.scala:215) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:338) at kafka.admin.AdminUtils$.writeTopicConfig(AdminUtils.scala:247)
  • 问题背景与现象 业务拓扑代码中配置参数topology.worker.childopts不生效,关键日志如下: [main] INFO b.s.StormSubmitter - Uploading topology jar /opt/jar/example.jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jarStart uploading file '/opt/jar/example.jar' to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar' (65574612 bytes)[==================================================] 65574612 / 65574612File '/opt/jar/example.jar' uploaded to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar' (65574612 bytes)[main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar[main] INFO b.s.StormSubmitter - Submitting topology word-count in distributed mode with conf {"topology.worker.childopts":"-Xmx4096m","storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5915065013522446406:-6421330379815193999","topology.workers":1}[main] INFO b.s.StormSubmitter - Finished submitting topology: word-count 通过ps -ef | grep worker命令查看worker进程信息如下:
  • 解决办法 如果想要修改拓扑的JVM参数,可以在命令中直接修改topology.worker.gc.childopts这个参数或者在服务端修改该参数,当topology.worker.gc.childopts为 "-Xms4096m -Xmx4096m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M"时,效果如下: [main-SendThread(10.7.61.88:2181)] INFO o.a.s.s.o.a.z.ClientCnxn - Socket connection established, initiating session, client: /10.7.61.88:44694, server: 10.7.61.88/10.7.61.88:2181[main-SendThread(10.7.61.88:2181)] INFO o.a.s.s.o.a.z.ClientCnxn - Session establishment complete on server 10.7.61.88/10.7.61.88:2181, sessionid = 0x16037a6e5f092575, negotiated timeout = 40000[main-EventThread] INFO o.a.s.s.o.a.c.f.s.ConnectionStateManager - State change: CONNECTED[main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [1000] the maxSleepTimeMs [1000] the maxRetries [1][main] INFO o.a.s.s.o.a.z.Login - successfully logged in.[main-EventThread] INFO o.a.s.s.o.a.z.ClientCnxn - EventThread shut down for session: 0x16037a6e5f092575[main] INFO o.a.s.s.o.a.z.ZooKeeper - Session: 0x16037a6e5f092575 closed[main] INFO b.s.StormSubmitter - Uploading topology jar /opt/jar/example.jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jarStart uploading file '/opt/jar/example.jar' to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar' (74143745 bytes)[==================================================] 74143745 / 74143745File '/opt/jar/example.jar' uploaded to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar' (74143745 bytes)[main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar[main] INFO b.s.StormSubmitter - Submitting topology word-count in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-7360002804241426074:-6868950379453400421","topology.worker.gc.childopts":"-Xms4096m -Xmx4096m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M","topology.workers":1}[main] INFO b.s.StormSubmitter - Finished submitting topology: word-count 通过ps -ef | grep worker命令查看worker进程信息如下:
  • 问题背景与现象 在使用Kafka客户端命令设置Topic ACL权限时,发现Topic无法被设置。 kafka-topics.sh --delete --topic test4 --zookeeper 10.5.144.2:2181/kafka 提示错误ERROR kafka.admin.AdminOperationException: Error while deleting topic test4。 具体如下: Error while executing topic command : Error while deleting topic test4[2017-01-25 14:00:20,750] ERROR kafka.admin.AdminOperationException: Error while deleting topic test4at kafka.admin.TopicCommand$$anonfun$deleteTopic$1.apply(TopicCommand.scala:177)at kafka.admin.TopicCommand$$anonfun$deleteTopic$1.apply(TopicCommand.scala:162)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)at kafka.admin.TopicCommand$.deleteTopic(TopicCommand.scala:162)at kafka.admin.TopicCommand$.main(TopicCommand.scala:68)at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$)
  • 原因分析 Driver端异常: 16/05/11 18:10:56 INFO Client: client token: N/Adiagnostics: Application application_1462441251516_0024 failed 2 times due to AM Container for appattempt_1462441251516_0024_000002 exited with exitCode: 10For more detailed output, check the application tracking page:https://hdnode5:26001/cluster/app/application_1462441251516_0024 Then click on links to logs of each attempt.Diagnostics: Exception from container-launch.Container id: container_1462441251516_0024_02_000001 在ApplicationMaster日志中,异常如下: 2016-05-12 10:21:23,715 | ERROR | [main] | Failed to connect to driver at 192.168.30.57:23867, retrying ... | org.apache.spark.Logging$class.logError(Logging.scala:75)2016-05-12 10:21:24,817 | ERROR | [main] | Failed to connect to driver at 192.168.30.57:23867, retrying ... | org.apache.spark.Logging$class.logError(Logging.scala:75)2016-05-12 10:21:24,918 | ERROR | [main] | Uncaught exception: | org.apache.spark.Logging$class.logError(Logging.scala:96)org.apache.spark.SparkException: Failed to connect to driver!at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:426)at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:292)…2016-05-12 10:21:24,925 | INFO | [Thread-1] | Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) | org.apache.spark.Logging$class.logInfo(Logging.scala:59) Spark-client模式任务Driver运行在客户端节点上(通常是集群外的某个节点),启动时先在集群中启动AppMaster进程,进程启动后要向Driver进程注册信息,注册成功后,任务才能继续。从AppMaster日志中可以看出,无法连接至Driver,所以任务失败。
  • 解决办法 请检查Driver进程所在的IP是否可以ping通。 启动一个Spark PI任务,会有类似如下打印信息。 16/05/11 18:07:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.100:23662]16/05/11 18:07:20 INFO Utils: Successfully started service 'sparkDriver' on port 23662. 在该节点,也就是2中示例的192.168.1.100上执行netstat - anp | grep 23662看下此端口是否打开,如下打印标明,相关端口是打开的。 tcp 0 0 ip:port :::* LISTEN 107274/java tcp 0 0 ip:port ip:port ESTABLISHED 107274/java 在AppMaster启动的节点执行telnet 192.168.1.100 23662看下是否可以连通该端口,请使用root用户和omm用户都执行一遍。 如果出现Escape character is '^]'类似打印则说明可以连通,如果出现connection refused则表示失败,无法连接到相关端口。 如果相关端口打开,但是从别的节点无法连通到该端口,则需要排查下相关网络配置。 23662这个端口每次都是随机的,所以要根据自己启动任务打开的端口来测试。
  • 原因分析 使用客户端命令,打印AdminOperationException异常。 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klistTicket cache: FILE:/tmp/krb5cc_0Default principal: test@HADOOP.COMValid starting Expires Service principal01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息 [root@10-10-144-2 client]# id testuid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 问题背景与现象 有MapReduce任务所有map任务均成功,但reduce任务失败,查看日志发现报异常“FileNotFoundException...No lease on...File does not exist”。 Error: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): No lease on /user/sparkhive/warehouse/daas/dsp/output/_temporary/1/_temporary/attempt_1479799053892_17075_r_000007_0/part-r-00007 (inode 6501287): File does not exist. Holder DFSClient_attempt_1479799053892_17075_r_000007_0_-1463597952_1 does not have any open files.at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3350)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3442)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3409)at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:789)
  • 原因分析 查看Kerberos日志“/var/log/Bigdata/kerberos/krb5kdc.log”,发现有集群外的IP使用Kafka用户连接,导致多次认证失败,最终导致Kafka账号被锁定。 Jul 11 02:49:16 192-168-1-91 krb5kdc[1863](info): AS_REQ (2 etypes {18 17}) 192.168.1.93: NEEDED_PREAUTH: kafka/hadoop.hadoop.com@HADOOP.COM for krbtgt/HADOOP.COM@HADOOP.COM, Additional pre-authentication requiredJul 11 02:49:16 192-168-1-91 krb5kdc[1863](info): preauth (encrypted_timestamp) verify failure: Decrypt integrity check failedJul 11 02:49:16 192-168-1-91 krb5kdc[1863](info): AS_REQ (2 etypes {18 17}) 192.168.1.93: PREAUTH_FAILED: kafka/hadoop.hadoop.com@HADOOP.COM for krbtgt/HADOOP.COM@HADOOP.COM, Decrypt integrity check failed
  • 问题背景与现象 给某目录设置quota后,往目录中写文件失败,出现如下问题“The DiskSpace quota of /tmp/tquota2 is exceeded”。 [omm@189-39-150-115 client]$ hdfs dfs -put switchuser.py /tmp/tquota2put: The DiskSpace quota of /tmp/tquota2 is exceeded: quota = 157286400 B = 150 MB but diskspace consumed = 402653184 B = 384 MB
  • 原因分析 HDFS支持设置某目录的配额,即限制某目录下的文件最多占用空间大小,例如如下命令是设置“/tmp/tquota”目录最多写入150MB的文件(文件大小*副本数)。 hadoop dfsadmin -setSpaceQuota 150M /tmp/tquota2 使用如下命令可以查看目录设置的配额情况,SPACE_QUOTA是设置的空间配额,REM_SPACE_QUOTA是当前剩余的空间配额。 hdfs dfs -count -q -h -v /tmp/tquota2 图1 查看目录设置的配额 日志分析 ,如下日志说明写入文件需要消耗384M,但是当前的空间配额是150M,因此空间不足。写文件前,需要的剩余空间是:块大小*副本数,128M*3副本=384M。 [omm@189-39-150-115 client]$ [omm@189-39-150-115 client]$ hdfs dfs -put switchuser.py /tmp/tquota2put: The DiskSpace quota of /tmp/tquota2 is exceeded: quota = 157286400 B = 150 MB but diskspace consumed = 402653184 B = 384 MB
  • 原因分析 FileNotFoundException...No lease on...File does not exist,该日志说明文件在操作的过程中被删除了。 搜索HDFS的NameNode的审计日志(Active NameNode的/var/log/Bigdata/audit/hdfs/nn/hdfs-audit-namenode.log)搜索文件名,确认文件的创建时间。 搜索文件创建到出现异常时间范围的NameNode的审计日志,搜索该文件是否被删除或者移动到其他目录。 如果该文件没有被删除或者移动,可能是该文件的父目录,或者更上层目录被删除或者移动,需要继续搜索上层目录。如本样例中,是文件的父目录被删除。 2017-05-31 02:04:08,286 | INFO | IPC Server handler 30 on 25000 | allowed=true ugi=appUser@HADOOP.COM (auth:TOKEN) ip=/192.168.1.22 cmd=delete src=/user/sparkhive/warehouse/daas/dsp/output/_temporary dst=null perm=null proto=rpc | FSNamesystem.java:8189 如上日志说明:192.168.1.22 节点的appUser用户删除了/user/sparkhive/warehouse/daas/dsp/output/_temporary。 可以使用zgrep "文件名" *.zip命令搜索zip包的内容。
  • 问题背景与现象 新建Kafka集群,部署Broker节点数为2,使用Kafka客户端可以正常生产,但是无法正常消费。Consumer消费数据失败,提示GROUP_COORDINATOR_NOT_AVAILABLE,关键日志如下: 2018-05-12 10:58:42,561 | INFO | [kafka-request-handler-3] | [GroupCoordinator 2]: Preparing to restabilize group DemoConsumer with old generation 118 | kafka.coordinator.GroupCoordinator (Logging.scala:68)2018-05-12 10:59:13,562 | INFO | [executor-Heartbeat] | [GroupCoordinator 2]: Preparing to restabilize group DemoConsumer with old generation 119 | kafka.coordinator.GroupCoordinator (Logging.scala:68)
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --replication-factor 2 --partitions 2 --topic test --zookeeper 192.168.234.231:2181 提示错误replication factor larger than available brokers。 具体如下: Error while executing topic command : replication factor: 2 larger than available brokers: 0[2017-09-17 16:44:12,396] ERROR kafka.admin.AdminOperationException: replication factor: 2 larger than available brokers: 0at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110)at kafka.admin.TopicCommand$.main(TopicCommand.scala:61)at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$)
  • 问题背景与现象 客户端安装成功,执行客户端命令例如yarn-session.sh时报错,提示如下: [root@host01 bin]# yarn-session.sh2018-10-25 01:22:06,454 | ERROR | [main] | Error while trying to split key and value in configuration file /opt/flinkclient/Flink/flink/conf/flink-conf.yaml:80: "security.kerberos.login.keytab: " | org.apache.flink.configuration.GlobalConfiguration (GlobalConfiguration.java:160)Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Error while parsing YAML configuration file :80: "security.kerberos.login.keytab: "
  • 原因分析 使用客户端命令,打印replication factor larger than available brokers异常。 Error while executing topic command : replication factor: 2 larger than available brokers: 0[2017-09-17 16:44:12,396] ERROR kafka.admin.AdminOperationException: replication factor: 2 larger than available brokers: 0at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110)at kafka.admin.TopicCommand$.main(TopicCommand.scala:61)at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$) 通过Manager参看Kafka服务是否处于正常状态,当前可用Broker是否小于设置的replication-factor。 检查客户端命令中ZooKeeper地址是否正确,访问ZooKeeper上所存放的Kafka信息,其路径(Znode)应该加上/kafka,发现配置中缺少/kafka。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 2 --partitions 2 --topic test --zookeeper 192.168.234.231:2181
  • 解决办法 保证Kafka服务处于正常状态,且可用Broker不小于设置的replication-factor。 创建命令中ZooKeeper地址信息需要添加/kafka。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181/kafka
共99316条