华为云用户手册

  • 问题背景与现象 客户端安装成功,执行客户端命令例如yarn-session.sh时报错,提示如下: [root@host01 bin]# yarn-session.sh 2018-10-25 01:22:06,454 | ERROR | [main] | Error while trying to split key and value in configuration file /opt/flinkclient/Flink/flink/conf/flink-conf.yaml:80: "security.kerberos.login.keytab: " | org.apache.flink.configuration.GlobalConfiguration (GlobalConfiguration.java:160) Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Error while parsing YAML configuration file :80: "security.kerberos.login.keytab: "
  • 原因分析 查看/var/log/Bigdata/dbservice/DB/gaussdb.log日志没有内容。 查看/var/log/Bigdata/dbservice/scriptlog/preStartDBService.log日志,发现如下信息,判断为配置信息丢失。 The program "gaussdb" was found by " /opt/Bigdata/ MRS _xxx/install/dbservice/gaussdb/bin/gs_guc) But not was not the same version as gs_guc. Check your installation. 比对主备DBServer节点/srv/BigData/dbdata_service/data目录下的配置文件发现差距比较大。
  • 原因分析 查看DBService的备份页面错误信息,有如下错误信息提示: Clear temporary files at backup checkpoint DBService_test_DBService_DBService_20180326155921 that fialed last time. Temporyary files at backup checkpoint DBService_test_DBService_DBService20180326155921 that failed last time are cleared successfully. 查看/var/log/Bigdata/dbservice/scriptlog/backup.log文件,发现日志停止打印,并没有备份相关信息。 查看主 OMS 节点 /var/log/Bigdata/controller/backupplugin.log日志发现如下错误信息: result error is ssh:connect to host 172.16.4.200 port 22 : Connection refused (172.16.4.200是DBService的浮动IP) DBService backup failed.
  • 原因分析 使用客户端命令,打印replication factor larger than available brokers异常。 Error while executing topic command : replication factor: 2 larger than available brokers: 0 [2017-09-17 16:44:12,396] ERROR kafka.admin.AdminOperationException: replication factor: 2 larger than available brokers: 0 at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110) at kafka.admin.TopicCommand$.main(TopicCommand.scala:61) at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$) 通过Manager参看Kafka服务是否处于正常状态,当前可用Broker是否小于设置的replication-factor。 检查客户端命令中ZooKeeper地址是否正确,访问ZooKeeper上所存放的Kafka信息,其路径(Znode)应该加上/kafka,发现配置中缺少/kafka。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 2 --partitions 2 --topic test --zookeeper 192.168.234.231:2181
  • 解决办法 保证Kafka服务处于正常状态,且可用Broker不小于设置的replication-factor。 创建命令中ZooKeeper地址信息需要添加/kafka。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181/kafka
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --replication-factor 2 --partitions 2 --topic test --zookeeper 192.168.234.231:2181 提示错误replication factor larger than available brokers。 具体如下: Error while executing topic command : replication factor: 2 larger than available brokers: 0 [2017-09-17 16:44:12,396] ERROR kafka.admin.AdminOperationException: replication factor: 2 larger than available brokers: 0 at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110) at kafka.admin.TopicCommand$.main(TopicCommand.scala:61) at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$)
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181 提示错误NoNodeException: KeeperErrorCode = NoNode for /brokers/ids。 具体如下: Error while executing topic command : org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids [2017-09-17 16:35:28,520] ERROR org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671) at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:541) at kafka.utils.ZkUtils.getSortedBrokerList(ZkUtils.scala:176) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:235) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:105) at kafka.admin.TopicCommand$.main(TopicCommand.scala:60) at kafka.admin.TopicCommand.main(TopicCommand.scala) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2256) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2284) at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114) at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678) at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985) ... 8 more (kafka.admin.TopicCommand$)
  • 原因分析 使用客户端命令,打印NoNodeException异常。 Error while executing topic command : org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids [2017-09-17 16:35:28,520] ERROR org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671) at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:541) at kafka.utils.ZkUtils.getSortedBrokerList(ZkUtils.scala:176) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:235) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:105) at kafka.admin.TopicCommand$.main(TopicCommand.scala:60) at kafka.admin.TopicCommand.main(TopicCommand.scala) 通过Manager查看Kafka服务是否处于正常状态。 检查客户端命令中ZooKeeper地址是否正确,访问ZooKeeper上所存放的Kafka信息,其路径(Znode)应该加上/kafka,发现配置中缺少/kafka: [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181
  • 原因分析 使用客户端命令,打印NoAuthException异常。 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息。 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 问题背景与现象 在使用Kafka客户端命令设置Topic ACL权限时,发现Topic无法被设置。 kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --topic topic_acl --producer --add --allow-principal User:test_acl 提示错误NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002。 具体如下: Error while executing ACL command: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:1038) at kafka.utils.ZkUtils.deletePath(ZkUtils.scala:499) at kafka.common.ZkNodeChangeNotificationListener$$anonfun$purgeObsoleteNotifications$1.apply(ZkNodeChangeNotificationListener.scala:118) at kafka.common.ZkNodeChangeNotificationListener$$anonfun$purgeObsoleteNotifications$1.apply(ZkNodeChangeNotificationListener.scala:112) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at kafka.common.ZkNodeChangeNotificationListener.purgeObsoleteNotifications(ZkNodeChangeNotificationListener.scala:112) at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:97) at kafka.common.ZkNodeChangeNotificationListener.processAllNotifications(ZkNodeChangeNotificationListener.scala:77) at kafka.common.ZkNodeChangeNotificationListener.init(ZkNodeChangeNotificationListener.scala:65) at kafka.security.auth.SimpleAclAuthorizer.configure(SimpleAclAuthorizer.scala:136) at kafka.admin.AclCommand$.withAuthorizer(AclCommand.scala:73) at kafka.admin.AclCommand$.addAcl(AclCommand.scala:80) at kafka.admin.AclCommand$.main(AclCommand.scala:48) at kafka.admin.AclCommand.main(AclCommand.scala) Caused by: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 at org.apache.zookeeper.KeeperException.create(KeeperException.java:117) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1416) at org.I0Itec.zkclient.ZkConnection.delete(ZkConnection.java:104) at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:1042) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --zookeeper 192.168.234.231:2181/kafka --replication-factor 1 --partitions 2 --topic test 提示错误NoAuthException,KeeperErrorCode = NoAuth for /config/topics。 具体如下: Error while executing topic command org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304) at org.I0Itec.zkclient.ZkClient.createPersistent(ZkClient.java:213) at kafka.utils.ZkUtils$.createParentPath(ZkUtils.scala:215) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:338) at kafka.admin.AdminUtils$.writeTopicConfig(AdminUtils.scala:247)
  • 原因分析 使用客户端命令,打印NoAuthException异常。 Error while executing topic command org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304) at org.I0Itec.zkclient.ZkClient.createPersistent(ZkClient.java:213) at kafka.utils.ZkUtils$.createParentPath(ZkUtils.scala:215) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:338) at kafka.admin.AdminUtils$.writeTopicConfig(AdminUtils.scala:247) 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息。 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 原因分析 使用客户端命令,打印AdminOperationException异常。 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 问题背景与现象 在使用Kafka客户端命令设置Topic ACL权限时,发现Topic无法被设置。 kafka-topics.sh --delete --topic test4 --zookeeper 10.5.144.2:2181/kafka 提示错误ERROR kafka.admin.AdminOperationException: Error while deleting topic test4。 具体如下: Error while executing topic command : Error while deleting topic test4 [2017-01-25 14:00:20,750] ERROR kafka.admin.AdminOperationException: Error while deleting topic test4 at kafka.admin.TopicCommand$$anonfun$deleteTopic$1.apply(TopicCommand.scala:177) at kafka.admin.TopicCommand$$anonfun$deleteTopic$1.apply(TopicCommand.scala:162) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at kafka.admin.TopicCommand$.deleteTopic(TopicCommand.scala:162) at kafka.admin.TopicCommand$.main(TopicCommand.scala:68) at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$)
  • 问题背景与现象 业务拓扑代码中配置参数topology.worker.childopts不生效,关键日志如下: [main] INFO b.s.StormSubmitter - Uploading topology jar /opt/jar/example.jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar Start uploading file '/opt/jar/example.jar' to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar' (65574612 bytes) [==================================================] 65574612 / 65574612 File '/opt/jar/example.jar' uploaded to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar' (65574612 bytes) [main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar [main] INFO b.s.StormSubmitter - Submitting topology word-count in distributed mode with conf {"topology.worker.childopts":"-Xmx4096m","storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5915065013522446406:-6421330379815193999","topology.workers":1} [main] INFO b.s.StormSubmitter - Finished submitting topology: word-count 通过ps -ef | grep worker命令查看worker进程信息如下:
  • 解决办法 如果想要修改拓扑的JVM参数,可以在命令中直接修改topology.worker.gc.childopts这个参数或者在服务端修改该参数,当topology.worker.gc.childopts为 "-Xms4096m -Xmx4096m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M"时,效果如下: [main-SendThread(10.7.61.88:2181)] INFO o.a.s.s.o.a.z.ClientCnxn - Socket connection established, initiating session, client: /10.7.61.88:44694, server: 10.7.61.88/10.7.61.88:2181 [main-SendThread(10.7.61.88:2181)] INFO o.a.s.s.o.a.z.ClientCnxn - Session establishment complete on server 10.7.61.88/10.7.61.88:2181, sessionid = 0x16037a6e5f092575, negotiated timeout = 40000 [main-EventThread] INFO o.a.s.s.o.a.c.f.s.ConnectionStateManager - State change: CONNECTED [main] INFO b.s.u.StormBoundedExponentialBackoffRetry - The baseSleepTimeMs [1000] the maxSleepTimeMs [1000] the maxRetries [1] [main] INFO o.a.s.s.o.a.z.Login - successfully logged in. [main-EventThread] INFO o.a.s.s.o.a.z.ClientCnxn - EventThread shut down for session: 0x16037a6e5f092575 [main] INFO o.a.s.s.o.a.z.ZooKeeper - Session: 0x16037a6e5f092575 closed [main] INFO b.s.StormSubmitter - Uploading topology jar /opt/jar/example.jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar Start uploading file '/opt/jar/example.jar' to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar' (74143745 bytes) [==================================================] 74143745 / 74143745 File '/opt/jar/example.jar' uploaded to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar' (74143745 bytes) [main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-86855b6b-133e-478d-b415-fa96e63e553f.jar [main] INFO b.s.StormSubmitter - Submitting topology word-count in distributed mode with conf {"storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-7360002804241426074:-6868950379453400421","topology.worker.gc.childopts":"-Xms4096m -Xmx4096m -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M","topology.workers":1} [main] INFO b.s.StormSubmitter - Finished submitting topology: word-count 通过ps -ef | grep worker命令查看worker进程信息如下:
  • 原因分析 在Driver端打印异常如下: Exception Occurs: BadPadding 16/02/22 14:25:38 ERROR Schema: Failed initialising database. Unable to open a test connection to the given database. JDBC url = jdbc:postgresql://ip:port/sparkhivemeta, username = spark. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). SparkSQL任务使用时,需要访问DBService以获取元数据信息,在客户端需要解密密文来访问,在使用过程中,用户没有按照流程操作,没有执行配置环境变量操作,且在其客户端环境变量中存在默认的JDK版本,导致在执行解密过程中调用的解密程序执行解密异常,导致用户被锁。
  • 原因分析 在Driver端打印异常找不到连接HDFS的Token信息,报错如下: 16/03/22 20:37:10 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 192 for admin) can't be found in cache 16/03/22 20:37:10 WARN Client: Failed to cleanup staging dir .sparkStaging/application_1458558192236_0003 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 192 for admin) can't be found in cache 在Yarn原生页面显示ApplicationMaster启动两次均失败,任务退出,如图1信息: 图1 ApplicationMaster启动失败 查看ApplicationMaster日志看到如下异常信息: Exception in thread "main" java.lang.ExceptionInInitializerError Caused by: org.apache.spark.SparkException: Unable to load YARN support Caused by: java.lang.IllegalArgumentException: Can't get Kerberos realm Caused by: java.lang.reflect.InvocationTargetException Caused by: KrbException: Cannot locate default realm Caused by: KrbException: Generic error (description in e-text) (60) - Unable to locate Kerberos realm org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410) ... 86 more Caused by: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables:java.lang.reflect.InvocationTargetException ... 110 more 执行./spark-submit --class yourclassname --master yarn-cluster /yourdependencyjars任务以yarn-cluster模式提交任务,Driver端会在集群中启动,由于加载的是客户端的spark.driver.extraJavaOptions,在集群节点上对应路径下找不到对应的kdc.conf文件,无法获取kerberos认证所需信息,导致ApplicationMaster启动失败。
  • 解决办法 在客户端提交任务时,在命令行中配置自定义的spark.driver.extraJavaOptions参数这样任务运行时就不会自动加载客户端路径下“spark-defaults.conf”中的spark.driver.extraJavaOptions;或者在启动Spark任务时,通过“--conf”来指定Driver的配置,如下(此处spark.driver.extraJavaOptions“=”号后面的引号部分不能缺少)。 ./spark-submit -class yourclassname --master yarn-cluster --conf spark.driver.extraJavaOptions=" -Dlog4j.configuration=file:/opt/client/Spark/spark/conf/log4j.properties -Djetty.version=x.y.z -Dzookeeper.server.principal=zookeeper/hadoop.794bbab6_9505_44cc_8515_b4eddc84e6c1.com -Djava.security.krb5.conf=/opt/client/KrbClient/kerberos/var/krb5kdc/krb5.conf -Djava.security.auth.login.config=/opt/client/Spark/spark/conf/jaas.conf -Dorg.xerial.snappy.tempdir=/opt/client/Spark/tmp -Dcarbon.properties.filepath=/opt/client/Spark/spark/conf/carbon.properties" ../yourdependencyjars
  • 原因分析 在Driver日志中直接打印申请的executor memory超过集群限制。 ... INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (6144 MB per container) ... ERROR SparkContext: Error initializing SparkContext. java.lang.IllegalArgumentException: Required executor memory (10240+1024 MB) is above the max threshold (6144 MB) of this cluster! Spark任务提交至Yarn上面,运行task的executor使用的资源受yarn的管理。从报错信息可看出,用户申请启动executor时,指定10G的内存,超出了Yarn设置的每个container的最大内存的限制,导致任务无法启动。
  • 解决办法 ZooKeeper连接失败导致。 Kafka客户端连接ZooKeeper服务超时。检查客户端到ZooKeeper的网络连通性。 网络连接失败,通过Manager界面查看Zookeeper服务信息。 图1 Zookeeper服务信息 配置错误,修改客户端命令中ZooKeeper地址。 Kafka服务端配置禁止删除。 通过Manager界面修改delete.topic.enable为true。保存配置并重启服务。 图2 修改delete.topic.enable 客户端查询命令,无Topic:test。 kafka-topics.sh --list --zookeeper 192.168.0.122:24002/kafka 进入RunningAsController节点日志目录,在controller.log发现Deletion of topic test successfully。 2016-03-10 10:39:40,665 | INFO | [delete-topics-thread-3] | [Partition state machine on Controller 3]: Invoking state change to OfflinePartition for partitions [test,2],[test,15],[test,6],[test,16],[test,12],[test,7],[test,10],[test,13],[test,9],[test,19],[test,3],[test,5],[test,1],[test,0],[test,17],[test,8],[test,4],[test,11],[test,14],[test,18] | kafka.controller.PartitionStateMachine (Logging.scala:68) 2016-03-10 10:39:40,668 | INFO | [delete-topics-thread-3] | [Partition state machine on Controller 3]: Invoking state change to NonExistentPartition for partitions [test,2],[test,15],[test,6],[test,16],[test,12],[test,7],[test,10],[test,13],[test,9],[test,19],[test,3],[test,5],[test,1],[test,0],[test,17],[test,8],[test,4],[test,11],[test,14],[test,18] | kafka.controller.PartitionStateMachine (Logging.scala:68) 2016-03-10 10:39:40,977 | INFO | [delete-topics-thread-3] | [delete-topics-thread-3], Deletion of topic test successfully completed | kafka.controller.TopicDeletionManager$DeleteTopicsThread (Logging.scala:68) Kafka部分节点处于停止或者故障状态。 启动停止的Broker实例。 客户端查询命令,无Topic:test。 kafka-topics.sh --list --zookeeper 192.168.0.122:24002/kafka 进入RunningAsController节点日志目录,在controller.log发现Deletion of topic test successfully。 2016-03-10 11:17:56,463 | INFO | [delete-topics-thread-3] | [Partition state machine on Controller 3]: Invoking state change to NonExistentPartition for partitions [test,4],[test,1],[test,8],[test,2],[test,5],[test,9],[test,7],[test,6],[test,0],[test,3] | kafka.controller.PartitionStateMachine (Logging.scala:68) 2016-03-10 11:17:56,726 | INFO | [delete-topics-thread-3] | [delete-topics-thread-3], Deletion of topic test successfully completed | kafka.controller.TopicDeletionManager$DeleteTopicsThread (Logging.scala:68) Kafka配置自动创建,且Producer未停止。 停止相关应用,通过Manager界面修改“auto.create.topics.enable”为“false”,保存配置并重启服务。 图3 修改auto.create.topics.enable 再次执行delete操作。
  • 问题背景与现象 Consumer来消费Kafka中指定Topic的消息时,发现无法从Kafka中获取到数据。 提示如下错误: org.apache.kafka.common.protocol.types.SchemaException: Error reading field 'brokers': Error reading field 'host': Error reading string of length 28271, only 593 bytes available
  • 问题背景与现象 新建Kafka集群,部署Broker节点数为2,使用Kafka客户端可以正常生产,但是无法正常消费。Consumer消费数据失败,提示GROUP_COORDINATOR_NOT_AVAILABLE,关键日志如下: 2018-05-12 10:58:42,561 | INFO | [kafka-request-handler-3] | [GroupCoordinator 2]: Preparing to restabilize group DemoConsumer with old generation 118 | kafka.coordinator.GroupCoordinator (Logging.scala:68) 2018-05-12 10:59:13,562 | INFO | [executor-Heartbeat] | [GroupCoordinator 2]: Preparing to restabilize group DemoConsumer with old generation 119 | kafka.coordinator.GroupCoordinator (Logging.scala:68)
  • 原因分析 由于执行命令的用户与当前查看pid信息的进程提交用户不一致导致。 Storm引入区分用户执行任务特性,在启动worker进程时将给进程的uid和gid改为提交用户和ficommon,目的是为了logviewer可以访问到worker进程的日志同时日志文件只开放权限到640。这样会导致切换到提交用户后对Worker进程执行jstack和jmap等命令执行失败,原因是提交用户的默认gid并不是ficommon,需要通过ldap命令修改提交用户的gid为9998(ficommon)才可执行。
  • 参考信息 nimbus.task.launch.secs和supervisor.worker.start.timeout.secs这两个参数分别代表nimbus端和supervisor端对于拓扑启动的超时容忍时间,一般nimbus.task.launch.secs的值要大于等于supervisor.worker.start.timeout.secs的值(建议相等或略大,如果超出太多会影响任务重分配的效率)。 nimbus.task.launch.secs:nimbus在超过该参数配置的时间内没有收到拓扑的task发的心跳时,会将该拓扑重新分配(分配给别的supervisor),同时会刷新zk中的任务信息,supervisor读到zk中的任务信息并且与自己当前所启动的拓扑进行比较,如果存在拓扑已经不属于自己,那么则会删除该拓扑的元数据,也就是/srv/Bigdata/streaming_data/stormdir/supervisor/stormdist/{worker-id}目录。 supervisor.worker.start.timeout.secs:supervisor启动worker后,在该参数配置的时间内没有收到worker的心跳时,supervisor会主动停掉worker,等待worker的重新调度,一般在业务启动时间较长时适当增加该参数的值,保证worker能启动成功。 如果supervisor.worker.start.timeout.secs配置的值比nimbus.task.launch.secs的值大,那么则会出现supervisor的容忍时间没到,仍然继续让worker启动,而nimbus却认定该业务启动超时,将该业务分配给了其他主机,这时supervisor的后台线程发现任务不一致,删除了拓扑的元数据,导致接下来worker在启动过程中要读取stormconf.ser时,发现该文件已经不存在了,就会抛出FileNotFoundException。 nimbus.task.timeout.secs和supervisor.worker.timeout.secs这两个参数则分别代表nimbus端和supervisor端对于拓扑运行过程中心跳上报的超时容忍时间,一般nimbus.task.timeout.secs的值要大于等于supervisor.worker.timeout.secs的值(建议相等或略大),原理同上。
  • 可能原因 Worker进程启动失败,触发Nimbus重新分配任务,在其他Supervisor上启动Worker。由于Worker启动失败后会继续重启,导致Worker节点在一直变化,且Worker日志内容为空。Worker进程启动失败的可能原因有两个: 提交的Jar包中包含“storm.yaml”文件。 Storm规定,每个“classpath”中只能包含一个“storm.yaml”文件,如果多于一个那么就会产生异常。使用Storm客户端提交拓扑,由于客户端“classpath”配置和Eclipse远程提交方式“classpath”不一样,客户端会自动加载用户的Jar包到“classpath”,从而使“classpath”中存在两个“storm.yaml”文件。 Worker进程初始化时间较长,超过Storm集群设置Worker启动超时时间,导致Worker被Kill从而一直进行重分配。
  • 解决办法 认证异常。 登录客户端节点,进入客户端目录。 执行以下命令重新提交任务。(业务jar包和Topology根据实际情况替换) source bigdata_env kinit 用户名 storm jar storm-starter-topologies-0.10.0.jar storm.starter.WordCountTopology test 拓扑包异常。 排查业务jar,将业务jar中storm.yaml文件删除,重新提交任务。
  • 问题1:报没权限(Access denied)执行balance 问题详细:执行start-balancer.sh,“hadoop-root-balancer-主机名.out”日志显示“Access denied for user test1. Superuser privilege is required” cat /opt/client/HDFS/hadoop/logs/hadoop-root-balancer-host2.out Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved INFO: Watching file:/opt/client/HDFS/hadoop/etc/hadoop/log4j.properties for changes with interval : 60000 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user test1. Superuser privilege is required at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkSuperuserPrivilege(FSPermissionChecker.java:122) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkSuperuserPrivilege(FSNamesystem.java:5916) 问题根因: 执行balance需要使用管理员账户 解决方法 安全版本 使用hdfs或者其他属于supergroup组的用户认证后,执行balance 普通版本 执行HDFS的balance命令前,需要在客户端执行su - hdfs命令。
  • 问题2:执行balance失败,/system/balancer.id文件异常 问题详细: 在HDFS客户端启动一个Balance进程,该进程被异常停止后,再次执行Balance操作,操作会失败。 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.protocol.RecoveryInProgressException): Failed to APPEND_FILE /system/balancer.id for DFSClient because lease recovery is in progress. Try again later. 问题根因: 通常,HDFS执行Balance操作结束后,会自动释放“/system/balancer.id”文件,可再次正常执行Balance。 但在上述场景中,由于第一次的Balance操作是被异常停止的,所以第二次进行Balance操作时,“/system/balancer.id”文件仍然存在,则会触发append /system/balancer.id操作,进而导致Balance操作失败。 解决方法 方法1:等待硬租期超过1小时后,原有客户端释放租约,再执行第二次Balance操作。 方法2:删除HDFS中的“/system/balancer.id”文件,再执行下次Balance操作。
  • 原因分析 DataNode节点内写block磁盘时,有两种策略“轮询”和“优先写剩余磁盘空间多的磁盘”,默认是“轮询”。 参数说明:dfs.datanode.fsdataset.volume.choosing.policy 可选值: 轮询:org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy 优先写剩余空间多的磁盘: org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy
共100000条