华为云用户手册

  • 解决办法 停止HBase组件。 在HBase客户端使用hbase用户登录认证,执行如下命令。 例如: hadoop03:~ # source /opt/client/bigdata_env hadoop03:~ # kinit hbase Password for hbase@HADOOP.COM: hadoop03:~ # hbase zkcli 删除zk中acl表信息。 例如: [zk: hadoop01:24002,hadoop02:24002,hadoop03:24002(CONNECTED) 0] deleteall /hbase/table/hbase:acl [zk: hadoop01:24002,hadoop02:24002,hadoop03:24002(CONNECTED) 0] deleteall /hbase/table-lock/hbase:acl 启动HBase组件。
  • 问题背景 新安装的集群HBase启动失败,查看RegionServer日志报如下错误: 2018-02-24 16:53:03,863 | ERROR | regionserver/host3/187.6.71.69:21302 | Master passed us a different hostname to use; was=host3, but now=187-6-71-69 | org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1386)
  • 解决办法 排查启动的MapReduce任务是否对应的HDFS文件个数很多,如果很多,减少文件数量,提前先合并小文件或者尝试使用combineInputFormat来减少任务读取的文件数量。 增大hadoop命令执行时的内存,该内存在客户端中设置,修改“客户端安装目录/HDFS/component_env”文件中“CLIENT_GC_OPTS”的“-Xmx”参数,将该参数的默认值改大,比如改为512m。然后执行source component_env命令,使修改的参数生效。
  • 针对不同的Topic访问场景 ,Kafka新旧API使用说明 场景一:访问设置了ACL的Topic 使用的API 用户属组 客户端参数 服务端参数 访问的端口 新API 用户需满足以下条件之一即可: 属于系统管理员组 属于kafkaadmin组 属于kafkasuperuser组 被授权的kafka组的用户 security.protocol=SASL_PLAINTEXT sasl.kerberos.service.name = kafka - sasl.port(默认21007) security.protocol=SASL_SSL sasl.kerberos.service.name = kafka ssl.mode.enable配置为true sasl-ssl.port(默认21009) 旧API 不涉及 不涉及 不涉及 不涉及 场景二:访问未设置ACL的Topic 使用的API 用户属组 客户端参数 服务端参数 访问的端口 新API 用户需满足以下条件之一: 属于系统管理员组 属于kafkaadmin组 属于kafkasuperuser组 security.protocol=SASL_PLAINTEXT sasl.kerberos.service.name = kafka - sasl.port(默认21007) 用户属于kafka组 allow.everyone.if.no.acl.found配置为true sasl.port(默认21007) 用户需满足以下条件之一: 属于系统管理员组 属于kafkaadmin组 kafkasuperuser组用户 security.protocol=SASL_SSLsasl.kerberos.service.name = kafka ssl-enable配置为“true” sasl-ssl.port(默认21009) 用户属于kafka组 allow.everyone.if.no.acl.found配置为“true” ssl-enable配置为“true” sasl-ssl.port(默认21009) - security.protocol=PLAINTEXT allow.everyone.if.no.acl.found配置为“true” port(默认21005) - security.protocol=SSL allow.everyone.if.no.acl.found配置为“true” ssl-enable配置为“true” ssl.port(默认21008) 旧Producer - - allow.everyone.if.no.acl.found配置为“true” port(默认21005) 旧Consumer - - allow.everyone.if.no.acl.found配置为“true” ZooKeeper服务端口:clientPort(默认24002)
  • Kafka访问协议说明 Kafka当前支持四种协议类型的访问:PLAINTEXT、SSL、SASL_PLAINTEXT、SASL_SSL。 Kafka服务启动时,默认会启动PLAINTEXT和SASL_PLAINTEXT两种协议类型的访问监测。可通过设置Kafka服务配置参数“ssl.mode.enable”为“true”,来启动SSL和SASL_SSL两种协议类型。 下表是四种协议类型的简单说明: 协议类型 说明 支持的API 默认端口 PLAINTEXT 支持无认证的明文访问 新API和旧API 9092 SASL_PLAINTEXT 支持Kerberos认证的明文访问 新API 21007 SSL 支持无认证的SSL加密访问 新API 9093 SASL_SSL 支持Kerberos认证的SSL加密访问 新API 21009
  • Kafka API简单说明 新Producer API 指org.apache.kafka.clients.producer.KafkaProducer中定义的接口,在使用“kafka-console-producer.sh”时,默认使用此API。 旧Producer API 指kafka.producer.Producer中定义的接口,在使用“kafka-console-producer.sh”时,加“--old-producer”参数会调用此API。 新Consumer API 指org.apache.kafka.clients.consumer.KafkaConsumer中定义的接口,在使用“kafka-console-consumer.sh”时,加“--new-consumer”参数会调用此API。 旧Consumer API 指kafka.consumer.ConsumerConnector中定义的接口,在使用“kafka-console-consumer.sh”时,默认使用此API。 新Producer API和新Consumer API,在下文中统称为新API。
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现创建Topic Partition的Leader显示为none。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 10.6.92.36:2181/kafka Created topic "test". [root@10-10-144-2 client]# kafka-topics.sh --describe --zookeeper 10.6.92.36:2181/kafka Topic:test PartitionCount:2 ReplicationFactor:2 Configs: Topic: test Partition: 0 Leader: none Replicas: 2,3 Isr: Topic: test Partition: 1 Leader: none Replicas: 3,1 Isr:
  • 原因分析 常见的进程被异常终止有2种原因: Java进程OOM被终止 一般Java进程都会配置OOM Killer,当检测到OOM会自动终止进程,OOM日志通常被打印到out日志中,此时可以看运行日志(如DataNode的日志路径为 /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-主机名.log),看是否有OutOfMemory内存溢出的打印。 被其他进程终止,或者人为终止。 排查DataNode运行日志(/var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-主机名.log),是先收到“RECEIVED SIGNAL 15”再健康检查失败。 即如下示例中DataNode先于 11:04:48被终止,然后过2分钟,于11:06:52启动。 2018-12-06 11:04:48,433 | ERROR | SIGTERM handler | RECEIVED SIGNAL 15: SIGTERM | LogAdapter.java:69 2018-12-06 11:04:48,436 | INFO | Thread-1 | SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at 192-168-235-85/192.168.235.85 ************************************************************/ | LogAdapter.java:45 2018-12-06 11:06:52,744 | INFO | main | STARTUP_MSG: 以上日志说明,DataNode先被其他进程关闭,然后健康检查失败,2分钟后,被NodeAgent启动DataNode进程。
  • 处理步骤 打开操作系统审计日志,给审计日志增加记录kill命令的规则,即可定位是何进程发送的命令。 操作影响 打印审计日志,会消耗一定操作系统性能,经过分析仅影响不到1%。 打印审计日志,会占用一定磁盘空间。该日志打印量不大,MB级别,且默认配置有老化机制和检测磁盘剩余空间机制,不会占满磁盘。 定位方法 在DataNode进程可能发生重启的所有节点,分别执行以下操作。 以root用户登录节点,执行service auditd status命令,确认该服务状态。 Checking for service auditd running 如果该服务未启动,执行service auditd restart命令重启该服务(无影响,耗时不到1秒)。 Shutting down auditd done Starting auditd done 审计日志临时增加kill命令审计规则。 增加规则: auditctl -a exit,always -F arch=b64 -S kill -S tkill -S tgkill -F a1!=0 -k process_killed 查看规则: auditctl -l 当进程有异常被终止后,使用ausearch -k process_killed命令,可以查询终止历史。 a0是被终止进程的PID(16进制),a1是kill命令的信号量。 验证方法 从 MRS 页面重启该节点一个实例,如DataNode。 执行ausearch -k process_killed命令,确认是否有日志打印。 例如以下命令ausearch -k process_killed |grep “.sh” ,可以看到是hdfs-daemon-ada* 脚本,关闭的DataNode进程。 停止审计kill命令方法 执行service auditd restart命令,即会清理临时增加的kill命令审计日志。 执行auditctl -l命令,如果没有相关信息,即说明已清理该规则。
  • 问题背景与现象 使用Flink过程中,具有两个相同权限用户testuser和bdpuser。 使用用户testuser创建Flink集群正常,但是切换至bdpuser用户创建Fllink集群时,执行yarn-session.sh命令报错: 2019-01-02 14:28:09,098 | ERROR | [main] | Ensure path threw exception | org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl (CuratorFrameworkImpl.java:566) org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /flink/application_1545397824912_0022
  • 解决办法 针对MRS 2.x及之前版本,操作如下: 方法1: 关闭Flink SSL通信加密,修改客户端配置文件“conf/flink-conf.yaml”。 security.ssl.internal.enabled: false 方法2: 开启Flink SSL通信加密,security.ssl.internal.enabled保持默认。 正确配置SSL: 配置keystore或truststore文件路径为相对路径时,Flink Client执行命令的目录需要可以直接访问该相对路径。 security.ssl.internal.keystore: ssl/flink.keystore security.ssl.internal.truststore: ssl/flink.truststore 在Flink的CLI yarn-session.sh命令中增加“-t”选项来传输keystore和truststore文件到各个执行节点。 yarn-session.sh -t ssl/ 2 配置keystore或truststore文件路径为绝对路径时,需要在Flink Client以及各个节点的该绝对路径上放置keystore或truststore文件。 security.ssl.internal.keystore: /opt/client/Flink/flink/conf/flink.keystore security.ssl.internal.truststore: /opt/client/Flink/flink/conf/flink.truststore 针对MRS3.x及之后版本,操作如下: 方法1: 关闭Flink SSL通信加密,修改客户端配置文件“conf/flink-conf.yaml”。 security.ssl.enabled: false 方法2: 开启Flink SSL通信加密,security.ssl.enabled 保持默认。 正确配置SSL: 配置keystore或truststore文件路径为相对路径时,Flink Client执行命令的目录需要可以直接访问该相对路径。 security.ssl.keystore: ssl/flink.keystore security.ssl.truststore: ssl/flink.truststore 在Flink的CLI yarn-session.sh命令中增加“-t”选项来传输keystore和truststore文件到各个执行节点。 yarn-session.sh -t ssl/ 2 配置keystore或truststore文件路径为绝对路径时,需要在Flink Client以及各个节点的该绝对路径上放置keystore或truststore文件。 security.ssl.keystore: /opt/client/Flink/flink/conf/flink.keystore security.ssl.truststore: /opt/client/Flink/flink/conf/flink.truststore
  • 问题背景与现象 创建Fllink集群,执行yarn-session.sh命令卡住一段时间后报错: 2018-09-20 22:51:16,842 | WARN | [main] | Unable to get ClusterClient status from Application Client | org.apache.flink.yarn.YarnClusterClient (YarnClusterClient.java:253) org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running. at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:861) at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248) at org.apache.flink.yarn.YarnClusterClient.waitForClusterToBeReady(YarnClusterClient.java:516) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:717) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:514) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:511) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:511) Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway. at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79) at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:856) ... 10 common frames omitted Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
  • 问题背景与现象 客户端安装成功,执行客户端命令例如yarn-session.sh时报错,提示如下: [root@host01 bin]# yarn-session.sh 2018-10-25 19:27:01,397 | ERROR | [main] | Error while trying to split key and value in configuration file /opt/flinkclient/Flink/flink/conf/flink-conf.yaml:81: "security.kerberos.login.principal:pippo " | org.apache.flink.configuration.GlobalConfiguration (GlobalConfiguration.java:160) Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Error while parsing YAML configuration file :81: "security.kerberos.login.principal:pippo " at org.apache.flink.configuration.GlobalConfiguration.loadYAMLResource(GlobalConfiguration.java:161) at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:112) at org.apache.flink.configuration.GlobalConfiguration.loadConfiguration(GlobalConfiguration.java:79) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:482)
  • 解决办法 从MRS上下载用户的keytab认证文件,并放置到Flink客户端所在节点的某个目录下。 在“flink-conf.yaml”文件中配置: keytab路径。 security.kerberos.login.keytab: /home/flinkuser/keytab/abc222.keytab “/home/flinkuser/keytab/abc222.keytab”表示的是用户目录,为1中放置目录。 请确保客户端用户具备对应目录权限。 principal名。 security.kerberos.login.principal: abc222 对于HA模式,如果配置了ZooKeeper,还需要设置ZooKeeper Kerberos认证相关的配置。 zookeeper.sasl.disable: false security.kerberos.login.contexts: Client 如果用户对于Kafka Client和Kafka Broker之间也需要做Kerberos认证,配置如下: security.kerberos.login.contexts: Client,KafkaClient
  • 原因分析 在安全集群环境下,Flink需要进行安全认证。当前客户端未进行相关安全认证设置。 Flink整个系统有两种认证方式: 使用kerberos认证:Flink yarn client、Yarn Resource Manager、JobManager、HDFS、TaskManager、Kafka和Zookeeper。 使用YARN内部的认证机制:Yarn Resource Manager与Application Master(简称AM)。 如果用户安装安全集群需要使用kerberos认证和security cookie认证。根据日志提示,发现配置文件中“security.kerberos.login.keytab :”配置项错误,未进行安全配置。
  • 问题背景与现象 客户端安装成功,执行客户端命令例如yarn-session.sh时报错,提示如下: [root@host01 bin]# yarn-session.sh 2018-10-25 01:22:06,454 | ERROR | [main] | Error while trying to split key and value in configuration file /opt/flinkclient/Flink/flink/conf/flink-conf.yaml:80: "security.kerberos.login.keytab: " | org.apache.flink.configuration.GlobalConfiguration (GlobalConfiguration.java:160) Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Error while parsing YAML configuration file :80: "security.kerberos.login.keytab: "
  • 原因分析 查看/var/log/Bigdata/dbservice/DB/gaussdb.log日志没有内容。 查看/var/log/Bigdata/dbservice/scriptlog/preStartDBService.log日志,发现如下信息,判断为配置信息丢失。 The program "gaussdb" was found by " /opt/Bigdata/MRS_xxx/install/dbservice/gaussdb/bin/gs_guc) But not was not the same version as gs_guc. Check your installation. 比对主备DBServer节点/srv/BigData/dbdata_service/data目录下的配置文件发现差距比较大。
  • 原因分析 查看DBService的备份页面错误信息,有如下错误信息提示: Clear temporary files at backup checkpoint DBService_test_DBService_DBService_20180326155921 that fialed last time. Temporyary files at backup checkpoint DBService_test_DBService_DBService20180326155921 that failed last time are cleared successfully. 查看/var/log/Bigdata/dbservice/scriptlog/backup.log文件,发现日志停止打印,并没有备份相关信息。 查看主 OMS 节点 /var/log/Bigdata/controller/backupplugin.log日志发现如下错误信息: result error is ssh:connect to host 172.16.4.200 port 22 : Connection refused (172.16.4.200是DBService的浮动IP) DBService backup failed.
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --replication-factor 2 --partitions 2 --topic test --zookeeper 192.168.234.231:2181 提示错误replication factor larger than available brokers。 具体如下: Error while executing topic command : replication factor: 2 larger than available brokers: 0 [2017-09-17 16:44:12,396] ERROR kafka.admin.AdminOperationException: replication factor: 2 larger than available brokers: 0 at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110) at kafka.admin.TopicCommand$.main(TopicCommand.scala:61) at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$)
  • 原因分析 使用客户端命令,打印replication factor larger than available brokers异常。 Error while executing topic command : replication factor: 2 larger than available brokers: 0 [2017-09-17 16:44:12,396] ERROR kafka.admin.AdminOperationException: replication factor: 2 larger than available brokers: 0 at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:110) at kafka.admin.TopicCommand$.main(TopicCommand.scala:61) at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$) 通过Manager参看Kafka服务是否处于正常状态,当前可用Broker是否小于设置的replication-factor。 检查客户端命令中ZooKeeper地址是否正确,访问ZooKeeper上所存放的Kafka信息,其路径(Znode)应该加上/kafka,发现配置中缺少/kafka。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 2 --partitions 2 --topic test --zookeeper 192.168.234.231:2181
  • 解决办法 保证Kafka服务处于正常状态,且可用Broker不小于设置的replication-factor。 创建命令中ZooKeeper地址信息需要添加/kafka。 [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181/kafka
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181 提示错误NoNodeException: KeeperErrorCode = NoNode for /brokers/ids。 具体如下: Error while executing topic command : org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids [2017-09-17 16:35:28,520] ERROR org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671) at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:541) at kafka.utils.ZkUtils.getSortedBrokerList(ZkUtils.scala:176) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:235) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:105) at kafka.admin.TopicCommand$.main(TopicCommand.scala:60) at kafka.admin.TopicCommand.main(TopicCommand.scala) Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2256) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2284) at org.I0Itec.zkclient.ZkConnection.getChildren(ZkConnection.java:114) at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:678) at org.I0Itec.zkclient.ZkClient$4.call(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985) ... 8 more (kafka.admin.TopicCommand$)
  • 原因分析 使用客户端命令,打印NoNodeException异常。 Error while executing topic command : org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids [2017-09-17 16:35:28,520] ERROR org.I0Itec.zkclient.exception.ZkNoNodeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/ids at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.getChildren(ZkClient.java:671) at kafka.utils.ZkUtils.getChildren(ZkUtils.scala:541) at kafka.utils.ZkUtils.getSortedBrokerList(ZkUtils.scala:176) at kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:235) at kafka.admin.TopicCommand$.createTopic(TopicCommand.scala:105) at kafka.admin.TopicCommand$.main(TopicCommand.scala:60) at kafka.admin.TopicCommand.main(TopicCommand.scala) 通过Manager查看Kafka服务是否处于正常状态。 检查客户端命令中ZooKeeper地址是否正确,访问ZooKeeper上所存放的Kafka信息,其路径(Znode)应该加上/kafka,发现配置中缺少/kafka: [root@10-10-144-2 client]# kafka-topics.sh --create --replication-factor 1 --partitions 2 --topic test --zookeeper 192.168.234.231:2181
  • 问题背景与现象 在使用Kafka客户端命令设置Topic ACL权限时,发现Topic无法被设置。 kafka-acls.sh --authorizer-properties zookeeper.connect=10.5.144.2:2181/kafka --topic topic_acl --producer --add --allow-principal User:test_acl 提示错误NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002。 具体如下: Error while executing ACL command: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:995) at org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:1038) at kafka.utils.ZkUtils.deletePath(ZkUtils.scala:499) at kafka.common.ZkNodeChangeNotificationListener$$anonfun$purgeObsoleteNotifications$1.apply(ZkNodeChangeNotificationListener.scala:118) at kafka.common.ZkNodeChangeNotificationListener$$anonfun$purgeObsoleteNotifications$1.apply(ZkNodeChangeNotificationListener.scala:112) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at kafka.common.ZkNodeChangeNotificationListener.purgeObsoleteNotifications(ZkNodeChangeNotificationListener.scala:112) at kafka.common.ZkNodeChangeNotificationListener.kafka$common$ZkNodeChangeNotificationListener$$processNotifications(ZkNodeChangeNotificationListener.scala:97) at kafka.common.ZkNodeChangeNotificationListener.processAllNotifications(ZkNodeChangeNotificationListener.scala:77) at kafka.common.ZkNodeChangeNotificationListener.init(ZkNodeChangeNotificationListener.scala:65) at kafka.security.auth.SimpleAclAuthorizer.configure(SimpleAclAuthorizer.scala:136) at kafka.admin.AclCommand$.withAuthorizer(AclCommand.scala:73) at kafka.admin.AclCommand$.addAcl(AclCommand.scala:80) at kafka.admin.AclCommand$.main(AclCommand.scala:48) at kafka.admin.AclCommand.main(AclCommand.scala) Caused by: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /kafka-acl-changes/acl_changes_0000000002 at org.apache.zookeeper.KeeperException.create(KeeperException.java:117) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:1416) at org.I0Itec.zkclient.ZkConnection.delete(ZkConnection.java:104) at org.I0Itec.zkclient.ZkClient$11.call(ZkClient.java:1042) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:985)
  • 原因分析 使用客户端命令,打印NoAuthException异常。 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息。 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 原因分析 使用客户端命令,打印NoAuthException异常。 Error while executing topic command org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304) at org.I0Itec.zkclient.ZkClient.createPersistent(ZkClient.java:213) at kafka.utils.ZkUtils$.createParentPath(ZkUtils.scala:215) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:338) at kafka.admin.AdminUtils$.writeTopicConfig(AdminUtils.scala:247) 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息。 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 问题背景与现象 在使用Kafka客户端命令创建Topic时,发现Topic无法被创建。 kafka-topics.sh --create --zookeeper 192.168.234.231:2181/kafka --replication-factor 1 --partitions 2 --topic test 提示错误NoAuthException,KeeperErrorCode = NoAuth for /config/topics。 具体如下: Error while executing topic command org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics org.I0Itec.zkclient.exception.ZkException: org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth for /config/topics at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:68) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685) at org.I0Itec.zkclient.ZkClient.create(ZkClient.java:304) at org.I0Itec.zkclient.ZkClient.createPersistent(ZkClient.java:213) at kafka.utils.ZkUtils$.createParentPath(ZkUtils.scala:215) at kafka.utils.ZkUtils$.updatePersistentPath(ZkUtils.scala:338) at kafka.admin.AdminUtils$.writeTopicConfig(AdminUtils.scala:247)
  • 问题背景与现象 在使用Kafka客户端命令设置Topic ACL权限时,发现Topic无法被设置。 kafka-topics.sh --delete --topic test4 --zookeeper 10.5.144.2:2181/kafka 提示错误ERROR kafka.admin.AdminOperationException: Error while deleting topic test4。 具体如下: Error while executing topic command : Error while deleting topic test4 [2017-01-25 14:00:20,750] ERROR kafka.admin.AdminOperationException: Error while deleting topic test4 at kafka.admin.TopicCommand$$anonfun$deleteTopic$1.apply(TopicCommand.scala:177) at kafka.admin.TopicCommand$$anonfun$deleteTopic$1.apply(TopicCommand.scala:162) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at kafka.admin.TopicCommand$.deleteTopic(TopicCommand.scala:162) at kafka.admin.TopicCommand$.main(TopicCommand.scala:68) at kafka.admin.TopicCommand.main(TopicCommand.scala) (kafka.admin.TopicCommand$)
  • 原因分析 使用客户端命令,打印AdminOperationException异常。 通过客户端命令klist查询当前认证用户: [root@10-10-144-2 client]# klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: test@HADOOP.COM Valid starting Expires Service principal 01/25/17 11:06:48 01/26/17 11:06:45 krbtgt/HADOOP.COM@HADOOP.COM 如上例中当前认证用户为test。 通过命令id查询用户组信息 [root@10-10-144-2 client]# id test uid=20032(test) gid=10001(hadoop) groups=10001(hadoop),9998(ficommon),10003(kafka)
  • 问题背景与现象 业务拓扑代码中配置参数topology.worker.childopts不生效,关键日志如下: [main] INFO b.s.StormSubmitter - Uploading topology jar /opt/jar/example.jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar Start uploading file '/opt/jar/example.jar' to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar' (65574612 bytes) [==================================================] 65574612 / 65574612 File '/opt/jar/example.jar' uploaded to '/srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar' (65574612 bytes) [main] INFO b.s.StormSubmitter - Successfully uploaded topology jar to assigned location: /srv/BigData/streaming/stormdir/nimbus/inbox/stormjar-8d3b778d-69ea-4fbe-ba88-01aa2036d753.jar [main] INFO b.s.StormSubmitter - Submitting topology word-count in distributed mode with conf {"topology.worker.childopts":"-Xmx4096m","storm.zookeeper.topology.auth.scheme":"digest","storm.zookeeper.topology.auth.payload":"-5915065013522446406:-6421330379815193999","topology.workers":1} [main] INFO b.s.StormSubmitter - Finished submitting topology: word-count 通过ps -ef | grep worker命令查看worker进程信息如下:
共100000条