MAPREDUCE服务 MRS-批量写入Hudi表:批量写入Hudi表

时间:2024-10-22 09:17:14

批量写入Hudi表

  1. 引入Hudi包生成测试数据,参考使用Spark Shell创建Hudi表章节的24
  2. 写入Hudi表,写入命令中加入参数:option("hoodie.datasource.write.operation", "bulk_insert"),指定写入方式为bulk_insert,如下所示:
    df.write.format("org.apache.hudi").
    options(getQuickstartWriteConfigs).
    option("hoodie.datasource.write.precombine.field", "ts").
    option("hoodie.datasource.write.recordkey.field", "uuid").
    option("hoodie.datasource.write.partitionpath.field", "").
    option("hoodie.datasource.write.operation", "bulk_insert").
    option("hoodie.table.name", tableName).
    option("hoodie.datasource.write.keygenerator.class", "org.apache.hudi.keygen.NonpartitionedKeyGenerator").
    option("hoodie.datasource.hive_sync.enable", "true").
    option("hoodie.datasource.hive_sync.partition_fields", "").
    option("hoodie.datasource.hive_sync.partition_extractor_class", "org.apache.hudi.hive.NonPartitionedExtractor").
    option("hoodie.datasource.hive_sync.table", tableName).
    option("hoodie.datasource.hive_sync.use_jdbc", "false").
    option("hoodie.bulkinsert.shuffle.parallelism", 4).
    mode(Overwrite).
    save(basePath)
    • 示例中各参数介绍请参考表1
    • 使用spark datasource接口更新Mor表,Upsert写入小数据量时可能触发更新数据的小文件合并,使在Mor表的读优化视图中能查到部分更新数据。
    • 当update的数据对应的base文件是小文件时,insert中的数据和update中的数据会被合在一起和base文件直接做合并产生新的base文件,而不是写log。
support.huaweicloud.com/cmpntguide-mrs/mrs_01_24035.html