从OBS导入数据到ModelArts数据集-华为云

AI开发平台MODELARTS-从OBS目录导入数据到数据集:文件型数据标注状态

文件型数据标注状态数据标注状态分为“未标注”和“已标注”。未标注：仅导入标注对象（指待标注的图片，文本等），不导入标注内容（指标注结果信息）。已标注：同时导入标注对象和标注内容，当前“自由格式”的数据集不支持导入标注内容。为了确保能够正确读取标注内容，要求用户严格按照规范存放数据：导入方式选择目录时，需要用户选择“标注格式”，并按照标注格式的要求存放数据，详细规范请参见标注格式章节。导入方式选择manifest时，需要满足manifest文件的规范。数据标注状态选择“已标注”，您需要保证目录或manifest文件满足相应的格式规范，否则可能存在导入失败的情况。导入已标注的文件，导入完成后，请检查您导入的数据是否为已标注状态。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从OBS目录导入数据到数据集:表格数据集从OBS导入操作

表格数据集从OBS导入操作 ModelArts支持从OBS导入表格数据，即csv文件。表格数据集导入说明：导入成功的前提是，数据源的schema需要与创建数据集指定的schema保持一致。其中schema指表格的列名和类型，创建数据集时一旦指定，不支持修改。从OBS导入csv文件，不会校验数据类型，但是列数需要跟数据集的schema保持一致。如果数据格式不合法，会将数据置为null，详见表4。导入的csv文件要求如下：需要选择文件所在目录，其中csv文件的列数需要跟数据集schema一致。支持自动获取csv文件的schema。 ├─dataset-import-example │ table_import_1.csv │ table_import_2.csv │ table_import_3.csv │ table_import_4.csv

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:文本命名实体

文本命名实体 { "source":"content://Michael Jordan is the most famous basketball player in the world.", "usage":"TRAIN", "annotation":[ { "type":"modelarts/text_entity", "name":"Person", "property":{ "@modelarts:start_index":0, "@modelarts:end_index":14 }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" }, { "type":"modelarts/text_entity", "name":"Category", "property":{ "@modelarts:start_index":34, "@modelarts:end_index":44 }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" } ] } “source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。其中，property的参数解释如表6所示。例如，当“"source":"content://Michael Jordan"”时，如果要提取“Michael”，则对应的“start_index”为“0”，“end_index”为“7”。表6 property参数说明参数名数据类型说明 @modelarts:start_index Integer 文本的起始位置，值从0开始，包括start_index所指的字符。 @modelarts:end_index Integer 文本的结束位置，但不包括end_index所指的字符。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:语音分割

语音分割 { "source":"s3://path/to/audio1.wav", "usage":"TRAIN", "annotation":[ { "type":"modelarts/audio_segmentation", "property":{ "@modelarts:start_time":"00:01:10.123", "@modelarts:end_time":"00:01:15.456", "@modelarts:source":"Tom", "@modelarts:content":"How are you?" }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" }, { "type":"modelarts/audio_segmentation", "property":{ "@modelarts:start_time":"00:01:22.754", "@modelarts:end_time":"00:01:24.145", "@modelarts:source":"Jerry", "@modelarts:content":"I'm fine, thank you." }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" } ] } “source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。 “property”的参数解释如表10所示。表10 “property”参数说明参数名数据类型描述 @modelarts:start_time String 声音的起始时间，格式为“hh:mm:ss.SSS”。其中“hh”表示小时，“mm”表示分钟，“ss”表示秒，“SSS”表示毫秒。 @modelarts:end_time String 声音的结束时间，格式为“hh:mm:ss.SSS”。其中“hh”表示小时，“mm”表示分钟，“ss”表示秒，“SSS”表示毫秒。 @modelarts:source String 声音来源。 @modelarts:content String 声音内容。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:文本三元组

文本三元组 { "source":"content://"Three Body" is a series of long science fiction novels created by Liu Cix.", "usage":"TRAIN", "annotation":[ { "type":"modelarts/text_entity", "name":"Person", "id":"E1", "property":{ "@modelarts:start_index":67, "@modelarts:end_index":74 }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" }, { "type":"modelarts/text_entity", "name":"Book", "id":"E2", "property":{ "@modelarts:start_index":0, "@modelarts:end_index":12 }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" }, { "type":"modelarts/text_triplet", "name":"Author", "id":"R1", "property":{ "@modelarts:from":"E1", "@modelarts:to":"E2" }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" }, { "type":"modelarts/text_triplet", "name":"Works", "id":"R2", "property":{ "@modelarts:from":"E2", "@modelarts:to":"E1" }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" } ] }

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:语音内容

语音内容 { "source":"s3://path/to/audio1.wav", "annotation":[ { "type":"modelarts/audio_content", "property":{ "@modelarts:content":"Today is a good day." }, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" } ] } “source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。 “property”中的“@modelarts:content”参数，数据类型为“String”，表示语音内容。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:图像分类

图像分类 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 { "source":"s3://path/to/image1.jpg", "usage":"TRAIN", "hard":"true", "hard-coefficient":0.8, "id":"0162005993f8065ef47eefb59d1e4970", "annotation": [ { "type": "modelarts/image_classification", "name": "cat", "property": { "color":"white", "kind":"Persian cat" }, "hard":"true", "hard-coefficient":0.8, "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" }, { "type": "modelarts/image_classification", "name":"animal", "annotated-by":"modelarts/active-learning", "confidence": 0.8, "creation-time":"2019-01-23 11:30:30" }], "inference-loc":"/path/to/inference-output" } 表1 字段说明字段是否必选说明 source 是被标注对象的URI。数据来源的类型及示例请参考表2。 usage 否默认为空，取值范围： TRAIN：指明该对象用于训练。 EVAL：指明该对象用于评估。 TEST：指明该对象用于测试。 INFERENCE：指明该对象用于推理。如果没有给出该字段，则使用者自行决定如何使用该对象。 id 否此参数为系统导出的样本id，导入时可以不用填写。 annotation 否如果不设置，则表示未标注对象。annotation值为一个对象列表，详细参数请参见表3。 inference-loc 否当此文件由推理服务生成时会有该字段，表示推理输出的结果文件位置。表2 数据来源类型类型示例 OBS “source”:“s3://path-to-jpg” Content “source”:“content://I love machine learning” 表3 annotation对象说明字段是否必选说明 type 是标签类型。取值范围为： image_classification：图像分类 text_classification：文本分类 text_entity：文本命名实体 object_detection：对象检测 audio_classification：声音分类 audio_content：声音内容 audio_segmentation：声音起止点 name 是/否对于分类是必选字段，对于其他类型为可选字段，本示例为图片分类名称。 id 是/否标签ID。对于三元组是必选字段，对于其他类型为可选字段。三元组的实体标签ID格式为“E+数字”，比如“E1”、“E2”，三元组的关系标签ID格式为“R+数字”，例如“R1”、“R2”。 property 否包含对标注的属性，例如本示例中Cat有两个属性，颜色（color）和品种（kind）。 hard 否表示是否是难例。“True”表示该标注是难例，“False”表示该标注不是难例。 annotated-by 否默认为“human”，表示人工标注。 human creation-time 否创建该标注的时间。是用户写入标注的时间，不是Manifest生成时间。 confidence 否表示机器标注的置信度。范围为0～1。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:文本分类

文本分类 { "source": "content://I like this product ", "id":"XGDVGS", "annotation": [ { "type": "modelarts/text_classification", "name": " positive", "annotated-by": "human", "creation-time": "2019-01-23 11:30:30" } ] } content字段是指被标注的文本（UTF-8编码，可以是中文），其他参数解释与图像分类相同，请参见表1。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

AI开发平台MODELARTS-从Manifest文件导入规范说明:声音分类

声音分类 { "source": "s3://path/to/pets.wav", "annotation": [ { "type": "modelarts/audio_classification", "name":"cat", "annotated-by":"human", "creation-time":"2019-01-23 11:30:30" } ] } “source”、“usage”、“annotation”等参数说明与图像分类一致，详细说明请参见表1。

AI开发平台MODELARTS 从OBS导入数据到ModelArts数据集

云服务器内容精选

从OBS导入数据到ModelArts数据集

7*24

备案

专业服务

退订

建议反馈

售前咨询热线