可信智能计算服务 TICS-准备数据:准备本地横向联邦数据资源
准备本地横向联邦数据资源
- 上传数据集文件(作业参与方)
上传数据集文件到计算节点挂载路径下,供计算节点执行的脚本读取。如果是主机挂载,上传到宿主机的挂载路径下。如果是OBS挂载,使用华为云提供的 对象存储服务 ,上传到当前计算节点使用的对象桶中。
图5 对象桶名称
此处以主机挂载为例:
- 创建一个主机挂载的计算节点Agent1,挂载路径为/tmp/tics1/。
- 使用文件上传工具上传包含数据集iris1.csv的dataset文件夹到宿主机/tmp/tics1/目录下。
iris1.csv内容如下:
sepal_length,sepal_width,petal_length,petal_width,class 5.1,3.5,1.4,0.3,Iris-setosa 5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa 5.4,3.4,1.7,0.2,Iris-setosa 5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1,0.2,Iris-setosa 5.1,3.3,1.7,0.5,Iris-setosa 4.8,3.4,1.9,0.2,Iris-setosa 5,3,1.6,0.2,Iris-setosa 5,3.4,1.6,0.4,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris-setosa 4.7,3.2,1.6,0.2,Iris-setosa 4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-setosa 5.2,4.1,1.5,0.1,Iris-setosa 5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5,3.2,1.2,0.2,Iris-setosa 5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 4.4,3,1.3,0.2,Iris-setosa 5.1,3.4,1.5,0.2,Iris-setosa 5,3.5,1.3,0.3,Iris-setosa 4.5,2.3,1.3,0.3,Iris-setosa 4.4,3.2,1.3,0.2,Iris-setosa 5,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5,3.3,1.4,0.2,Iris-setosa 6.8,2.8,4.8,1.4,Iris-versicolor 6.7,3,5,1.7,Iris-versicolor 6,2.9,4.5,1.5,Iris-versicolor 5.7,2.6,3.5,1,Iris-versicolor 5.5,2.4,3.8,1.1,Iris-versicolor 5.5,2.4,3.7,1,Iris-versicolor 5.8,2.7,3.9,1.2,Iris-versicolor 6,2.7,5.1,1.6,Iris-versicolor 5.4,3,4.5,1.5,Iris-versicolor 6,3.4,4.5,1.6,Iris-versicolor 6.7,3.1,4.7,1.5,Iris-versicolor 6.3,2.3,4.4,1.3,Iris-versicolor 5.6,3,4.1,1.3,Iris-versicolor 5.5,2.5,4,1.3,Iris-versicolor 5.5,2.6,4.4,1.2,Iris-versicolor 6.1,3,4.6,1.4,Iris-versicolor 5.8,2.6,4,1.2,Iris-versicolor 5,2.3,3.3,1,Iris-versicolor 5.6,2.7,4.2,1.3,Iris-versicolor 5.7,3,4.2,1.2,Iris-versicolor 5.7,2.9,4.2,1.3,Iris-versicolor 6.2,2.9,4.3,1.3,Iris-versicolor 5.1,2.5,3,1.1,Iris-versicolor 5.7,2.8,4.1,1.3,Iris-versicolor 6.3,3.3,6,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3,5.9,2.1,Iris-virginica 6.3,2.9,5.6,1.8,Iris-virginica 6.5,3,5.8,2.2,Iris-virginica 7.6,3,6.6,2.1,Iris-virginica 4.9,2.5,4.5,1.7,Iris-virginica 7.3,2.9,6.3,1.8,Iris-virginica 6.7,2.5,5.8,1.8,Iris-virginica 7.2,3.6,6.1,2.5,Iris-virginica 6.5,3.2,5.1,2,Iris-virginica 6.4,2.7,5.3,1.9,Iris-virginica 6.8,3,5.5,2.1,Iris-virginica 5.7,2.5,5,2,Iris-virginica 5.8,2.8,5.1,2.4,Iris-virginica 6.4,3.2,5.3,2.3,Iris-virginica 6.5,3,5.5,1.8,Iris-virginica 7.7,3.8,6.7,2.2,Iris-virginica 7.7,2.6,6.9,2.3,Iris-virginica 6,2.2,5,1.5,Iris-virginica 6.9,3.2,5.7,2.3,Iris-virginica 5.6,2.8,4.9,2,Iris-virginica 7.7,2.8,6.7,2,Iris-virginica 6.3,2.7,4.9,1.8,Iris-virginica 6.7,3.3,5.7,2.1,Iris-virginica 7.2,3.2,6,1.8,Iris-virginica
- 为了使容器内的计算节点程序有权限能够读取到文件,使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组为1000:1000。
- 在第二台主机上创建计算节点Agent2,挂载路径为/tmp/tics2/。上传包含数据集iris2.csv的dataset文件夹到宿主机目录下,修改属主。
iris2.csv的内容如下:
sepal_length,sepal_width,petal_length,petal_width,class 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-setosa 4.8,3.4,1.6,0.2,Iris-setosa 4.8,3,1.4,0.1,Iris-setosa 4.3,3,1.1,0.1,Iris-setosa 5.8,4,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris-setosa 7,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor 5,2,3.5,1,Iris-versicolor 5.9,3,4.2,1.5,Iris-versicolor 6,2.2,4,1,Iris-versicolor 6.1,2.9,4.7,1.4,Iris-versicolor 5.6,2.9,3.6,1.3,Iris-versicolor 6.7,3.1,4.4,1.4,Iris-versicolor 5.6,3,4.5,1.5,Iris-versicolor 5.8,2.7,4.1,1,Iris-versicolor 6.2,2.2,4.5,1.5,Iris-versicolor 5.6,2.5,3.9,1.1,Iris-versicolor 5.9,3.2,4.8,1.8,Iris-versicolor 6.1,2.8,4,1.3,Iris-versicolor 6.3,2.5,4.9,1.5,Iris-versicolor 6.1,2.8,4.7,1.2,Iris-versicolor 6.4,2.9,4.3,1.3,Iris-versicolor 6.6,3,4.4,1.4,Iris-versicolor 6.8,2.8,4.8,1.4,Iris-versicolor 6.2,2.8,4.8,1.8,Iris-virginica 6.1,3,4.9,1.8,Iris-virginica 6.4,2.8,5.6,2.1,Iris-virginica 7.2,3,5.8,1.6,Iris-virginica 7.4,2.8,6.1,1.9,Iris-virginica 7.9,3.8,6.4,2,Iris-virginica 6.4,2.8,5.6,2.2,Iris-virginica 6.3,2.8,5.1,1.5,Iris-virginica 6.1,2.6,5.6,1.4,Iris-virginica 7.7,3,6.1,2.3,Iris-virginica 6.3,3.4,5.6,2.4,Iris-virginica 6.4,3.1,5.5,1.8,Iris-virginica 6,3,4.8,1.8,Iris-virginica 6.9,3.1,5.4,2.1,Iris-virginica 6.7,3.1,5.6,2.4,Iris-virginica 6.9,3.1,5.1,2.3,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 6.8,3.2,5.9,2.3,Iris-virginica 6.7,3.3,5.7,2.5,Iris-virginica 6.7,3,5.2,2.3,Iris-virginica 6.3,2.5,5,1.9,Iris-virginica 6.5,3,5.2,2,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3,5.1,1.8,Iris-virginica
- 准备模型文件/初始权重(作业发起方)
作业发起方需要提供模型、初始权重(非必须),上传到Agent1的挂载目录下并使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组。
使用python代码创建模型文件,保存为二进制文件model.h5,以鸢尾花为例,生成如下的模型:
import tensorflow as tf import keras model = keras.Sequential([ keras.layers.Dense(4, activation=tf.nn.relu, input_shape=(4,)), keras.layers.Dense(6, activation=tf.nn.relu), keras.layers.Dense(3, activation='softmax') ]) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.save("d:/model.h5")
初始权重的格式是浮点数的数组,与模型对应。使用联邦学习训练出来的结果result_1可以作为初始权重,样例如下:
-0.23300957679748535,0.7804553508758545,0.0064492723904550076,0.5866460800170898,0.676144003868103,-0.7883696556091309,0.5472091436386108,-0.20961782336235046,0.58524489402771,-0.5079598426818848,-0.47474920749664307,-0.3519996106624603,-0.10822880268096924,-0.5457949042320251,-0.28117161989212036,-0.7369481325149536,-0.04728877171874046,0.003856887575238943,0.051739662885665894,0.033792052417993546,-0.31878742575645447,0.7511205673217773,0.3158722519874573,-0.7290999293327332,0.7187696695327759,0.09846954792737961,-0.06735057383775711,0.7165604829788208,-0.730293869972229,0.4473201036453247,-0.27151209115982056,-0.6971480846405029,0.7360773086547852,0.819558322429657,0.4984433054924011,0.05300116539001465,-0.6597640514373779,0.7849202156066895,0.6896201372146606,0.11731931567192078,-0.5380218029022217,0.18895208835601807,-0.18693888187408447,0.357051283121109,0.05440644919872284,0.042556408792734146,-0.04341210797429085,0.0,-0.04367709159851074,-0.031455427408218384,0.24731603264808655,-0.062861368060112,-0.4265706539154053,0.32981523871421814,-0.021271884441375732,0.15228557586669922,0.1818728893995285,0.4162319302558899,-0.22432318329811096,0.7156463861465454,-0.13709741830825806,0.7237883806228638,-0.5489991903305054,0.47034209966659546,-0.04692812263965607,0.7690137028694153,0.40263476967811584,-0.4405142068862915,0.016018997877836227,-0.04845477640628815,0.037553105503320694
- 编写训练脚本(作业发起方)
作业发起方还需要编写联邦学习训练脚本,其中需要用户自行实现读取数据、训练模型、评估模型、获取评估指标的逻辑。计算节点会将数据集配置文件中的path属性作为参数传递给训练脚本。
JobParam属性如下:
class JobParam: """训练脚本参数 """ # 作业id job_id = '' # 当前轮数 round = 0 # 迭代次数 epoch = 0 # 模型文件路径 model_file = '' # 数据集路径 dataset_path = '' # 是否仅做评估 eval_only = False # 权重文件 weights_file = '' # 输出路径 output = '' # 其他参数json字符串 param = ''
鸢尾花的训练脚本iris_train.py样例如下:
# -*- coding: utf-8 -*- import getopt import sys import keras import horizontal.horizontallearning as hl def train(): # 解析命令行输入 jobParam = JobParam() jobParam.parse_from_command_line() job_type = 'evaluation' if jobParam.eval_only else 'training' print(f"Starting round {jobParam.round} {job_type}") # 加载模型,设置初始权重参数 model = keras.models.load_model(jobParam.model_file) hl.set_model_weights(model, jobParam.weights_file) # 加载数据、训练、评估 -- 用户自己实现 print(f"Load data {jobParam.dataset_path}") train_x, test_x, train_y, test_y, class_dict = load_data(jobParam.dataset_path) if not jobParam.eval_only: b_size = 1 model.fit(train_x, train_y, batch_size=b_size, epochs=jobParam.epoch, shuffle=True, verbose=1) print(f"Training job [{jobParam.job_id}] finished") eval = model.evaluate(test_x, test_y, verbose=0) print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" % (eval[0], eval[1] * 100)) # 结果以json格式保存 -- 用户读取评估指标 result = {} result['loss'] = eval[0] result['accuracy'] = eval[1] # 生成结果文件 hl.save_train_result(jobParam, model, result) # 读取 CS V数据集,并拆分为训练集和测试集 # 该函数的传入参数为CSV_FILE_PATH: csv文件路径 def load_data(CSV_FILE_PATH): import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelBinarizer # 读取目录数据集,读取目录下所有CSV文件 if os.path.isdir(CSV_FILE_PATH): print(f'read file folder [{CSV_FILE_PATH}]') all_csv_path = glob.glob(os.path.join(CSV_FILE_PATH, '*.csv')) all_csv_path.sort() csv_list = [] for csv_path in all_csv_path: csv_list.append(pd.read_csv(csv_path)) IRIS = pd.concat(csv_list) # 读取CSV文件 else: IRIS = pd.read_csv(CSV_FILE_PATH) target_var = 'class' # 目标变量 # 数据集的特征 features = list(IRIS.columns) features.remove(target_var) # 目标变量的类别 Class = IRIS[target_var].unique() # 目标变量的类别字典 Class_dict = dict(zip(Class, range(len(Class)))) # 增加一列target, 将目标变量进行编码 IRIS['target'] = IRIS[target_var].apply(lambda x: Class_dict[x]) # 对目标变量进行0-1编码(One-hot Encoding) lb = LabelBinarizer() lb.fit(list(Class_dict.values())) transformed_labels = lb.transform(IRIS['target']) y_bin_labels = [] # 对多分类进行0-1编码的变量 for i in range(transformed_labels.shape[1]): y_bin_labels.append('y' + str(i)) IRIS['y' + str(i)] = transformed_labels[:, i] # 将数据集分为训练集和测试集 train_x, test_x, train_y, test_y = train_test_split(IRIS[features], IRIS[y_bin_labels], train_size=0.7, test_size=0.3, random_state=0) return train_x, test_x, train_y, test_y, Class_dict class JobParam: """训练脚本参数 """ # required parameters job_id = '' round = 0 epoch = 0 model_file = '' dataset_path = '' eval_only = False # optional parameters weights_file = '' output = '' param = '' def parse_from_command_line(self): """从命令行中解析作业参数 """ opts, args = getopt.getopt(sys.argv[1:], 'hn:w:', ['round=', 'epoch=', 'model_file=', 'eval_only', 'dataset_path=', 'weights_file=', 'output=', 'param=', 'job_id=']) for key, value in opts: if key in ['--round']: self.round = int(value) if key in ['--epoch']: self.epoch = int(value) if key in ['--model_file']: self.model_file = value if key in ['--eval_only']: self.eval_only = True if key in ['--dataset_path']: self.dataset_path = value if key in ['--weights_file']: self.weights_file = value if key in ['--output']: self.output = value if key in ['--param']: self.param = value if key in ['--job_id']: self.job_id = value if __name__ == '__main__': train()
- 免费的云数据库_云原生数据库_云数据库和本地数据库的区别
- 分布式云原生UCS集群_华为云分布式云原生_华为云UCS集群
- 什么是弹性资源池_数据湖探索DLI弹性资源池
- 免费缓存数据库_KV数据库redis场景_云数据库资源免费领取_缓存数据库2
- 免费时序数据库_时序数据库influxdb场景_数据库资源免费领取_实时数据库
- 时序数据库_GeminiDB Influx 接口场景_免费_数据库资源_实时数据库和时序数据库
- ModelArts是什么_AI开发平台_ModelArts功能
- 华为云UCS如何实现多云多集群管理?
- 大数据应用范围有哪些_大数据技术与应用要学习什么课程
- 云端开发环境服务_IDE _云开发