AI开发平台MODELARTS-查看Notebook实例事件
查看Notebook实例事件
在Notebook的整个生命周期,包括实例的创建、启动、停止、规格变更等关键操作以及实例的运行状态等在后台都有记录,用户可以在Notebook实例详情页中查看具体的事件,通过实例的事件,从而看到实例的运行或者异常等状态详情。在右侧可以手动刷新事件,也可以设置间隔30秒,1分钟,5分钟自动刷新事件。
事件名称 |
事件描述 |
事件级别 |
---|---|---|
Scheduled |
实例被调度成功 |
提示 |
PullingImage |
正在拉取镜像 |
提示 |
PulledImage |
镜像拉取完毕 |
提示 |
NotebookHealthy |
实例运行中,处于健康状态 |
重要 |
CreateNotebookFailed |
创建实例失败 |
紧急 |
PullImageFailed |
镜像拉取失败 |
紧急 |
FailedCreate |
Failed to create notebook container. Please contact SRE to check node {node_name} |
紧急 |
CreateContainerError |
Failed to create container. Please contact SRE to check node {node_name} |
紧急 |
FailedAttachVolume |
Failed to attach volume. Please contact SRE to check node {node_name} |
重要 |
MountVolumeFailed |
Mount volume failed; Check whether the DEW secret is correct if the instance cannot change to running in five minutes |
紧急 |
Mount volume failed; Check if vpc of sfs-turbo is interconnected if the instance cannot change to running in five minutes |
紧急 |
|
Mount volume failed; Please contact SRE to check node {node_name} if the instance cannot change to running in five minutes |
紧急 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
EmptyDirExceeded |
Usage of empty-dir volume exceeds its limit. A new container will be scheduled and created automatically soon. |
紧急 |
NodeResourcePressure |
Insufficient node resources. A new container will be scheduled and created automatically soon. |
紧急 |
EphemeralStorageExceeded |
Local ephemeral storage exceeds its limit. A new container will be scheduled and created automatically soon. |
紧急 |
FailedToStartContainer |
Failed to start container. Please contact SRE to check node {node_name} |
紧急 |
Scheduled |
实例被调度成功 |
提示 |
PullingImage |
正在拉取镜像 |
提示 |
PulledImage |
镜像拉取完毕 |
提示 |
NotebookHealthy |
实例运行中,处于健康状态 |
重要 |
RunHookScript |
运行自定义脚本 |
提示 |
StartNotebookFailed |
实例启动失败 |
紧急 |
PullImageFailed |
镜像拉取失败 |
紧急 |
CreateKernelFailed |
conda命令不可用导致创建jupyter kernel失败 (The jupyter launcher page does not contain the kernel due to conda environment issues, please ensure that {conda_env} is available and the command: {conda_cmdt} env list can be run properly) |
重要 |
权限问题导致创建jupyter kernel失败 (The jupyter launcher page does not contain the kernel due to permission issues, please ensure that the uid {ma_uid} have write permissions to {conda_path}) |
重要 |
|
ConfigurationError |
conda命令不可用导致配置modelarts sdk和ma-cli路径到conda env失败 (The modelarts sdk and cli is unavailable in the conda envs due to conda environment issues, please ensure that the {conda_env} is available and the command: {conda_cmd} env list can be run properly) |
重要 |
权限问题导致配置modelarts sdk和ma-cli路径到conda env失败 (The modelarts sdk and cli is unavailable in the conda env due to permission issues,please ensure that the uid {ma_uid} have write permissions to {conda_path}) |
重要 |
|
FailedToPullImageReason |
Failed to pull image. Please make sure the image exists in SWR repo, otherwise contact SRE to check node {node_name} |
重要 |
Failed to pull image. Please contact SRE to check node {node_name}
说明:
{node_name}表示节点名称,为可变变量,一般为IP形式,如:192.168.1.1 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
StopNotebook |
实例停止 |
重要 |
StopNotebookResourceIdle |
实例因资源空闲即将自动停止或实例因资源空闲自动停止 |
重要 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
UpdateName |
更新实例名称 |
提示 |
UpdateDescription |
更新实例描述 |
提示 |
UpdateFlavor |
更新实例规格 |
重要 |
UpdateImage |
更新实例镜像 |
重要 |
UpdateStorageSize |
实例存储正在扩容 (User %s is updating storage size from %sGB to %sGB) |
重要 |
实例扩容完成 (User %s updated storage size successfully) |
重要 |
|
UpdateKeyPair |
配置实例密钥对 (User %s updated the instance keypair to "{%s}") |
重要 |
更新实例密钥对 (User %s updated the instance keypair from %s to %s) |
重要 |
|
UpdateWhitelist |
更新实例访问白名单 |
重要 |
UpdateHook |
更新自定义脚本 |
重要 |
UpdateStorageSizeFailed |
资源售罄引起的实例存储扩容失败 (The EVS disk is sold out) |
紧急 |
内部错误引起的实例扩容失败 (The EVS disk size updated failed. Operations and maintenance personnel are handling the problem) |
紧急 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
SaveImage |
保存镜像成功 |
重要 |
SavedImageFailed |
D进程引起的保存镜像失败 (There are processes in 'D' status, please check process status using 'ps -aux' and kill all the 'D' status processes) |
紧急 |
镜像大小引起的保存镜像失败 (Container size %dG is greater than threshold %dG) |
紧急 |
|
层数限制引起的保存镜像失败 (Too many layers in your image) |
紧急 |
|
任务超时引起的保存镜像失败 (Operations personnel are handling the problem) |
紧急 |
|
SWR故障引起的保存镜像失败 (Failed to save the image because the SWR service is faulty) |
紧急 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
NotebookUnhealthy |
实例处于不健康状态 |
紧急 |
OutOfMemory |
实例被OOM掉了 |
紧急 |
JupyterProcessKilled |
jupyter进程被killed掉了 |
紧急 |
CacheVolumeExceedQuota |
/cache目录文件大小超过最大限制 |
紧急 |
NotebookHealthy |
实例从不健康恢复到了健康状态 |
重要 |
EVSSoldOut |
EVS存储售罄 |
紧急 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
DynamicMountStorage |
挂载OBS存储 |
重要 |
DynamicUnmountStorage |
卸载OBS存储 |
重要 |
事件名称 |
事件描述 |
事件级别 |
---|---|---|
RefreshCredentialsFailed |
用户鉴权失败 |
紧急 |