Comments (20)
感谢回复,抱歉才回复@gshilei
serviceStatuses和podStatuses一样单独存储便于后续扩展,在endpoint里添加service自定义时间参数,在KusciaTask通过Kist-Watch获取变更进行处理做到解耦,你指出的改进很有道理,方案可行,我准备按这个方案着手开发了。
计划本周开发完,下周验证后提交pr
from kuscia.
hdkoutann Give it to me
from kuscia.
Hi @hdkoutann , 使用下面方式,下载依赖的镜像
docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-deps:0.1.0b0
docker tag secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-deps:0.1.0b0 docker.io/secretflow/kuscia-deps:0.1.0b0
docker pull secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-envoy:0.2.0b0
docker tag secretflow-registry.cn-hangzhou.cr.aliyuncs.com/secretflow/kuscia-envoy:0.2.0b0 docker.io/secretflow/kuscia-envoy:0.2.0b0
from kuscia.
-
概要分析
-
Pod资源时间观测
包括两个内容,节点创建pod时间戳和pod拉起成功时间戳
KusciaTask的TaskStatus定义里面已经有podStatuses内容,podStatuses包含所有task part所需要创建的pod,在当前版本的podStatuses中添加两个字段:
- podCreateTime(pod创建时间)
- podStartupTime(pod启动时间)
podStatuses: alice/secretflow-task-psi-0: podCreateTime: "2023-06-26T03:46:58Z" podStartupTime: "2023-06-26T03:46:58Z"
-
Service(Endpoints)资源时间观测
当前KusciaTask的CRD定义没有service相关内容,需要添加字段。
在podStatuses下新建envoyServices字段,以及处理时间戳decorateTime
podStatuses: alice/secretflow-task-psi-0: podCreateTime: "2023-06-26T03:46:58Z" podStartupTime: "2023-06-26T03:46:58Z" envoyServices: - serviceName: task-template-psi-0-global decorateTime: "2023-06-26T03:46:58Z"
-
-
详细方案
-
pod创建和拉起时间添加
KusciaTask处理逻辑中监听pod状态并更新逻辑,在pkg\controllers\kusciatask\handler\running_handler.go添加判断,判断pod状态
当pod在pending状态,更新podStatuses中podCreateTime字段
当pod在running状态,更新podStatuses中podStartupTime字段;
pod挂掉自动重启情况应该也可以覆盖到。
-
envoy完成时间添加
envoy的后置处理逻辑在pkg\gateway\controller\endpoints.go中,监听Service变更事件后创建envoy节点。
service的变更事件中可以获取到kuscia的Service,通过service的OwnerReferences获取到pod,通过pod的OwnerReferences可以获取到KusciaTask,通过获取到的kusciaTask更新envoServices的decorateTime字段;
-
from kuscia.
这样设计实现是否可行呀
from kuscia.
@hdkoutann 感谢你的积极参与,针对你上面设计方案,有以下建议:
- Pod资源时间观测
- podStartupTime 可以细粒度地拆分成:scheduledTime/readyTime;podStatuses结构如下:
podStatuses:
alice/secretflow-task-psi-0:
...
createdTime: "2023-06-26T03:46:58Z" -> 对应pod资源的 metadata.creationTimestamp
scheduledTime: "2023-06-26T03:47:01Z" -> 对应pod资源status.conditions[PodScheduled].lastTransitionTime
readyTime: "2023-06-26T03:47:05Z" -> 对应pod资源status.conditions[Ready].lastTransitionTime
- Service(Endpoints)资源时间观测
- 新增serviceStatuses字段,存放service的时间信息
- service包含2个字段:createdTime 和 readyTime
- readyTime:当gateway处理watch的service和endpoint时,最后会调用AddEnvoyCluster。在add envoy cluster成功之后,gateway 给对应service的annotation新增字段kuscia.secretflow/ready-time;kusciaTask Controller 通过List-Watch机制监听到该service有变化之后,更新 KusciaTask 中的serviceStatuses字段。
podStatuses:
...
serviceStatuses:
alice/secretflow-task-xxx-single-psi-0-spu:
createdTime: "2023-06-26T03:46:58Z" -> 对应service资源的 metadata.creationTimestamp
readyTime: "2023-06-26T03:46:59Z" -> 对应service资源的 annotations kuscia.secretflow/ready-time: 2023-06-26T03:46:59Z
alice/secretflow-task-xxx-single-psi-0-global:
createdTime: "2023-06-26T03:46:58Z"
readyTime: "2023-06-26T03:46:59Z"
bob/secretflow-task-xxx-single-psi-0-spu:
createdTime: "2023-06-26T03:46:58Z"
readyTime: "2023-06-26T03:46:59Z"
bob/secretflow-task-xxx-single-psi-0-global:
createdTime: "2023-06-26T03:46:58Z"
readyTime: "2023-06-26T03:46:59Z"
from kuscia.
在mac上make image速度非常非常慢,基本make不出来,有什么办法可以提速么@gshilei
from kuscia.
能看一下,在Make的过程中,哪一步比较慢吗?
from kuscia.
依赖的kuscia-envoy和kuscia-deps下载很慢
from kuscia.
搭了下centos虚拟机上跑也不行,是不是机器配置不行还是网络配置有问题?
from kuscia.
从上面看,构建Kuscia镜像时依赖的两个基础镜像:secretflow/kuscia-envoy:0.2.0b0 和 secretflow/kuscia-deps:0.1.0b0下载比较慢,可以尝试下配个加速器 https://gist.github.com/y0ngb1n/7e8f16af3242c7815e7ca2f0833d3ea6 看是否有效果。
在构建Kuscia镜像之前,可以手动先把两个镜像下载到本地
from kuscia.
https://gist.github.com/y0ngb1n/7e8f16af3242c7815e7ca2f0833d3ea6 这个打不开呀
from kuscia.
Hi @hdkoutann, 抱歉现在才回复,针对podStatuses下的时间需要再调整下,以便更细粒度的展示每个阶段的时间点。建议如下:
下面4个时间字段,会先定义出来。除了scheduleTime字段本期留空外,其他字段根据实际值填写。
podStatuses:
alice/secretflow-task-psi-0:
...
createTime: "2023-06-26T03:46:58Z" -> 1. pod创建时间,对应pod.metadata.creationTimestamp
scheduleTime: "2023-06-26T03:40:00Z" -> 2. pod调度时间,本期不填
startTime: "2023-06-26T03:47:02Z" -> 3. pod被agent接受时间,对应pod.status.startTime
readyTime: "2023-06-26T03:47:05Z" -> 4. pod被agent拉起时间,对应pod.status.conditions[Ready].lastTransitionTime
from kuscia.
了解,下来我调整下
from kuscia.
调整好了,createTime、scheduleTime两个字段和之前的定义createdTime、scheduledTime字段名不一样,也需要统一调整吗
from kuscia.
这两个字段名称需要改一下
- createdTime 改成 createTime
- scheduledTime 改成 scheduleTime
from kuscia.
调整完了,service的createTime也一起调整了,风格保持一致@gshilei
from kuscia.
from kuscia.
from kuscia.
验证ok了@gshilei
from kuscia.
Related Issues (20)
- Request entity too large: limit is 3145728 HOT 8
- 导出模型异常 HOT 1
- KusciaTask 系统资源指标采集、暴露及统一导出 HOT 2
- Kuscia有没有helmchart提供部署 HOT 4
- K8S RunK点对点模式部署kuscia 测试作业运行失败 HOT 7
- 重装kuscia后,出现授权错误 HOT 3
- 咨询Kuscia故障处理方式 HOT 3
- 使用kuscia v0.11.0b0部署,appimage远程拉取镜像时报错 HOT 12
- 想要实现亿级别隐私求交的问题 HOT 4
- Kuscia K8S RunK运行时运行SCQL任务失败 HOT 21
- Docker多机部署点对点集群,运行任务时报错 HOT 3
- 中心化部署v0.11.0b0注册自定义算法镜像成功,自定义镜像在lite节点内已存在,但执行任务时,报找不到镜像 HOT 3
- 通过k3s启动的容器特权模式不生效 HOT 6
- kuscia开发环境问题,请教大佬指点,拜谢 HOT 8
- kuscia 发起任务出现资源不足的情况;查看机器资源是够的 HOT 2
- kuscia多机部署节点转发作业validate失败 HOT 20
- kuscia中心化集群K8S部署执行任务失败
- K8s部署kuscia中心化集群,Runp模式,执行隐私计算有问题 HOT 35
- Kuscia 上运行 SCQL 联合分析任务,创建 MySQL 数据库数据源,运行报错 HOT 6
- 创建节点间路由失败 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuscia.