Release 0.9.1
Fixed
- Fix the bug that failed to run pytorchjob with RDMA
- Fix the bug that error dispaly gpu core resources on nodes
- Fix the bug that add evaluator and tensorboard to pod group
Changed
- Refact installtion
- Modify restful-serving to http-serving of deployment services
- Optimize the operators to omit the Completed jobs into the queue
Added
- Support modeljob adapts helm3
- Cron workload supports custom labels
- Java SDK submits training job with --label
- Add resource limits for tfjob
- Add subpathexpr for job