Terjemahan disediakan oleh mesin penerjemah. Jika konten terjemahan yang diberikan bertentangan dengan versi bahasa Inggris aslinya, utamakan versi bahasa Inggris.
catatan
Untuk informasi selengkapnya, lihat .
kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE training-operator-658c68d697-46zmn 1/1 Running 0 90s
kubectl apply -f/path/to/training_job.yaml
kubectl get -o yamltraining-job-nkubeflow
kubectl delete -f/path/to/training_job.yaml
pytorchjob.kubeflow.org "training-job" deleted
sagemaker.amazonaws.com/node-health-status: Schedulable
apiVersion: "kubeflow.org/v1" kind: PyTorchJob metadata: name: pytorch-simple namespace: kubeflowannotations: { // config for job auto resume sagemaker.amazonaws.com/enable-job-auto-resume: "true" sagemaker.amazonaws.com/job-max-retry-count: "2" }spec: pytorchReplicaSpecs: ...... Worker: replicas: 10restartPolicy: OnFailuretemplate: spec: nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable
kubectl describe pytorchjob -n kubeflow<job-name>
Start Time: 2024-07-11T05:53:10Z Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-0 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-worker-1 Normal SuccessfulCreateService 9m45s pytorchjob-controller Created service: pt-job-1-master-0 Warning PyTorchJobRestarting 7m59s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed. Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-0 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-worker-1 Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controller Created pod: pt-job-1-master-0 Warning PyTorchJobRestarting 7m58s pytorchjob-controller PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-worker-0 Normal SuccessfulCreatePod 19m pytorchjob-controller Created pod: pt-job-2-master-0 Normal SuccessfulCreateService 19m pytorchjob-controller Created service: pt-job-2-master-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-worker-0 Normal SuccessfulCreatePod 4m48s pytorchjob-controller Created pod: pt-job-2-master-0