Run a TrainJob
This page shows how to leverage Kueue’s scheduling and resource management capabilities when running Kubeflow Trainer TrainJobs.
This guide is for batch users that have a basic understanding of Kueue. For more information, see Kueue’s overview.
Overview
Kubeflow Trainer v2 introduces the TrainJob API that works seamlessly with Kueue for batch scheduling and resource management. TrainJobs can be configured to use either:
- ClusterTrainingRuntime: Cluster-scoped training runtimes that can be used across all namespaces
- TrainingRuntime: Namespace-scoped training runtimes that are only available within a specific namespace
Kueue manages TrainJobs by scheduling their underlying JobSets according to available quota and priority.
Before you begin
-
Check administer cluster quotas for details on the initial cluster setup.
-
Install Kubeflow Trainer v2. Check the Trainer installation guide.
Note: The minimum required Trainer version is v2.0.0.
-
Enable TrainJob integration in Kueue. You can modify kueue configurations from installed releases to include TrainJobs as an allowed workload.
TrainJob definition
a. Queue selection
The target local queue should be specified in the metadata.labels section of the TrainJob configuration:
metadata:
labels:
kueue.x-k8s.io/queue-name: user-queue
b. Suspend field
By default, Kueue will set suspend to true via webhook and unsuspend it when the TrainJob is admitted:
spec:
suspend: true
Using ClusterTrainingRuntime
ClusterTrainingRuntimes are cluster-scoped resources that define training configurations accessible across all namespaces. They are typically created by platform administrators.
Example: PyTorch Distributed Training with ClusterTrainingRuntime
First, create a ClusterTrainingRuntime for PyTorch distributed training:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-distributed
labels:
trainer.kubeflow.org/framework: torch
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
replicatedJobs:
- name: node
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
Now, create a TrainJob that references this ClusterTrainingRuntime and will be scheduled by Kueue:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: pytorch-distributed
namespace: default
labels:
kueue.x-k8s.io/queue-name: user-queue
spec:
runtimeRef:
name: torch-distributed
kind: ClusterTrainingRuntime
trainer:
image: docker.io/kubeflow/pytorch-dist-mnist-test:v1.0
command:
- torchrun
- /workspace/examples/mnist/mnist.py
numNodes: 2
resourcesPerNode:
requests:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
Key Points:
- The
kueue.x-k8s.io/queue-namelabel assigns this TrainJob to theuser-queueLocalQueue - The
runtimeRefpoints to theClusterTrainingRuntimenamedtorch-distributed - Kueue will manage the lifecycle and admission of this TrainJob based on available quota
Using TrainingRuntime (Namespace-scoped)
TrainingRuntimes are namespace-scoped resources that provide more granular control per namespace. They are useful when different teams need customized training configurations.
Example: Custom PyTorch Training with TrainingRuntime
Create a namespace-scoped TrainingRuntime:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainingRuntime
metadata:
name: torch-custom
namespace: team-a
spec:
mlPolicy:
numNodes: 1
torch:
numProcPerNode: auto
template:
spec:
replicatedJobs:
- name: node
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: trainer
image: pytorch/pytorch:2.7.1-cuda12.8-cudnn9-runtime
env:
- name: CUSTOM_ENV
value: "team-a-value"
Create a TrainJob that uses this namespace-scoped runtime:
apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
name: pytorch-custom
namespace: team-a
labels:
kueue.x-k8s.io/queue-name: team-a-queue
spec:
runtimeRef:
name: torch-custom
kind: TrainingRuntime
apiGroup: trainer.kubeflow.org
trainer:
image: docker.io/team-a/custom-training:latest
numNodes: 1
resourcesPerNode:
requests:
cpu: "2"
memory: "4Gi"
Key Points:
- The TrainingRuntime is created in the same namespace as the TrainJob (
team-a) - The
runtimeRefspecifieskind: TrainingRuntimeto use the namespace-scoped runtime - Each namespace can have its own customized runtimes with different configurations
Using Workload Priority
To prioritize TrainJobs, use Kueue’s workload priority classes. See Run job with WorkloadPriority for details on configuring and using workload priority classes.
TrainJobs use the same priority mechanism as other Kueue workloads via the kueue.x-k8s.io/priority-class label.
LLM Fine-Tuning with Kueue
Kubeflow Trainer v2 supports LLM fine-tuning with TorchTune and DeepSpeed. For comprehensive examples, see:
- Fine-tune Llama-3.2-1B with Alpaca Dataset
- Fine-tune Qwen2.5-1.5B with Alpaca Dataset
- T5 Fine-Tuning with DeepSpeed
To use Kueue scheduling with these examples, add the queue label to your TrainJob:
metadata:
labels:
kueue.x-k8s.io/queue-name: gpu-queue # Add this label for Kueue scheduling
spec:
runtimeRef:
name: torchtune-llama3.2-1b
kind: ClusterTrainingRuntime
# ... rest of the TrainJob spec as shown in Kubeflow examples
Differences from Kubeflow Training Operator V1
Important
Kubeflow Trainer v2 introduces a new API that is not compatible with Training Operator v1 APIs (PyTorchJob, TFJob, etc.). The key differences are:
- Unified API: TrainJob replaces framework-specific CRDs like PyTorchJob, TFJob
- Runtime-based: Training configurations are defined in reusable Runtimes
- Built on JobSet: Uses Kubernetes JobSet as the underlying infrastructure
- Better integration: Native support for Kueue scheduling from the start
For migration guidance, refer to the Kubeflow Trainer documentation.
Best Practices
- Use ClusterTrainingRuntimes for common patterns: Create cluster-scoped runtimes for frequently used training configurations
- Use TrainingRuntimes for team-specific needs: Leverage namespace-scoped runtimes for customizations per team
- Set appropriate resource requests: Ensure your TrainJob resource requests match the ResourceFlavor in your ClusterQueue
- Monitor quota usage: Use
kubectl get clusterqueueto track resource utilization - Use priority classes: Assign priorities to TrainJobs to ensure critical workloads are scheduled first
- Test with small configurations: Before scaling up, test your TrainJob configuration with minimal resources
Additional Resources
- Kubeflow Trainer Documentation
- Kueue Concepts
- Run job with WorkloadPriority
- Monitor Pending Workloads
- Kubeflow Python SDK
Troubleshooting
For general troubleshooting guidance, see the Kueue troubleshooting guide.
For TrainJob-specific issues, verify that the referenced ClusterTrainingRuntime or TrainingRuntime exists:
kubectl get clustertrainingruntime
kubectl get trainingruntime -n <namespace>
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.