Session Outline
Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced. In this talk at the Data Innovation Summit 2024, Filipp Fisin from Nebius AI, provides an overview of techniques for more resilient training that they’ve found useful in their JAX-based multi-node training setup.
Key Takeaways
- Multi-node training orchestration in Kubernetes via Argo with automatic failure recovery
- A special type of Kubernetes health-checks to detect if a training process is stuck – techniques to efficiently save and load terabyte-scale checkpoints
- XLA compilation cache
- GPU node monitoring and auto-cordoning