Fail Fast & Recover Faster: Infrastructure Resilience of Multi-Node LLM Training – Filipp Fisin, Nebius AI

Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced.
Data Innovation Summit 2024 Data Innovation Summit 2024
Data Innovation Summit 2024

Session Outline

Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced. In this talk at the Data Innovation Summit 2024, Filipp Fisin from Nebius AI, provides an overview of techniques for more resilient training that they’ve found useful in their JAX-based multi-node training setup.

Key Takeaways

  • Multi-node training orchestration in Kubernetes via Argo with automatic failure recovery
  • A special type of Kubernetes health-checks to detect if a training process is stuck – techniques to efficiently save and load terabyte-scale checkpoints
  • XLA compilation cache
  • GPU node monitoring and auto-cordoning
Add a comment

Leave a Reply