A PyTorch framework for training transformer language models with Mixture of Experts (MoE) architecture support, Mixture of Depths (MoD), and DeepSpeed integration. Implements models from 70M to 300B parameters with automatic dataset processing, distributed training, and memory management.