ws-edge-ai

Collaborative Compression for Large-Scale MoE Deployment on Edge

September 13

•

14:30 - 14:55

Location: Venue 6 - B01

The Mixture of Experts (MoE) architecture enables scaling Large Language Models while keeping computation costs low. However, ultra-large MoE models with hundreds of billions of parameters require massive memory and storage, making edge deployment challenging. This presentation introduces a comprehensive compression framework combining expert pruning, MoE-specific mixed-precision quantization, and activation optimization. The framework reduces both model weight size and activation memory usage, achieving the first efficient deployment of models as large as DeepSeek-V3 under 128 GB memory constraints, outperforming uniform low-bit quantization methods.

Speakers

Yanzhi Wang

Professor of Northeastern University in the United States