ai-next

Sharing on the Design of UCM Inference Architecture for Sparse Attention Acceleration

September 14

•

11:25 - 12:00

Location: Keynote Venue - 318 & 328

As large model parameters and context windows continue to expand simultaneously, dense attention computation has become the absolute bottleneck for online inference—memory usage follows a quadratic curve, and latency scales linearly with sequence length. The industry commonly adopts KVCache + speculative decoding to mitigate this, but it still falls short in extreme scenarios with ultra-long sequences. This sharing focuses on the "sparse attention" paradigm, systematically presenting for the first time our self-developed UCM inference memory data manager's sparse inference stack architectural design and implementation experience, from algorithms and plugin design to software implementation.

Speakers

Jingbin Zhang

Senior Principal Architect & Technical Manager, Huawei