NexNPU

Overview

To meet the exponential growth of AI, Neural Processing Units (NPUs) have become the cornerstone of modern computing infrastructure. The continued evolution of AI relies heavily on the advancement of NPUs, which demands collaborative efforts from both academia and industry. Currently, a gap exists between the exploration of new ideas in academia and the state-of-the-art implementation of industry. While academia continuously proposes novel systems and architectures, the lack of an open, industry-standard baseline limits the community’s ability to validate and integrate these innovations into production-grade systems. We need an open-source NPU ecosystem for the community to innovate collaboratively across the entire stack.

In this tutorial, we will present a roadmap for building an open-source ecosystem for NPU design and implementation. We first introduce the architecture and software stack of modern NPUs and highlight the missing pieces for an open NPU ecosystem. Then, we survey the open tool landscape for NPU development, including architectural design, compiler stack, system software, and hardware prototypes, discussing the capabilities and limitations of existing tools. To facilitate NPU architecture design, we will introduce NeuSim and discuss how users can utilize it to perform fast architectural design space exploration of NPU chips. Finally, we use Google TPU stack as a strong case study to demonstrate the importance of building the NPU ecosystem across the full stack. The tutorial concludes with perspectives on future research and community efforts toward an open, extensible, and production-relevant NPU ecosystem.

Schedule

13:30 – 17:00 EDT · Room 302A

Time	Title	Speaker	Description
13:30 – 13:40	Opening: Why Do We Need an Open NPU Ecosystem?	Jian Huang	Introduce the vision, scope, and roadmap of building an open NPU ecosystem.
13:40 – 14:10	The NPU Hardware and Software Stack: The Missing Pieces and Open Research Problems	Yuqi Xue	Introduce the basic NPU hardware and software components and the differences from the GPU ecosystem, highlighting the missing pieces for NPU ecosystem and the open research problems.
14:10 – 14:40	Open Tools for NPU Development Lifecycle	Yuqi Xue	Survey existing open tools across the NPU development lifecycle, including architectural design, compiler stack, system software, and hardware prototypes. Discuss the capabilities and limitations of existing tools and what remains to be done.
14:40 – 15:30	Architectural Design Space Exploration for NPUs: Methodology and Case Studies	Yuqi Xue	Discuss the methodology/workflow for performing the architectural design space exploration for NPUs using the NeuSim simulator.
15:30 – 16:00	Coffee Break
16:00 – 16:45	Demystifying the Google TPU Stack: From Silicon Architecture to Custom Pallas Kernels	Srinath Mandalapu (Google)	Introduce the Google TPU stack as a case study for building NPU ecosystems. Abstract Building an open and innovative NPU/TPU ecosystem requires developers to understand exactly how AI hardware executes code. This concise, 45-minute tutorial provides a fast-paced, bottom-up tour of the Google TPU computing stack. We will break down the entire stack across three core layers: The Hardware Layer: A look at the physical core architecture—Matrix Multiply Units (MXU), Vector Processing Units (VPU), and Vector Registers (VREGs)—and how DMA engines move data across HBM, VMEM, and SMEM using specific tiling and memory layouts. The Compiler Layer (XLA): Tracking the lifecycle of a program as it lowers from high-level JAX tracing down to JAXpr, StableHLO, and low-level optimization (LLO) into instruction-level VLIW bundles. The Scaling & Customization Layer: Demystifying distributed execution by contrasting automatic JIT collectives with device-level explicit sharding (ShardMap), concluding with a practical introduction to writing custom hardware kernels using Pallas. By bridging the gap between silicon primitives and compiler abstractions, this talk equips developers with the mental models needed to build, optimize, and advocate for next-generation, open accelerator software stacks.
16:45 – 17:00	Conclusion: Toward an End-to-End Open NPU Ecosystem	Yuqi Xue	Summarize what we have and what are missing for an open NPU ecosystem and plans for future works.

Speakers

Yuqi Xue

University of Illinois Urbana-Champaign

Yuqi Xue is a Ph.D. candidate at the University of Illinois Urbana-Champaign. He received his B.Sc. degree in computer engineering from the University of Illinois Urbana-Champaign. His research explores hardware/software co-design techniques for building efficient ecosystems for Neural Processing Units (NPUs). His research on NPUs has been published at top-tier conferences, including ISCA and MICRO, and has received MICRO Best Paper Runner-Up, MICRO Best Artifact Award, and IEEE Micro Top Picks. He also received Rambus Computer Engineering Fellowship, Dan Vivoli Endowed Fellowship, Qualcomm Graduate Award, and Yi-Min Wang and Pi-Yu Chung Research Award.

Srinath Mandalapu

Google

Srinath Mandalapu is a Senior Staff Engineer at Google specializing in the co-design of ML frameworks and TPU hardware. As a technical leader within Google's ML Frameworks organization, he serves as a primary architect for low-level Pallas kernels utilized in MaxText distributed training and vLLM TPU inference. His recent work focuses on the large-scale optimization of state-of-the-art open-source models—including DeepSeekV3, Llama, and Qwen—where he drives performance at the intersection of low-level hardware utilization (TPU/SparseCore) and high-level framework efficiency. Prior to Google, Srinath led distributed training and framework migrations (to PyTorch) at Meta and architected foundational advertising systems at Yahoo! that generated billions in annual revenue. He holds an M.S. in Computer Engineering from UC Santa Cruz and is an alumnus of the UC Berkeley Engineering Leadership Professional Program.

Jian Huang

University of Illinois Urbana-Champaign

Jian Huang is an Associate Professor and Y. T. Lo Faculty Fellow in the ECE department and an affiliated Associate Professor in Siebel School of Computing and Data Science at the University of Illinois at Urbana-Champaign. He received his Ph.D. in Computer Science at Georgia Institute of Technology in 2017. His research interests include computer systems and architecture, sustainable AI infrastructure, memory/storage systems, data systems, systems security, distributed systems, and especially the intersections of them. Most recently, he is working on sustainable AI infrastructures. His research contributions have been published at top-tier computer architecture and systems conferences such as ISCA, MICRO, ASPLOS, OSDI, and SOSP. His work received USENIX Best Paper Award, MICRO Best Paper Runner Up, MICRO Best Artifact award, multiple IEEE Micro Top Picks (and Honorable Mentions), and Microsoft Research Outstanding Project Award. He also received the inaugural SIGMICRO Early Career Award, NSF CAREER Award, NSF CRII Award, Dean's Award for Early Innovation, NetApp Faculty Fellowship Award, and Google Faculty Research Award. He is a member of MICRO Hall of Fame, and the founder of the Workshop on Hot Topics in System Infrastructure.

Organizers

Yuqi Xue

University of Illinois Urbana-Champaign

yuqixue2@illinois.edu

Jian Huang

University of Illinois Urbana-Champaign

jianh@illinois.edu

Tools & Resources

NeuSim — Open-Source NPU Simulator (GitHub)

V10: Hardware-Assisted NPU Multi-tenancy (ISCA ‘23)

Neu10: Hardware-Assisted Virtualization of NPUs (MICRO ‘24)

ReGate: Enabling Power Gating in NPUs (MICRO ‘25)