Scaling Down: Optimizing Foundation Models for Edge Deployment
Foundation models have revolutionized AI across modalities, but their deployment on edge devices remains limited by resource constraints such as compute, memory, and energy. In this talk, we explore how to optimize foundation models for on-device applications, focusing on algorithm–hardware co-design strategies that bridge the gap between large-scale pretraining and real-world efficiency. We begin by examining how quantization can be pushed to its extreme limits. By uncovering empirical scaling laws, we demonstrate that ultra-low-bit quantization can lie on the Pareto frontier compared to 4-bit or higher-bit quantization. Next, we discuss how the core principles of efficient modeling extend to vision-language systems. Finally, we present a novel approach for post-training adaptation via direct parameter mixing—enabling fast, zero-cost customization of large models without requiring additional compute or data.