Gimlet Labs Blog
A blog about our lab's research on high performance AI systems.
Announcing Gimlet's Series A Raise
By Zain Asgar, Michelle Nguyen, Omid Azizi, James Bartlett, Natalie Serrino
Low-Latency Inference with Speculative Decoding on d-Matrix Corsair and GPU
We evaluated running gpt-oss-120b with a 1.6B parameter speculative decoder on d-Matrix Corsair. Compared to the same speculative decoder on GPU and equivalent energy consumption, we've found that the Corsair-based solution delivers 2-5X end-to-end request speedup on configurations optimized for interactivity, and up to 10X end-to-end speedup for energy-optimized configurations.
March 11, 2026
By James Bartlett, Natalie Serrino, Zain Asgar, Sudeep Bhoja, Prashant Nair, Nikhil Ghanathe, Nithesh Kurella
The emerging role of SRAM-centric chips in AI inference
In this post, we'll discuss the major differences between GPUs and SRAM-centric accelerators (e.g. Cerebras, Groq, and d-Matrix), explaining why near-compute memory versus far-compute memory is the key tradeoff being made by these architectures, and what this means for inference workloads.
March 5, 2026
By Natalie Serrino, Zain Asgar
Introducing Gimlet Labs: AI Infrastructure for the Agentic Era
We're excited to finally share what we've been building at Gimlet Labs. Our mission is to make AI workloads 10X more efficient by expanding the pool of usable compute and improving how it's orchestrated.
October 22, 2025
By Zain Asgar, Michelle Nguyen, Omid Azizi, Natalie Serrino
Designing infrastructure for running efficient AI workloads
AI workloads are shifting from simple LLM inference to complex, multi-model workflows. To run them efficiently at scale, we need a system that can dynamically decompose workloads, plan and schedule them, and map execution to the right hardware.
October 20, 2025
By Michelle Nguyen, Zain Asgar
Benchmarking AI-generated CUDA kernels on an H100
We extended our kernel generation research to CUDA, benchmarking on an H100 where generated kernels achieve around 1.8X speedups over baseline PyTorch (including torch.compile).
October 18, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar, Burak Bartan
Splitting LLM inference across different hardware platforms
Separating prefill and decode stages of LLM inference improves token throughput because their resource needs differ. Although most deployments use NVIDIA hardware for both stages, multivendor disaggregation can actually improve efficiency while maintaining SLAs. Based on our models using NVIDIA B200s and Intel Gaudi 3, common workloads can see 1.7X TCO improvement compared to single-vendor disaggregation.
October 13, 2025
By Zain Asgar, Michelle Nguyen, Sachin Katti, Natalie Serrino
Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels
Our lab investigated whether frontier models can write optimized GPU kernels for Apple devices to speed up inference. We found that they can: our AI-generated Metal kernels were 1.24x faster across KernelBench v0.1 problems, and 1.87x faster across KernelBench v0 problems.
August 26, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar