Gimlet Labs Blog

A blog about our lab's research on high performance AI systems.

Announcing Gimlet's Series A Raise

Today, we're announcing our $80M Series A raise, led by Menlo Ventures and joined by Eclipse, Factory, Prosperity7, and Triatomic.
March 23, 2026
By Zain Asgar, Michelle Nguyen, Omid Azizi, James Bartlett, Natalie Serrino
AnnouncementGimlet Labs

Low-Latency Inference with Speculative Decoding on d-Matrix Corsair and GPU

We evaluated running gpt-oss-120b with a 1.6B parameter speculative decoder on d-Matrix Corsair. Compared to the same speculative decoder on GPU and equivalent energy consumption, we've found that the Corsair-based solution delivers 2-5X end-to-end request speedup on configurations optimized for interactivity, and up to 10X end-to-end speedup for energy-optimized configurations.

March 11, 2026
By James Bartlett, Natalie Serrino, Zain Asgar, Sudeep Bhoja, Prashant Nair, Nikhil Ghanathe, Nithesh Kurella

InferencePerformanceHardwareSRAMSpeculative Decodingd-Matrix

The emerging role of SRAM-centric chips in AI inference

In this post, we'll discuss the major differences between GPUs and SRAM-centric accelerators (e.g. Cerebras, Groq, and d-Matrix), explaining why near-compute memory versus far-compute memory is the key tradeoff being made by these architectures, and what this means for inference workloads.

March 5, 2026
By Natalie Serrino, Zain Asgar

InferencePerformanceHardwareSRAM

Introducing Gimlet Labs: AI Infrastructure for the Agentic Era

We're excited to finally share what we've been building at Gimlet Labs. Our mission is to make AI workloads 10X more efficient by expanding the pool of usable compute and improving how it's orchestrated.

October 22, 2025
By Zain Asgar, Michelle Nguyen, Omid Azizi, Natalie Serrino

AnnouncementGimlet Labs

Designing infrastructure for running efficient AI workloads

AI workloads are shifting from simple LLM inference to complex, multi-model workflows. To run them efficiently at scale, we need a system that can dynamically decompose workloads, plan and schedule them, and map execution to the right hardware.

October 20, 2025
By Michelle Nguyen, Zain Asgar

InferencePerformanceTCO

Benchmarking AI-generated CUDA kernels on an H100

We extended our kernel generation research to CUDA, benchmarking on an H100 where generated kernels achieve around 1.8X speedups over baseline PyTorch (including torch.compile).

October 18, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar, Burak Bartan

Kernel OptimizationPerformanceNVIDIA

Splitting LLM inference across different hardware platforms

Separating prefill and decode stages of LLM inference improves token throughput because their resource needs differ. Although most deployments use NVIDIA hardware for both stages, multivendor disaggregation can actually improve efficiency while maintaining SLAs. Based on our models using NVIDIA B200s and Intel Gaudi 3, common workloads can see 1.7X TCO improvement compared to single-vendor disaggregation.

October 13, 2025
By Zain Asgar, Michelle Nguyen, Sachin Katti, Natalie Serrino

InferencePerformanceTCOIntelNVIDIA

Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

Our lab investigated whether frontier models can write optimized GPU kernels for Apple devices to speed up inference. We found that they can: our AI-generated Metal kernels were 1.24x faster across KernelBench v0.1 problems, and 1.87x faster across KernelBench v0 problems.

August 26, 2025
By Taras Sereda, Natalie Serrino, Zain Asgar

Kernel OptimizationPerformanceApple Silicon