Overview

This project applies Karpathy's autoresearch concept to LLM inference optimization by building an automated system that uses AI agents to iteratively improve inference speed while maintaining quality constraints. The system runs continuous experiments on Apple Silicon hardware, revealing insights about what kinds of optimizations actually work in practice versus what fails.

The Breakdown

  • Bounded optimization framework - Creates a constrained search space where an AI agent can only modify inference.py while prepare.py remains locked, preventing the agent from "winning" by changing the evaluation criteria
  • Reversible experimentation with quality gates - Every code change is automatically tested against fixed benchmarks across 5 prompt types, with changes reverted if they improve speed but hurt output quality
  • Real hardware constraints reveal optimization limits - Testing on Apple Silicon with MLX shows which theoretical speedups actually work in practice versus which ones fail due to hardware bottlenecks
  • Stable evaluation harness design - The core insight is that prepare.py (the evaluation) is more important than inference.py (the code being optimized), ensuring consistent measurement across experiments
  • Graveyard of failed optimizations - Unlike typical AI demos that only show wins, this approach deliberately tracks and learns from failed attempts, noise, and fake improvements