Overview

This episode explores the massive engineering scale behind AWS S3, the world’s largest cloud storage service. Milan, VP of data and analytics at AWS who has run S3 for 13 years, reveals how S3 manages hundreds of exabytes across tens of millions of hard drives while maintaining reliability. The discussion covers technical challenges like consistency models, failure handling, and formal methods that most AWS engineers rarely discuss publicly.

Key Takeaways

  • Scale requires thinking in unprecedented units - S3 manages hundreds of exabytes across tens of millions of hard drives, demonstrating that truly large systems operate at scales that challenge human comprehension
  • Consistency models can evolve in mature systems - S3’s transition from eventual to strong consistency shows that fundamental architectural decisions can be changed even in massive production systems with careful engineering
  • Physical infrastructure still matters in cloud computing - despite abstractions, S3’s reliability ultimately depends on managing millions of physical servers and drives across global data centers
  • Failure handling becomes a core discipline at scale - concepts like correlated failure and crash consistency become daily operational concerns rather than theoretical edge cases when managing exabytes of data
  • Formal methods become essential for correctness - at S3’s scale, mathematical proofs and formal verification are necessary tools rather than academic luxuries to ensure system reliability

Topics Covered