Overview
This episode explores the massive engineering scale behind AWS S3, the world’s largest cloud storage service. Milan, VP of data and analytics at AWS who has run S3 for 13 years, reveals how S3 manages hundreds of exabytes across tens of millions of hard drives while maintaining reliability. The discussion covers technical challenges like consistency models, failure handling, and formal methods that most AWS engineers rarely discuss publicly.
Key Takeaways
- Scale requires thinking in unprecedented units - S3 manages hundreds of exabytes across tens of millions of hard drives, demonstrating that truly large systems operate at scales that challenge human comprehension
- Consistency models can evolve in mature systems - S3’s transition from eventual to strong consistency shows that fundamental architectural decisions can be changed even in massive production systems with careful engineering
- Physical infrastructure still matters in cloud computing - despite abstractions, S3’s reliability ultimately depends on managing millions of physical servers and drives across global data centers
- Failure handling becomes a core discipline at scale - concepts like correlated failure and crash consistency become daily operational concerns rather than theoretical edge cases when managing exabytes of data
- Formal methods become essential for correctness - at S3’s scale, mathematical proofs and formal verification are necessary tools rather than academic luxuries to ensure system reliability
Topics Covered
- 0:00 - Introduction and S3 Scale Overview: Introduction to Milan and discussion of S3’s massive scale - 500 trillion objects, hundreds of exabytes, hundreds of millions of transactions per second
- 3:00 - Hardware Infrastructure Deep Dive: Physical infrastructure behind S3 - tens of millions of hard drives across millions of servers in 120 availability zones and 38 regions
- 6:00 - Customer Scale and Data Lakes: Discussion of individual customers with exabytes of data and the concept of data lakes vs data oceans
- 9:00 - Consistency Model Evolution: How S3 transitioned from eventual consistency to strong consistency and the engineering complexity behind this change
- 15:00 - Failure Handling and Reliability: Correlated failure, crash consistency, failure allowances and how S3 engineers think about these concepts
- 20:00 - Formal Methods and Correctness: The importance of formal methods to ensure correctness at S3’s massive scale