Schedule
PostgreSQL Disaster Recovery at Scale: Lessons from Amazon RDS Operations
Level: Intermediate
Amazon Relational Database Service (Amazon RDS) operates a variety of large, very-high-throughput PostgreSQL databases in unusual architectures. Apart from performance challenges over the years, we've been investing into disaster recovery capabilities of our Postgres databases and would like to share our experience.
In this technical session, we'll examine our comprehensive approach to disaster prevention and recovery, sharing insights from managing business-critical database operations at a high scale.
This session covers:
- Recovery strategies from logical unavailability caused by database overload (monitoring and resource management)
- Recovery from storage, hardware, or networking failures (topics: MAZ, read-replicas)
- Addressing logical data corruption (topics: PiTR, delayed replicas, change-log, pg_dump and backups)
- Patterns that enhance resilience and prevent disasters from happening (topics: preventive triggers, access control, db row-level signatures)
- Developing and maintaining operational readiness in the team (automated testing and DR trainings)
Speaker
Andrei Dukhounik