The outage was caused by an infrastructure configuration change yesterday in a Redis replication layer which caused crash loops 12 hours after being applied.
The main issue was the final phone alert routing which was broken in our setup, not waking up the right people.
Both these issues have been fixed now.