Partial planner outage

Incident Report for Iternio Status

Postmortem

We had a partial outage of the planner function on all environments during 13:10 to 13:25 UTC due to a central database being unavailable.

The direct cause of the outage was three ElasticSeach nodes crashing simultaneously, causing crucial data to be inaccessible. This seems to be triggered by our offsite backup, which internally means a snapshot of the ES data. This has been working fine until recently when we upgraded the underlying Linux platform the nodes are running on. This has happened on both an ES 7 and an ES 8 cluster at different times, so seems to be an unresolved issue in ES.

We are now investigating if we need to downgrade the Linux platform, or use other backup solutions.

Posted Sep 09, 2024 - 14:06 UTC

Resolved

This incident has been resolved.
Posted Sep 09, 2024 - 14:02 UTC

Monitoring

The ElasticSearch database instances have restarted and restored their state and everything is operational again.
Posted Sep 09, 2024 - 14:00 UTC

Identified

We had a number of database instances crash simultaneously. They are now back online and we are starting the investigation on why this occured.
Posted Sep 09, 2024 - 13:29 UTC

Update

We are continuing to investigate this issue.
Posted Sep 09, 2024 - 13:27 UTC

Investigating

We are currently investigating this issue.
Posted Sep 09, 2024 - 13:27 UTC
This incident affected: ABetterRouteplanner (ABRP), Iternio Planning API, OEM1 Planner API, and OEM2 Planner API.