On 7th December 2023 at 09:35, we experienced a critical incident impacting some institutions for approximately one hour.
Incident Summary:
Issue: 5xx responses from API gateway.
Impact: Customers may have experienced 5xx for some institutions.
Causes: A platform deployment.
Resolution: We finished rolling back the offending deployment within 26 minutes of the critical incident being opened.
Timeline:
09:37: Impact started. Customers experiencing 5xx responses for some institutions.
09:48: First alert detected by the monitoring system.
10:10: A critical incident is reported internally.
10:20: Root cause identified.
10:29: Rollback of the offending deployment initiated.
10:37: Rollback completed. 5xx responses cease.
Root Causes:
The incident was triggered by a breaking change introduced by Yapily during the deployment of a platform release. The main issue was identified as aggressive consumption of database connections, leading to instances running out of memory and causing restarts.
Learnings:
To enhance the stability and reliability of our services, we have taken actions in response to this incident. These measures include: Optimising memory efficiency through code level improvements, enhancing monitoring to highlight memory inefficiencies earlier in the release cycle, and enhancing the speed of resource scaling in case there is an occurrence of similar incidents. Additionally, we are exploring release strategies that allow speedy redirection of traffic during potential disruptions to healthy, stable versions and optimising how we serve consents from the consent store.
We want to take the opportunity to offer our apologies for the disruption that this incident caused you and your customers.