Payment Service Updates - Resolved

September 20, 5:00 AM EST - September 20, 2:43 PM EST
Users were unable to access GitKraken Client or GitLens and received a subscription renewal message

Last week, we experienced our first major incident that resulted in users being unable to use GitKraken Client and Gitlens. This report is intended to provide an overview of incident, the steps taken to resolve it and the steps being taken to mitigate this going forward.

Incident Background:

On September 20, at 9:00 am UTC (5:00 am EST), we deployed changes to our payment-service which included infrastructure updates and new features. However, certain configuration options were not correctly set in our production environment and observability platform.

These changes led to subscriptions mistakenly being marked as expired. We also experienced increased response times in our api-service, creating a bottleneck. Only the api-service was affected; other services such as projects, providers, referrals, and notifications operated normally.

Impact:

Users were unable to access GitKraken Client or GitLens and received a subscription renewal message. Concurrently, API response times extended to 1-2 minutes.

Timeline:

Deployment and Initial Discovery

5:00am (EST): Payment service deployed; loss of monitoring visibility noted.
~6:00 am (EST): Manual testing revealed performance issues.

Investigation and Initial Responses

6:55 am to 8:36 am (EST): Multiple alarms received; limited visibility from monitoring tools. Rollback attempts were made, but issues persisted. Configuration error in payment service identified.

Further Analysis and Rollbacks

9 am to 10:30 am (EST): Continued performance issues led to team discussions and another attempt at a rollback. Monitoring tools highlighted prolonged api-service response times.

Resolution and Monitoring

11 am to ~2 pm (EST): Monitoring tools toggled to manage traffic. Api-service showed gradual improvement in performance.
~2:30pm (EST): After thorough checks and tests, an all-clear was issued.

Forward-Looking Remediation Steps

Deployment Tests: Introduce user-workflow tests during deployments to detect issues early.
Rollback Procedures: Improve knowledge and procedures for quick reversions after problematic production pushes.
Access and Permissions: Broaden permissions for developers to modify configurations and access key systems, ensuring availability across different time zones.
Customer Communication: Enhance coordination with Customer Support and automate the service status page for more transparent customer updates.