Recording system malfunction and short access issues to the web application
Incident Report for Gong
Postmortem

At a Glance

We’ve experienced malfunction of the recording system beginning on May 6 at ~13:45 UTC. A series of continuous fixes resulted in complete restoration of the system at ~17:20 UTC. During a short part of this period, users of the Gong web applications were having issues logging in.

Background

1. On May 5, we released a software update to the recording system. We monitored the system subsequently, with no noticeable issues.

2. On May 6 at ~13:45 UTC we received an alert indicating that our recording system “brain” was malfunctioning. We observed increased, unusual, load on the recording database. It also appeared that our cloud provider rolled over the database from the primary instance to a secondary instance, and then back. We reached out to our cloud provider for assistance. In parallel, we performed emergency database adjustments, which seemed to have resolved the issues.

3. While monitoring the system, we noticed that the issues were resurfacing. Based on the recommendation of our cloud provider, we upgraded the recording database at ~16:45 UTC.

4. While the database upgrade addressed the original issues, it resulted in connectivity issues in the recording system “brain” at around 17:00 UTC. Consequently, during a short period (~10 min), users of the Gong web applications were having issues logging in, as the web application was attempting to connect to the partially-responsive recording servers to display up-to-date status of calls.

5. At ~17:20, we restarted the recording system “brain”, and the issues were deemed resolved. Calls starting at around that time were recorded successfully.

6. During this incident, our global status page was not updated.

Analysis and Response

After further investigation of the issue, we’ve determined that the issue emanated from a combination of the new, faulty, recording system, in a high-load situation. The combination resulted in extraneous load on the recording database.

We’ve subsequently rolled back the faulty recording system, and will be performing further testing before rolling it out to the production environment.

We’ve also determined several remediation steps to reduce the chances for future incidents:

1. We will disconnect the web application from the recording system, so that if the recording system is slow to respond, the web application users would be able to successfully use the application.

2. We will review our testing, launch and monitoring procedures for the recording system. Specifically, we will review the alerting thresholds, to provide earlier detection of similar issues.

3. We will review our rollover process to ensure it does not result in system issues.

4. We will refresh our guidance to our on-call team to ensure more reliable communication to customers.

Posted May 07, 2019 - 09:16 PDT

Resolved
Recording system malfunction
Posted May 06, 2019 - 12:30 PDT